7c0h.com

Bringing your NLP research to the World

Good practices on sharing your research with end-users.

Most NLP researchers I know have at some point in their life built a prototype to show that their research works. This prototype may have even been used in a paper or two, only to be forgotten and cursed to sleep forever in a hard drive. Even worse, this research is often fragile. A single hard drive failure can wipe out months of work in an instant. In software development terms, we say that its bus factor is less than one [1].

This talk introduces tools and concepts that will help you share your prototype with the world, ensuring in the process that your code is easy to understand, install, and use. The talk is split into three parts:

And for those of you who don’t have the time to go through this presentation, let me give you the one piece of advice you should always keep in mind:

Good practices are about clear communication. Every step we’ll see today is simply a way to ensure that you can clearly communicate with other researchers and systems. When in doubt, always choose the path that makes your research, code, or system as easy to understand as possible.

One final detail: all examples in this presentation are written in Python. This is simply because Python is very popular in NLP, but most advice here translates to any other language.

Part 1: Sharing with future you

The first piece of advice I’ll give is one you heard a hundred times: comment your code. And because I believe that good advice should be actionable, I’m going to be even more specific:

all I ask from you is that, every time you write a function, you write down…

Moving on, I am going to suggest that you put your code under version control.

If you have never used version control, it can be described as a mechanism for keeping track of all your changes over time. For practical reasons I am going to suggest that you stick to Git [2], but I’ll deviate from standard advice and say:

My third and final suggestion is that you add a Readme.

A Readme is nothing more than a text file explaining what your project does, how to install it, and how to use it. And while you probably don’t care about it at this time, plenty people will not touch your code with a 10-foot pole unless you add a license too.

If you don’t know what else would make sense to add, the website Readme.so has a nice interface for building a Readme file with the most commonly used options.

Part 2: sharing with other scientists

There is a fair chance that your code works on your machine and nowhere else. You almost certainly installed that one library that one time, forgot about it, and then had to spend a week retracing your steps when your computer installed an update and everything stopped working. While this may be fine for you (and it really isn’t), sharing your code with other scientists require that we raise the bar a bit. The key is ensuring that anyone can run your code at any time. I’ll quickly present three tools that you can use, from easiest to use to most powerful.

The easiest way is using a virtual environment, a program that keeps track of every library you installed and isolates them from the rest of the system.

Sharing your code is then as easy as dumping your environment to a single file and adding it to your Git repository, where other users can use it to reproduce your exact environment in a matter of minutes. As a plus, it also lets you install Python dependencies without calling your system administrator every time.

If you have Python installed, you almost certainly have the virtualenv module ready to go. If you use Tensorflow and need GPU support, or if you need to use the RDKit library, you might want to give Anaconda and its free cousin Conda-Forge a try instead. virtualenv only keeps track of Python libraries, but Anaconda and Conda-Forge will also install the required GPU drivers for you.

Moving up the chain, you can keep track of your entire operating system using Docker.

When you create a Docker image you are re-creating a system from scratch, ensuring that absolutely every detail about your code can be found in a single file. Docker is powerful, but it’s also resource-heavy (hard drive in particular) and not super easy to use. But if you are aiming for the highest possible replication score in your next paper review, this is what you’ve been looking for.

And finally, you could go all the way and publish a package with your code.

Once you’ve done this, all your users need to do to use your code is typing pip install <package> in their computers, and they are done. There is no better way to share code. This is a lot of work, though, so I won’t delve into the details here. Feel free to read either the official guide or the Poetry website. And if you are researching Transformers, consider pushing them to the Hugging Face repository.

Part 3: Sharing with the world

Now that everyone can run your code, it is time to put it on the internet. But before we get into how to put your research into grandma’s hands, let’s talk about security.

Before you release anything in the open, you should stop and ask yourself “how could this go wrong?”. This is what security expert Bruce Schneier calls having a Security Mindset: having the certainty that users will misuse your code (maliciously or not) and being prepared for it.

Here’s an example: we wrote this function earlier today, which is supposed to count how many times the word ‘happy’ appears in a sentence.

But do you know what happens if a user gives it a number as input? Because I do: it crashes. Because I didn’t foresee that someone might use my function in a way that wasn’t what I expected, I didn’t plan for it. So let’s fix that.

Now let me ask you: how good does this function scale? If I were researching tweets there’s a good chance that I never tested it beyond 280 characters. So what would happen if I fed it the entire text of “War and Peace”? What if I feed it the 3:28 hours long film of the same name? Will my function still work? Will my computer still work, or will it consume all of its memory and crash? If you don't know the answer, then maybe it's easier if you just ensure that this problem never comes up.

When limited to code, a good way to be prepared is through Defensive programming. But a security mindset extends beyond that: are you absolutely sure that your crowdsourced workers are actually fluent in English? What if they are using a proxy to appear as living in a different country? What if they are sharing results with each other? What if they answer your questions supernaturally fast? A security mindset is wondering “how would an unreasonable person who hates me misuse this system?” and plan accordingly [3].

Now that you are ready to share your code, you need to decide whether you want to build the interface yourself or if you’d prefer someone else to do it for you. In the later case what you need is an API, a standard interface for other computers to talk with your system. If you have existing code then your best friend is Flask, a Python library that adds a thin wrapper around your code to quickly build such an API.

Once this is done, you use your favorite search engine and choose whichever interface provider you like the most, connect your API with theirs, and call it a day.

But if you’d prefer to take care of everything then I’d suggest using Django instead. Django has support for everything you need to get started, from user management to database connection.

You may notice that Django websites tend to look boring, but don’t worry - the Bootstrap library can turn even the most boring website into a responsive, professional looking website.

And with that your model is now on the internet for everyone to use, and this talk comes to an end.

I’d like to thank you for staying with me until the end, and feel free to reach out if you have any questions.

Footnotes

  1. A bus factor of 1 means that you need to lose a single person for a project to come to a halt. A factor less than 1 means that you don't even need the entire person to go away - all it would take to derail the project is a single glass of water on their fragile, non-waterproof, easy-to-misplace laptop.
    Back
  2. If it were up to me, I would suggest using Mercurial instead of Git. Unfortunately Mercurial lost the Version Control Wars in 2019, and I am professionally obligated to suggest the tool you are most likely to use instead of the best one. If you need a Github-like interface I am personally a fan of Heptapod, but that’s far from the only one.
    Go back
  3. The relationship between researchers and crowdsourced workers is complicated. Researchers would like workers to be as dedicated as a university student for a fraction of the price. Workers are trying to earn a living through honest work that pays fractions of a cent. Being strict on the technical side and generous on pay is a strategy that never let me down.
    Go back