What are training and testing?
This article is the first of a series in which I explain what my research is about in (I hope) a simple and straightforward manner. For more details, feel free to check the Research section.
In research, we often want to teach computers how to do a new task, but that is difficult because computers are not too smart, and teaching them even a simple task takes a lot of work. So let's say I want my computer to tell me whether an e-mail is important or not. If I could teach my computer that, then it could show me important e-mails first and save me the trouble of sorting through them daily.
One way of teaching tasks to computers is by doing the job myself, and then make the computer repeat what I did. This is something scientists have been doing for a long time, and today we have a set of steps that every researcher should follow.
The first step is to collect as many e-mails as possible, both important and not. In science, such a big set of e-mails is called a corpus.
Now, just like you wouldn't know what kind of e-mails I consider important, neither does a computer. So the second step is to go through all those e-mails I collected, and mark which ones are important. I'll create two groups, one called "training" and another one called "testing". The first group will contain 4 out of 5 emails, picked at random, while the second group will have the remaining ones.
The third step, unsurprisingly called the training stage, requires the computer to analyze all the e-mails I put in the training group and decide what makes an e-mail important. We would expect our computer to understand, for instance, that since every e-mail containing the word "SALE" was marked as unimportant, then it might be a good idea to mark all e-mails with commercial offers as unimportant. This is by far the hardest step, and there are many ways in which I can influence how well the computer will learn.
The fourth and final step is to give our computer a test, to see whether it learned something useful or not. For this step, called the testing stage, I'll go through each e-mail from the testing group, show the computer the e-mail's text, and ask whether it's important or not. Then I compare the computer's answers with mine, and I'll use that result to decide how good (or how bad) my computer learned the task. If the results are not good enough I can always go back, change how are the e-mails analyzed, and try again. If the results are good, on the other hand, I can trust my program to sort my e-mail from now on.
This is pretty much half my daily work. Collecting enough data (e-mails) is either complicated, expensive, takes a lot of time, or all of that together. And remember I said there are several ways in which a computer might learn? We have to try some of those alternatives too.
Finally, training is usually very slow - in my last project, it took almost a week.
I usually dedicate that time to play Solitaire.