7c0h

A more polite Taylor Swift with NLP and word vectors

My relation with Taylor Swift is complicated: I don't hate her — in fact, she seems like a very nice person. But I definitely hate her songs: her public persona always comes up to me as entitled, abusive, and/or an unpleasant person overall. But what if she didn't have to be? What if we could take her songs and make them more polite? What would that be like?

In today's post we will use the power of science to answer this question. In particular, the power of Natural Language Processing (NLP) and word embeddings.

The first step is deciding on a way to model songs. We will reach into our NLP toolbox and take out Distributional semantics, a research area that investigates whether words that show up in similar contexts also have similar meanings. This research introduced the idea that once you treat a word like a number (a vector, to be precise, called the embedding of the word), you can apply regular math operations to it and obtain results that make sense. The classical example is a result shown in this paper, where Mikolov and his team managed to represent words in such a way that the result of the operation King - man + woman ended up being very close to Queen.

The picture below shows an example. If we apply this technique to all the Sherlock Holmes novels, we can see that the names of the main characters are placed in a way that intuitively makes sense if you also plot the locations for "good", "neutral", and "evil" as I've done. Mycroft, Sherlock Holmes' brother, barely cares about anything and therefore is neutral; Sherlock, on the other hand, is much "gooder" than his brother. Watson and his wife Mary are the least morally-corrupt characters, while the criminals end up together in their own corner. "Holmes" is an interesting case: the few sentences where people refer to the detective by saying just "Sherlock" are friendly scenes, while the scenes where they call him "Mr. Holmes" are usually tense, serious, or may even refer to his brother. As a result, the world "Sherlock" ends up with a positive connotation that "Holmes" doesn't have.

Embeddings for characters in the
Sherlock Holmes novels

This technique is implemented by word2vec, a series of models that receive documents as input and turn their words into vectors. For this project, I've chosen the gensim Python library. This library does not only implement word2vec but also doc2vec, a model that will do all the heavy-lifting for us when it comes to turn a list of words into a song.

This model needs data to be trained, and here our choices are a bit limited. The biggest corpus of publicly available lyrics right now is (probably) the musiXmatch Dataset, a dataset containing information for 327K+ songs. Unfortunately, and thanks to copyright laws, working with this dataset is complicated. Therefore, our next best bet is this corpus of 55K+ songs in English, which is much easier to get and work with.

The next steps are more or less standard: for each song we take their words, convert them into vectors, and define a "song" as a special word whose meaning is a combination of its individual words. But once we have that, we can start performing some tests. The following code does all of this, and then asks an important question: what would happen if we took Aerosmith's song Amazing, removed the amazing part, and chose the song that's most similar to the result?


import csv
import gzip
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

documents = []
with gzip.open('songlyrics.zip', 'r') as f:
    csv_reader = csv.DictReader(f)
    counter = 0
    # Read the lyrics, turn them into documents,
    # and pre-process the words
    for row in csv_reader:
        words = simple_preprocess(row['text'])
        doc = TaggedDocument(words, ['SONG_{}'.format(counter)})
        documents.append(doc)
        counter += 1

# Train a Doc2Vec model
model = Doc2Vec(documents, size=150, window=10, min_count=2, workers=10)
model.train(document, total_examples=len(documents), epochs=10)

# Apply some simple math to a song, and obtain a list of the 10
# most similar songs to the result.
# In our lyrics database, song 22993 is "Amazing", by Aerosmith
song = model['SONG_22993']
query_vector = song - model['amazing']
for song, vector in model.docvecs.most_similar([query_vector]):
    print(song)

One would expect that Amazing minus amazing would be... well, boring. And you would be right! Predictably, when we do exactly that we end up with...

  • ...Margarita, a song about a man who meets a woman in a bar and cooks soup with her.
  • ...Alligator, a song about an alligator lying by the river.
  • ...Pony Express, a song about a mailman delivering mail.

We can use this same model to answer all kind of important questions I didn't know I had:

  • Have you ever wondered what would be "amazingly lame"? I can tell you! Amazing + lame = History in the making, a song where a rapper tells us how much money he has.
  • Don't you think sometimes "I like We are the World, but I wish it had more violence?". If so, Blood on the World's hands is the song for you.
  • What if we take Roxette's You don't understand me and add understanding to it? As it turns out, we end up with It's you, a song where a man breaks up with his wife/girlfriend because he can't be the man she's looking for. I guess he does understand her now but still: dude, harsh.
  • On the topic of hypotheticals: if we take John Lennon's Imagine and we take away the imagination, all that's left is George Gershwin's Strike up the band, a song about nothing but having "fun, fun, fun". On the other hand, if we added even more imagination we end up with Just my imagination, dreaming all day of a person who doesn't even know us.

This is all very nice, but what about our original question: what if we took Taylor Swift's songs and removed all the meanness? We can start with her Grammy-winning songs, and the results are actually amazing: the song that best captures the essence of Mean minus the meanness is Blues is my middle name, going from a song where a woman swears vengeance to a song where a man quietly laments his life and hopes that one day things will come his way. Adding politeness to We are never coming back together results in Everybody knows, a song where a man lets a woman know he's breaking up with her in a very calm and poetic way. The change is even more apparent when the bitter Christmases when you were mine turns into the (slightly too) sweet memories of Christmas brought by Something about December.

Finally, and on the other side, White Horse works better with the anger in. While this song is about a woman enraged at a man who let her down, taking the meanness out results in the hopeless laments of Yesterday's Hymn.

So there you have it. I hope it's clear that these are completely accurate results, that everything I've done here is perfectly scientific, and that any kind of criticism from Ms. Swift's fans can be safely disregarded. But on a more serious note: I hope it's clear that this is only the tip of the iceberg, and that you can take the ideas I've presented here in many cool directions. Need a hand? Let me know!

Further reading

Eye-tracking and visual salience

This article is the fifth of a series in which I explain what my research is about in (I hope) a simple and straightforward manner. For more details, feel free to check the Research section.

In my last post we faced a hard problem: If a person visits a museum, for instance, we could give them information on the piece they are looking at. But computers don't have eyes! We could use a camera, sure, but that only works if there is only one art piece nearby. If there are several paintings close to each other, how do we decide which one of them is the interesting one?

One way is through what we call eye-tracking. This technology works like a regular camera, but with a catch: it doesn't only look forward, but it also looks backwards, at you! If you wear one of these so-called eye-trackers, it follows the movement of your eyes and records not only the entire scene (like a regular camera) but also a tiny dot that points out what you were looking at. Some colleagues and I found that eye-movement gives you a very good guess at what has captured someone's attention. After all, if you are interested in something, you are probably looking at it.

But there's a complication: eye-trackers are bulky, expensive, and take a long time to set up. And most people feel uncomfortable knowing that someone is recording their activity all the time. It is safe to say that we won't be wearing eye-trackers for fun anytime soon, and that's not great: what good are our results, if no one wants to use them?

Luckily, two men named John Kelleher and Josef van Genabith came up with a smart idea: whenever we are interested in an object, we look at it and get closer. He then applied this idea backwards: if we are looking in a certain direction and walking towards it, all we need to do is figure out what is right in front of us - that must be the object we care about. This technique is called visual salience, and it's a good alternative to an eye-tracker: rather than wearing expensive glasses, all we need to know is the direction in which they are walking. It might not be as effective, but it's good enough for us.

Following people's attention is important if we want our computers to cooperate with us: if a computer asks you to turn on the lights, but you start walking towards the fire alarm, it should warn you immediately that you are about to make a mistake. How to correct that mistake, however, is the topic of the next (and final) article.

Sentiments are the new Spam - Part 2: user groups

So, you have successfully created an online community. People seem genuinely engaged, and you have interesting discussions going on. And then one day I show up, decide that "it would be a shame if something were to happen to your little communnity", and start harrassing your users because... well, because. Call it 4chan, Gamergate, MRA or trolls, there's always a group ready to drag a community into the ground.

Like I said last time, one of the main characteristics about the internet is that you can't block me, you can only block my user. So let's focus, from the simplest to the more complex, in how could you keep me from being annoying and/or harrassing other people in your community.

Privileged users

The first step I suggest you take is a hierarchical scale of users. It doesn't have to be too complex - I'd start with something like this:

Anonymous users are those that have not yet logged in. Usually they are allowed read-only access to the site, but in some cases not even that. As a counter-example Slashdot is known for allowing anonymous users to post and comment on the site, although with a catch that I'll discuss later.

New users should have limited posting capabilities - maybe they can only vote but not comment, or their comments are given partial visibility by default. Getting out of this category should be relatively easy for a "good" user (although time-consuming - no less than an hour, perhaps even days), but it should definitely annoy those that are only "giving the website a try".

Your regular users are the ones that actually use your site as intended. They can post and comment at will. And finally, the power users are allowed some extra permissions - usually this mean they can edit or remove other people's posts. This level should be pretty hard to achieve.

The iron fist of justice

Now that you have user levels, new users are your main concern: it is not unusual for trolls to create thousands of accounts (automatically, of course) and use them to assault a particular user. Remember: any regular user should be able to stop the noise in a simple and straightforward way - otherwise you risk becoming an online harrassing platform, and you'll have to publicly apologize like Twitter's CEO often does.

Our first moderation tool will be karma points. Each time a user contributes to our website, other users can rate this contribution positively or negatively. Contributions with "high karma" will be given a predominant position, while contributions with "low karma" will be buried. This is how Slashdot can allow anonymous contributions without being buried in dumb comments: every comment posted anonymously will have very low karma by default, but if enough users vote it up, it will eventually be seen by everyone else. Similarly, Hacker News will not allow users to vote negatively if they haven't yet reached a certain karma threshold.

Sidenote: you don't want to rank your posts/comments simply based on who has the highest number of votes. Instead, take a look at reddit's comment sorting system.

Another tool you'll find useful is the good old ban. A temporary ban means that a given user cannot post for a given period of time, while a permaban (permanent ban) means that the user is kicked out forever. This is a standard tool in every forum, but we can still do better: given that nothing stops a banned user from creating a new account and continue their toxic behavior (and remember, now they are pissed for being banned), you can use a hellban. When a user is hellbanned, no one but them can see their activity. The user can still log in, comment and post, but this activity is invisible to everyone else. From their point of view, it looks as if no one cares about them anymore, and it's not unusual for them to just leave.

Finally, you might also want to consider a "report" button, through which users can report unruly behavior. This should be more or less automated, but you cannot blindly trust these reports: you risk trolls banding together and reporting users at will. To prevent this, an automated recourse method should be enough - a moderator is notified, and the user is not fully banned until a final decision is reached. And finally, if you want to go the extra mile, you could have a "protected" flag that keeps certain users from being reported.

That's about all you can do at this level. There are no new ideas here, which is good - now you know that these concepts have been tried and tested before. In next two posts I'll be discussing about things that might not make as much sense, so stay tuned.

Sentiments are the new Spam - Prologue

Once upon a time, you would create an e-mail account and use it for a long time without receiving spam. In fact, whenever you received your first spam message, you'd know exactly who to blame: that one cousin of yours who'd send you every single motivational powerpoint she came across, along with a list of 1500 other e-mail addresses. We could argue about who's the spammer in this situation, but that discussion will have to wait.

That kind of control over your account is no longer possible: even if you never share your account with anyone, you will at some point get spam. It's just the way things are, the "background radiation" of the internet. Luckily for us, things got so bad that a lot of smart people sat down to think really hard about this, and came up with Bayesian filtering, a technique so effective that most of us don't even bother checking our Spam folders anymore.

So we^1 succeeded once. It's a good thing to remember, because we have a much harder battle to fight now: trolling, and it's ugly cousin, online harrasment.

Let's say you post a message on an online board. These are some of the things that could happen, in no particular order:

  • You could get an interesting, well thought reply (note that "well thought" doesn't mean "agrees with you"). It happens.
  • You could be modded down by people that disagree with what you just posted, even if the rules say they shouldn't.
  • You could be flooded by negative messages, because a certain group decided to impose their point of view. This is called brigading, by the way, and it's usually not personal - they oppose your point of view, but not you.
  • You could be flooded by negative messages, because a group has decided to target you online for something you said, or did, or are.
  • You could be posting in behalf of a company, in order to speak in favor of your products posting as anyone-but-an-employee. This is called being a shill, and most websites either pretend that it doesn't happen or they don't care.
  • You could be trying to derail a discussion, in order to make sure a certain point is not brought to light, or is drowned in the noise. This usually implies that you work for a government agency, it's being done right now, and it works.

We used to believe that everyone on the internet would eventually behave nicely, and that we could build our services based on trusting the 95% of users that have no hidden agenda. This is sadly not so, because

  1. ... people have not behaved nicely on the Internet since September 1993.
  2. ... 5% of very loud users are a lot more noticeable than 95% of the quiet ones. A post-mortem of a DARPA Challenge showed that a single person can sabotage the work of thousands of well-meaning volunteers.

In the follow-up articles I'm going to comment on what I perceive to be three main points in which this issue could be attacked. They are

  • Anonymity: there's no way of taking measures against a person, only against a user. This is by design, and I'm not arguing that we should get rid of anonymity. We should instead focus on identifying toxic users, which I think can be done implementing user groups.
  • Flamewars: derailing discussions in order to kill them. This may be a job for pattern matching, identifying when the shape of a discussion is tending towards known anti-patterns. We might also want to add clustering, in order to identify brigades.
  • Harrassment: perhaps the harder one, requires sentiment analysis techniques to identify negative comments and kill them before they reach their destination.

In the follow-up essays I'll present some papers about how one would go about attacking each point. I have no reason to believe that this techniques are unknown (some of them are already implemented), but I post them hoping that, much like Bayesian filtering, someone will read them and have an "oh, wait" moment).

Coming up next: anonymous users and user groups.

Footnotes

^1 Of course, by "we" I mean "the computer science community in general". I did not create Bayesian filtering.

Genius MousePen i608 in Debian Linux

I'm the proud owner of a Genius MousePen i608 graphic tablet (also known as UC-LOGIC Tablet WP8060U). This tablet is quite old and cheap, which is more often than not a recipe for headaches.

One very specific problem that I have: my tablet has an aspect ratio of 4:3, like old computers did, but both my desktop and laptop's screens have an aspect ratio of 16:9. Why is this a problem? Because my computer believes that the tablet and screen have the same aspect ratio, and whenever I draw a circle on my tablet it comes up on screen as an oval.

There are two possible solutions to this issue. One is changing my screen's resolution to match the 4:3 aspect ratio, which is annoying: I have to change the screen settings, then fiddle with my actual, physical screen so it doesn't stretch the image, and then I have to undo both steps once I'm done. The second solution requires a bit more calculations, but it's the right way: we'll configure the tablet in such a way that Linux recognizes the difference in ratios.

To be more precise: We will define a rectangle with the same height as the screen and a proportional width (sticking to the 4:3 ratio between width and height), we will position that rectangle in the center of the screen, and all movements in our tablet will only apply to that section of the screen. All movements on the tablet will translate to this rectangle without distortion, and if we need to interact with the screen outside this area we can still use our mouse.

The following code will run all the numbers for us. In essence, it will calculate the required set of parameters, and then it will modify the property Coordinate Transformation Matrix of xinit accordingly:

# Get the current screen resolution
resolution=`xrandr | grep '*' | cut -f 4 -d ' '`
width=`echo ${resolution} | cut -f 1 -d 'x'`
height=`echo ${resolution} | cut -f 2 -d 'x'`

# Get the proper tablet width, according to the 4:3 proportion
tablet_width=`echo "${height} 3 / 4 * n" | dc`

# We need to calculate four parameters c0, c1, c2, c3. For that, we use the
# 'dc' utility, which uses postfix notation (i.e., you write "7/3" as "7 3 /").
#
# Note: if you want to move the usable section of the screen left or right,
# take a look at the 'x offset' value. Also note that, since we are using the
# entire height of the screen, the 'y offset' is simply 0.

# Touch area width / width
c0=0`echo "7 k ${tablet_width} ${width} / n" | dc`
# Touch area x offset / width
c1=0`echo "7 k ${width} ${tablet_width} - 2 / ${width} / n" | dc`
# Touch area height / height
c2=1.0
# Touch area y offset / height
c3=0.0

# Obtain the device ID for the graphics tablet. Note that UC-LOGIC is my device
# ID, but yours may be different
device=`xinput | grep UC-LOGIC | head -n 1 | cut -f 2 -d '=' | cut -f 1`
# Set the Coordinate Transformation Matrix
xinput set-prop ${device} --type=float "Coordinate Transformation Matrix" ${c0} 0 ${c1} 0 ${c2} ${c3} 0 0 1

And that's it! It happens to me often that the transformation doesn't work straight away, in which case unplugging and plugging the tablet again solves the problem. A second issue with every reinstall is that the X server sometimes refuses to recognize my tablet. I solved that problem by adding the following lines to the /etc/X11/xorg.conf file:

Section "InputClass"
    Identifier "evdev tablet catchall"
    MatchIsTablet "on"
    MatchDevicePath "/dev/input/event*"
    Driver "evdev"
EndSection