7c0h

Hierarchical loss for multi-label classification - Part II

In my previous post on hierarchical loss for multi-label classification I gave an implementation of a specific algorithm for calculating the loss between two trees. I then added a quick edit mentioning that "this algorithm doesn't work too well in practice", and today I want to delve into why.

Imagine you want to predict the cities where someone lived based on some data. The hierarchy of locations is a tree with country at the first level, province or state second, and city at its third level. This tree has ca. 195 nodes on its first level and a lot more as we go down the tree.

Let's now say that I was supposed to choose Argentina.Misiones.Posadas (which corresponds to a city in Argentina) but I predicted Congo.Bouenza.Loutété; (which is the 10th most popular city in the Republic of Congo). The loss for this prediction is 0.01, which is surprisingly low - seeing as I wasn't even close to the real answer, I would have expected something near 1.

As we go deeper into the tree, the loss goes down real quick. If I had predicted Argentina.Chubut.Puerto Madryn (a city 1900km away in one of the other 23 possible provinces) the loss would be 0.00043, and if I had predicted Argentina.Misiones.Wanda (one of the other 64 cities in the correct province) my loss would have been 0.000019. If your tree is deeper than this then you will soon start running into numerical issues.

The problem here is the nature of the problem itself. Because my predictions are multi-label there is no limit to the number of cities where a person may have lived while, simultaneously, there is no limit to how many cities I may predict. If I predict that a person has lived in every single city in America, from Ward Hunt Island Camp in Canada down to Ushuaia in Argentina and everything in between, but it turns out that the person has lived in all other cities in the world, my loss would only then be 1. And if it turns out that the person has briefly lived in Argentina.Misiones.Posadas then my loss goes down to ~0.995 because getting one city right also means that I got the country right.

Now you see why this algorithm is very good in theory but not useful in practice: if you are trying to predict one or two points in a big tree then your losses will always be negligible. No matter how wrong your prediction is, the loss for a "normal" person will never be high enough to be useful.

On the other hand, if you are expecting your predictions to cover a good chunk of the tree then this algorithm is still right for you. Otherwise a good alternative is to use the Jaccard distance instead and represent Argentina.Misiones.Posadas as the set {"Argentina", "Argentina.Misiones", "Argentina.Misiones.Posadas"}. This is not as fair a measure as I would like (it punishes small errors a bit too harshly) but it still works well in practice. You could also look deeper into the paper and see if the non-normalized algorithms work for you.