Hierarchical loss for multi-label classification - Part II
In my previous post on hierarchical loss for multi-label classification I gave an implementation of a specific algorithm for calculating the loss between two trees. I then added a quick edit mentioning that "this algorithm doesn't work too well in practice", and today I want to delve into why.
Imagine you want to predict the cities where someone lived based on some data. The hierarchy of locations is a tree with country at the first level, province or state second, and city at its third level. This tree has ca. 195 nodes on its first level and a lot more as we go down the tree.
Let's now say that I was supposed to choose Argentina.Misiones.Posadas
(which
corresponds to a city in Argentina) but I predicted
Congo.Bouenza.Loutété;
(which is the 10th most popular city in
the Republic of Congo). The loss for this prediction is 0.01, which is
surprisingly low - seeing as I wasn't even close to the real answer, I would
have expected something near 1.
As we go deeper into the tree, the loss goes down real quick. If I had predicted
Argentina.Chubut.Puerto Madryn
(a city 1900km away in one of the other 23
possible provinces) the loss would be 0.00043, and if I had predicted
Argentina.Misiones.Wanda
(one of the other 64 cities in the correct province)
my loss would have been 0.000019. If your tree is deeper than this then you
will soon start running into numerical issues.
The problem here is the nature of the problem itself. Because my predictions
are multi-label there is no limit to the number of cities where a person may
have lived
while, simultaneously, there is no limit to how many cities I may
predict. If I predict that a person has lived in every single city in America,
from Ward Hunt Island Camp in
Canada down to Ushuaia in
Argentina and everything in between,
but it turns out that the person has lived in all other cities in the
world, my loss would only then be 1. And if it turns out that the person has
briefly lived in Argentina.Misiones.Posadas
then my loss goes down to ~0.995
because getting one city right also means that I got the country right.
Now you see why this algorithm is very good in theory but not useful in practice: if you are trying to predict one or two points in a big tree then your losses will always be negligible. No matter how wrong your prediction is, the loss for a "normal" person will never be high enough to be useful.
On the other hand, if you are expecting your predictions to cover a good
chunk of the tree then this algorithm is still right for you. Otherwise
a good alternative is to use the
Jaccard distance instead and
represent Argentina.Misiones.Posadas
as the set
{"Argentina", "Argentina.Misiones", "Argentina.Misiones.Posadas"}
. This is
not as fair a measure as I would like (it punishes small errors a bit too
harshly) but it still works well in practice. You could also look deeper into
the paper
and see if the non-normalized algorithms work for you.