Hierarchical loss for multi-label classification - Part II
In my previous post on hierarchical loss for multi-label classification I gave an implementation of a specific algorithm for calculating the loss between two trees. I then added a quick edit mentioning that "this algorithm doesn't work too well in practice", and today I want to delve into why.
Imagine you want to predict the cities where someone lived based on some data. The hierarchy of locations is a tree with country at the first level, province or state second, and city at its third level. This tree has ca. 195 nodes on its first level and a lot more as we go down the tree.
Let's now say that I was supposed to choose Argentina.Misiones.Posadas
(which
corresponds to a city in Argentina) but I predicted
Congo.Bouenza.Loutété;
(which is the 10th most popular city in
the Republic of Congo). The loss for this prediction is 0.01, which is
surprisingly low - seeing as I wasn't even close to the real answer, I would
have expected something near 1.
As we go deeper into the tree, the loss goes down real quick. If I had predicted
Argentina.Chubut.Puerto Madryn
(a city 1900km away in one of the other 23
possible provinces) the loss would be 0.00043, and if I had predicted
Argentina.Misiones.Wanda
(one of the other 64 cities in the correct province)
my loss would have been 0.000019. If your tree is deeper than this then you
will soon start running into numerical issues.
The problem here is the nature of the problem itself. Because my predictions
are multi-label there is no limit to the number of cities where a person may
have lived
while, simultaneously, there is no limit to how many cities I may
predict. If I predict that a person has lived in every single city in America,
from Ward Hunt Island Camp in
Canada down to Ushuaia in
Argentina and everything in between,
but it turns out that the person has lived in all other cities in the
world, my loss would only then be 1. And if it turns out that the person has
briefly lived in Argentina.Misiones.Posadas
then my loss goes down to ~0.995
because getting one city right also means that I got the country right.
Now you see why this algorithm is very good in theory but not useful in practice: if you are trying to predict one or two points in a big tree then your losses will always be negligible. No matter how wrong your prediction is, the loss for a "normal" person will never be high enough to be useful.
On the other hand, if you are expecting your predictions to cover a good
chunk of the tree then this algorithm is still right for you. Otherwise
a good alternative is to use the
Jaccard distance instead and
represent Argentina.Misiones.Posadas
as the set
{"Argentina", "Argentina.Misiones", "Argentina.Misiones.Posadas"}
. This is
not as fair a measure as I would like (it punishes small errors a bit too
harshly) but it still works well in practice. You could also look deeper into
the paper
and see if the non-normalized algorithms work for you.
Hierarchical loss for multi-label classification
Here's one of those problems that sounds complicated but, when you take a deep dive into it, turns out to be just as complicated as it sounds.
Suppose you build a classifier that takes a book and returns its classification according to the Dewey Decimal System. This classifier would take a book such as "The return of Sherlock Holmes" and classify it as, say, "Fiction".
Of course, life is rarely this easy. This book in particular is more often than not classified as 823.8, "Literature > English > Fiction > Victorian period 1837-1900". The stories, however, were written between 1903 and 1904, meaning that some librarians would rather file it under 823.912, "Literature > English > Fiction > Modern Period > 20th Century > 1901-1945".
Other books are more complicated. Tina Fey's autobiography Bossypants can be classified under any of the following categories:
- Arts and Recreation > Amusements and Recreation > Public Entertainments, TV, Movies > Biography And History > Biography
- Arts and Recreation > Amusements and Recreation > Stage presentations > Biography And History > Biography
- Literature > American And Canadian > Authors, American and American Miscellany > 21st Century
This is known as a hierarchical multi-label classification problem:
- It is hierarchical because the expected classification is part of a hierarchy. We could argue whether Sherlock Holmes should be classified as "Victorian" or "Modern", but we would all agree that either case is not as bad as classifying it under "Natural Science and Mathematics > Chemistry".
- It is multi-label because there is more than one possible valid class. Tina Fey is both a Public entertainer and an American. There is no need to choose just one.
- It is classification because we need to choose the right bin for this book.
- It is a problem because I had to solve it this week and it wasn't easy.
There seems to be exactly one paper on this topic, Incremental algorithms for hierarchical classification, and is not as easy to read as one would like (and not just because it refers to Section 4 when in reality should be Section 5). Luckily, this survey on multi-label learning presents a simpler version.
I ended up writing a test implementation to ensure I had understood the solution correctly, and decided that it would be a shame to just throw it away. So here it is. This version separates levels in a tree with '.' characters and is optimized for clarity.
Edit June 17: this algorithm doesn't work too well in practice. I'll write about its shortcomings soon, but until then you should think twice about using it as it is.
Edit June 26: Part II of this article is now up
#!/usr/bin/python
from collections import defaultdict
def parent(node):
""" Given a node in a tree, returns its parent node.
Parameters
----------
node : str
Node whose parent I'm interested in.
Returns
-------
str
Parent node of the input node or None if the input Node is already a
root node.
Notes
-----
In truth, returning '' for root nodes would be acceptable. However,
None values force us to think really hard about our assumptions at every
moment.
"""
parent_str = '.'.join(node.split('.')[:-1])
if parent_str == '':
parent_str = None
return parent_str
def nodes_to_cost(taxonomy):
""" Calculates the costs associated with errors for a specific node in a
taxonomy.
Parameters
----------
taxonomy : set
Set of all subtrees that can be found in a given taxonomy.
Returns
-------
dict
A cost for every possible node in the taxonomy.
References
----------
Implements the weight function from
Cesa-bianchi, N., Zaniboni, L., and Collins, M. "Incremental algorithms for
hierarchical classification". In Journal of Machine Learning Research,
pages 31–54. MIT Press, 2004.
"""
assert taxonomy == all_subtrees(taxonomy), \
"There are missing subnodes in the input taxonomy"
# Set of nodes at every depth
depth_to_nodes = defaultdict(set)
# How many children does a node have
num_children = defaultdict(int)
for node in taxonomy:
depth = len(node.split('.'))-1
depth_to_nodes[depth].add(node)
parent_node = parent(node)
if parent_node is not None:
num_children[parent_node] += 1
cost = dict()
for curr_depth in range(1+max(depth_to_nodes.keys())):
for node in depth_to_nodes[curr_depth]:
if curr_depth == 0:
# Base case: parent node
cost[node] = 1.0/len(depth_to_nodes[curr_depth])
else:
# General case: node guaranteed to have a parent
parent_node = parent(node)
cost[node] = cost[parent_node]/num_children[parent_node]
return cost
def all_subtrees(leaves):
""" Given a set of leafs, ensures that all possible subtrees are
included in the set too.
Parameters
----------
leaves : set
A set of selected subtrees from the overall category tree.
Returns
-------
set
A set containing the original subtrees plus all possible subtrees
contained in these leaves.
Notes
-----
Example: if leaves = {"01.02", "01.04.05"}, then the returned value is the
set {"01", "01.02", "01.04", "01.04.05"}.
"""
full_set = set()
for leave in leaves:
parts = leave.split('.')
for i in range(len(parts)):
full_set.add('.'.join(parts[:i+1]))
return full_set
def h_loss(labels1, labels2, node_cost):
""" Calculates the Hierarchical loss for the given two sets.
Parameters
----------
labels1 : set
First set of labels
labels2 : set
Second set of labels
node_cost : dict
A map between tree nodes and the weight associated with them.
Notes
-----
If you want a loss between 0 and 1, the `nodes_to_cost` function implements
such a function.
Returns
-------
float
Loss between the two given sets.
References
----------
The nicer reference of the algorithm is to be found in
Sorower, Mohammad S. "A literature survey on algorithms for multi-label
learning." Oregon State University, Corvallis (2010).
"""
# We calculate the entire set of subtrees, just in case.
all_labels1 = all_subtrees(labels1)
all_labels2 = all_subtrees(labels2)
# Symmetric difference between sets
sym_diff = all_labels1.union(all_labels2) - \
all_labels1.intersection(all_labels2)
loss = 0
for node in sym_diff:
parent_node = parent(node)
if parent_node not in sym_diff:
loss += node_cost[node]
return loss
if __name__ == '__main__':
# Simple usage example
taxonomy = set(["01", "01.01", "01.02", "01.03", "01.04", "01.05",
"02", "02.01", "02.02", "02.03", "02.03.01"])
weights = nodes_to_cost(taxonomy)
node_1=set(['01'])
node_2=set(['01.01', '02'])
print(h_loss(node_1, node_2, weights))