So, this is a thing that happened:
I was invited to give a talk to the Social event organized by
LatinX in AI during the
NAACL 2021 conference.
I talked about best practices for publishing your code on the internet for
everyone to see, starting from how to collaborate with your future self (aka
"please write comments"), with scientists, with nice APIs who will do the
web design for you, and finally directly with final users. I have published the
slides in this PDF, and will publish
the video (or even better, a transcription) as soon as I get my hands on it.
Update July 11th: the presentation with notes is now available
here.
Here's one of those problems that sounds complicated but, when you take a deep
dive into it, turns out to be just as complicated as it sounds.
Suppose you build a classifier that takes a book and returns its classification
according to the Dewey Decimal System.
This classifier would take a book such as "The return of Sherlock
Holmes" and
classify it as, say, "Fiction".
Of course, life is rarely this easy. This book in
particular is more
often than not classified as 823.8, "Literature > English > Fiction > Victorian period 1837-1900".
The stories, however, were written between 1903 and 1904, meaning that some
librarians would rather file it under 823.912,
"Literature > English > Fiction > Modern Period > 20th Century > 1901-1945".
Other books are more complicated. Tina Fey's autobiography
Bossypants
can be classified under any of the following categories:
- Arts and Recreation > Amusements and Recreation > Public Entertainments, TV, Movies > Biography And History > Biography
- Arts and Recreation > Amusements and Recreation > Stage presentations > Biography And History > Biography
- Literature > American And Canadian > Authors, American and American Miscellany > 21st Century
This is known as a hierarchical multi-label classification problem:
- It is hierarchical because the expected classification is part of a
hierarchy. We could argue whether Sherlock Holmes should be classified as
"Victorian" or "Modern", but we would all agree that either case is not as
bad as classifying it under "Natural Science and Mathematics > Chemistry".
- It is multi-label because there is more than one possible valid class.
Tina Fey is both a Public entertainer and an American. There is no need
to choose just one.
- It is classification because we need to choose the right bin for this
book.
- It is a problem because I had to solve it this week and it wasn't easy.
There seems to be exactly one paper on this topic, Incremental algorithms for
hierarchical classification,
and is not as easy to read as one would like (and not just because it refers to
Section 4 when in reality should be Section 5). Luckily, this survey on
multi-label learning presents a simpler version.
I ended up writing a test implementation to ensure I had understood the solution
correctly, and decided that it would be a shame to just throw it away. So here
it is. This version separates levels in a tree with '.' characters and is
optimized for clarity.
Edit June 17: this algorithm doesn't work too well in practice. I'll write
about its shortcomings soon, but until then you should think twice about using
it as it is.
Edit June 26: Part II of this article is now
up
#!/usr/bin/python
from collections import defaultdict
def parent(node):
""" Given a node in a tree, returns its parent node.
Parameters
----------
node : str
Node whose parent I'm interested in.
Returns
-------
str
Parent node of the input node or None if the input Node is already a
root node.
Notes
-----
In truth, returning '' for root nodes would be acceptable. However,
None values force us to think really hard about our assumptions at every
moment.
"""
parent_str = '.'.join(node.split('.')[:-1])
if parent_str == '':
parent_str = None
return parent_str
def nodes_to_cost(taxonomy):
""" Calculates the costs associated with errors for a specific node in a
taxonomy.
Parameters
----------
taxonomy : set
Set of all subtrees that can be found in a given taxonomy.
Returns
-------
dict
A cost for every possible node in the taxonomy.
References
----------
Implements the weight function from
Cesa-bianchi, N., Zaniboni, L., and Collins, M. "Incremental algorithms for
hierarchical classification". In Journal of Machine Learning Research,
pages 31–54. MIT Press, 2004.
"""
assert taxonomy == all_subtrees(taxonomy), \
"There are missing subnodes in the input taxonomy"
# Set of nodes at every depth
depth_to_nodes = defaultdict(set)
# How many children does a node have
num_children = defaultdict(int)
for node in taxonomy:
depth = len(node.split('.'))-1
depth_to_nodes[depth].add(node)
parent_node = parent(node)
if parent_node is not None:
num_children[parent_node] += 1
cost = dict()
for curr_depth in range(1+max(depth_to_nodes.keys())):
for node in depth_to_nodes[curr_depth]:
if curr_depth == 0:
# Base case: parent node
cost[node] = 1.0/len(depth_to_nodes[curr_depth])
else:
# General case: node guaranteed to have a parent
parent_node = parent(node)
cost[node] = cost[parent_node]/num_children[parent_node]
return cost
def all_subtrees(leaves):
""" Given a set of leafs, ensures that all possible subtrees are
included in the set too.
Parameters
----------
leaves : set
A set of selected subtrees from the overall category tree.
Returns
-------
set
A set containing the original subtrees plus all possible subtrees
contained in these leaves.
Notes
-----
Example: if leaves = {"01.02", "01.04.05"}, then the returned value is the
set {"01", "01.02", "01.04", "01.04.05"}.
"""
full_set = set()
for leave in leaves:
parts = leave.split('.')
for i in range(len(parts)):
full_set.add('.'.join(parts[:i+1]))
return full_set
def h_loss(labels1, labels2, node_cost):
""" Calculates the Hierarchical loss for the given two sets.
Parameters
----------
labels1 : set
First set of labels
labels2 : set
Second set of labels
node_cost : dict
A map between tree nodes and the weight associated with them.
Notes
-----
If you want a loss between 0 and 1, the `nodes_to_cost` function implements
such a function.
Returns
-------
float
Loss between the two given sets.
References
----------
The nicer reference of the algorithm is to be found in
Sorower, Mohammad S. "A literature survey on algorithms for multi-label
learning." Oregon State University, Corvallis (2010).
"""
# We calculate the entire set of subtrees, just in case.
all_labels1 = all_subtrees(labels1)
all_labels2 = all_subtrees(labels2)
# Symmetric difference between sets
sym_diff = all_labels1.union(all_labels2) - \
all_labels1.intersection(all_labels2)
loss = 0
for node in sym_diff:
parent_node = parent(node)
if parent_node not in sym_diff:
loss += node_cost[node]
return loss
if __name__ == '__main__':
# Simple usage example
taxonomy = set(["01", "01.01", "01.02", "01.03", "01.04", "01.05",
"02", "02.01", "02.02", "02.03", "02.03.01"])
weights = nodes_to_cost(taxonomy)
node_1=set(['01'])
node_2=set(['01.01', '02'])
print(h_loss(node_1, node_2, weights))
I recently confirmed that high quality audio makes you sound
smarter, which
is exactly what I always wanted: a way to look smart without having to actually
work for it. This discovery led me to an internet rabbit hole on how to look
and sound good online, with this guide being the end result.
If you are trapped inside online-meeting purgatory like me, hopefully this guide
will give you a small edge in your next important meeting.
I divided this guide in three sections:
- Video: while video is not as important as audio, it's the easiest one to
improve. You may not know exactly how to better equalize your voice, but
identifying which part of your face needs light is easy.
- Audio: you can live with bad video (or no video at all) but bad audio is
a different issue.
- Delivery: once you are clearly seen and heard, let's talk about how to
improve your message.
Before we start, a couple words of meta-advice: by caring about how you look and
sound you are already ahead of everyone else who just turns their computer on
and shows up. And since you can't improve when you don't know what is there to
improve, your first step is to go get some feedback. If you can't find someone
willing to have a meeting with you then you should at least have a test meeting
alone. For instructions click here: Zoom, Teams,
Jitsi.
Video
As this guide points
out you should
ensure there is no strong light coming from behind you. Ideally you
want a three-point lighting
setup but a good
compromise is a general strong light source (such as an open window or room
light) plus diffuse light behind your monitor (either point a lamp towards the
wall behind your monitor or a full-screen, white document opened on your
screen). The next step up are ring
lights, but we have other,
more pressing issues to worry about first.
Your background comes next. Most meeting software nowadays includes a "blurry
background" filter that you can use to hide what's going on behind you. These
filters don't work as good as I wish they did, but they have nonetheless been a
blessing for those of us sitting in shared living rooms. Still: consider
re-orienting your camera (or your desk!) to keep a clear, distraction-free
background.
Which brings us to the final point: the camera itself. Whichever
camera you have around is likely to be fine. A more expensive camera will
give better results, but they might not be worth the cost.
Pro tip: if you have a DSLR camera laying around, it may also double as an
amazing webcam too. Check your manual.
Pay attention to the camera angle.
Keep your camera at eye level either
by repositioning your webcam (if it's an external one) or by getting yourself
a laptop stand (which you can also build out of
cardboard). Say no to cameras looking at you
from below!
Audio
Audio is tricky: it is more important than video during conferences but it's
harder to tune adequately. Let's get the obvious out of the way: ideally you
want a quiet room for your meeting, but there's only so much you can do with the
rooms your apartment already has. So let's not dwell on that.
Unlike video, where you can get far with what you have, in audio you really,
really want to have a better microphone. You don't have to go pro (in fact, an
expensive mic can easily backfire by being too sensitive) but you should at
least get a decent, dedicated one. If you have no idea of audio then I would
recommend a USB mic - I have had bad experiences with microphones picking up
line noise and USB should help with that.
And since using your speakers is guaranteed to cause echo sooner or later, save
yourself the trouble and get some headphones too.
If you want to tweak your voice even more you can try a software equalizer.
There are plenty of
guides around
courtesy of the internet, but getting into details goes beyond this guide.
Delivery
Once you have optimized your environment as much as possible, it is time to talk
about delivery. That's a topic by itself, so I'll limit myself to two tips:
- Dress appropriately and keep a neat background. Bookshelves are
particularly nice. If you show a messy room your audience will assume you
are also a messy person with messy ideas, and no one wants that.
- You don't have to go and get a voice coach, but it might be worth your time
to watch a couple videos on the topic. I have personally learned a lot from
the Broadcast Voice Handbook
but you might be better served by more casual online courses. Youtubers
have created an explosion of content on that area, so it should be easy
to find.
Have you heard of the LatinX in AI social
event? It is a
social event organized by LXAI intended to
bring together Latin American researchers in NLP and AI. I joined as a
participant in their EMNLP 2020 edition, and I am now volunteering to run the
2021 version to take place soon parallel to EACL 2021.
One of my tasks as part of the organizing committee is to send invitations to
those who joined last year. And you can probably see where this is going:
even though I have written permission from these 116 participants to contact
them and even though I followed Google's best
practices for sending
emails, my 18-years-old GMail account was nonetheless blocked and it has
remained so ever since.
I have now spent several days in Google support hell leaving no stone unturned
and no link unclicked. If you have never tried to get support from Google, the
following diagram illustrates all the maddening steps I have followed during the
account recovery process with no luck so far:
The "Number not accepted" box is particularly annoying: I have always refused to
give Google my personal telephone number because there is no guarantee that they
won't use it for tracking me like Facebook was caught
doing,
and you cannot enable 2-factor-authentication without providing
one first (trust me, I tried). As a result, Google will not trust any number I
give now - it is mildly funny to read that the telephone number of the Fortune
Global 500 company where I work "has already been used too many times for
verification" even though I had never used it before. Either whoever used my
desk before me was a serious spammer, or Google is not being as honest as one
would expect.
But you know what hurts the most? That all of this could have been avoided if I
hadn't insisted on personalizing the emails. I hate emails addressed to "Dear
sir or madam" and therefore went out of my way to write the script that would
pull people's names and display it properly. If I had written a generic email
instead and dumped 100+ addresses in the web interface I would probably still
have my account. I know it isn't much, but that's all I could do to show people
that I care about them receiving their invitation. No good deed goes
unpunished.
Maybe in the future I will write about all my complaints.
One particularly mean example are the emails I still get in my recovery account
letting me know that someone has been trying to access my account but that I
shouldn't worry because they didn't let them in. But for today, I want to leave
you with two parting thoughts.
First: this story is not new, and if you have all your eggs in the Google basket
it is only a matter of time before you lose something important with no
recourse. Maybe they will remove your browser extension, ruin your startup,
kill your game,
terminate your Android app,
delete your YouTube channel, or who
knows what else. So be prepared. I can assure you that if I had not started
using my own email domain years ago I would now be truly screwed with no way
forward. If you are not willing to leave Google products for good, at the very
least get a local copy of your data with Google
Takeout and keep it safe.
And second: I still need that account to organize the event. So if you know
someone who works for Google, please tell them to
write me an email to get this sorted out.
I would be slightly sad of losing the epic burn "Google closed my 18-years-old
account forever for helping Latin American researchers", but I'll let it go if
that means moving this event forward.
Further reading