Therac-25, LLMs and the Zen of Python
If you are a professional software developer you probably heard at some point the cautionary tale of the Therac-25, a radiation therapy machine that gave massive overdoses of radiation to at least six patients in the 80s and killed at least three.
There are several layers of problems explaining how all of this came to be, but today I'd like to focus on a specific thread. The (in)famous race condition in the Therac-25 was caused by two simultaneous bugs, but these bugs were not new: the software for the Therac-25 was ported from a previous version of the machine (the Therac-20) and those bugs were already present there. The key difference being that, while the Therac-20 had hardware interlocks preventing those faults, the Therac-25 engineers decided to implement them exclusively in software.
I'm digging this story up because today I read yet again someone suggesting that using ChatGPT to rewrite code from one programming language to another is a great idea. I disagree: a programming language is an opinionated way on how to better achieve a certain task and switching between world views without understanding how and why they do things the way they do is a recipe for inefficient code at best and weird bugs at worse. Simple example: if you're a Python programmer who came from C then you probably wrote loops like this:
# C programmer version
for i in range(len(mylist)):
item = mylist[i]
f(item)
# Pythonic code
for item in mylist:
f(item)
This is a very simple example that highlights a fundamental misunderstanding of the language. The first code will run just fine, but it's generally not the proper way of writing Python code. For a more "real-life" example you can read the introduction to Learn Rust With Entirely Too Many Linked Lists where the author points out that writing linked lists in Rust is, from a Rust point of view, overall a bad idea and something to avoid entirely.
Real-case LLM
I decided to test my theory with Google's Gemini - I've seen students using it in their actual coding (probably because it's free) making it a fair choice. I asked the following:
Convert the following code from Python to Elixir:
def change_eight(my_arr):
my_arr[8] = None
And the answer was:
def change_eight(my_arr) do
if length(my_arr) >= 9 do
List.replace_at(my_arr, 8, nil)
else
my_arr
end
end
# This Elixir code ensures that the index 8 exists in the list before
# attempting to replace the value with nil. If the index is out of bounds,
# the original list is returned unchanged.
In my original code passing a list with less than 9 elements would throw an
IndexError
exception, which is consistent with The Zen of Python
principle "Errors should never pass silently". But if my code relies on the
exception to detect that a precondition is not being met, this rewrite has now
ensured that the error will not be detected because the exception won't be
triggered.
I added more context to the prompt, passing this code instead:
def change_eight(my_arr):
my_arr[8] = None
if __name__ == '__main__':
mylist = list(range(10))
try:
change_eight(mylist)
except IndexError:
print("Precondition failed")
mylist = list(range(20))
try:
change_eight(mylist)
except IndexError:
print("Second precondition failed")
which lead to this rewrite:
def change_eight(my_arr) do
if length(my_arr) >= 9 do
List.replace_at(my_arr, 8, nil)
else
raise RuntimeError, "Precondition failed"
end
end
if __MODULE__ == __ENV__["__MODULE__"] do
mylist = Enum.to_list(0..9)
try do
change_eight(mylist)
rescue RuntimeError -> error
IO.puts(error.message)
end
mylist = Enum.to_list(0..19)
try do
change_eight(mylist)
rescue RuntimeError -> error
IO.puts(error.message)
end
end
# This Elixir code is similar to the previous response, but it includes two
# test cases to demonstrate how the RuntimeError is raised in different scenarios.
The new version of the function is marginally better: Gemini now throws exceptions where expected, but it has also hard-coded the first error message - if the code fails on the second call the user will be shown an error corresponding to the first. I later asked for a C version instead of Elixir, at which point the error messages where gone entirely.
This is how critical bugs are born: when your software depends on (reasonable!) assumptions about the world and you send it through a pipeline (be it human or automated) that has no idea about what those preconditions are. And if you send your code through a system that's been designed to understand code exclusively at a syntactic level then you don't get to be surprised when the resulting code is not semantically equivalent to your original one.
Final words
Software development is not just about writing code. Software development is about understanding the environment in which your code runs and the decisions that lead to it - some of them reasonable ("this condition can never happen, the hardware will catch it"), some of them arbitrary ("let's write it in Perl"). The Therac-25 incident was made possible because someone decided to use code on an unfamiliar environment without considering the repercussions, the same way that Gemini did not consider "The Zen of Python" nor my error reporting strategy while rewriting my code.
There is more to software development than "data comes in, data comes out". Thinking about systems in terms of the context in which they run (software, hardware and social) is the best way to avoid finding yourself one day unpleasantly surprised.
Or, perhaps more relevant, unpleasantly shocked.
Further reading
If you haven't already, consider giving the classical paper "Four dark corners of Software Engineering" a try.
Hierarchical loss for multi-label classification - Part II
In my previous post on hierarchical loss for multi-label classification I gave an implementation of a specific algorithm for calculating the loss between two trees. I then added a quick edit mentioning that "this algorithm doesn't work too well in practice", and today I want to delve into why.
Imagine you want to predict the cities where someone lived based on some data. The hierarchy of locations is a tree with country at the first level, province or state second, and city at its third level. This tree has ca. 195 nodes on its first level and a lot more as we go down the tree.
Let's now say that I was supposed to choose Argentina.Misiones.Posadas
(which
corresponds to a city in Argentina) but I predicted
Congo.Bouenza.Loutété;
(which is the 10th most popular city in
the Republic of Congo). The loss for this prediction is 0.01, which is
surprisingly low - seeing as I wasn't even close to the real answer, I would
have expected something near 1.
As we go deeper into the tree, the loss goes down real quick. If I had predicted
Argentina.Chubut.Puerto Madryn
(a city 1900km away in one of the other 23
possible provinces) the loss would be 0.00043, and if I had predicted
Argentina.Misiones.Wanda
(one of the other 64 cities in the correct province)
my loss would have been 0.000019. If your tree is deeper than this then you
will soon start running into numerical issues.
The problem here is the nature of the problem itself. Because my predictions
are multi-label there is no limit to the number of cities where a person may
have lived
while, simultaneously, there is no limit to how many cities I may
predict. If I predict that a person has lived in every single city in America,
from Ward Hunt Island Camp in
Canada down to Ushuaia in
Argentina and everything in between,
but it turns out that the person has lived in all other cities in the
world, my loss would only then be 1. And if it turns out that the person has
briefly lived in Argentina.Misiones.Posadas
then my loss goes down to ~0.995
because getting one city right also means that I got the country right.
Now you see why this algorithm is very good in theory but not useful in practice: if you are trying to predict one or two points in a big tree then your losses will always be negligible. No matter how wrong your prediction is, the loss for a "normal" person will never be high enough to be useful.
On the other hand, if you are expecting your predictions to cover a good
chunk of the tree then this algorithm is still right for you. Otherwise
a good alternative is to use the
Jaccard distance instead and
represent Argentina.Misiones.Posadas
as the set
{"Argentina", "Argentina.Misiones", "Argentina.Misiones.Posadas"}
. This is
not as fair a measure as I would like (it punishes small errors a bit too
harshly) but it still works well in practice. You could also look deeper into
the paper
and see if the non-normalized algorithms work for you.
Good practices on sharing your research with end users
So, this is a thing that happened:
I was invited to give a talk to the Social event organized by LatinX in AI during the NAACL 2021 conference.
I talked about best practices for publishing your code on the internet for everyone to see, starting from how to collaborate with your future self (aka "please write comments"), with scientists, with nice APIs who will do the web design for you, and finally directly with final users. I have published the slides in this PDF, and will publish the video (or even better, a transcription) as soon as I get my hands on it.
Update July 11th: the presentation with notes is now available here.
Hierarchical loss for multi-label classification
Here's one of those problems that sounds complicated but, when you take a deep dive into it, turns out to be just as complicated as it sounds.
Suppose you build a classifier that takes a book and returns its classification according to the Dewey Decimal System. This classifier would take a book such as "The return of Sherlock Holmes" and classify it as, say, "Fiction".
Of course, life is rarely this easy. This book in particular is more often than not classified as 823.8, "Literature > English > Fiction > Victorian period 1837-1900". The stories, however, were written between 1903 and 1904, meaning that some librarians would rather file it under 823.912, "Literature > English > Fiction > Modern Period > 20th Century > 1901-1945".
Other books are more complicated. Tina Fey's autobiography Bossypants can be classified under any of the following categories:
- Arts and Recreation > Amusements and Recreation > Public Entertainments, TV, Movies > Biography And History > Biography
- Arts and Recreation > Amusements and Recreation > Stage presentations > Biography And History > Biography
- Literature > American And Canadian > Authors, American and American Miscellany > 21st Century
This is known as a hierarchical multi-label classification problem:
- It is hierarchical because the expected classification is part of a hierarchy. We could argue whether Sherlock Holmes should be classified as "Victorian" or "Modern", but we would all agree that either case is not as bad as classifying it under "Natural Science and Mathematics > Chemistry".
- It is multi-label because there is more than one possible valid class. Tina Fey is both a Public entertainer and an American. There is no need to choose just one.
- It is classification because we need to choose the right bin for this book.
- It is a problem because I had to solve it this week and it wasn't easy.
There seems to be exactly one paper on this topic, Incremental algorithms for hierarchical classification, and is not as easy to read as one would like (and not just because it refers to Section 4 when in reality should be Section 5). Luckily, this survey on multi-label learning presents a simpler version.
I ended up writing a test implementation to ensure I had understood the solution correctly, and decided that it would be a shame to just throw it away. So here it is. This version separates levels in a tree with '.' characters and is optimized for clarity.
Edit June 17: this algorithm doesn't work too well in practice. I'll write about its shortcomings soon, but until then you should think twice about using it as it is.
Edit June 26: Part II of this article is now up
#!/usr/bin/python
from collections import defaultdict
def parent(node):
""" Given a node in a tree, returns its parent node.
Parameters
----------
node : str
Node whose parent I'm interested in.
Returns
-------
str
Parent node of the input node or None if the input Node is already a
root node.
Notes
-----
In truth, returning '' for root nodes would be acceptable. However,
None values force us to think really hard about our assumptions at every
moment.
"""
parent_str = '.'.join(node.split('.')[:-1])
if parent_str == '':
parent_str = None
return parent_str
def nodes_to_cost(taxonomy):
""" Calculates the costs associated with errors for a specific node in a
taxonomy.
Parameters
----------
taxonomy : set
Set of all subtrees that can be found in a given taxonomy.
Returns
-------
dict
A cost for every possible node in the taxonomy.
References
----------
Implements the weight function from
Cesa-bianchi, N., Zaniboni, L., and Collins, M. "Incremental algorithms for
hierarchical classification". In Journal of Machine Learning Research,
pages 31–54. MIT Press, 2004.
"""
assert taxonomy == all_subtrees(taxonomy), \
"There are missing subnodes in the input taxonomy"
# Set of nodes at every depth
depth_to_nodes = defaultdict(set)
# How many children does a node have
num_children = defaultdict(int)
for node in taxonomy:
depth = len(node.split('.'))-1
depth_to_nodes[depth].add(node)
parent_node = parent(node)
if parent_node is not None:
num_children[parent_node] += 1
cost = dict()
for curr_depth in range(1+max(depth_to_nodes.keys())):
for node in depth_to_nodes[curr_depth]:
if curr_depth == 0:
# Base case: parent node
cost[node] = 1.0/len(depth_to_nodes[curr_depth])
else:
# General case: node guaranteed to have a parent
parent_node = parent(node)
cost[node] = cost[parent_node]/num_children[parent_node]
return cost
def all_subtrees(leaves):
""" Given a set of leafs, ensures that all possible subtrees are
included in the set too.
Parameters
----------
leaves : set
A set of selected subtrees from the overall category tree.
Returns
-------
set
A set containing the original subtrees plus all possible subtrees
contained in these leaves.
Notes
-----
Example: if leaves = {"01.02", "01.04.05"}, then the returned value is the
set {"01", "01.02", "01.04", "01.04.05"}.
"""
full_set = set()
for leave in leaves:
parts = leave.split('.')
for i in range(len(parts)):
full_set.add('.'.join(parts[:i+1]))
return full_set
def h_loss(labels1, labels2, node_cost):
""" Calculates the Hierarchical loss for the given two sets.
Parameters
----------
labels1 : set
First set of labels
labels2 : set
Second set of labels
node_cost : dict
A map between tree nodes and the weight associated with them.
Notes
-----
If you want a loss between 0 and 1, the `nodes_to_cost` function implements
such a function.
Returns
-------
float
Loss between the two given sets.
References
----------
The nicer reference of the algorithm is to be found in
Sorower, Mohammad S. "A literature survey on algorithms for multi-label
learning." Oregon State University, Corvallis (2010).
"""
# We calculate the entire set of subtrees, just in case.
all_labels1 = all_subtrees(labels1)
all_labels2 = all_subtrees(labels2)
# Symmetric difference between sets
sym_diff = all_labels1.union(all_labels2) - \
all_labels1.intersection(all_labels2)
loss = 0
for node in sym_diff:
parent_node = parent(node)
if parent_node not in sym_diff:
loss += node_cost[node]
return loss
if __name__ == '__main__':
# Simple usage example
taxonomy = set(["01", "01.01", "01.02", "01.03", "01.04", "01.05",
"02", "02.01", "02.02", "02.03", "02.03.01"])
weights = nodes_to_cost(taxonomy)
node_1=set(['01'])
node_2=set(['01.01', '02'])
print(h_loss(node_1, node_2, weights))
Write-only code
The compiler as we know it is generally attributed to Grace Hopper, who also popularized the notion of machine-independent programming languages and served as technical consultant in 1959 in the project that would become the COBOL programming language. The second part is not important for today's post, but not enough people know how awesome Grace Hopper was and that's unfair.
It's been at least 60 years since we moved from assembly-only code into what we now call "good software engineering practices". Sure, punching assembly code into perforated cards was a lot of fun, and you could always add comments with a pen, right there on the cardboard like well-educated cavemen and cavewomen (cavepeople?). Or, and hear me out, we could use a well-designed programming language instead with fancy features like comments, functions, modules, and even a type system if you're feeling fancy.
None of these things will make our code run faster. But I'm going to let
you into a tiny secret: the time programmers spend actually coding
pales in comparison to the time programmers spend thinking about what
their code should do. And that time is dwarfed by the time programmers
spend cursing other people who couldn't add a comment to save their
life, using variables named var
and cramming lines of code as tightly
as possible because they think it's good for the environment.
The type of code that keeps other people from strangling you is what we call "good code". And we can't talk about "good code" without it's antithesis: "write-only" code. The term is used to describe languages whose syntax is, according to Wikipedia, "sufficiently dense and bizarre that any routine of significant size is too difficult to understand by other programmers and cannot be safely edited". Perl was heralded for a long time as the most popular "write-only" language, and it's hard to argue against it:
open my $fh, '<', $filename or die "error opening $filename: $!";
my $data = do { local $/; <$fh> };
This is not by far the worse when it comes to Perl, but it highlights the type of code you get when readability is put aside in favor of shorter, tighter code.
Some languages are more propense to this problem than others. The International Obfuscated C Code Contest is a prime example of the type of code that can be written when you really, really want to write something badly. And yet, I am willing to give C a pass (and even to Perl, sometimes) for a couple reasons:
- C was always supposed to be a thin layer on top of assembly, and was designed to run in computers with limited capabilities. It is a language for people who really, really need to save a couple CPU cycles, readability be damned.
- We do have good practices for writing C code. It is possible to write okay code in C, and it will run reasonably fast.
- All modern C compilers have to remain backwards compatible. While some edge cases tend to go away with newer releases, C wouldn't be C without its wildest, foot-meet-gun features, and old code still needs to work.
Modern programming languages, on the other hand, don't get such an easy pass: if they are allowed to have as many abstraction layers and RAM as they want, have no backwards compatibility to worry about, and are free to follow 60+ years of research in good practices, then it's unforgivable to introduce the type of features that lead to write-only code.
Which takes us to our first stop: Rust. Take a look at the following code:
let f = File::open("hello.txt");
let mut f = match f {
Ok(file) => file,
Err(e) => return Err(e),
};
This code is relatively simple to understand: the variable f
contains
a file descriptor to the hello.txt
file. The operation can either
succeed or fail. If it succeeded, you can read the file's contents by extracting
the file descriptor from Ok(file)
, and if it failed you can either do something
with the error e
or further propagate Err(e)
. If you
have seen functional programming before, this concept may sound familiar
to you. But more important: this code makes sense even if you have
never programmed with Rust before.
But once we introduce the ?
operator, all that clarity is thrown off
the window:
let mut f = File::open("hello.txt")?;
All the explicit error handling that we saw before is now hidden from you.
In order to save 3 lines of code, we have now put our error handling logic
behind an easy-to-overlook, hard-to-google ?
symbol. It's literally there to
make the code easier to write, even if it makes it harder to read.
And let's not forget that the operator also facilitates the "hot potato" style of catching exceptions1, in which you simply... don't:
File::open("hello.txt")?.read_to_string(&mut s)?;
Python is perhaps the poster child of "readability over conciseness". The Zen of Python explicitly states, among others, that "readability counts" and that "sparser is better than dense". The Zen of Python is not only a great programming language design document, it is a great design document, period.
Which is why I'm still completely puzzled that both f-strings and the infamous walrus operator have made it into Python 3.6 and 3.8 respectively.
I can probably be convinced of adopting f-strings. At its core, they are designed to bring variables closer to where they are used, which makes sense:
"Hello, {}. You are {}.".format(name, age)
f"Hello, {name}. You are {age}."
This seems to me like a perfectly sane idea, although not one without
drawbacks. For instance, the fact that the f
is both
important and easy to overlook. Or that there's no way to know what
the =
here does:
some_string = "Test"
print(f"{some_string=}")
(for the record: it will print some_string='Test'
). I also hate that
you can now mix variables, functions, and formatting in a way
that's almost designed to introduce subtle bugs:
print(f"Diameter {2 * r:.2f}")
But this all pales in comparison to the walrus operator, an operator designed to save one line of code2:
# Before
myvar = some_value
if my_var > 3:
print("my_var is larger than 3")
# After
if (myvar := some_value) > 3:
print("my_var is larger than 3)
And what an expensive line of code it was! In order to save one or two variables, you need a new operator that behaves unexpectedly if you forget parenthesis, has enough edge cases that even the official documentation brings them up, and led to an infamous dispute that ended up with Python's creator taking a "permanent vacation" from his role. As a bonus, it also opens the door to questions like this one, which is answered with (paraphrasing) "those two cases behave differently, but in ways you wouldn't notice".
I think software development is hard enough as it is. I cannot convince the Rust community that explicit error handling is a good thing, but I hope I can at least persuade you to really, really use these type of constructions only when they are the only alternative that makes sense.
Source code is not for machines - they are machines, and therefore they couldn't care less whether we use tabs, spaces, one operator, or ten. So let your code breath. Make the purpose of your code obvious. Life is too short to figure out whatever it is that the K programming language is trying to do.