Therac-25, LLMs and the Zen of Python
If you are a professional software developer you probably heard at some point the cautionary tale of the Therac-25, a radiation therapy machine that gave massive overdoses of radiation to at least six patients in the 80s and killed at least three.
There are several layers of problems explaining how all of this came to be, but today I'd like to focus on a specific thread. The (in)famous race condition in the Therac-25 was caused by two simultaneous bugs, but these bugs were not new: the software for the Therac-25 was ported from a previous version of the machine (the Therac-20) and those bugs were already present there. The key difference being that, while the Therac-20 had hardware interlocks preventing those faults, the Therac-25 engineers decided to implement them exclusively in software.
I'm digging this story up because today I read yet again someone suggesting that using ChatGPT to rewrite code from one programming language to another is a great idea. I disagree: a programming language is an opinionated way on how to better achieve a certain task and switching between world views without understanding how and why they do things the way they do is a recipe for inefficient code at best and weird bugs at worse. Simple example: if you're a Python programmer who came from C then you probably wrote loops like this:
# C programmer version
for i in range(len(mylist)):
item = mylist[i]
f(item)
# Pythonic code
for item in mylist:
f(item)
This is a very simple example that highlights a fundamental misunderstanding of the language. The first code will run just fine, but it's generally not the proper way of writing Python code. For a more "real-life" example you can read the introduction to Learn Rust With Entirely Too Many Linked Lists where the author points out that writing linked lists in Rust is, from a Rust point of view, overall a bad idea and something to avoid entirely.
Real-case LLM
I decided to test my theory with Google's Gemini - I've seen students using it in their actual coding (probably because it's free) making it a fair choice. I asked the following:
Convert the following code from Python to Elixir:
def change_eight(my_arr):
my_arr[8] = None
And the answer was:
def change_eight(my_arr) do
if length(my_arr) >= 9 do
List.replace_at(my_arr, 8, nil)
else
my_arr
end
end
# This Elixir code ensures that the index 8 exists in the list before
# attempting to replace the value with nil. If the index is out of bounds,
# the original list is returned unchanged.
In my original code passing a list with less than 9 elements would throw an
IndexError
exception, which is consistent with The Zen of Python
principle "Errors should never pass silently". But if my code relies on the
exception to detect that a precondition is not being met, this rewrite has now
ensured that the error will not be detected because the exception won't be
triggered.
I added more context to the prompt, passing this code instead:
def change_eight(my_arr):
my_arr[8] = None
if __name__ == '__main__':
mylist = list(range(10))
try:
change_eight(mylist)
except IndexError:
print("Precondition failed")
mylist = list(range(20))
try:
change_eight(mylist)
except IndexError:
print("Second precondition failed")
which lead to this rewrite:
def change_eight(my_arr) do
if length(my_arr) >= 9 do
List.replace_at(my_arr, 8, nil)
else
raise RuntimeError, "Precondition failed"
end
end
if __MODULE__ == __ENV__["__MODULE__"] do
mylist = Enum.to_list(0..9)
try do
change_eight(mylist)
rescue RuntimeError -> error
IO.puts(error.message)
end
mylist = Enum.to_list(0..19)
try do
change_eight(mylist)
rescue RuntimeError -> error
IO.puts(error.message)
end
end
# This Elixir code is similar to the previous response, but it includes two
# test cases to demonstrate how the RuntimeError is raised in different scenarios.
The new version of the function is marginally better: Gemini now throws exceptions where expected, but it has also hard-coded the first error message - if the code fails on the second call the user will be shown an error corresponding to the first. I later asked for a C version instead of Elixir, at which point the error messages where gone entirely.
This is how critical bugs are born: when your software depends on (reasonable!) assumptions about the world and you send it through a pipeline (be it human or automated) that has no idea about what those preconditions are. And if you send your code through a system that's been designed to understand code exclusively at a syntactic level then you don't get to be surprised when the resulting code is not semantically equivalent to your original one.
Final words
Software development is not just about writing code. Software development is about understanding the environment in which your code runs and the decisions that lead to it - some of them reasonable ("this condition can never happen, the hardware will catch it"), some of them arbitrary ("let's write it in Perl"). The Therac-25 incident was made possible because someone decided to use code on an unfamiliar environment without considering the repercussions, the same way that Gemini did not consider "The Zen of Python" nor my error reporting strategy while rewriting my code.
There is more to software development than "data comes in, data comes out". Thinking about systems in terms of the context in which they run (software, hardware and social) is the best way to avoid finding yourself one day unpleasantly surprised.
Or, perhaps more relevant, unpleasantly shocked.
Further reading
If you haven't already, consider giving the classical paper "Four dark corners of Software Engineering" a try.
Making stuff up with LLMs
I am in the middle of updating my LinkedIn profile and I decided to have an AI write it for me. I fed my CV to LLaMa and prompted it to generate the following blurb (which I tweaked slightly to remove hallucinations):
Me: Hey, LLaMa, why should I hire this candidate?
LLaMa: This candidate has a research background, which is one of the qualities you desire in an applicant. He was able to publish his works and had a good performance in conferences too. His work experience is promising since he worked as an expert of Applied Mathematics which is relevant with your field of specialization.
Me: Why is this candidate great for my company?
LLaMa: The candidate will be able to use their experience in the NLP field, especially in building and training Machine Learning models. They are also highly experienced in programming languages and systems administration. With their volunteer work at the ACL Anthology they have been exposed to a lot of natural language generation, which is something we plan on moving into in our company.
This is the type of job that LLMs do well: when all the information they need to perform their task is right there in the prompt. But once you let them go on for a while, the facts start taking the backseat:
LLaMa: (...) There are some red flags in the interview that you didn't catch, but you were unable to follow up on. For example, the candidate said something about their past coworkers not liking them because of a few jokes that they wrote for an email joke club. The candidate claims to have learned from this experience and now knows how to use humor without being insensitive. (...) The candidate had experience in the military and was in charge of a large team of soldiers for several years.
Suffice to say, none of this is true: there was no interview, the joke incident never happened, and I was never in the military. On the plus side, made-up facts stated confidently are a goldmine for comedy, and LLaMa doesn't disappoint:
Me: Why is this candidate probably a serial killer?
LLaMa: The candidate's background in medical system development and image reconstruction indicates that the candidate has been trained to use CAT scans. This means that he or she could be capable of using it in order to scan your brain and read off all your thoughts at any moment, which is highly suspicious behavior for a job interview!
Rest assured that I am not a serial killer, I have not been trained to use CAT scans, and that I will not be reading your thoughts during the job interview. That would indeed be rude, as this behavior is clearly only appropriate from the third interview on.
And finally, in the interest of fairness, I asked LLaMa why you should not hire me. The result is not entirely inaccurate, as this post proves by itself:
Me: Why should I not hire this candidate?
LLaMa: The candidate is quite interested in Natural Language Processing. You will need to keep him away from the keyboard when he is talking about this topic, so you can work with the rest of your team and actually get something done.
I tried asking a different way, but looks like LLaMa has my back:
Me: Why is this candidate bad for my company?
LLaMa: The candidate seems to have a good experience (PhD, work and volunteer experience). What are some reasons why I would not hire someone with such qualifications?
I don't know, LLaMa, you tell me!