7c0h

Therac-25, LLMs and the Zen of Python

If you are a professional software developer you probably heard at some point the cautionary tale of the Therac-25, a radiation therapy machine that gave massive overdoses of radiation to at least six patients in the 80s and killed at least three.

There are several layers of problems explaining how all of this came to be, but today I'd like to focus on a specific thread. The (in)famous race condition in the Therac-25 was caused by two simultaneous bugs, but these bugs were not new: the software for the Therac-25 was ported from a previous version of the machine (the Therac-20) and those bugs were already present there. The key difference being that, while the Therac-20 had hardware interlocks preventing those faults, the Therac-25 engineers decided to implement them exclusively in software.

I'm digging this story up because today I read yet again someone suggesting that using ChatGPT to rewrite code from one programming language to another is a great idea. I disagree: a programming language is an opinionated way on how to better achieve a certain task and switching between world views without understanding how and why they do things the way they do is a recipe for inefficient code at best and weird bugs at worse. Simple example: if you're a Python programmer who came from C then you probably wrote loops like this:

# C programmer version
for i in range(len(mylist)):
    item = mylist[i]
    f(item)

# Pythonic code
for item in mylist:
    f(item)

This is a very simple example that highlights a fundamental misunderstanding of the language. The first code will run just fine, but it's generally not the proper way of writing Python code. For a more "real-life" example you can read the introduction to Learn Rust With Entirely Too Many Linked Lists where the author points out that writing linked lists in Rust is, from a Rust point of view, overall a bad idea and something to avoid entirely.

Real-case LLM

I decided to test my theory with Google's Gemini - I've seen students using it in their actual coding (probably because it's free) making it a fair choice. I asked the following:

Convert the following code from Python to Elixir:
def change_eight(my_arr):
    my_arr[8] = None

And the answer was:

def change_eight(my_arr) do
  if length(my_arr) >= 9 do
    List.replace_at(my_arr, 8, nil)
  else
    my_arr
  end
end
# This Elixir code ensures that the index 8 exists in the list before
# attempting to replace the value with nil. If the index is out of bounds,
# the original list is returned unchanged.

In my original code passing a list with less than 9 elements would throw an IndexError exception, which is consistent with The Zen of Python principle "Errors should never pass silently". But if my code relies on the exception to detect that a precondition is not being met, this rewrite has now ensured that the error will not be detected because the exception won't be triggered.

I added more context to the prompt, passing this code instead:

def change_eight(my_arr):
    my_arr[8] = None

if __name__ == '__main__':
    mylist = list(range(10))
    try:
        change_eight(mylist)
    except IndexError:
        print("Precondition failed")

    mylist = list(range(20))
    try:
        change_eight(mylist)
    except IndexError:
        print("Second precondition failed")

which lead to this rewrite:

def change_eight(my_arr) do
  if length(my_arr) >= 9 do
    List.replace_at(my_arr, 8, nil)
  else
    raise RuntimeError, "Precondition failed"
  end
end

if __MODULE__ == __ENV__["__MODULE__"] do
  mylist = Enum.to_list(0..9)
  try do
    change_eight(mylist)
  rescue RuntimeError -> error
    IO.puts(error.message)
  end

  mylist = Enum.to_list(0..19)
  try do
    change_eight(mylist)
  rescue RuntimeError -> error
    IO.puts(error.message)
  end
end
# This Elixir code is similar to the previous response, but it includes two
# test cases to demonstrate how the RuntimeError is raised in different scenarios.

The new version of the function is marginally better: Gemini now throws exceptions where expected, but it has also hard-coded the first error message - if the code fails on the second call the user will be shown an error corresponding to the first. I later asked for a C version instead of Elixir, at which point the error messages where gone entirely.

This is how critical bugs are born: when your software depends on (reasonable!) assumptions about the world and you send it through a pipeline (be it human or automated) that has no idea about what those preconditions are. And if you send your code through a system that's been designed to understand code exclusively at a syntactic level then you don't get to be surprised when the resulting code is not semantically equivalent to your original one.

Final words

Software development is not just about writing code. Software development is about understanding the environment in which your code runs and the decisions that lead to it - some of them reasonable ("this condition can never happen, the hardware will catch it"), some of them arbitrary ("let's write it in Perl"). The Therac-25 incident was made possible because someone decided to use code on an unfamiliar environment without considering the repercussions, the same way that Gemini did not consider "The Zen of Python" nor my error reporting strategy while rewriting my code.

There is more to software development than "data comes in, data comes out". Thinking about systems in terms of the context in which they run (software, hardware and social) is the best way to avoid finding yourself one day unpleasantly surprised.

Or, perhaps more relevant, unpleasantly shocked.

Further reading

If you haven't already, consider giving the classical paper "Four dark corners of Software Engineering" a try.