Notes from the day's log. This is day 1/n, the hope is that it's a useful output with a few days or weeks of practice.
Debugging Go, the Linux Kernel, and GCC signal handling. There's an interesting and very deep set of bugs that are being explored in the Linux kernel and in the latest Go. Brad Fitzpatrick notes this on Twitter - "Pretty fun debugging". The Go bug is is 35326 "Corrupt binary export data" and the upstream kernel bug is 205663 "AVX register corruption from signal delivery". At the moment the assembled team is staring at differences between the GCC 8 and GCC 9 compiler output, and bisecting 5.x kernel versions.
A previous interesting kernel debug exercise is 201685 "Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption". This was in the 4.19 era of kernel development, and the interesting piece of it is just how tricky it was to reproduce the condition. Eventually after some weeks of effort this was pinned down, and this commit to blktests has the reproducer based on original code from Lukáš Krejčí and contributed by Omar Sandoval.
In firmware engineering news, certain HPE SSD drives will brick themselves after 32,768 hours (less than 4 years) of operation, with no recovery possible. Details at the HPE Support Center. Firmware updates will fix the drives from certain doom if they are updated in time. Note that the worst case scenario is that all of the drives of this type which you put into production at the same time will fail at the same time. No news of reported failures in the field.
A common theme in each of these reports is "how do we reproduce this bug". Any number of systems will crash in mysterious ways if you have heavy loads and unstable components. The code paths that you reason about when they system is doing ordinary work can have very little to do with what actually happens in overload conditions. Composing a minimum reproducing case for a complex bug is a very special and useful skill set, and it often takes people with an unusual command of system innards as well as a faithful and exact mental model of how a complex piece of silicon actually works to pull that off.
Comments