Learning from Software Failures

Students in Vincent St-Amour’s new Responsible Software Engineering course are analyzing case studies of software failures and exploring tools and techniques to prevent similar disasters

Software failures can have devastating consequences.

In 2017, an Equifax data breach exposed the personal information of 147 million people. Last December, following a US National Highway Traffic Safety Administration probe into a series of collisions related to Tesla's autopilot software, the company recalled more than 2 million vehicles.

And UK authorities are continuing an ongoing investigation into the Horizon accounting system failure that resulted in the wrongful prosecution of hundreds of UK Post Office employees for false shortfalls.

Forty students in the new COMP_SCI 396: Responsible Software Engineering course launched by Northwestern Engineering’s Vincent St-Amour are analyzing these types of incidents and exploring tools and techniques that can prevent similar failures in the future.

Vincent St-Amour“The welfare of the public is the professional engineer’s number one priority,” said St-Amour, associate professor of instruction at the McCormick School of Engineering. “I'm increasingly of the thinking that ethics needs to be an omnipresent presence in our curriculum, something that must be reinforced at every turn, because what students are building is going to affect people’s lives.”

Inspired by the core curriculum ethics integration model he piloted last year in collaboration with fellows in the Northwestern Tech Ethics Initiative, St-Amour structured Responsible Software Engineering as a team-based course focused on problems and solutions.

Each Tuesday this winter quarter, students examine a case study of a software disaster. Teams respond to a set of questions posed by St-Amour, then pose their own discussion points to delve into aspects of the disaster they find most interesting or relevant.

“The discussion is wide-ranging because a lot of the causes of these failures are things that we can't necessarily tackle in a classroom setting, but they are nonetheless really important to talk about,” St-Amour said. “For instance, one of the big problems in the UK Post Office scandal is an excessive faith in technology coupled with a lack of understanding around the limits of technology — and students will encounter this type of issue.”

On Thursdays, the class analyzes one specific root cause of the software disaster and applies potential technical and process solutions.

“One central failure in the Post Office case was that different parts of the system could not actually communicate with one another,” St-Amour explained. “The fact that a postmaster had sent a payment was lost to the ether, so it looked like they had just kept the money.”

Students practiced using validation tools to test and check message models that represent the types of messages that would be transmitted in a Horizon-type system.

Students also work through hypothetical ethical conflicts that can arise in the workplace using frameworks such as the ACM Code of Ethics and Professional Conduct and ACM Proactive CARE (Consider, Analyze, Review, and Evaluate) to help make informed decisions.

“Students will confront ethical conflicts — pressure to cut corners to save money or ship something on time,” St-Amour said. “We discussed how to approach these things in practice to prevent compromising user safety or data security.”

For their final projects, each team of five students can opt to present a case study or a technical solution.

Lucy BeckLucy Beck, a fourth-year student pursuing a combined bachelor of arts and master of science in computer science degree, will present a case study with her team on the 2019 grounding of Boeing’s 737 Max airliner. The grounding followed two fatal plane crashes resulting from malfunctioning sensor data to retrofit flight stability software called the Maneuver Characteristics Augmentation System (MCAS).

Beck, an incoming software engineer at Dropbox, enrolled in the course to learn about past software failures so she could understand what went wrong and how to avoid those types of failures in her own career.

“Throughout my time at Northwestern, my view has shifted from prioritizing technical skills to emphasizing the broader societal implications of technology,” Beck said. “Enrolling in courses like Responsible Software Engineering, alongside other advanced CS classes, has prompted me to critically examine the ethical dimensions of my work.”

Jack Burkhardt, a fourth-year student earning a combined bachelor of arts and master of science in computer science degree, believes a responsible software engineer can recognize their shortcomings, take ownership of their mistakes, and advocate for others whose voices may otherwise not be heard during the development process.

Jack Burkhardt“A common theme in many of the failure case studies we examined during class involved engineers letting their hubris get the best of them,” Burkhardt said. “To be responsible is to do the best you can to recognize the ramifications of your work and ensure that a particular group will not be negatively impacted if the software fails.”

Burkhardt’s team is working on a case study of the 2022 Southwest Airlines holiday travel meltdown, in which operational failures of two proprietary and internally maintained software systems used for managing and crewing flights resulted in the cancellation of 16,900 flights and the stranding of more than two million passengers.

“The airline's software systems were already known to be dated and marred with issues, and a blizzard was the perfect storm for a software crisis,” Burkhardt said.


McCormick News Article