Tuesday, July 23, 2019

Not To Pick On Toyota

Just under five years ago Prof. Phil Koopman gave a talk entitled A Case Study of Toyota Unintended Acceleration and Software Safety (slides, video). I only just discovered it, and its an extraordinarily valuable resource for understanding the risks of embedded software. Especially the risks of embedded software in life-critical products, and the processes needed to avoid failures such as those that caused deaths from sudden unintended acceleration (SUA) of Toyota cars, and from unintended pitch-down of Boeing 737 MAX aircraft. I doubt Toyota is an outlier in this respect, and I would expect that the multi-billion dollar costs of the problems Koopman describes have motivated much improvement in their processes. Follow me below the fold for the details.

Wikipedia has a summary of the background to Koopman's talk, which states:
A subsequent investigation ... revealed that bad software design, antiquated ECU hardware fueled by a poor company culture were the likely cause of the SUA in the Toyota Camry incidents. ... on March 19, 2014, the DOJ issued a deferred prosecution agreement with a $1.2 billion criminal penalty for issuing misleading and deceptive statements to its consumers and federal regulators, as well as hiding another cause of unintended acceleration, the sticky pedal, from the NHTSA. This fine was separate from the $1.2 billion settlement of a class action suit paid to the drivers of Toyota cars who claimed that their cars had lost value as a result of the SUA problems gaining publicity in 2012, and was at the time the largest criminal fine against an automaker in US history. Toyota was also forced to pay a total of $66.2 million in fines to the Department of Transportation for failing to handle recalls properly and $25.5 million to Toyota shareholders whose stock lost value due to recalls. Nearly 400 wrongful-death and personal injury cases were also privately settled by Toyota as a result of unintended acceleration.
In 2010 the NHTSA reported that:
from 2000 to mid-May, it had received more than 6,200 complaints involving sudden acceleration in Toyota vehicles. The reports include 89 deaths and 57 injuries over the same period.
Under a tight timeline and with limited access to the Electronic Throttle Control System (ETCS) software, NASA investigated. Their conclusion was (Slide 9):
Proof for the hypothesis that the ETCS-i caused the large throttle opening UAs ... could not be found with the hardware and software testing performed. Because proof that the ETCS-i caused the reported UAs was not found does not mean it could not occur.
Nevertheless (Slide 9):
U.S. Transportation Secretary Ray LaHood said, “We enlisted the best and brightest engineers to study Toyota’s electronics systems, and the verdict is in. There is no electronic-based cause for unintended high-speed acceleration in Toyotas."
Koopman describes two major problems with the NASA investigation (Slide 10):
  • The ETCS has both a main CPU, and a monitor CPU. NASA only examined the main CPU.
  • Apparently based on misinformation from Toyota, NASA reported that the ETCS SRAMs had error detection and correction (EDAC), but for at least the 2005 model year, they didn't.
He goes on to examine in detail the flawed software in both CPUs, and the system architecture that provided multiple single points of failure. It is well worth studying the slides. I just want to make a few points from them.

1. Testing can never be enough

From (Slide 20):
  • Toyota tested about 35 million miles at system level
    • Plus 11 million hours module level software testing ... covering 2005-2010 period
    • In 2010 Toyota sold 2.1 million vehicles
  • Total testing is perhaps 1-2 hours per vehicle produced
    • Fleet will see thousands of times more field exposure
    • Vehicle testing simply can’t find all uncommon failures
Koopman quotes Ricky Butler and George Finelli's 1993 The infeasibility of quantifying the reliability of life-critical real-time software pointing out that:
life-testing of ultrareliable software is infeasible (i.e. to quantify 10-8/h failure rate requires more than 108 h of testing
This is essentially the same argument as in my Petabyte For A Century series, and by those exploring exascale super-computers; the product would be obsolete before testing could confirm it met the required reliability.

2. The scale of the problem is huge

A decade and a half ago Toyota's ETCS main CPU software was well over a quarter of a million lines of C code (Slide 18).  Imagine how much bigger today's ETCS software must be, let alone all the other software embedded in a car's systems. One hopes that coding standards and development processes have improved, and that there isn't legacy code hanging around from those days.

3. Responses will always start with denial

Just as Boeing did when it initially blamed the 737 MAX crash pilots, Toyota blamed the drivers. And just as the FAA initially did, the government backed them up. It took a whistleblower to reveal what was really going on:
In April 2013, Betsy Benjaminson, a freelance translator working for Toyota to translate internal documents, released a personal statement about Toyota covering up facts about the sudden unintended acceleration problem. Benjaminson stated she “read many descriptions by executives and managers of how they had hoodwinked regulators, courts, and even congress, by withholding, omitting, or misstating facts.” Benjaminson also compared Toyota’s press releases and mentioned that they were obviously meant to “maintain public belief in the safety of Toyota’s cars—despite providing no evidence to support those reassurances.” This public statement was released when Benjaminson decided to name herself as a whistleblower after she had been providing evidence to Iowa Senator Charles Grassley.
To some extent, this is knee-jerk lawyerly caution, but to a much greater extent it is because the manufacturer doesn't understand the depth of their problem, and the regulator doesn't understand the extent to which they've been captured. Once they've both started on a PR strategy of denial, it takes something dramatic (a whistleblower, another crash) to U-turn.

4. Resilience

Four years ago, in Brittle Systems, I contrasted the typical brittle failure of computer systems, which work fine until they abruptly fail, with the way life-critical physical systems are engineered to fail, gradually while making alarming noises. The canonical example is the stranded cables used in suspension bridges. Each strand acts as a Fault Containment Region (FCR); a crack will not propagate from one strand to its neighbors.

Koopman discusses FCRs (Slides 31 & 32) in software systems, as part of "defense in depth" strategy necessary to make these life-critical systems resilient against the inevitable failures.


Clifford Atiyeh's Toyota Prius Stalling Problems Continue, Highlighted by a Dealer’s Lawsuit over Five-Year Problem reports that the inevitable software problems continue:
A Toyota Prius recall involving defective electric powertrains has reached a boiling point, with a California dealership suing the automaker. Behind the business disputes is a safety concern that could affect more than 800,000 Prius hybrids on U.S. roads. According to a report in the Los Angeles Times, the software fixes Toyota has released in three recalls between 2014 and 2018 aren't working.

Reporting on courtroom testimony from Toyota executives, the Times said as many as 20,000 Prius owners have reported electric powertrain failures since the recall was issued in February 2014. At that time, Toyota recalled nearly 700,000 cars from 2010 to 2014 model years for inverter transistors that could fail and shut the whole car down while driving. Toyota expanded this recall in July 2015 to include another 109,000 Prius V models, then released another recall for all of these cars in October 2018. Each involved software updates that were supposed to place the cars in limp-home mode if they detected a power failure. But the lawsuit alleges this hasn't been working as Toyota intended, according to the Times.

1 comment:

Unknown said...

Thank you for this article. I own a 2010 Toyota Scion with a Corolla engine. I just had a terrifying unintended acceleration incident with my vehicle 2 weeks ago. It was very poorly handled. They blamed it on my floor mats but honestly don't believe that was the case. I was not even allowed to speak to the technician who did the assessment of my car. The attributed to format entrapment because they couldn't find anything else wrong with a vehicle. I'm afraid to drive the car nor do I feel it's ethical to sell it.