Tuesday, June 8, 2021

Unreliability At Scale

Thomas Claiburn's FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof discusses two recent papers that are relevant to the extraordinary levels of reliability needed in long-term digital preservation at scale. Below the fold some commentary on both papers.

I started writing about the way keeping large numbers of bits for long periods of time posed a fundamental engineering problem with 2007's A Petabyte For A Century. A little later I summarized my argument:
The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.
The "koan" I used was to require that the storage system have a 50% probability that every bit survived the century unchanged. The test to confirm in one year that the requirement was met would be economically infeasible, costing around three orders of magnitude more than the system itself. These papers study a problem with similar characteristics, silent data corruption during computation rather than during storage.

Facebook

The Facebook team introduce their paper thus:
Facebook infrastructure initiated investigations into silent data corruptions in 2018. In the past 3 years, we have completed analysis of multiple detection strategies and the performance cost associated. For brevity, this paper does not include details on the performance vs cost tradeoff evaluation. A follow up study would dive deep into the details. In this paper, we provide a case study with an application example of the corruption and are not using any fault injection mechanisms. This corruption represents one of the hundreds of CPUs we have identified with real silent data corruption through our detection techniques.
Source
In other words, they tell the story of how a specific silent data corruption was detected and the root cause determined:
In one such computation, when the file size was being computed, a file with a valid file size was provided as input to the decompression algorithm, within the decompression pipeline. The algorithm invoked the power function provided by the Scala library ... Interestingly, the Scala function returned a 0 size value for a file which was known to have a non-zero decompressed file size. Since the result of the file size computation is now 0, the file was not written into the decompressed output database.

Imagine the same computation being performed millions of times per day. This meant for some random scenarios, when the file size was non-zero, the decompression activity was never performed. As a result, the database had missing files. The missing files subsequently propagate to the application. An application keeping a list of key value store mappings for compressed files immediately observes that files that were compressed are no longer recoverable. This chain of dependencies causes the application to fail. Eventually the querying infrastructure reports critical data loss after decompression. The problem’s complexity is magnified as this manifested occasionally when the user scheduled the same workload on a cluster of machines. This meant the patterns to reproduce and debug were non-deterministic.
Even at first sight this is a nightmare to debug. And so it turned out to be :
With concerted debugging efforts and triage by multiple engineering teams, logging was enabled across all the individual worker machines at every step. This helped narrow down the host responsible for this issue. The host had clean system event logs and clean kernel logs. From a system health monitoring perspective, the machine showed no symptoms of failure. The machine sporadically produced corrupt results which returned zero when the expected results were non-zero.
Once they had a single machine on which it was possible to reproduce the data corruption they could investigate in more detail:
From the single machine workload, we identified that the failures were truly sporadic in nature. The workload was identified to be multi-threaded, and upon single threading the workload, the failure was no longer sporadic but consistent for a certain subset of data values on one particular core of the machine. The sporadic nature associated with multi-threading was eliminated but the sporadic nature associated with the data values persisted. After a few iterations, it became obvious that the computation of
Int(1.153) = 0
as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU. However, if the computation was attempted with a different input value set
Int(1.152) = 142
the result was accurate.
Next they needed to understand the specific sequence of instructions causing the corruption. This turned out to be as much of a nightmare as anything else in the story. The application, like most similar applications in hyperscale environments, ran in a virtual machine that used Just-In-Time compilation, rendering the exact instruction sequence inaccessible. They had to use mutiple tools to figure out what the JIT compiler was doing to the source code, and then finally achieve an assembly language test:
The assembly code accurately reproducing the defect is reduced to a 60-line assembly level reproducer. We started with a 430K line reproducer and narrowed it down to 60 lines.
The Facebook team conclude:
Silent data corruptions are real phenomena in datacenter applications running at scale. We present an example here which illustrates one of the many scenarios that we encounter with these data dependent, reclusive and hard to debug errors. Understanding these corruptions helps us gain insights into the silicon device characteristics; through intricate instruction flows and their interactions with compilers and software architectures. Multiple strategies of detection and mitigation exist, with each contributing additional cost and complexity into a large-scale datacenter infrastructure. A better understanding of these corruptions has helped us evolve our software architecture to be more fault tolerant and resilient. Together these strategies allow us to mitigate the costs of data corruption at Facebook’s scale.

Google

The Google team's paper takes a broader view of the problem that they refer to as corrupt execution errors (CEEs):
As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often “silent” – the only symptom is an erroneous computation.

We refer to a core that develops such behavior as “mercurial.” Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem – one that will require collaboration between hardware designers, processor vendors, and systems software architects.
They recount how CEEs were initially detected:
Imagine you are running a massive-scale data-analysis pipeline in production, and one day it starts to give you wrong answers – somewhere in the pipeline, a class of computations are yielding corrupt results. Investigation fingers a surprising cause: an innocuous change to a low-level library. The change itself was correct, but it caused servers to make heavier use of otherwise rarely-used instructions. Moreover, only a small subset of the server machines are repeatedly responsible for the errors.

This happened to us at Google. Deeper investigation revealed that these instructions malfunctioned due to manufacturing defects, in a way that could only be detected by checking the results of these instructions against the expected results; these are “silent" corrupt execution errors, or CEEs. Wider investigation found multiple different kinds of CEEs; that the detected incidence is much higher than software engineers expect; that they are not just incremental increases in the background rate of hardware errors; that these can manifest long after initial installation; and that they typically afflict specific cores on multi-core CPUs, rather than the entire chip. We refer to these cores as “mercurial."

Because CEEs may be correlated with specific execution units within a core, they expose us to large risks appearing suddenly and unpredictably for several reasons, including seemingly-minor software changes. Hyperscalers have a responsibility to customers to protect them against such risks. For business reasons, we are unable to reveal exact CEE rates, but we observe on the order of a few mercurial cores per several thousand machines – similar to the rate reported by Facebook [8]. The problem is serious enough for us to have applied many engineer-decades to it.
The "few mercurial cores per several thousand machines" have caused a wide range of problems:
Some specific examples where we have seen CEE:
  • Violations of lock semantics leading to application data corruption and crashes.
  • Data corruptions exhibited by various load, store, vector, and coherence operations.
  • A deterministic AES mis-computation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
  • Corruption affecting garbage collection, in a storage system, causing live data to be lost.
  • Database index corruption leading to some queries, depending on which replica (core) serves them, being non-deterministically corrupted.
  • Repeated bit-flips in strings, at a particular bit position (which stuck out as unlikely to be coding bugs).
  • Corruption of kernel state resulting in process and kernel crashes and application malfunctions.
Be thankful it isn't your job to debug these problems! Each of them is likely as much of a nightmare as the Facebook example.

Conclusion

The similarity between CEEs and silent data corruption in at-scale long-term storage is revealed when the Google team ask "Why are we just learning now about mercurial cores?":
There are many plausible reasons: larger server fleets; increased attention to overall reliability; improvements in software development that reduce the rate of software bugs. But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse manufacturing defects – especially those defects that manifest in corner cases, or only after post-deployment aging.
In other words it is technically and economically infeasible to implement tests for both storage systems and CPUs that assure errors will not occur in at-scale use. Even if hardware vendors can be persuaded to devote more resources to reliability, that will not remove the need for software to be appropriately skeptical of the data that cores are returning as the result of computation, just as the software needs to be skeptical of the data returned by storage devices and network interfaces.

Both papers are must-read horror stories.

6 comments:

Steven McGeady said...

It was a central conceit of the Intel/Siemens BiiN venture, the follow-on to the Intel i432, that every result in a high-trust compute system needed to be verified by *two* independent compute processes. In addition, that system predicated 33-bit memory(so-called "capability" systems), separating code from data, and more.

None of these innovations were commercially successful. This system emerged at about the same time as RISC systems, a completely different design ideology, and one more well-adjusted to the times.

So here we are, we've populated a massive cloud with single-thread computational resources that are inherently unreliable, if only at a small rate. It's an interesting discussion as to whether one should solve this at a hardware, software, or some "middle-ware" place. But it seems clear that it should probably be solved.

Unknown said...

Great overview! Just to add a little more fuel to the fire, in case anyone missed it, NTT reported similar challenging scenarios with networking (Technology to prevent information and communications equipment trouble from cosmic rays): https://www.ntt.co.jp/news2013/1303e/130321a.html

The Holley's said...

Good comment, Steven. I wasn’t aware of BiiN. My thought as I read through this article was that perhaps we should introduce a two- or three-channel voting architecture. However, we’d be increasing the cost of machines dramatically and may be only receiving limited increase in reliability as corner cases are, well, corner. Still the earlier we detect voting failure, the sooner we can research manufacturing mitigation.

As developer I shook my head to technology enthusiast who insisted that if the software was tested there was no way a machine could flip a bit when it wasn’t supposed to. We have so much to learn.

Anonymous said...

The i432 was not the first shadow-execution reliable computer. They are commercially successful in aviation, vehicles, power plants, high risk machinery, even telephone switching systems. But mostly they are RISC-like machines. The Plessey S/500 was a rare capability machine with shadow execution which may have been an inspiration to Intel on the 432. Intel was doing a lot of work with DARPA in the late 70s on highly secure computing, it was an influence on the segmented and ring architecture of the 286. So you can see the 432 as the last hurrah of that clique in their designers.

It will not be easy to get shadow execution to work in a world of speculative execution and non-deterministic timing. Some innovation will be needed.

Blissex2 said...

Many thanks to our blogger for this excellent overview, and here are some cynical comments about what he reports, not his overview.

«"Why are we just learning now about mercurial cores?"»

I saw all of these things happen even on discrete logic CPU computers, XDS Sigma IIs, PDP-11, and much more rarely mainframes. They are nothing new, part of the experience of IT people for many decades (even if usually it was more the buses, memory, peripherals than the CPUs that had "glitches"), if they are learning only now about such possibilities it is more because of the "we" than the "mercurial cores".

«"There are many plausible reasons:"»

My impression is that those "we" at Google and Facebook and many other places were simply not looking for them, and when they manifested they were dismissed because of social, not technical, factors:

* A lot of the engineering people at Google, Facebook, Twitter, Amazon, etc. were/are newbies who had never been exposed to such issues on their Nintendos and Playstations :-).

* A lot of the managerial people at Google, Facebook, Twitter, Amazon, etc. cared very little about spending money and time on occasional glitches as long as eyeballs were booming and so were stock valuations. For managerial types "it works" means "my bonus/option package is doing well".

That is “Hyperscalers have a responsibility to customers to protect them against such risks” used to be the last thing big web corps cared about; my guess is that things changed for one or both of these reasons:

* Some big web corps stated to offer paid-for services, like Google Docs or AWS, where individual customers do complain if glitches happen, where glitches in users-as-a-product services matter only in the aggregate. If you lose Google Docs files or advertising videos someone with a thick wallet is going to complain, not if you lose someone's search history or random posts.

* Those big web corps as they grew split into customer-provider departments, with well defined boundaries and internal as-if-paid-for interfaces and blame dumping, and therefore the customer departments would dump blame on the glitches in services onto the providers departments, requiring responses.

Surely there have been technical reasons why, but they are also driven by social factors:

* “ever-smaller feature sizes that push closer to the limits of CMOS scaling”: that really means "cheaper" sells more than "greater reliability", something particularly evident in GPUs.
* “coupled with ever-increasing complexity in architectural design”: that really means "more features" and "backwards compatible forever" sell more than "greater reliability".

Overall however the really important "detail" of these reports from big web corps is:

“The problem is serious enough for us to have applied many engineer-decades to it.”

That level of spending is not very commonly affordable. As our excellent blogger keeps reminding us, in computing at size or time scale funding issues are very important, even as just to auditing (something that I have often to remind optimists), as in his point “The test to confirm in one year that the requirement was met would be economically infeasible, costing around three orders of magnitude more than the system itself”, which is often applicable in other situations.

For the future I am cynical, and we shall live with glitchy computers systems, both hardware and software, just as we have long learned to live with glitchy biological and social systems. Science fiction writers have long anticipated that size, time, complexity scales will result in unreliable automation, right now I remember "Fearsum Endjin" and "A deepness upon the sky".

David. said...

John Markoff covers this topic for the general reader in Tiny Chips, Big Headaches:

"In their paper “Cores That Don’t Count,” the Google researchers noted that the problem was challenging enough that they had already dedicated the equivalent of several decades of engineering time to solving it.

Modern processor chips are made up of dozens of processor cores, calculating engines that make it possible to break up tasks and solve them in parallel. The researchers found a tiny subset of the cores produced inaccurate results infrequently and only under certain conditions. They described the behavior as sporadic. In some cases, the cores would produce errors only when computing speed or temperature was altered.

Increasing complexity in processor design was one important cause of failure, according to Google. But the engineers also said that smaller transistors, three-dimensional chips and new designs that create errors only in certain cases all contributed to the problem."