Tuesday, June 8, 2021

Unreliability At Scale

Thomas Claiburn's FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof discusses two recent papers that are relevant to the extraordinary levels of reliability needed in long-term digital preservation at scale. Below the fold some commentary on both papers.

I started writing about the way keeping large numbers of bits for long periods of time posed a fundamental engineering problem with 2007's A Petabyte For A Century. A little later I summarized my argument:
The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.
The "koan" I used was to require that the storage system have a 50% probability that every bit survived the century unchanged. The test to confirm in one year that the requirement was met would be economically infeasible, costing around three orders of magnitude more than the system itself. These papers study a problem with similar characteristics, silent data corruption during computation rather than during storage.

Facebook

The Facebook team introduce their paper thus:
Facebook infrastructure initiated investigations into silent data corruptions in 2018. In the past 3 years, we have completed analysis of multiple detection strategies and the performance cost associated. For brevity, this paper does not include details on the performance vs cost tradeoff evaluation. A follow up study would dive deep into the details. In this paper, we provide a case study with an application example of the corruption and are not using any fault injection mechanisms. This corruption represents one of the hundreds of CPUs we have identified with real silent data corruption through our detection techniques.
Source
In other words, they tell the story of how a specific silent data corruption was detected and the root cause determined:
In one such computation, when the file size was being computed, a file with a valid file size was provided as input to the decompression algorithm, within the decompression pipeline. The algorithm invoked the power function provided by the Scala library ... Interestingly, the Scala function returned a 0 size value for a file which was known to have a non-zero decompressed file size. Since the result of the file size computation is now 0, the file was not written into the decompressed output database.

Imagine the same computation being performed millions of times per day. This meant for some random scenarios, when the file size was non-zero, the decompression activity was never performed. As a result, the database had missing files. The missing files subsequently propagate to the application. An application keeping a list of key value store mappings for compressed files immediately observes that files that were compressed are no longer recoverable. This chain of dependencies causes the application to fail. Eventually the querying infrastructure reports critical data loss after decompression. The problem’s complexity is magnified as this manifested occasionally when the user scheduled the same workload on a cluster of machines. This meant the patterns to reproduce and debug were non-deterministic.
Even at first sight this is a nightmare to debug. And so it turned out to be :
With concerted debugging efforts and triage by multiple engineering teams, logging was enabled across all the individual worker machines at every step. This helped narrow down the host responsible for this issue. The host had clean system event logs and clean kernel logs. From a system health monitoring perspective, the machine showed no symptoms of failure. The machine sporadically produced corrupt results which returned zero when the expected results were non-zero.
Once they had a single machine on which it was possible to reproduce the data corruption they could investigate in more detail:
From the single machine workload, we identified that the failures were truly sporadic in nature. The workload was identified to be multi-threaded, and upon single threading the workload, the failure was no longer sporadic but consistent for a certain subset of data values on one particular core of the machine. The sporadic nature associated with multi-threading was eliminated but the sporadic nature associated with the data values persisted. After a few iterations, it became obvious that the computation of
Int(1.153) = 0
as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU. However, if the computation was attempted with a different input value set
Int(1.152) = 142
the result was accurate.
Next they needed to understand the specific sequence of instructions causing the corruption. This turned out to be as much of a nightmare as anything else in the story. The application, like most similar applications in hyperscale environments, ran in a virtual machine that used Just-In-Time compilation, rendering the exact instruction sequence inaccessible. They had to use mutiple tools to figure out what the JIT compiler was doing to the source code, and then finally achieve an assembly language test:
The assembly code accurately reproducing the defect is reduced to a 60-line assembly level reproducer. We started with a 430K line reproducer and narrowed it down to 60 lines.
The Facebook team conclude:
Silent data corruptions are real phenomena in datacenter applications running at scale. We present an example here which illustrates one of the many scenarios that we encounter with these data dependent, reclusive and hard to debug errors. Understanding these corruptions helps us gain insights into the silicon device characteristics; through intricate instruction flows and their interactions with compilers and software architectures. Multiple strategies of detection and mitigation exist, with each contributing additional cost and complexity into a large-scale datacenter infrastructure. A better understanding of these corruptions has helped us evolve our software architecture to be more fault tolerant and resilient. Together these strategies allow us to mitigate the costs of data corruption at Facebook’s scale.

Google

The Google team's paper takes a broader view of the problem that they refer to as corrupt execution errors (CEEs):
As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often “silent” – the only symptom is an erroneous computation.

We refer to a core that develops such behavior as “mercurial.” Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem – one that will require collaboration between hardware designers, processor vendors, and systems software architects.
They recount how CEEs were initially detected:
Imagine you are running a massive-scale data-analysis pipeline in production, and one day it starts to give you wrong answers – somewhere in the pipeline, a class of computations are yielding corrupt results. Investigation fingers a surprising cause: an innocuous change to a low-level library. The change itself was correct, but it caused servers to make heavier use of otherwise rarely-used instructions. Moreover, only a small subset of the server machines are repeatedly responsible for the errors.

This happened to us at Google. Deeper investigation revealed that these instructions malfunctioned due to manufacturing defects, in a way that could only be detected by checking the results of these instructions against the expected results; these are “silent" corrupt execution errors, or CEEs. Wider investigation found multiple different kinds of CEEs; that the detected incidence is much higher than software engineers expect; that they are not just incremental increases in the background rate of hardware errors; that these can manifest long after initial installation; and that they typically afflict specific cores on multi-core CPUs, rather than the entire chip. We refer to these cores as “mercurial."

Because CEEs may be correlated with specific execution units within a core, they expose us to large risks appearing suddenly and unpredictably for several reasons, including seemingly-minor software changes. Hyperscalers have a responsibility to customers to protect them against such risks. For business reasons, we are unable to reveal exact CEE rates, but we observe on the order of a few mercurial cores per several thousand machines – similar to the rate reported by Facebook [8]. The problem is serious enough for us to have applied many engineer-decades to it.
The "few mercurial cores per several thousand machines" have caused a wide range of problems:
Some specific examples where we have seen CEE:
  • Violations of lock semantics leading to application data corruption and crashes.
  • Data corruptions exhibited by various load, store, vector, and coherence operations.
  • A deterministic AES mis-computation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
  • Corruption affecting garbage collection, in a storage system, causing live data to be lost.
  • Database index corruption leading to some queries, depending on which replica (core) serves them, being non-deterministically corrupted.
  • Repeated bit-flips in strings, at a particular bit position (which stuck out as unlikely to be coding bugs).
  • Corruption of kernel state resulting in process and kernel crashes and application malfunctions.
Be thankful it isn't your job to debug these problems! Each of them is likely as much of a nightmare as the Facebook example.

Conclusion

The similarity between CEEs and silent data corruption in at-scale long-term storage is revealed when the Google team ask "Why are we just learning now about mercurial cores?":
There are many plausible reasons: larger server fleets; increased attention to overall reliability; improvements in software development that reduce the rate of software bugs. But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse manufacturing defects – especially those defects that manifest in corner cases, or only after post-deployment aging.
In other words it is technically and economically infeasible to implement tests for both storage systems and CPUs that assure errors will not occur in at-scale use. Even if hardware vendors can be persuaded to devote more resources to reliability, that will not remove the need for software to be appropriately skeptical of the data that cores are returning as the result of computation, just as the software needs to be skeptical of the data returned by storage devices and network interfaces.

Both papers are must-read horror stories.

1 comment:

Steven McGeady said...

It was a central conceit of the Intel/Siemens BiiN venture, the follow-on to the Intel i432, that every result in a high-trust compute system needed to be verified by *two* independent compute processes. In addition, that system predicated 33-bit memory(so-called "capability" systems), separating code from data, and more.

None of these innovations were commercially successful. This system emerged at about the same time as RISC systems, a completely different design ideology, and one more well-adjusted to the times.

So here we are, we've populated a massive cloud with single-thread computational resources that are inherently unreliable, if only at a small rate. It's an interesting discussion as to whether one should solve this at a hardware, software, or some "middle-ware" place. But it seems clear that it should probably be solved.