Thursday, March 25, 2021

Internet Archive Storage

The Internet Archive is a remarkable institution, which has become increasingly important during the pandemic. It has been for many years in the world's top 300 Web sites and is currently ranked #209, sustaining almost 60Gb/s outbound bandwidth from its collection of almost half a trillion archived Web pages and much other content. It does this on a budget of under $20M/yr, yet maintains 99.98% availability.

Jonah Edwards, who runs the Core Infrastructure team, gave a presentation on the Internet Archive's storage infrastructure to the Archive's staff. Below the fold, some details and commentary.

Tuesday, March 16, 2021

Correlated Failures

The invaluable statistics published by Backblaze show that, despite being built from technologies close to the physical limits (Heat-Assisted Magnetic Recording, 3D NAND Flash), modern digital storage media are extraordinarily reliable. However, I have long believed that the models that attempt to project the reliability of digital storage systems from the statistics of media reliability are wildly optimistic. They ignore foreseeable causes of data loss such as Coronal Mass Ejections and ransomware attacks, which cause correlated failures among the media in the system. No matter how many they are, if all replicas are destroyed or corrupted the data is irrecoverable.

Modelling these "black swan" events is clearly extremely difficult, but much less dramatic causes are in practice important too. It has been known at least since Talagala's 1999 Ph.D. thesis that media failures in storage systems are significantly correlated, and at least since Jiang et al's 2008 Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics that only about half the failures in storage systems are traceable to media failures. The rest happen in the pipeline from the media to the CPU. Because this typically aggregates data from many media components, it naturally causes correlations.

As I wrote in 2015's Disk reliability, discussing Backblaze's experience of a 40% Annual Failure Rate (AFR) in over 1,100 Seagate 3TB drives:
Alas, there is a long history of high failure rates among particular batches of drives. An experience similar to Backblaze's at Facebook is related here, with an AFR over 60%. My first experience of this was nearly 30 years ago in the early days of Sun Microsystems. Manufacturing defects, software bugs, mishandling by distributors, vibration resonance, there are many causes for these correlated failures.
Despite plenty of anecdotes, there is little useful data on which to base models of correlated failures in storage systems. Below the fold I summarize and comment on an important paper by a team from the Chinese University of Hong Kong and Alibaba that helps remedy this.

Thursday, March 4, 2021

History Of Window Systems

Alan Kay's Should web browsers have stuck to being document viewers? makes important points about the architecture of the infrastructure for user interfaces, but also sparked comments and an email exchange that clarified the early history of window systems. This is something I've wrtten about previously, so below the fold I go into considerable detail.