FAST opened with a richly-deserved IEEE award for my friend Kirk McKusick, recognizing his 30-year stewardship of the Fast File System and his many contributions to the evolution of file systems in general. There's a prevalent mindset that software is unstable and evanescent; it is nice to draw attention to areas that have demonstrated long-term stability. In 30 years of consistent gradual improvement that have kept FFS competitive with the best in the industry, as the underlying disks have grown from megabytes to terabytes and the code has grown from 12K to 55K lines, there have been no incompatible changes in either the application interface or the on-disk format. Today's FFS could read any disk it has ever written, even had one by some miracle survived from 30 years ago.
This was followed by a fascinating keynote by Alyssa Henry, who manages Amazon's S3 storage service. As of a few months ago, it held over 40 billion objects and was growing exponentially. Designing systems to survive at these scales and growth rates is a major challenge; you will never know the important parameters, and failures of all kinds are an everyday occurrence. Alyssa enumerated the system's design goals:
- Dont lose or corrupt objects
- 99.99% uptime
- Scale as the competitive advantage
- Security, authentication and logs
- Low latency compared to the Internet's latency
- Simple API
- Cheap and pay-as-you-go so as to eliminate under-utilization.
She also identified a number of important techniques:
- Redundancy: but only as much as you need as although it will increase durability and availability it will also increase cost and complexity.
- Retry: making operations idempotent means that they can be retried to leverage redundancy.
- Surge Protection: rate limiting, exponential backoff and cache TTL reduce the impact of the inevitable load peaks.
- Eventual consistency: sacrifice some consistency to improve availability, sacrifice some availability to improve durability. For example, writes are not acknowledged to the application until enough redundant writes have completed, but at that time only some of the indexes that provide access to the written data may have been updated.
- Routine failure: make sure to exercise all code paths all the time, so for example when equipment or software is end-of-lifed or taken down for maintenance they just pull the plug, don't try to do a graceful shutdown.
- Integrity checking: the application delivers a checksum with the data which is checked on ingest, regularly during storage, and on dissemination.
- Telemetry: internal, external, real-time and historical, per host and aggregate data is essential to managing the system.
- Autopilot: humans fail and are slow. Don't blame person, blame the tool design.
Her conclusion was that "Storage is a lasting relationship, it requires trust. Reliability comes from engineering, experience, and scale." It was interesting that although she gave a numeric goal of 4-nines availability, and said that S3 offers a guarantee of 3-nines with financial penalties if they don't deliver, she carefully avoided giving any numbers for reliability except that their goal was 100%. Well, Duh! Although she did state that they measured it, she didn't give any indication of what the measurements showed. Nor did she show any willingness to offer any stronger statement for reliability than their existing EULA, which basically states that if they lose your data it is your problem, not theirs. It is hard for me to see any responsible way to use S3 as a long-term data repository without some commitment to or measurement of a level of reliability.
The papers were interesting, but there was only one directly relevant to digital preservation. This was Tiered Fault Tolerance for Long-Term Integrity by Byung-Gon Chun, Petros Maniatis, Scott Shenker and John Kubiatowicz (disclosure, I have been a co-author with Petros). The paper essentially answers the question:
What is the minimum necessary hardware support that would allow a guarantee of long-term integrity in a distributed, replicated system equivalent to the guarantee Byzantine Fault Tolerance (BFT) offers for short-term integrity?The answer turns out to be surprisingly little. Just a small amount of state to hold the root of a tree of hashes, protected in a specific way that is easy to implement in hardware.
I'm particularly interested in this paper because, even before I started work on LOCKSS, I had realized that long-term integrity with a BFT-like guarantee wasn't possible without hardware assistance, and that I had no idea what such hardware would look like. The work that led to the major LOCKSS papers was a response to the fact that requiring special-purpose hardware that hadn't even been designed yet would have been a major barrier to entry for libraries wishing to preserve content. Our papers addressed the question:
What is the strongest reassurance we can offer without special-purpose hardware?Taking the long view, it is likely that Chun and his co-authors will eventually be the starting point of the right track for preservation.