Typically, Jason Scott was first to respond with a outline proposal to back up the Internet Archive, by greatly expanding the collaborative efforts of ArchiveTeam. I think Jason is trying to do something really important, and extremely difficult.
The Internet Archive's collection is currently around 15PB. It has doubled in size in about 30 months. Suppose it takes another 30 months to develop and deploy a solution at scale. We're talking crowd-sourcing a distributed backup of at least 30PB growing at least 3PB/year.
To get some idea of what this means, suppose we wanted to use Amazon's Glacier. This is, after all, exactly the kind of application Glacier is targeted at. As I predicted shortly after Glacier launched, Amazon has stuck with the 1c/GB/mo price. So in 2017 we'd be paying Amazon $3.6M a year just for the storage costs. Alternately, suppose we used Backblaze's Storage Pod 4.5 at their current price of about 5c/GB, for each copy we'd have paid $1.5M in hardware cost and be adding $150K worth per year. This ignores running costs and RAID overhead.
It will be very hard to crowd-source resources on this scale, which is why I say this is the opposite of Lots Of Copies Keep Stuff Safe. The system is going to be short of storage; the goal of a backup for the Internet Archive must be the maximum of reliability for the minimum of storage.
Nevertheless, I believe it would be well worth trying some version of his proposal and I'm happy to help any way I can. Below the fold, my comments on the design of such a system.
ReliabilityWhy is reliability so important for this system? After all, I've been arguing that reliability isn't as important as people think elsewhere. Lets suppose that somehow we have a single copy of the 30PB on disks, and that we perform an integrity check via checksums 10 times a year. Optimistically, we assume these disks never fail for any reason, but they achieve their specified Unrecoverable Bit Error Rate (UBER) of 10-15. There are 2.4*1017 bits in the copy, so on average every time we do an integrity check we will get 240 bad bits. Pessimistically, we assume these bits are randomly distributed (this makes the analysis much easier).
If, as Jason suggests, the backup is divided into 70K 500GB blocks, the probability that any of them will have more than 1 bad bit is small, so we will lose 10*240*500GB of data every year, or 1.2PB, or about a third of the incoming data. Of course, we can repair these failures from the Internet Archive itself, at the cost of increasing the bandwidth impact from about an additional quarter of the Archive's current bandwidth to about a third (see below). But the probability that the Archive and the backup would lose the same data becomes significant.
This argues for much smaller blocks, to reduce the impact of the UBER at the cost of increasing the overhead of the system. Smaller blocks would also make it possible for more people to contribute storage, both from the cost of their contribution and from the impact on their bandwidth. Downloading 500GB on my DSL link would take its entire capacity for two weeks.
In real life, even in data centers disks fail in all sorts of ways that make UBER fairly unimportant. The crowd-sourced disks are likely to be much less reliable still. So the system needs to replicate the data.
ReplicationThe discussions I've seen so far assume that the data is simply replicated, as it is in the LOCKSS system, but only three times. Even replicating by a factor of three means the demand for storage in the backup network by 2018 is nearly 100PB. Clearly, some scheme that gave adequate reliability but used less storage would help significantly. There are two techniques that might help, erasure coding and entanglement. Warning: the following discussion is radically simplified, see here.
Erasure CodingErasure coding is like a distributed version of RAID; files are divided into storage blocks. These data blocks are organized into groups of N. For each group (M>N) blocks are stored, containing the data from the N blocks mixed together so that from any N blocks in the group the original data can be recovered. This allows for non-integer replication factors; the replication factor is (M/N). There are two ways to do this:
- The N data blocks can be stored unchanged, and (M-N) parity blocks computed from the N blocks can be added. This is the way RAID typically works; it has the advantage that, if nothing has gone wrong, reading a data block requires accessing only a single block. Writing a data block requires writing (1+M-N) blocks, as the parity blocks need to be updated to reflect the new data.
- The N data blocks are not stored. Instead M blocks are stored each computed from all of the N data blocks in the group. Reading a block requires accessing N blocks, writing a data block requires accessing M blocks.
The real importance is that no individual storage node can, if compromised, reveal any data. In the context of a crowd-sourced backup of the Internet Archive, this is important. If a node in the backup network contains data from the Archive "in the clear" the owner of the node might be in trouble if the relevant authorities considered that content undesirable. If the owner has deniability, in the sense that they can say "there is no way I can know what the data I am storing is, and no way anyone can recover usable data from my disk alone" it is much harder for the authorities to claim that the owner is doing something bad.
The second form of erasure coding has desirable properties for a backup of the Archive, and it can significantly reduce the demand for storage. Examples of systems using the second form of erasure coding are Tahoe-LAFS and Cleversafe.
EntanglementEntanglement was introduced in two 2001 papers, Tangler: A Censorship-Resistant Publishing System Based On Document Entanglements by Marc Waldman and David Mazières, and Dagster: Censorship-Resistant Publishing Without Replication by Adam Stubblefield and Dan Wallach. It has recently been revived in a strengthened form by Verónica Estrada Galiñanes and Pascal Felber in their paper Helical Entanglement Codes: An Efficient Approach for Designing Robust Distributed Storage Systems.
Like the second form of erasure coding, entanglement does not store the data blocks themselves. Each stored block contains data derived from multiple data blocks. The key difference is that erasure coding mixes the data from a fixed group of blocks, whereas entanglement does not organize the blocks into groups but mixes each incoming block with a pseudo-randomly chosen set of stored blocks. This has the following effects:
- Since the information from which a data block can be recovered is spread across the whole set of stored blocks, deleting or over-writing a data block will affect other data blocks. If the spread is wide enough, selective censorship is effectively blocked and the system is append-only. For the Internet Archive backup application, this is a good thing.
- Entanglement supports only integer replication factors:
- Tangler's publishing algorithm takes two stored blocks and a data block and outputs two new stored blocks, thus its replication factor is two. A data block can be recovered from any three of the four (two input and two output) stored blocks.
- In the default three-strand configuration of the Helical Entanglement Code (HEC) system publishing takes three stored blocks, one from each strand, and a data block and outputs three new stored blocks, one for each strand. Its replication factor is thus three. Absent data loss, recovering a data block requires accessing two successive stored blocks from any of the three strands. If data loss means that none of the three strands can supply the necessary blocks, a search process can recover the lost stored blocks from information in other stored blocks.
- A system provides document dependency if a document cannot be recovered if any document it is entangled with is lost.
- A system provides all-or-nothing integrity if no document can be recovered if any document is lost.
Entanglement has several desirable properties for a backup of the Archive, but it has too high a replication factor to be practical.
RequirementsIf I were doing the design I would start from the end I haven't seen any discussion of so far. Its neat to have a backup copy of the archive, but if it won't actually work when it is needed what's the point? So I'd start the design by looking not at how the data gets out there, but at the use cases when the data needs to get back. Two obvious cases are:
- The archive loses say 10TB, perhaps because it suffers a rash of correlated disk failures. How does the archive get it back from the backup?
- The Big One hits the Bay Area and the entire archive is lost. How can the service be re-constituted from the backup?
Once I had a good recovery design, then I'd figure out:
- How to get data from the archive into a system like that.
- How to continually audit the system verify that it was in good enough shape to work when needed, something systems used only in an emergency frequently fail to do.
- Capacity of 35PB by late 2017.
- Replication factor less than 1.5, to limit storage demand by late 2017 to less than about 50PB, or say 100K volunteers each providing 500GB.
- Provides deniability (see above).
- Ingest bandwidth of 15PB/year by late 2017 (the current content of the archive needs to be backed up in say 2 years while it is growing say 5PB/year). Note that this is about 4Gb/s leaving the archive, or roughly an additional 25% outbound bandwidth.
- 95% probability of correctly recovering 10TB in 5 days (it is assumed that much of the content will be off-line most of the time, so instant recovery cannot be a requirement).
- 95% probability of correctly recovering 95% of the entire archive in 90 days.
- Meet these requirements in the face of 5% malign nodes conspiring with each other, and realistic error and availability probabilities for the non-malign nodes.
- The system self-configures as available storage resources change so as to:
- Ensure all content is stored.
- Minimize the variation of replication factor across the content.