Bruce Li introduces his
The Long Now of the Web: Inside the Internet Archive’s Fight Against Forgetting thus:
This report delves into the mechanics of the Internet Archive with the precision of a teardown. We will strip back the chassis to examine the custom-built PetaBox servers that heat the building without air conditioning. We will trace the evolution of the web crawlers—from the early tape-based dumps of Alexa Internet to the sophisticated browser-based bots of 2025. We will analyze the financial ledger of this non-profit giant, exploring how it survives on a budget that is a rounding error for its Silicon Valley neighbors. And finally, we will look to the future, where the "Decentralized Web" (DWeb) promises to fragment the Archive into a million pieces to ensure it can never be destroyed.
It is long, detailed, comprehensive and well worth reading in full. Below the fold I comment on the part about storage.
This picture shows Brewster Kahle with the Internet Archive's very first storage, tape drives holding crawls from
Alexa Internet, the Web traffic analysis company founded by Brewster and Bruce Gilliat in 1996 and, three years later, acquired by Amazon for $250M.
The first custom-designed Internet Archive storage system was version 1 of the
PetaBox, a vividly red-painted 1U system introduced in 2004. This picture shows a rack of them. I'm familiar with them because the
LOCKSS Program used them for some years.
They were great! Li
explains:
The first PetaBox rack, operational in June 2004, was a revelation in storage density. It held 100 terabytes (TB) of data—a massive sum at the time—while consuming only about 6 kilowatts of power.1 To put that in perspective, in 2003, the entire Wayback Machine was growing at a rate of just 12 terabytes per month.
The LOCKSS Program eventually
transitioned to 4U systems similar to those then in use at
Backblaze. Around the same time the Archive also moved to 4U systems for their
PetaBox.
As disk technology progressed the Archive continued
evolving the system:
The fourth-generation PetaBox, introduced around 2010, exemplified this density. Each rack contained 240 disks of 2 terabytes each, organized into 4U high rack mounts. These units were powered by Intel Xeon processors (specifically the E7-8870 series in later upgrades) with 12 gigabytes of RAM. The architecture relied on bonding pair of 1-gigabit interfaces to create a 2-gigabit pipe, feeding into a rack switch with a 10-gigabit uplink.10
Over time older, smaller drives were phased out and replaced by
newer, bigger drives:
By 2025, the storage landscape had shifted again. The current PetaBox racks provide 1.4 petabytes of storage per rack. This leap is achieved not by adding more slots, but by utilizing significantly larger drives—8TB, 16TB, and even 22TB drives are now standard. In 2016, the Archive managed around 20,000 individual disk drives. Remarkably, even as storage capacity tripled between 2012 and 2016, the total count of drives remained relatively constant due to these density improvements.
This is indeed an impressive strory and Li
writes:
The trajectory of the PetaBox is a case study in Moore's Law applied to magnetic storage.
Here I have to correct Li. Moore's Law applies to silicon such as solid-state storage; it is
Kryder's Law that applies to magnetic storage such as the hard disks in the PetaBox.
.
Li notes the Archive's innovative approach to
cooling the racks:
The Archive's primary data center is located in the Richmond District of San Francisco, a neighborhood known for its perpetual fog and cool maritime climate. The building utilizes this ambient air for cooling. There is no traditional air conditioning in the PetaBox machine rooms. Instead, the servers are designed to run at slightly higher operational temperatures, and the excess heat generated by the spinning disks is captured and recirculated to heat the building during the damp San Francisco winters.9
This "waste heat" system is a closed loop of efficiency. The 60+ kilowatts of heat energy produced by a storage cluster is not a byproduct to be eliminated but a resource to be harvested. This design choice dramatically lowers the Power Usage Effectiveness (PUE) ratio of the facility, allowing the Archive to spend its limited funds on hard drives rather than electricity bills.
In the unlikely, for San Francisco, event that the day is too hot, less-urgent tasks can be delayed, or some of the racks can have their clock rate reduced, disks put into sleep mode, or even be powered down. Redundancy means that the data will be available elsewhere.
Brian Wilson, CTO of BackBlaze
pointed out back in 2014 ago that in their long-term storage environment
"Double the reliability is only worth 1/10th of 1 percent cost increase". And thus that the moral of the story was "design for failure and buy the cheapest components you can".
The Archive has followed this advice for a long time, as Li
explains:
With over 28,000 spinning disks in operation, drive failure is a statistical certainty. ...
The PetaBox software is designed to be fault-tolerant. Data is mirrored across multiple machines, often in different physical locations (including data centers in Redwood City and Richmond, California, and copies in Europe and Canada).12 Because the data is not "mission-critical" in the sense of a live banking transaction, the Archive can tolerate a certain number of dead drives in a node before physical maintenance is required.
Here is Brian Wilson's
analysis:
Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
Back in 2006 I started asking
How Hard Is "A Petabyte for a Century"? and concluding that it was
really hard. By analogy between "bit rot" and radioactivity, I estimated that keeping a petabyte for a century with a 50% chance of no bit flips required a bit half-life
a hundred million times the age of the universe. But that wasn't the
fundamental problem:
The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.
We have to assume that, despite their best efforts, the Archive will over time lose some data. This isn't a problem for two reasons:
- The Archive's collection is, fundamentally, a sample of cyberspace. Stuff missing is vastly more likely to be because it wasn't collected than because it was collected but subsequently lost.
- Driving the already low probability of loss down further gets exponentially more expensive. The more of the Archive's limited resources is devoted to doing so, the less can be devoted to collecting and storing more stuff. Thus devoting more to eliminating data loss results in less data surviving.
Since at least
2011 I have written repeatedly about the fact that preserving data for the long term is and economic, not a technical problem. With a lavish budget it is simple; the smaller the budget the harder it gets. Li correctly points out that the Archive's budget, in the range of $25-30M/year, is vastly
lower than any comparable website:
By owning its hardware, using the PetaBox high-density architecture, avoiding air conditioning costs, and using open-source software, the Archive achieves a storage cost efficiency that is orders of magnitude better than commercial cloud rates.
But in making this point he uses a
flawed comparison:
Consider the cost of storing 100 petabytes on Amazon S3. At standard rates (~$0.021 per GB per month), the storage alone would cost over $2.1 million per month. The Internet Archive’s entire annual operating budget—for staff, buildings, legal defense, and hardware—is less than what it would cost to store their data on AWS for a year.
First, Li ignores bandwidth charges, which as one of the most-visited sites on the Web would dwarf the storage charges. Second, Li compares S3, intended for and priced as instantly accessible, replicated data for the back-end of mission-critical online services, with the Archive's storage, intended for the long-term storage of data, most of which is accessed very rarely. Amazon has products aimed at this market. As an example, writing a Petabyte to Amazon Glacier, storing it for a decade, and then reading it out at last year's prices would
cost just under $160K. This too is a misleading comparison, Glacier is intended for
archival data that has a much lower probability of access than the Archive's.
The key to making storage affordable is
tiering, moving data that hasn't been accessed recently down the storage hierarchy to cheaper media. Kestutis Patiejunas gave an excellent talk about this back in 2014. I discussed it in
More on Facebook's "Cold Storage". Facebook was in the happy position of having an extremely accurate model of the access patterns to each of the small number of data types they stored, so could do an very good job of matching the data to the performance of the layer in their storage hierarchy that held it. Their expectation was that the primary reason data in the lowest layer would be accessed was that it had been subpoenaed.
No comments:
Post a Comment