DSHR's Blog: Internet Archive's Storage

Bruce Li introduces his The Long Now of the Web: Inside the Internet Archive’s Fight Against Forgetting thus:

This report delves into the mechanics of the Internet Archive with the precision of a teardown. We will strip back the chassis to examine the custom-built PetaBox servers that heat the building without air conditioning. We will trace the evolution of the web crawlers—from the early tape-based dumps of Alexa Internet to the sophisticated browser-based bots of 2025. We will analyze the financial ledger of this non-profit giant, exploring how it survives on a budget that is a rounding error for its Silicon Valley neighbors. And finally, we will look to the future, where the "Decentralized Web" (DWeb) promises to fragment the Archive into a million pieces to ensure it can never be destroyed.

It is long, detailed, comprehensive and well worth reading in full. Below the fold I comment on the part about storage.

Internet Archive 10/29/96

This picture shows Brewster Kahle with the Internet Archive's very first storage, tape drives holding crawls from Alexa Internet, the Web traffic analysis company founded by Brewster and Bruce Gilliat in 1996 and, three years later, acquired by Amazon for $250M.

Source

The first custom-designed Internet Archive storage system was version 1 of the PetaBox, a vividly red-painted 1U system introduced in 2004. This picture shows a rack of them. I'm familiar with them because the LOCKSS Program used them for some years.

They were great! Li explains:

The first PetaBox rack, operational in June 2004, was a revelation in storage density. It held 100 terabytes (TB) of data—a massive sum at the time—while consuming only about 6 kilowatts of power. To put that in perspective, in 2003, the entire Wayback Machine was growing at a rate of just 12 terabytes per month.

The LOCKSS Program eventually transitioned to 4U systems similar to those then in use at Backblaze. Around the same time the Archive also moved to 4U systems for their PetaBox.

PetaBox specifications

As disk technology progressed the Archive continued evolving the system:

The fourth-generation PetaBox, introduced around 2010, exemplified this density. Each rack contained 240 disks of 2 terabytes each, organized into 4U high rack mounts. These units were powered by Intel Xeon processors (specifically the E7-8870 series in later upgrades) with 12 gigabytes of RAM. The architecture relied on bonding pair of 1-gigabit interfaces to create a 2-gigabit pipe, feeding into a rack switch with a 10-gigabit uplink.10

Brewster Kahle & PetaBoxes

Over time older, smaller drives were phased out and replaced by newer, bigger drives:

By 2025, the storage landscape had shifted again. The current PetaBox racks provide 1.4 petabytes of storage per rack. This leap is achieved not by adding more slots, but by utilizing significantly larger drives—8TB, 16TB, and even 22TB drives are now standard. In 2016, the Archive managed around 20,000 individual disk drives. Remarkably, even as storage capacity tripled between 2012 and 2016, the total count of drives remained relatively constant due to these density improvements.

This is indeed an impressive strory and Li writes:

The trajectory of the PetaBox is a case study in Moore's Law applied to magnetic storage.

Here I have to correct Li. Moore's Law applies to silicon such as solid-state storage; it is Kryder's Law that applies to magnetic storage such as the hard disks in the PetaBox.

. Li notes the Archive's innovative approach to cooling the racks:

The Archive's primary data center is located in the Richmond District of San Francisco, a neighborhood known for its perpetual fog and cool maritime climate. The building utilizes this ambient air for cooling. There is no traditional air conditioning in the PetaBox machine rooms. Instead, the servers are designed to run at slightly higher operational temperatures, and the excess heat generated by the spinning disks is captured and recirculated to heat the building during the damp San Francisco winters.

This "waste heat" system is a closed loop of efficiency. The 60+ kilowatts of heat energy produced by a storage cluster is not a byproduct to be eliminated but a resource to be harvested. This design choice dramatically lowers the Power Usage Effectiveness (PUE) ratio of the facility, allowing the Archive to spend its limited funds on hard drives rather than electricity bills.

In the unlikely, for San Francisco, event that the day is too hot, less-urgent tasks can be delayed, or some of the racks can have their clock rate reduced, disks put into sleep mode, or even be powered down. Redundancy means that the data will be available elsewhere.

Brian Wilson, CTO of BackBlaze pointed out back in 2014 ago that in their long-term storage environment "Double the reliability is only worth 1/10th of 1 percent cost increase". And thus that the moral of the story was "design for failure and buy the cheapest components you can".

The Archive has followed this advice for a long time, as Li explains:

With over 28,000 spinning disks in operation, drive failure is a statistical certainty. ...

The PetaBox software is designed to be fault-tolerant. Data is mirrored across multiple machines, often in different physical locations (including data centers in Redwood City and Richmond, California, and copies in Europe and Canada).12 Because the data is not "mission-critical" in the sense of a live banking transaction, the Archive can tolerate a certain number of dead drives in a node before physical maintenance is required.

Here is Brian Wilson's analysis:

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

Back in 2006 I started asking How Hard Is "A Petabyte for a Century"? and concluding that it was really hard. By analogy between "bit rot" and radioactivity, I estimated that keeping a petabyte for a century with a 50% chance of no bit flips required a bit half-life a hundred million times the age of the universe. But that wasn't the fundamental problem:

The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.

We have to assume that, despite their best efforts, the Archive will over time lose some data. This isn't a problem for two reasons:

The Archive's collection is, fundamentally, a sample of cyberspace. Stuff missing is vastly more likely to be because it wasn't collected than because it was collected but subsequently lost.
Driving the already low probability of loss down further gets exponentially more expensive. The more of the Archive's limited resources is devoted to doing so, the less can be devoted to collecting and storing more stuff. Thus devoting more to eliminating data loss results in less data surviving.

Source

Since at least 2011 I have written repeatedly about the fact that preserving data for the long term is and economic, not a technical problem. With a lavish budget it is simple; the smaller the budget the harder it gets. Li correctly points out that the Archive's budget, in the range of $25-30M/year, is vastly lower than any comparable website:

By owning its hardware, using the PetaBox high-density architecture, avoiding air conditioning costs, and using open-source software, the Archive achieves a storage cost efficiency that is orders of magnitude better than commercial cloud rates.

But in making this point he uses a flawed comparison:

Consider the cost of storing 100 petabytes on Amazon S3. At standard rates (~$0.021 per GB per month), the storage alone would cost over $2.1 million per month. The Internet Archive’s entire annual operating budget—for staff, buildings, legal defense, and hardware—is less than what it would cost to store their data on AWS for a year.

First, Li ignores bandwidth charges, which as one of the most-visited sites on the Web would dwarf the storage charges. Second, Li compares S3, intended for and priced as instantly accessible, replicated data for the back-end of mission-critical online services, with the Archive's storage, intended for the long-term storage of data, most of which is accessed very rarely. Amazon has products aimed at this market. As an example, writing a Petabyte to Amazon Glacier, storing it for a decade, and then reading it out at last year's prices would cost just under $160K. This too is a misleading comparison, Glacier is intended for archival data that has a much lower probability of access than the Archive's.

The key to making storage affordable is tiering, moving data that hasn't been accessed recently down the storage hierarchy to cheaper media. Kestutis Patiejunas gave an excellent talk about this back in 2014. I discussed it in More on Facebook's "Cold Storage". Facebook was in the happy position of having an extremely accurate model of the access patterns to each of the small number of data types they stored, so could do an very good job of matching the data to the performance of the layer in their storage hierarchy that held it. Their expectation was that the primary reason data in the lowest layer would be accessed was that it had been subpoenaed.

4 comments:

Tardigrade said...: I'm slightly curious what sort of assets they sell to have made $200k from them in 2024.

Every once in a while I've been slightly sad that my very first website didn't get archived by them (96/97). But yeah, that's when they were just starting.

They epitomize the efficiency of being cash poor. It's really amazing what they've been able to do.

On a fourth point, that 2005 Kryder article was kind of retrospectively cringey. So close yet missed by a mile.
- "Who would have predicted the success of hand-held digital audio players?" Kryder asks. "We completely missed seeing the iPod coming."
- Now tiny, capacious hard drives are replacing low-capacity flash memory cards
- "In a few years the average U.S. consumer will own 10 to 20 disk drives in devices that he uses regularly," Kryder predicts.

These days a typical consumer may actually own zero hard drives, as flash has taken over. They'll still indirectly use HDDs for their cloud storage backup, but yeah this was a predictive miss by the article.; January 20, 2026 at 10:45 PM
Marcus Davidsson said...: Great article!; January 24, 2026 at 8:22 AM
IT Consulting Services Company In Sydeny said...: I really appreciated this deep dive — as someone who regularly uses the Internet Archive, it’s easy to take for granted how much effort and ingenuity goes into storing all that data. Reading about the evolution of PetaBoxes and the economic challenges behind preserving digital history made me pause and genuinely admire the team’s work. It’s not just tech — it’s a mission to save our collective memory, and now I understand a bit better how hard that really is. Thanks for breaking it down in a way that felt thoughtful and personal!

Best Regards
Founder: laptop reviews; March 3, 2026 at 7:08 PM
David. said...: IA currently has 220PB of data growing at 123TB/day with at least 2 copies; March 9, 2026 at 11:42 AM

Tuesday, January 20, 2026

Internet Archive's Storage

4 comments: