Thursday, March 25, 2021

Internet Archive Storage

The Internet Archive is a remarkable institution, which has become increasingly important during the pandemic. It has been for many years in the world's top 300 Web sites and is currently ranked #209, sustaining almost 60Gb/s outbound bandwidth from its collection of almost half a trillion archived Web pages and much other content. It does this on a budget of under $20M/yr, yet maintains 99.98% availability.

Jonah Edwards, who runs the Core Infrastructure team, gave a presentation on the Internet Archive's storage infrastructure to the Archive's staff. Below the fold, some details and commentary.

Among the highlights:
  • 750 servers, some up to 9-years old
  • 1,300 VMs
  • 30K storage devices
  • >20K spinning disks (in paired storage), a mix of 4,8,12,16TB drives, about 40% of the bytes are on 16TB drives.
  • almost 200PB of raw storage
  • growing the size of the archive >25%/yr.
  • adding 10-12PB of raw storage per quarter
  • with 16TB drives it would need 15 racks to hold a copy
  • currently running ~75 racks
  • currently serving about 55GB/s, planning for ~80GB/s soon
Edwards reports that the primary outage causes are:
  • Fiber cuts
  • power quality issues
  • power outages
Going forward, Edwards is asking "whether paired storage the right model?" The current constraints are:
  • Items in the archive are directories on disk
  • basic unit of storage is the disk
  • disks are replicated across datacenters
  • content is served from all (=both?) copies
The big issue with treating disk as the unit of paired storage is that when a disk fails a new member of the pair has to be created by reading the whole of the good member and writing the whole of the new member. This takes time, during which the good member is under high load and thus likely to suffer a correlated failure. The new member will be at the start of its life so subject to infant mortality, although it is fair to say that drive manufacturers have paid a lot of attention to reducing infant mortality. Edwards reports that the more recent drives are enough faster than the 8TB drives that the risk is manageable, but as the drives get bigger architectural change will be required to manage this.

Another issue is that the servers in the Archive's racks provide both the storage and the processing needed. The CPUs are getting faster, but not fast enough to keep up with the disks getting denser. More storage per server and per rack also increases the demand for per-rack bandwidth.

1 comment:

David. said...

Wendy Hanamura posted Filecoin Foundation Grants 50,000 FIL to the Internet Archive to the Internet Archive's blog:

"Today, the Filecoin Foundation announced a 50,000 FIL grant to the Internet Archive – the largest single donation in the digital library’s 25-year history."

As I write 50,000 FIL has a notional value of $8,509,500, representing about 0.08% of the circulating supply. This great news, but note this example:

"The excellent crypto critic Trolly McTrollface (not his real name, if you’re curious) pointed out on Twitter that on Saturday a sale of just 150 bitcoin resulted in a 10 per cent drop in the price."

Thus a sale of 0.0008% (2 orders of magnitude less) of the circulating supply of Bitcoin crashed the price 10%, so the idea that 50K FIL could actually be turned into $8.5M any time soon is an illusion. Even if headlines like Mike Masnick's Filecoin Foundation Donates $10 Million Worth Of Filecoin To Internet Archive were not arithmetically incorrect, they would be seriously misleading.