Jonah Edwards, who runs the Core Infrastructure team, gave a presentation on the Internet Archive's storage infrastructure to the Archive's staff. Below the fold, some details and commentary.
Among the highlights:
- 750 servers, some up to 9-years old
- 1,300 VMs
- 30K storage devices
- >20K spinning disks (in paired storage), a mix of 4,8,12,16TB drives, about 40% of the bytes are on 16TB drives.
- almost 200PB of raw storage
- growing the size of the archive >25%/yr.
- adding 10-12PB of raw storage per quarter
- with 16TB drives it would need 15 racks to hold a copy
- currently running ~75 racks
- currently serving about 55GB/s, planning for ~80GB/s soon
- Fiber cuts
- power quality issues
- power outages
- Items in the archive are directories on disk
- basic unit of storage is the disk
- disks are replicated across datacenters
- content is served from all (=both?) copies
Another issue is that the servers in the Archive's racks provide both the storage and the processing needed. The CPUs are getting faster, but not fast enough to keep up with the disks getting denser. More storage per server and per rack also increases the demand for per-rack bandwidth.