Friday, March 14, 2025

Archival Storage

I gave a talk at the Berkeley I-school's Information Access Seminar entitled Archival Storage. Below the fold is the text of the talk with links to the sources and the slides (with yellow background).

Don't, don't, don't, don't believe the hype!
Public Enemy

Introduction

I'm honored to appear in what I believe is the final series of these seminars. Most of my previous appearances have focused on debunking some conventional wisdom, and this one is no exception. My parting gift to you is to stop you wasting time and resources on yet another seductive but impractical idea — that the solution to storing archival data is quasi-immortal media. As usual, you don't have to take notes. The full text of my talk with the slides and links to the sources will go up on my blog shortly after the seminar.

Backups

Archival data is often confused with backup data. Everyone should back up their data. After nearly two decades working in digital preservation, here is how I back up my four important systems:
  • I run my own mail and Web server. It is on my DMZ network, exposed to the Internet. It is backed up to a Raspberry Pi, also on the DMZ network but not directly accessible from the Internet. Once a week there is a full backup, and daily an incremental backup. Every week the full and incremental backups for the week are written to two DVD-Rs.
  • My desktop PC creates a full backup on an external hard drive nightly. The drive is one of a cycle of three.
  • I back up my iPhone to my Mac Air laptop every day.
  • I create a Time Machine backup of my Mac Air laptop, which includes the most recent iPhone backup, every day on one of a cycle of three external SSDs.
Each week the DVD-Rs, the current SSD and the current hard drive are moved off-site. Why am I doing all this? In case of disasters such as fire or ransomware I want to be able to recover to a state as close as possible to that before the disaster. In my case, the worst case is not more than one week.

Note the implication that the useful life of backup data is only the time that elapses between the last backup before a disaster and the recovery. Media life span is irrelevant to backup data; that is why backups and archiving are completely different problems.

The fact that the data encoded in magnetic grains on the platters of the three hard drives is good for a quarter-century is interesting but irrelevant to the backup task.

MonthMediaGoodBadVendor
01/04CD-R5x0GQ
05/04CD-R5x0Memorex
02/06CD-R5x0GQ
11/06DVD-R5x0GQ
12/06DVD-R1x0GQ
01/07DVD-R4x0GQ
04/07DVD-R3x0GQ
05/07DVD-R2x0GQ
07/11DVD-R4x0Verbatim
08/11DVD-R1x0Verbatim
05/12DVD+R2x0Verbatim
06/12DVD+R3x0Verbatim
04/13DVD+R2x0Optimum
05/13DVD+R3x0Optimum
I have saved many hundreds of pairs of weekly DVD-Rs but the only ones that are ever accessed more than a few weeks after being written are the ones I use for my annual series of Optical Media Durability Update posts. It is interesting that:
with no special storage precautions, generic low-cost media, and consumer drives, I'm getting good data from CD-Rs more than 20 years old, and from DVD-Rs nearly 18 years old.
But the DVD-R media lifetime is not why I'm writing backups to them. The attribute I'm interested in is that DVD-Rs are write-once; the backup data could be destroyed but it can't be modified.

Note that the good data from 18-year-old DVD-Rs means that consumers have an affordable, effective archival technology. But the market for optical media and drives is dying, killed off by streaming, which suggests that consumers don't really care about archiving their data. Cathy Marshall's 2008 talk Its Like A Fire, You Just Have To Move On vividly describes this attitude. Her subtitle is "Rethinking personal digital archiving".

Archival Data

  • Over time, data falls down the storage hierarchy.
  • Data is archived when it can't earn its keep on near-line media.
  • Lower cost is purchased with longer access latency.
What is a useful definition of archival data? It is data that can no longer earn its keep on readily accessible storage. Thus the fundamental design goal for archival storage systems is to reduce costs by tolerating increased access latency. Data is archived, that is moved to an archival storage system, to save money. Archiving is an economic rather than a technical issue.

By Pkirlin CC BY-SA 3.0 Source
How long should the archived data last? The Long Now Foundation is building the Clock of the Long Now, intended to keep time for 10,000 years. They would like to accompany it with a 10,000-year archive. That is at least two orders of magnitude longer than I am talking about here. We are only just over 75 years from the first stored-program computer, so designing a digital archive for a century is a very ambitious goal.

Archival Media

The mainstream media occasionally comes out with an announcement like this from the Daily Mail in 2013. Note the extrapolation from "a 26 second excerpt" to "every film and TV program ever created in a teacup".
Six years later, this is a picture of, as far as I know, the only write-to-read DNA storage drive ever demonstrated. It is from the Microsoft/University of Washington team that has done much of the research in DNA storage. They published it in 2019's Demonstration of End-to-End Automation of DNA Data Storage. It cost about $10K and took 21 hours to write then read 5 bytes.

The technical press is equally guilty. The canonical article about some development in the lab starts with the famous IDC graph projecting the amount of data that will be generated in the future. It goes on to describe the amazing density some research team achieved by writing say a gigabyte into their favorite medium in the lab, and how this density could store all the world's data in a teacup for ever. This conveys five false impressions.

Market Size

First, that there is some possibility the researchers could scale their process up to a meaningful fraction of IDC's projected demand, or even to the microscopic fraction of the projected demand that makes sense to archive. There is no such possibility. Archival media is a much smaller market than regular media. In 2018's Archival Media: Not a Good Business I wrote:
Archival-only media such as steel tape, silica DVDs, 5D quartz DVDs, and now DNA face some fundamental business model problems because they function only at the very bottom of the storage hierarchy. The usual diagram of the storage hierarchy, like this one from the Microsoft/UW team researching DNA storage, makes it look like the size of the market increases downwards. But that's very far from the case.
IBM's Georg Lauhoff and Gary M Decad's slide shows that the size of the market in dollar terms decreases downwards. LTO tape is less than 1% of the media market in dollar terms and less than 5% in capacity terms. Archival media are a very small part of the storage market. It is noteworthy that in 2023 Optical Archival (OD-3), the most recent archive-only medium, was canceled for lack of a large enough market. It was a 1TB optical disk, an upgrade from Blu-Ray.

Timescales

Second, that the researcher's favorite medium could make it into the market in the timescale of IDC's projections. Because the reliability and performance requirements of storage media are so challenging, time scales in the storage market are much longer than the industry's marketeers like to suggest.

Take, for example, Seagate's development of the next generation of hard disk technology, HAMR, where research started twenty-six years ago. Nine years later in 2008 they published this graph, showing HAMR entering the market in 2009. Seventeen years later it is only now starting to be shipped to the hyper-scalers. Research on data in silica started fifteen years ago. Research on the DNA medium started thirty-six years ago. Neither is within five years of market entry.

Customers

Third, that even if the researcher's favorite medium did make it into the market it would be a product that consumers could use. As Kestutis Patiejunas figured out at Facebook more than a decade ago, because the systems that surround archival media rather than the media themselves are the major cost, the only way to make the economics of archival storage work is to do it at data-center scale but in warehouse space and harvest the synergies that come from not needing data-center power, cooling, staffing, etc.

Storage has an analog of Moore's Law called Kryder's Law, which states that over time the density of bits on a storage medium increases exponentially. Given the need to reduce costs at data-center scale, Kryder's Law limits the service life of even quasi-immortal media. As we see with tape robots, where data is routinely migrated to newer, denser media long before its theoretical lifespan, what matters is the economic, not the technical lifespan of a medium.

Hard disks are replaced every five years although the magnetically encoded data on the platters is good for a quarter-century. They are engineered to have a five-year life because Kryder's Law implies that they will be replaced after five years even though they still work perfectly. Seagate actually built drives with 25-year life but found that no-one would pay the extra for the longer life.

The Cloud

Fourth, that anyone either cares or even knows what medium their archived data lives on. Only the hyper-scalers do. Consumers believe their data is safe in the cloud. Why bother backing it up, let alone archiving it, if it is safe anyway? If anyone really cares about archiving they use a service such as Glacier, when they definitely have no idea what medium is being used.

Threats

Fifth, that bit rot is the only threat that matters; the idea that with quasi-immortal media you don't need Lots Of Copies to Keep Stuff Safe.

No medium is perfect. They all have a specified Unrecoverable Bit Error Rate (UBER) rate. For example, typical disk UBERs are 10-15. A petabyte is 8*1015 bits, so if the drive is within its specified performance you can expect up to 8 errors when reading a petabyte. The specified UBER is an upper limit, you will normally see far fewer. The UBER for LT09 tape is 10-20, so unrecoverable errors on a new tape are very unlikely. But not impossible, and the rate goes up steeply with tape wear.

The property that classifies a medium as quasi-immortal is not that its reliability is greater than regular media to start with, although as with tape it may be. It is rather that its reliability decays more slowly than that of regular media. Thus archival systems need to use erasure coding to mitigate both UBER data loss and media failures such as disk crashes and tape wear-out.

Another reason for needing erasure codes is that media errors are not the only ones needing mitigation. What matters is the reliability the system delivers to the end user. Research has shown that the majority of end user errors come from layers of the system above the actual media.

The archive may contain personally identifiable or other sensitive data. If so, the data on the medium must be encrypted. This is a double-edged sword, because the encryption key becomes a single point of failure; its loss or corruption renders the entire archive inaccessible. So you need Lots Of Copies to keep the key safe. But the more copies the greater the risk of key compromise.

Media such as silica, DNA, quartz DVDs, steel tape and so on address bit rot, which is only one of the threats to which long-lived data is subject. Clearly a single copy on such media, even if erasure coded, is still subject to threats including fire, flood, earthquake, ransomware, and insider attacks. Thus even an archive needs to maintain multiple copies. This greatly increases the cost, bringing us back to the economic threat.

Archival Storage Systems

At Facebook Patiejunas built rack-scale systems, each holding 10,000 100GB optical disks for a Petabyte per rack. Writable Blu-Ray disks are about 80 cents each, so the media to fill the rack would cost about $8K. This is clearly much less than the cost of the robotics and the drives.

Let's drive this point home with another example. An IBM TS4300 LTO tape robot starts at $20K. Two 20-pack tape cartridges to fill it cost about $4K, so the media is about 16% of the total system capex. The opex for the robot includes power, cooling, space, staff and an IBM maintenance contract. The opex for the tapes is essentially zero.

The media is an insignificant part of the total lifecycle cost of storing archival data on tape. What matters for the economic viability of an archival storage system is minimizing the total system cost, not the cost of the media. No-one is going to spend $24K on a rack-mount tape system from IBM to store 720TB for their home or small business. The economics only work at data-center scale.

The reason why this focus on media is a distraction is that the fundamental problem of digital preservation is economic, not technical. No-one wants to pay for preserving data that isn't earning its keep, pretty much the definition of archived data. The cost per terabyte of the medium is irrelevant, what drives the economic threat is the capital and operational cost of the system. Take tape for example. The media capital cost is low, but the much higher system capital cost includes the drives and the robotics. Then there are the operational costs of the data center space, power, cooling and staff. It is only by operating at data-center scale and thus amortizing the capital and operational costs over very large amounts of data that the system costs per terabyte can be made competitive.

Operating at data center scale, as Patiejunas discovered and Microsoft understands, means that one of the parameters that determines the system cost is write bandwidth. Each of Facebook's racks wrote 12 optical disks in parallel almost continuously. It would take over 800 times the time to write an entire disk to fill the rack. At the 8x write speed it takes 22.5 minutes to fill a disk, so it would take around 18,750 minutes to fill the rack, or about two weeks. It isn't clear how many racks Facebook needed simultaneously doing this to keep up with the flow of user-generated content, but it was likely enough to fill a reasonable-size warehouse. Similarly, it would take about 8.5 days to fill the base model TS4300.

Project Silica

I wrote about Microsoft's Project Silica a year ago, in Microsoft's Archival Storage Research. It uses femtosecond lasers to write data into platters of silica. Like Facebook's, the prototype Silica systems are data-center size:
A Silica library is a sequence of contiguous write, read, and storage racks interconnected by a platter delivery system. Along all racks there are parallel horizontal rails that span the entire library. We refer to a side of the library (spanning all racks) as a panel. A set of free roaming robots called shuttles are used to move platters between locations.
...
A read rack contains multiple read drives. Each read drive is independent and has slots into which platters are inserted and removed. The number of shuttles active on a panel is limited to twice the number of read drives in the panel. The write drive is full-rack-sized and writes multiple platters concurrently.
Their performance evaluation focuses on the ability to respond to read requests within 15 hours. Their cost evaluation, like Facebook's, focuses on the savings from using warehouse-type space to house the equipment, although is isn't clear that they have actually done so. The rest of their cost evaluation is somewhat hand-wavy, as is natural for a system that isn't yet in production:
The Silica read drives use polarization microscopy, which is a commoditized technique widely used in many applications and is low-cost. Currently, system cost in Silica is dominated by the write drives, as they use femtosecond lasers which are currently expensive and used in niche applications. ... As the Silica technology proliferates, it will drive up the demand for femtosecond lasers, commoditizing the technology.
I'm skeptical of "commoditizing the technology". Archival systems are a niche in the IT market, and one on which companies are loath to spend money. Realistically, there aren't going to be a vast number of Silica write heads. The only customers for systems like Silica are the large cloud providers, who will be reluctant to commit their archives to technology owned by a competitor. Unless a mass-market application for femtosecond lasers emerges, the scope for cost reduction is limited.

But the more I think about this technology, which is still in the lab, the more I think it probably has the best chance of impacting the market among all the rival archival storage technologies. Not great, but better than its competitors:
  • The media is very cheap and very dense, so the effect of Kryder's Law economics driving media replacement and thus its economic rather than technical lifetime is minimal.
  • The media is quasi-immortal and survives benign neglect, so opex once written is minimal.
  • The media is write-once, and the write and read heads are physically separate, so the data cannot be encrypted or erased by malware. The long read latency makes exfiltrating large amounts of data hard.
  • The robotics are simple and highly redundant. Any of the shuttles can reach any of the platters. They should be much less troublesome than tape library robotics because, unlike tape, a robot failure only renders a small fraction of the library inaccessible and is easily repaired.
  • All the technologies needed are in the market now, the only breakthroughs needed are economic, not technological.
  • The team has worked on improving the write bandwidth which is a critical issue for archival storage at scale. They can currently write hundreds of megabytes a second.
  • Like Facebook's archival storage technologies, Project Silica enjoys the synergies of data center scale without needing full data center environmental and power resources.
  • Like Facebook's technologies, Project Silica has an in-house customer, Azure's archival storage, with a need for a product like this.
The expensive part of the system is the write head. It is an entire rack using femtosecond lasers, which start at around $50K. The eventual system's economics will depend upon the progress made in cost-reducing the lasers.

Retrieval

The Svalbard archipelago is where I spent the summer of 1969 doing a geological survey.

The most important part of an archiving strategy is knowing how you will get stuff out of the archive. Putting stuff in and keeping it safe are important and relatively easy, but if you can't get stuff out when you need it what's the point?

In some cases access is only needed to a small proportion of the archive. At Facebook, Patiejunas expected that the major reason for access would be to respond to a subpoena. In other cases, such as migrating to a new archival system, bulk data retrieval is required.

But if the reason for needing access is disaster recovery it is important to have a vision of what resources are likley to be available after the disaster. Microsoft gained a lot of valuable PR by encoding much of the world's open source software in QR codes on film and storing the cans of film in an abandoned coal mine in Svalbard so it would "survive the apocalypse". In Seeds Or Code? I had a lot of fun imagining how the survivors of the apocalypse would be able to access the archive.

The voyage
To make a long story short, after even a mild apocalypse, they wouldn't be able to. Let's just point out that the first steps after the apocalypse are getting to Svalbard. They won't be able to fly to LYR. As the crow flies, the voyage from Tromsø is 591 miles across very stormy seas. It takes several days, and getting to Tromsø won't be easy either.

Archival Storage Services

Because technologies have very strong economies of scale, the economics of most forms of IT work in favor of the hyper-scalers. These forces are especially strong for archival data, both because it is almost pure cost with no income, and because as I discussed earlier the economics of archival storage only work at data-center scale. It will be the rare institution that can avoid using cloud archival storage. I analyzed the way these economic forces operate in 2019's Cloud For Preservation:
Much of the attraction of cloud technology for organizations, especially public institutions funded through a government's annual budget process, is that they transfer costs from capital to operational expenditure. It is easy to believe that this increases financial flexibility. As regards ingest and dissemination, this may be true. Ingesting some items can be delayed to the next budget cycle, or the access rate limit lowered temporarily. But as regards preservation, it isn't true. It is unlikely that parts of the institution's collection can be de-accessioned in a budget crunch, only to be re-accessioned later when funds are adequate. Even were the content still available to be re-ingested, the cost of ingest is a significant fraction of the total life-cycle cost of preserving digital content.
Cloud services typically charge differently for ingest, storage and retrieval. The service's goal in designing their pricing structure is to create lock-in, by analogy with the drug-dealer's algorithm "the first one's free". In 2019 I used the published rates to compute the cost of ingesting in a month, storing for a year, and retrieving in a month a petabyte using the archive services of the three main cloud providers. Here is that table, with the costs adjusted for inflation to 2024 using the Bureau of Labor Statistics' calculator:
Archival Services
Service In Store Out Total Lock-in
AMZN Glacier $2,821 $60,182 $69,260 $132,263 13.8
GOOG Coldline $4,514 $105,319 $105,144 $214,977 12.0
MSFT Archive $7,962 $30,091 $20,387 $58,440 8.1
The "Lock-in" column is the approximate number of months of storage cost that getting the Petabyte out in a month represents. Note that:
  • In all cases getting data out is much more expensive than putting it in.
  • The lower cost of archival storage compared to the same service's near-line storage is purchased at the expense of a much stronger lock-in.
  • Since the whole point of archival storage is keeping data for the long term, the service will earn much more in storage charges over the longer life of archival data than the shorter life of near-line data.
There may well be reasons why retrieving data from archival storage is expensive. Most storage technologies have unified read/write heads, so retrieval competes with ingest which, as Patiejunas figured out, is the critical performance parameter for archival storage. This is because, to minimize cost, archival systems are designed assuming bulk retrieval is rare. When it happens, whether from a user request or to migrate data to new media, it is disruptive. For example, emptying a base model TS4300 occupies it for more than a week.

Six years later, things have changed significantly. Here is the current version of the archival services table:
Archival Services
Service In Store Out Total Lock-in
AMZN Glacier Deep Archive $500 $10,900 $49,550 $60,950 50.0
GOOG Archive $500 $13,200 $210,810 $224,510 175.6
MSFT Archive $100 $22,000 $40,100 $62,200 20.0
Points to note:
  • Glacier is the only one of the three that is significantly cheaper in real terms than it was 6 years ago.
  • Glacier can do this because Kryder's Law has made their storage about a factor of about 6 cheaper in real terms in six years, or about a 35% Kryder rate. This is somewhat faster than the rate of areal density increase of tape, and much faster than that of disk. The guess is that Glacier Deep Archive is on tape.
  • Google's pricing indicates they aren't serious about the archival market.
  • Archive services now have differentiated tiers of service. This table uses S3 Deep Archive, Google Archive and Microsoft Archive.
  • Lock-in has increased from 13.8/12.0/8.1 to 50/175/20. It is also increased by additional charges for data lifetimes less than a threshold, 180/365/180 days. So my cost estimate for Google is too low, because the data would suffer these charges. But accounting for this would skew the comparison.
  • Bandwidth charges are a big factor in lock-in. For Amazon they are 77%, for Google they are 38%, for Microsoft they are 32%. Amazon's marketing is smart, hoping you won't notice the outbound bandwidth charges.
Looking at these numbers it is hard to see how anyone can justify any archive storage other than S3 Deep Archive. It is the only one delivering Kryder's Law to the customer, and as my economic model shows, delivering Kryder's Law is essential to affordable long-term storage. A petabyte for a decade costs under $120K before taking Kryder's Law into account and you can get it all out for under $50K.

LOCKSS

Original Logo
The fundamental idea behind LOCKSS was that, given a limited budget and a realistic range of threats, data would survive better in many cheap, unreliable, loosely-coupled replicas than in a single expensive, durable one.

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)
Brian Wilson, CTO of BackBlaze pointed out eleven years ago that in their long-term storage environment "Double the reliability is only worth 1/10th of 1 percent cost increase". And thus that the moral of the story was "design for failure and buy the cheapest components you can".

In other words, don't put all your eggs in one basket.

No comments: