Don't, don't, don't, don't believe the hype!
Public Enemy
Public Enemy
Introduction
I'm honored to appear in what I believe is the final series of these seminars. Most of my previous appearances have focused on debunking some conventional wisdom, and this one is no exception. My parting gift to you is to stop you wasting time and resources on yet another seductive but impractical idea — that the solution to storing archival data is quasi-immortal media. As usual, you don't have to take notes. The full text of my talk with the slides and links to the sources will go up on my blog shortly after the seminar.Backups
Archival data is often confused with backup data. Everyone should back up their data. After nearly two decades working in digital preservation, here is how I back up my four important systems:- I run my own mail and Web server. It is on my DMZ network, exposed to the Internet. It is backed up to a Raspberry Pi, also on the DMZ network but not directly accessible from the Internet. Once a week there is a full backup, and daily an incremental backup. Every week the full and incremental backups for the week are written to two DVD-Rs.
- My desktop PC creates a full backup on an external hard drive nightly. The drive is one of a cycle of three.
- I back up my iPhone to my Mac Air laptop every day.
- I create a Time Machine backup of my Mac Air laptop, which includes the most recent iPhone backup, every day on one of a cycle of three external SSDs.
Note the implication that the useful life of backup data is only the time that elapses between the last backup before a disaster and the recovery. Media life span is irrelevant to backup data; that is why backups and archiving are completely different problems.
The fact that the data encoded in magnetic grains on the platters of the three hard drives is good for a quarter-century is interesting but irrelevant to the backup task.
| Month | Media | Good | Bad | Vendor | 
| 01/04 | CD-R | 5x | 0 | GQ | 
| 05/04 | CD-R | 5x | 0 | Memorex | 
| 02/06 | CD-R | 5x | 0 | GQ | 
| 11/06 | DVD-R | 5x | 0 | GQ | 
| 12/06 | DVD-R | 1x | 0 | GQ | 
| 01/07 | DVD-R | 4x | 0 | GQ | 
| 04/07 | DVD-R | 3x | 0 | GQ | 
| 05/07 | DVD-R | 2x | 0 | GQ | 
| 07/11 | DVD-R | 4x | 0 | Verbatim | 
| 08/11 | DVD-R | 1x | 0 | Verbatim | 
| 05/12 | DVD+R | 2x | 0 | Verbatim | 
| 06/12 | DVD+R | 3x | 0 | Verbatim | 
| 04/13 | DVD+R | 2x | 0 | Optimum | 
| 05/13 | DVD+R | 3x | 0 | Optimum | 
with no special storage precautions, generic low-cost media, and consumer drives, I'm getting good data from CD-Rs more than 20 years old, and from DVD-Rs nearly 18 years old.But the DVD-R media lifetime is not why I'm writing backups to them. The attribute I'm interested in is that DVD-Rs are write-once; the backup data could be destroyed but it can't be modified.
Note that the good data from 18-year-old DVD-Rs means that consumers have an affordable, effective archival technology. But the market for optical media and drives is dying, killed off by streaming, which suggests that consumers don't really care about archiving their data. Cathy Marshall's 2008 talk Its Like A Fire, You Just Have To Move On vividly describes this attitude. Her subtitle is "Rethinking personal digital archiving".
Archival Data
- Over time, data falls down the storage hierarchy.
- Data is archived when it can't earn its keep on near-line media.
- Lower cost is purchased with longer access latency.
How long should the archived data last? The Long Now Foundation is building the Clock of the Long Now, intended to keep time for 10,000 years. They would like to accompany it with a 10,000-year archive. That is at least two orders of magnitude longer than I am talking about here. We are only just over 75 years from the first stored-program computer, so designing a digital archive for a century is a very ambitious goal.
Archival Media
The mainstream media occasionally comes out with an announcement like this from the Daily Mail in 2013. Note the extrapolation from "a 26 second excerpt" to "every film and TV program ever created in a teacup".Six years later, this is a picture of, as far as I know, the only write-to-read DNA storage drive ever demonstrated. It is from the Microsoft/University of Washington team that has done much of the research in DNA storage. They published it in 2019's Demonstration of End-to-End Automation of DNA Data Storage. It cost about $10K and took 21 hours to write then read 5 bytes.
The technical press is equally guilty. The canonical article about some development in the lab starts with the famous IDC graph projecting the amount of data that will be generated in the future. It goes on to describe the amazing density some research team achieved by writing say a gigabyte into their favorite medium in the lab, and how this density could store all the world's data in a teacup for ever. This conveys five false impressions.
Market Size
First, that there is some possibility the researchers could scale their process up to a meaningful fraction of IDC's projected demand, or even to the microscopic fraction of the projected demand that makes sense to archive. There is no such possibility. Archival media is a much smaller market than regular media. In 2018's Archival Media: Not a Good Business I wrote:Archival-only media such as steel tape, silica DVDs, 5D quartz DVDs, and now DNA face some fundamental business model problems because they function only at the very bottom of the storage hierarchy. The usual diagram of the storage hierarchy, like this one from the Microsoft/UW team researching DNA storage, makes it look like the size of the market increases downwards. But that's very far from the case.IBM's Georg Lauhoff and Gary M Decad's slide shows that the size of the market in dollar terms decreases downwards. LTO tape is less than 1% of the media market in dollar terms and less than 5% in capacity terms. Archival media are a very small part of the storage market. It is noteworthy that in 2023 Optical Archival (OD-3), the most recent archive-only medium, was canceled for lack of a large enough market. It was a 1TB optical disk, an upgrade from Blu-Ray.
Timescales
Second, that the researcher's favorite medium could make it into the market in the timescale of IDC's projections. Because the reliability and performance requirements of storage media are so challenging, time scales in the storage market are much longer than the industry's marketeers like to suggest.Take, for example, Seagate's development of the next generation of hard disk technology, HAMR, where research started twenty-six years ago. Nine years later in 2008 they published this graph, showing HAMR entering the market in 2009. Seventeen years later it is only now starting to be shipped to the hyper-scalers. Research on data in silica started fifteen years ago. Research on the DNA medium started thirty-six years ago. Neither is within five years of market entry.
Customers
Third, that even if the researcher's favorite medium did make it into the market it would be a product that consumers could use. As Kestutis Patiejunas figured out at Facebook more than a decade ago, because the systems that surround archival media rather than the media themselves are the major cost, the only way to make the economics of archival storage work is to do it at data-center scale but in warehouse space and harvest the synergies that come from not needing data-center power, cooling, staffing, etc.Storage has an analog of Moore's Law called Kryder's Law, which states that over time the density of bits on a storage medium increases exponentially. Given the need to reduce costs at data-center scale, Kryder's Law limits the service life of even quasi-immortal media. As we see with tape robots, where data is routinely migrated to newer, denser media long before its theoretical lifespan, what matters is the economic, not the technical lifespan of a medium.
Hard disks are replaced every five years although the magnetically encoded data on the platters is good for a quarter-century. They are engineered to have a five-year life because Kryder's Law implies that they will be replaced after five years even though they still work perfectly. Seagate actually built drives with 25-year life but found that no-one would pay the extra for the longer life.
The Cloud
Fourth, that anyone either cares or even knows what medium their archived data lives on. Only the hyper-scalers do. Consumers believe their data is safe in the cloud. Why bother backing it up, let alone archiving it, if it is safe anyway? If anyone really cares about archiving they use a service such as Glacier, when they definitely have no idea what medium is being used.Threats
Fifth, that bit rot is the only threat that matters; the idea that with quasi-immortal media you don't need Lots Of Copies to Keep Stuff Safe.No medium is perfect. They all have a specified Unrecoverable Bit Error Rate (UBER) rate. For example, typical disk UBERs are 10-15. A petabyte is 8*1015 bits, so if the drive is within its specified performance you can expect up to 8 errors when reading a petabyte. The specified UBER is an upper limit, you will normally see far fewer. The UBER for LT09 tape is 10-20, so unrecoverable errors on a new tape are very unlikely. But not impossible, and the rate goes up steeply with tape wear.
The property that classifies a medium as quasi-immortal is not that its reliability is greater than regular media to start with, although as with tape it may be. It is rather that its reliability decays more slowly than that of regular media. Thus archival systems need to use erasure coding to mitigate both UBER data loss and media failures such as disk crashes and tape wear-out.
Another reason for needing erasure codes is that media errors are not the only ones needing mitigation. What matters is the reliability the system delivers to the end user. Research has shown that the majority of end user errors come from layers of the system above the actual media.
The archive may contain personally identifiable or other sensitive data. If so, the data on the medium must be encrypted. This is a double-edged sword, because the encryption key becomes a single point of failure; its loss or corruption renders the entire archive inaccessible. So you need Lots Of Copies to keep the key safe. But the more copies the greater the risk of key compromise.
Media such as silica, DNA, quartz DVDs, steel tape and so on address bit rot, which is only one of the threats to which long-lived data is subject. Clearly a single copy on such media, even if erasure coded, is still subject to threats including fire, flood, earthquake, ransomware, and insider attacks. Thus even an archive needs to maintain multiple copies. This greatly increases the cost, bringing us back to the economic threat.
Archival Storage Systems
At Facebook Patiejunas built rack-scale systems, each holding 10,000 100GB optical disks for a Petabyte per rack. Writable Blu-Ray disks are about 80 cents each, so the media to fill the rack would cost about $8K. This is clearly much less than the cost of the robotics and the drives.Let's drive this point home with another example. An IBM TS4300 LTO tape robot starts at $20K. Two 20-pack tape cartridges to fill it cost about $4K, so the media is about 16% of the total system capex. The opex for the robot includes power, cooling, space, staff and an IBM maintenance contract. The opex for the tapes is essentially zero.
The media is an insignificant part of the total lifecycle cost of storing archival data on tape. What matters for the economic viability of an archival storage system is minimizing the total system cost, not the cost of the media. No-one is going to spend $24K on a rack-mount tape system from IBM to store 720TB for their home or small business. The economics only work at data-center scale.
The reason why this focus on media is a distraction is that the fundamental problem of digital preservation is economic, not technical. No-one wants to pay for preserving data that isn't earning its keep, pretty much the definition of archived data. The cost per terabyte of the medium is irrelevant, what drives the economic threat is the capital and operational cost of the system. Take tape for example. The media capital cost is low, but the much higher system capital cost includes the drives and the robotics. Then there are the operational costs of the data center space, power, cooling and staff. It is only by operating at data-center scale and thus amortizing the capital and operational costs over very large amounts of data that the system costs per terabyte can be made competitive.
Operating at data center scale, as Patiejunas discovered and Microsoft understands, means that one of the parameters that determines the system cost is write bandwidth. Each of Facebook's racks wrote 12 optical disks in parallel almost continuously. It would take over 800 times the time to write an entire disk to fill the rack. At the 8x write speed it takes 22.5 minutes to fill a disk, so it would take around 18,750 minutes to fill the rack, or about two weeks. It isn't clear how many racks Facebook needed simultaneously doing this to keep up with the flow of user-generated content, but it was likely enough to fill a reasonable-size warehouse. Similarly, it would take about 8.5 days to fill the base model TS4300.
Project Silica
I wrote about Microsoft's Project Silica a year ago, in Microsoft's Archival Storage Research. It uses femtosecond lasers to write data into platters of silica. Like Facebook's, the prototype Silica systems are data-center size:A Silica library is a sequence of contiguous write, read, and storage racks interconnected by a platter delivery system. Along all racks there are parallel horizontal rails that span the entire library. We refer to a side of the library (spanning all racks) as a panel. A set of free roaming robots called shuttles are used to move platters between locations.
...
A read rack contains multiple read drives. Each read drive is independent and has slots into which platters are inserted and removed. The number of shuttles active on a panel is limited to twice the number of read drives in the panel. The write drive is full-rack-sized and writes multiple platters concurrently.
The Silica read drives use polarization microscopy, which is a commoditized technique widely used in many applications and is low-cost. Currently, system cost in Silica is dominated by the write drives, as they use femtosecond lasers which are currently expensive and used in niche applications. ... As the Silica technology proliferates, it will drive up the demand for femtosecond lasers, commoditizing the technology.
But the more I think about this technology, which is still in the lab, the more I think it probably has the best chance of impacting the market among all the rival archival storage technologies. Not great, but better than its competitors:
- The media is very cheap and very dense, so the effect of Kryder's Law economics driving media replacement and thus its economic rather than technical lifetime is minimal.
- The media is quasi-immortal and survives benign neglect, so opex once written is minimal.
- The media is write-once, and the write and read heads are physically separate, so the data cannot be encrypted or erased by malware. The long read latency makes exfiltrating large amounts of data hard.
- The robotics are simple and highly redundant. Any of the shuttles can reach any of the platters. They should be much less troublesome than tape library robotics because, unlike tape, a robot failure only renders a small fraction of the library inaccessible and is easily repaired.
- All the technologies needed are in the market now, the only breakthroughs needed are economic, not technological.
- The team has worked on improving the write bandwidth which is a critical issue for archival storage at scale. They can currently write hundreds of megabytes a second.
- Like Facebook's archival storage technologies, Project Silica enjoys the synergies of data center scale without needing full data center environmental and power resources.
- Like Facebook's technologies, Project Silica has an in-house customer, Azure's archival storage, with a need for a product like this.
Retrieval
The Svalbard archipelago is where I spent the summer of 1969 doing a geological survey.The most important part of an archiving strategy is knowing how you will get stuff out of the archive. Putting stuff in and keeping it safe are important and relatively easy, but if you can't get stuff out when you need it what's the point?
In some cases access is only needed to a small proportion of the archive. At Facebook, Patiejunas expected that the major reason for access would be to respond to a subpoena. In other cases, such as migrating to a new archival system, bulk data retrieval is required.
But if the reason for needing access is disaster recovery it is important to have a vision of what resources are likley to be available after the disaster. Microsoft gained a lot of valuable PR by encoding much of the world's open source software in QR codes on film and storing the cans of film in an abandoned coal mine in Svalbard so it would "survive the apocalypse". In Seeds Or Code? I had a lot of fun imagining how the survivors of the apocalypse would be able to access the archive.
To make a long story short, after even a mild apocalypse, they wouldn't be able to. Let's just point out that the first steps after the apocalypse are getting to Svalbard. They won't be able to fly to LYR. As the crow flies, the voyage from Tromsø is 591 miles across very stormy seas. It takes several days, and getting to Tromsø won't be easy either.
Archival Storage Services
Because technologies have very strong economies of scale, the economics of most forms of IT work in favor of the hyper-scalers. These forces are especially strong for archival data, both because it is almost pure cost with no income, and because as I discussed earlier the economics of archival storage only work at data-center scale. It will be the rare institution that can avoid using cloud archival storage. I analyzed the way these economic forces operate in 2019's Cloud For Preservation:Much of the attraction of cloud technology for organizations, especially public institutions funded through a government's annual budget process, is that they transfer costs from capital to operational expenditure. It is easy to believe that this increases financial flexibility. As regards ingest and dissemination, this may be true. Ingesting some items can be delayed to the next budget cycle, or the access rate limit lowered temporarily. But as regards preservation, it isn't true. It is unlikely that parts of the institution's collection can be de-accessioned in a budget crunch, only to be re-accessioned later when funds are adequate. Even were the content still available to be re-ingested, the cost of ingest is a significant fraction of the total life-cycle cost of preserving digital content.
| Archival Services | |||||
|---|---|---|---|---|---|
| Service | In | Store | Out | Total | Lock-in | 
| AMZN Glacier | $2,821 | $60,182 | $69,260 | $132,263 | 13.8 | 
| GOOG Coldline | $4,514 | $105,319 | $105,144 | $214,977 | 12.0 | 
| MSFT Archive | $7,962 | $30,091 | $20,387 | $58,440 | 8.1 | 
- In all cases getting data out is much more expensive than putting it in.
- The lower cost of archival storage compared to the same service's near-line storage is purchased at the expense of a much stronger lock-in.
- Since the whole point of archival storage is keeping data for the long term, the service will earn much more in storage charges over the longer life of archival data than the shorter life of near-line data.
Six years later, things have changed significantly. Here is the current version of the archival services table:
| Archival Services | |||||
|---|---|---|---|---|---|
| Service | In | Store | Out | Total | Lock-in | 
| AMZN Glacier Deep Archive | $500 | $10,900 | $49,550 | $60,950 | 50.0 | 
| GOOG Archive | $500 | $13,200 | $210,810 | $224,510 | 175.6 | 
| MSFT Archive | $100 | $22,000 | $40,100 | $62,200 | 20.0 | 
- Glacier is the only one of the three that is significantly cheaper in real terms than it was 6 years ago.
- Glacier can do this because Kryder's Law has made their storage about a factor of about 6 cheaper in real terms in six years, or about a 35% Kryder rate. This is somewhat faster than the rate of areal density increase of tape, and much faster than that of disk. The guess is that Glacier Deep Archive is on tape.
- Google's pricing indicates they aren't serious about the archival market.
- Archive services now have differentiated tiers of service. This table uses S3 Deep Archive, Google Archive and Microsoft Archive.
- Lock-in has increased from 13.8/12.0/8.1 to 50/175/20. It is also increased by additional charges for data lifetimes less than a threshold, 180/365/180 days. So my cost estimate for Google is too low, because the data would suffer these charges. But accounting for this would skew the comparison.
- Bandwidth charges are a big factor in lock-in. For Amazon they are 77%, for Google they are 38%, for Microsoft they are 32%. Amazon's marketing is smart, hoping you won't notice the outbound bandwidth charges.
LOCKSS
The fundamental idea behind LOCKSS was that, given a limited budget and a realistic range of threats, data would survive better in many cheap, unreliable, loosely-coupled replicas than in a single expensive, durable one.Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
In other words, don't put all your eggs in one basket.














 
9 comments:
My show-and-tell for this talk was a 26-year-old 340MB IBM microdrive that, until I re-formatted it last week, still held the Linux system I wrote to it 26 years ago. Microdrives were hard disks in the Compact Flash 2 form factor. I recently found a bag with seven microdrives, in capacities ranging from 340MB to 4GB. Two of the 340MB drives had failed, the others still work fine.
At the current Library of Congress "Designing Storage Architectures" meeting there was a report on the nascent state of the Library's investigation of DNA storage. This sparked an interesting discussion of the business of quasi-immortal media. The Library is interested because of their long-term mandate and the on-going cost of media migration (every 7 years or so for tape) and hardware migration (20 years for tape robots). The difficulty is that these migrations make money for the storage industry. Quasi-immortal media disrupt this cash flow, so the existing players will suppress it. Because of the customers' time horizons and scale, they need large, credible vendors. But those vendors don't want to supply a technology that disrupts their existing business. And the startups aren't credible.
Library of Congress' effort to migrate their preservation storage to AWS is finding a roughly one-in-a-million rate of hash mismatches.
In his talk at DSA David Borland of Wasabi focused on a point that I should have made in the "Cloud" section of the talk. It is extraordinarily difficult to budget for cloud storage in general because everything you do is charged, not just storage. I mentioned egress fees and minimum object lifetimes, but there are API call fees, minimum chargeable object sizes, and may others. Many cloud storage users report blowing their budgets because they didn't predict the additional fees, which can easily exceed the raw storage cost. My crude estimates of archival storage costs are likely way too low for all services.
One takeaway from the DSA talks is that the longer-term growth projections for the supply side of storage and its technologies are implausible, not because the required densities cannot be achieved, but because neither the resources needed in terms of power, water, minerals and so on, nor the investments needed in manufacturing and data centers, will be available. Combine this with the rapidly decreasing ROI of investments in training LLMs that will reduce the demand side of the equation. Technologies progress on S-curves and the only one that still has a lot of runway on the steep part of the theoretical curve is tape. But much of the meeting featured reports from customers unhappy with the real-world performance of LT09 and skeptical of LT10.
Elizabeth Braw's We need to pay closer attention to Svalbard outlines another potential threat to the cans of film in the mine.
Piql/the Arctic World Archive could put a duplicate repository in Nunavut in order to mitigate geopolitical risk. However, Piql has also said that piqlFilm is rated for 1,000 years at room temperature, so really they could put a backup location anywhere. If the refrigeration fails, the estimated longevity will drop from 2,000 years to 1,000 years, but it might be more important to have a more accessible location anyway.
The Asianometry YouTube channel's The Incredible Femtosecond Laser is a fascinating review of how technology progressed to be able to emit such short light pulses.
Mark Tyson reports on yet another archival storage technology in Holographic ribbon aims to oust magnetic tape with 50-year life span and 200TB capacity per cartridge — HoloMem says optical ribbon-based carts work with some components of existing systems, reducing fricition:
"According to the inventors of HoloMem, their new cold storage technology offers far greater capacity than magnetic tape, with a much longer shelf life, and “zero energy storage” costs. HoloMem carts can fit up to 200TB, which is more than 11x the capacity of LTO-10 magnetic tape. Also, the optical-based new tech’s touted 50-year life is 10x the life of magnetic tape."
And:
"The firm claims that a HoloDrive can be integrated into a legacy cold storage system “with minimal hardware and software disruption.” This allows potential customers to phase-in HoloMem use, reducing the chance of abrupt transition issues. Moreover, its LTO-sized cartridges can be transported by a storage library’s robot transporters with no change."
Post a Comment