Thursday, January 9, 2020

Library of Congress Storage Architecture Meeting

.The Library of Congress has finally posted the presentations from the 2019 Designing Storage Architectures for Digital Collections workshop that took place in early September, I've greatly enjoyed the earlier editions of this meeting, so I was sorry I couldn't make it this time. Below the fold, I look at some of the presentations.

Robert Fontana & Gary Decad

As usual, Fontana and Decad provided their invaluable overview of the storage landscape. Their key points include:
  • [Slide 5] The total amount of storage manufactured each year continues its exponential growth at around 20%/yr. The vast majority (76%) of it is HDD, but the proportion of flash (20%) is increasing. Tape remains a very small proportion (4%).
  • [Slide 12] They contrast this 20% growth in supply with the traditionally ludicrous 40% growth in "demand". Their analysis assumes one byte of storage manufactured in a year represents one byte of data stored in that year, which is not the case (see my 2016 post Where Did All Those Bits Go? for a comprehensive debunking). So their supposed "storage gap" is actually a huge, if irrelevant, underestimate. But they hit the nail on the head with:
    Key Point: HDD 75% of bits and 30% of revenue, NAND 20% of bits and 70% of revenue".
  • [Slide 9] The Kryder rates for NAND Flash, HDD and Tape are comparable;
    $/GB decreases are competitive with all technologies.
    But, as I've been writing since at least 2012's Storage Will Be A Lot Less Free Than It Used To Be, the Kryder rate has decreased significantly from the good old days:
    $/GB decreases are in the 19%/yr range and not the classical Moore’s Law projection of 28%/yr associated with areal density doubling every 2 years
    As my economic model shows, this makes long-term data storage a significantly greater investment.
  • [Slide 11] In 2017 flash was 9.7 times as expensive as HDD. In 2018 the ratio was 9 times. Thus, despite recovering from 2017's supply shortages, flash has not made significant progress in eroding HDD's $/GB advantage. By continuing current trends, they project that by 2026 flash will ship more bytes than HDD. But they project it will still be 6 times as expensive per byte. So they ask a good question:
    In 2026 is there demand for 7X more manufactured storage annually and is there sufficient value for this storage to spend $122B more annually (2.4X) for this storage?

Jon Trantham

Jon Trantham of Seagate confirmed that, as it has been for a decade, the date for volume shipments of HAMR drives is still slipping in real time; "Seagate is now shipping HAMR drives in limited quantities to lead customers".

His presentation is interesting in that he provides some details of the extraordinary challenges involved in manufacturing HAMR drives, with pictures showing how small everything is:
The height from the bottom of the slider to the top of the laser module is less than 500 um

The slider will fly over the disk with an air-gap of only 1-2 nm
As usual, I will predict that the industry is far more likely to achieve the 15% CAGR in areal density line on the graph than the 30% line. Note the flatness of the "HDD Product" curve for the last five years or so.

Tape

The topic of tape provided a point-counterpoint balance.

Gary Decad and Robert Fontana from IBM made the point that tape's roadmap is highly credible by showing that:
Tape, unlike HDD, has consistently achieved published capacity roadmaps
and that:
For the last 8 years, the ratio of manufactured EB of tape to manufactured EB of HDD as remained constant in the 5.5% range
and that:
Unlike HDD, tape magnetic physics is not the limiting issues since tape bit cells are 60X larger than HDD bit cells ... The projected tape areal density in 2025 (90 Gbit/in2) is 13x smaller than today’s HDD areal density and has already been demonstrated in laboratory environments.
Carl Watts' Issues in Tape Industry needed only a few bullets to make his counterpoint that the risk in tape is not technological:
  • IBM is the last of the hardware manufacturers:
    • IBM is the only builder of LTO8
    • IBM is the only vendor left with enterprise class tape drives
    • If you only have one manufacturer how do you mitigate risk?
  • These cloud archival solutions all use tape:
    • Amazon AWS Glacier and Glacier Deep ($1/TB/month)
    • Azure General Purpose v2 storage Archive ($2/TB/month)
    • Google GCP Coldline($7/TB/month)
  • If it's all the same tape, how do we mitigate risk?
If, as Decad and Fontana claim:
Tape storage is strategic in public, hybrid, and private “Clouds”
then IBM has achieved a monopoly, which could have implications for tape's cost advantage. Jon Trantham's presentation described Seagate's work on robots, similar to tape robots and the Blu-Ray robots developed by Facebook, but containing hard disk cartridges descended from those we studied in 2008's Predicting the Archival Life of Removable Hard Disk Drives. We showed that the bits on the platters had similar life to bits on tape. Of course, tape has the advantage of being effectively a 3D medium where disk is effectively a 2D medium.

Cloud Storage

Amazon, Wasabi and Ceph gave useful marketing presentations. Julian Morley reported on Stanford's transition from in-house tape to cloud storage, with important cost data. I reported previously on the economic modeling Morley used to support this decision.

Cold storage US$/month
AWS 1000GB 0.99
10,000 write operations 0.05
10,000 read operations 0.004
1GB retrieval 0.02
Early deletion charge: 180 days
Azure 1000GB 0.99
10,000 write operations 0.1
10,000 read operations 5
1GB retrieval 0.02
Early deletion charge: 180 days
Google 1000GB 1.2
10,000 operations 0.5
1GB retrieval 0.12
Early deletion charge: 365 days
At The Register, Tim Anderson's Archive storage comes to Google Cloud: Will it give AWS and Azure the cold shoulder? provides a handy comparison of the leading cloud providers' pricing options for archiva storage, and concludes:
This table, note, is an over-simplification. The pricing is complex; operations are broken down more precisely than read and write; the exact features vary; and there may be discounts for reserved storage. Costs for data transfer within your cloud infrastructure may be less. The only way to get a true comparison is to specify your exact requirements (and whether the cloud provider can meet them), and work out the price for your particular case.

DNA

I've been writing enthusiastically about the long-term potential, but skeptically about the medium-term future, of DNA as an archival storage medium for more than seven years. I've always been impressed by the work of the Microsoft/UW team in this field, and Karin Strauss and Luis Ceze's DNA data storage and computation is no exception. It includes details of their demonstration of a complete write-to-read automated system (see also video), and discussion of techniques for performing "big data" computations on data stored in DNA.

Anne Fischer reported on DARPA's research program in Molecular Informatics. One of its antecedents was a DARPA workshop in 2016. Her presentation stressed the diverse range of small molecules that can be used as storage media. I wrote about one non-DNA approach from Harvard last year.

In Cost-Reducing Writing DNA Data I wrote about Catalog's approach, assembling a strand from a library of short sequences of bases. It is a good idea, addressing one of the big deficiencies of DNA as a storage medium, its write bandwidth. But Devin Leake's slides are short on detail, more of an elevator pitch for investment, They start by repeating the ludicrous IDC projection of "bytes generated" and equating it to demand for storage, and in particular archival storage. If you're doing a company you need a much better idea than this about the market you're addressing.

Henry Newman

The good Dr. Pangloss loved Henry Newman's enthusiasm for 5G networking, but I'm a lot more skeptical. It is true that early 5G phones can demo nearly 2Gb/s in very restricted coverage areas in some US cities. But 5G phones are going to be more expensive to buy, more expensive to use, have less battery life, overheat, have less consistent bandwidth and almost non-existent coverage. In return, you get better peak bandwidth, which most people don't use. Customers are already discovering that their existing phone is "good enough". 5G is such a deal!

The reason the carriers are building out 5G networks isn't phones, it is because they see a goldmine in the Internet of Things. But combine 2Gb/s bandwidth with the IoT's notoriously non-existent security, and you have a disaster the carriers simply cannot allow to happen.

The IoT has proliferated for two reasons, the Things are very cheap and connecting them to the Internet is unregulated, so ISPs cannot impose hassles. But connecting a Thing to the 5G Internet will require a data plan from the carrier, so they will be able to impose requirements, and thus costs. Among the requirements will have to be that the Things have UL certification, adequate security and support, including timely software updates for their presumably long connected life. It is precisely the lack of these expensive attributes that have made the IoT so ubiquitous and such a security dumpster-fire!

Fixity

Two presentations discussed fixity checks. Mark Cooper reported on an effort to validate both the inventory and the checksums of part of LC's digital collection. The conclusion was that the automated parts were reliable, the human parts not so much:
  • Content on storage is correct, inventory is not
  • Content custodians working around system limitations, resulting in broken inventory records
  • Content in the digital storage system needs to be understood as potentially dynamic, in particular for presentation and access
  • System needs to facilitate required actions in ways that are logged and versioned
Buzz Hayes from Google explained their recommended technique for performing fixity checks on data in Google's cloud. They provide scripts for the two traditional approaches:
  • Read the data back and hash it, which at scale gets expensive in access and bandwidth charges.
  • Hash the data in the cloud that stores it, which involves trusting the cloud to actually perform the hash rather than simply remember the hash computed at ingest.
I have yet to see a cloud API that implements the technique published by Mehul Shah et al twelve years ago, allowing the data owner to challenge the cloud provider with a nonce, thus forcing it to compute the hash of the nonce and the data at check time. See also my Auditing The Integrity Of Multiple Replicas.

Blockchain

Sharmila Bhatia reported on an initiative by NARA to investigate the potential for blockchain to assist government records management which concluded:
Authenticity and Integrity
  • Blockchain distributed ledger functionality presents a new way to ensure electronic systems provide electronic record authenticity / integrity.
  • May not help with preservation or long term access and may make these issues more complicated.
It is important to note that what NARA means by "government records" is quite different from what is typically meant by "records", and the legislative framework under which they operate may make applying blockchain technology tricky.

Ben Fino-Radin and Michelle Lee pitched Starling, a startup claiming:
Simplified & coordinated decentralized storage on the Filecoin network
Their slides describe how the technology works, but give no idea of how much it would cost to use. Just as with DNA and other exotic media, the real issue is economic not technical.

I wrote skeptically about the economics of the Filecoin network in The Four Most Expensive Words in the English Language and Triumph Of Greed Over Arithmetic, comparing its possible pricing to Amazon's S3 and S3 RRS. Of course, the numbers would have looked much worse for Filecoin had I compared it with Wasabi's pricing.

A Final Request To The Organizers

This is always a fascinating meeting. But, please, on the call for participation next year make it clear that anyone using projections for "data generated" in their slides as somehow relevant to "data storage" and archival data storage in particular will be hauled off stage by the hook.

2 comments:

David. said...

In 5G Security, Bruce Schneier points out that, even if the telcos were to enforce strict security for 5G-connected Things, we are still screwed:

"Security vulnerabilities in the standards ­the protocols and software for 5G ­ensure that vulnerabilities will remain, regardless of who provides the hardware and software. These insecurities are a result of market forces that prioritize costs over security and of governments, including the United States, that want to preserve the option of surveillance in 5G networks. If the United States is serious about tackling the national security threats related to an insecure 5G network, it needs to rethink the extent to which it values corporate profits and government espionage over security."

Go read the whole post and weep.

David. said...

Chris Mellor reports that Hard disk drive shipments fell 50% between 2012 and 2019 as SSD cannibalized everything except nearline. But note from Fontana's graph above that capacity per drive increased faster than unit shipments decreased, so total bytes shipped still increased.