Henry Newman's Industry TrendsHenry has been a fixture at these meetings since 2001 and this was to be his last. He started out explaining why "it takes longer than it takes" using the example of Jim Gray's 2006 presentation Tape is Dead, Disk is Tape, Flash is Disk, RAM Locality is King — the whole presentation is in his backup slides and is well worth reading.
He uses Q2 2021 data from Trendfocus to show that, in the enterprise space, HDD shipments were 247.61EB versus SDD shipments of 35.79EB. For SSDs to displace HDDs in the enterprise would require increasing production by a factor of nearly 7. There is no way to justify the enormous investment in flash fabs that would require.
Georg Lauhoff's Recent Storage Landscape – Tape, HDD, NAND
presentation to the 2009 edition of this meeting, and this bias wasn't helped by Trantham using the numbers IDC pulls out of thin air for his "gee-whiz" 5th slide about the amount of data in 2025.
He did make some interesting points:
- Data continues to shift to the cloud; however, we are now seeing more data kept at the edge
- Content distribution and the cost & latency of networking are key drivers
- Most of cloud data is stored on large-capacity nearline hard disk drives
Seagate Announced the first NVMe HDD at OCP last NovemberMoving all storage to the NVMe interface is an interesting trend.
CDUs will be available in Mid-2024 in Single and Dual-Port SKUs
To do Seagate justice, their slides have much more meat to them than Paul Peck's HDD Storage in the Zettabyte Era from Western Digital, which is almost content-free.
Zef Delgadillo's Digital Preservation: Solutions in Google CloudDelgadillo discussed my long-time bête noir, namely how a digital preservation system using cloud storage can confirm the fixity of content, i.e. that the content is unchanged from when it was submitted.
the gsutil stat command provides a strongly consistent way to check for the existence (and read the metadata) of an object.CRC32C and Installing crcmod says:
gsutil automatically performs integrity checks on all uploads and downloads. Additionally, you can use the gsutil hash command to calculate a CRC for any local file.hash - Calculate file hashes says:
Calculate hashes on local files, which can be used to compare with gsutil ls -L output. If a specific hash option is not provided, this command calculates all gsutil-supported hashes for the files.Note this command hashes the local files, not the files in the could. Uploading and downloading is performed by the cp command. cp - Copy files and objects says:
Note that gsutil automatically performs hash validation when uploading or downloading files, so this command is only needed if you want to write a script that separately checks the hash.
At the end of every upload or download, the gsutil cp command validates that the checksum it computes for the source file matches the checksum that the service computes. If the checksums do not match, gsutil deletes the corrupted object and prints a warning message.The problem is that the application has no way of knowing when or even whether the hashes in the metadata were computed. The Content-MD5 header tells the storage service what the MD5 is; it could discard the content and respond correctly to the gsutil stat command by remembering only the MD5. The preservation application would discover this only when it tried to download the content. Or some background process in the cloud could have computed the MD5 and validated the content against the roginal MD5 some time ago, which doesn't validate the content now.
If you know the MD5 of a file before uploading, you can specify it in the Content-MD5 header, which enables the cloud storage service to reject the upload if the MD5 doesn't match the value computed by the service.
Ideally, the storage service API would have a command that supplied an application generated random nonce to be prepended to the content, forcing the service to hash it at the time of the command. Anything less requires the application to either (a) trust the storage service, which is not appropriate for preservation audits, or (b) regularly download the entire stored content to hash it locally, which cost too much in bandwidth charges to be practical. Absent a nonce, the API should at least provide a timestamp at which it claims to have computed the hashes.
Storage ServicesThere followed presentations from a range of other preservation storage services, which were mostly vanilla marketing:
New – Additional Checksum Algorithms for Amazon S3, that you could now use SHA-1, SHA-256, CRC-32, and CRC-32C as well as MD5 as checksums and that GetObjectAttributes had been added to the API to get all the available attributes in one call. Note that this API, like gsutil, returns the checksums with no nonce or evidence of when they were validated.
DNAThe last session I want to discuss featured five presentations on storing data in molecules:
- Microsoft/UW. Karin Strauss pointed to their recent publications, some of which I've already covered.
- Catalog. In Part 1 I wrote:
Three years ago I reported on Catalog, who encode data not in individual bases, but in short strands of pre-synthesized DNA. The idea is to sacrifice ultimate density for write speed. David Turek reported that, by using conventional ink-jet heads to print successive strands on dots on a polymer tape, they have demonstrated writing at 1Mb/s.
- Datacule, which encodes data in flourescent dyes. The combination of dyes encodes multiple bits (8) in a printed dot (press). The abstract of Storing and Reading Information in Mixtures of Fluorescent Molecules by Amit A. Nagarkar et al reads:
This work shows that digital data can be stored in mixtures of fluorescent dye molecules, which are deposited on a surface by inkjet printing, where an amide bond tethers the dye molecules to the surface. A microscope equipped with a multichannel fluorescence detector distinguishes individual dyes in the mixture. The presence or absence of these molecules in the mixture encodes binary information (i.e., “0” or “1”). The use of mixtures of molecules, instead of sequence-defined macromolecules, minimizes the time and difficulty of synthesis and eliminates the requirement of sequencing. We have written, stored, and read a total of approximately 400 kilobits (both text and images) with greater than 99% recovery of information, written at an average rate of 128 bits/s (16 bytes/s) and read at a rate of 469 bits/s (58.6 bytes/s).The presentation claims 5Mbit/in2 areal density.
Twist Bioscience, who showed a very realistic "reality check" on their progress. They are working towards a goal of a 2D array on a chip that synthesizes 1TB of DNA at a time, which is then washed into a vial for storage — they are working on a 62.5GB chip. The vials are packed into the standard biochem trays, for 1,536TB for the largest tray.