Thursday, December 21, 2017

Science Friday's "File Not Found"

Science Friday's Lauren Young has a three-part series on digital preservation:
  1. Ghosts In The Reels is about magnetic tape.
  2. The Librarians Saving The Internet is about Web archiving.
  3. Data Reawakening is about the search for a quasi-immortal medium.
Clearly, increasing public attention to the problem of preserving digital information is a good thing, but I have reservations about these posts. Below the fold, I lay them out.

Cory Doctorow's A deep dive into the race to preserve our digital heritage pointed me to the series. He is quoted in it, but even he has reservations:
I think that modern data is actually a lot simpler to preserve than older data, because of the growth in both cloud and local, online storage. The data I had to store on floppies (before I had a hard drive, and then after I got one but before I could afford a drive that was capacious enough to maintain all my data) is vulnerable because it's on media that is slowly decaying and whose reading equipment is getting harder and harder to source.

But once the floppies and cards and tapes and cartridges are read into the primary storage for computers in constant use, it gets a lot more robust. Backing up that data gets easier and easier (I maintain two encrypted hard drives with backups, only one of which is onsite, and which are rotated; as well as an encrypted cloud backup of key data), and running programs that can interpret the data has effectively ceased to be a problem because I can use virtual machines running obsolete operating systems and the original programs to see, copy and manipulate the data.
Precisely. The way you know something is online is that you can immediately copy it. Preservation depends on redundancy, i.e. making copies. Thus online information is inherently much easier to preserve than information locked up in an offline medium.

Ghosts In The Reels

This part focuses on tape, but it gets the problems of tape wrong because it takes a small-scale, individual tape view:
Archivists are “learning now that even if the tape will last 50 years, their tape drives won’t be around that long,” Koski says. This obsolescence “can occur unexpectedly, where data tapes that have been stored in archival vaults outlive the machines that can read or write them.”
The days when it made sense to write a tape and put it in a vault are long gone. Tapes nowadays live in slots in tape robots, and they don't last anything like 50 years before being discarded, as Young writes:
Archives often undergo a migration haul when newer, denser generations of magnetic tape are released, reducing the actual amount of time data remains on a cartridge. Lantz and his research team at IBM recommend that clients migrate their data every couple generations.
But Young gets the reason for the migration wrong:
The rapid turnover of tape technology causes more and more older generations of tape drives to die off and succumb to mechanical failure from extensive use. The data that may remain on the corresponding tape format then becomes lost—unreadable from technology obsolescence.
The reason isn't that the tape drives die off, it is that the slots in the robots are far more valuable than the tapes they hold, so the migration is driven by the economics of increasing the capacity of the robot system by increasing the amount of data each slot holds:
the required migration consumes a lot of bandwidth, meaning that in order to supply the bandwidth needed to ingest the incoming data you need a lot more drives. This reduces the tape/drive ratio, and thus decreases tape's apparent cost advantage. Not to mention that migrating data from tape to tape is far less automated and thus far more expensive than migrating between on-line media such as disk.
That costly, self-perpetuating cycle of data migration is why Dino Everett, film archivist for the University of Southern California, calls LTO “archive heroin—the first taste doesn’t cost much, but once you start, you can’t stop. And the habit is expensive.”
Tape's longevity, and efforts to maintain obsolete drives in working order, may be interesting but they are basically irrelevant to the big picture of digital preservation.

The Librarians Saving The Internet

This part does deal with the big picture of digital preservation, Web archiving. It focuses on the various data rescue efforts surrounding last year's End-of-Term Crawl:
As the 2016 presidential campaign heated up, digital preservationists as well as scientists and researchers, like [Prof. Debbie] Rabina, began to wonder what might happen to the thousands of datasets and articles published on government domains, particularly the information that did not align with the agenda of President Donald Trump’s administration.

“This administration made it very clear what their priorities are during the election, and they also made it very clear that they are going to proactively withdraw things out of the public domain and disengage from the public when it comes to information,” Rabina says.

Rabina, a former librarian, was compelled to act. After a conference at the New York Academy of Medicine, she and a few colleagues hosted one of the very first pop-up URL harvesting events for the End Of Term Web Archive, a project run by a consortium of universities and institutions that archives all kinds of government information at the end of every four-year presidential term.
Its not just government information:
Already, important keystones of the internet’s history have seemingly evaporated from existence. There are innumerable accounts of disappearing data (big and small), from award-winning web features to online forums to significant digital artifacts that tell the origins of the World Wide Web.
The link in the quoted paragraph is to:
Adrienne LaFrance's piece Raiders of the Lost Web at The Atlantic. It is based on an account of last month's resurrection of a 34-part, Pulitzer-winning newspaper investigation from 2007 of the aftermath of a 1961 railroad crossing accident in Colorado. It vanished from the Web when The Rocky Mountain News folded and survived only because Kevin Vaughan, the reporter, kept a copy on DVD-ROM.
So it might have been worth pointing out that the "award-winning web feature" was in fact rescued. Although as I pointed out because it was Flash-based:
Soon, accessing the "coolest part" of the resurrected site will require a virtual machine with a legacy browser.
Which is a quite different problem.

Because she focuses on targeted crawls, Young doesn't mention the rather poor odds that a random Web page will be archived, nor that the reason for it is that Web archiving is critically under-funded.

Data Reawakening

Young notes in the first part that IBM, maker of tape storage systems, expects data to stay on a given tape for only about 10% of the tape's rated lifetime. She devotes the third part to the search for a quasi-immortal storage medium without explaining why, if tape's 50-year life is irrelevant to the actual lifetime of data on tape, a medium with a much longer life would be preferable. She writes:
In order to prevent us from spiraling further into the informational black hole, researchers are on the hunt for ways to immortalize history—a system to eternalize data forever.
She correctly identifies that current media are approaching physical limits:
Currently, data that needs to be preserved for the long term is stored on dense mediums such as magnetic tape. However, there’s a physical limit to how much data can be packed on to today’s storage technologies. “We are reaching the point where the individual grains on our hard disks or the individual components in the flash memory are getting so small that it will have to fail very soon,” says Sander Otte, a fundamental physicist.
And focuses on the potential density of DNA storage:
“Everything that somebody with a computer would be able to access on the internet [without a password], that whole thing can be archived in something that’s a little bigger than a shoebox,” says Karin Strauss, a computer architecture researcher at Microsoft and affiliate professor at the University of Washington. “If you were to do it with any other technology, it would be at least 1,000, if not 10,000, times bigger.”
And, as every journalist writing about DNA storage does, floats off into the hype-sphere:
DNA might just be the key to preserving human history—a tiny molecule that will withstand time as long as there is life on the planet, Strauss says. Technology obsolescence is no longer a threat to data when the information lives on a medium that will always be read.

“Now that we know how to read DNA, it’ll be eternally relevant because we will always have readers to read it,” she says.

The molecule itself does have a shelf life of 6.8 million years if kept at ideal preservation conditions, and ceases to be readable at roughly 1.5 million years due to the decay of the strands. While it lasts much longer than any medium currently used on the market, DNA storage remains a costly option in comparison to other long-term storage, such as magnetic tape, which has a lifespan of 30 to 50 years.
First, its worth pointing out that "we will always have readers to read it" is recursive. It assumes that the problem of preserving digital information has been solved. The reason we can sequence DNA is because we have very sophisticated equipment built around very sophisticated computer software. The series is about the risk of loss of digital information, such as the software upon which DNA sequencing rests.

Second, one of the traditional risks to digital preservation is format obsolescence. Using DNA as a storage medium involves choosing a format, a way of using the bases to represent bits, and a way to incorporate redundancy to allow error correction. DNA's formats are just as subject to obsolescence as the formats of other media.

Third, the series has already established that media life, even if only 30-50 years, would only be relevant if technological progress stopped. But the whole of this part of the series is a celebration of technological progress.

Fourth, "a costly option" is an understatement that even an Englishman like me can appreciate:
the fact remains that DNA data storage requires a reduction in relative synthesis cost of at least 6 orders of magnitude over the next decade to be competitive with conventional media, and that currently the relative write cost is increasing, not decreasing.
Fifth, Young shows no understanding of the timescales required to bring a new storage technology into the mass market. Magnetic tape was first used for data storage more than 65 years ago. Disk technology is more than 60 years old. Flash will celebrate its 30th birthday next year. It is in many ways superior to both disk and tape, but it has yet to make a significant impact in the market for long-term bulk data storage. DNA and the other technologies discussed in this part will be lucky if they enter the mass market two decades from now.

No comments: