Thursday, March 23, 2017

Threats to stored data

Recently there's been a lively series of exchanges on the pasig-discuss mail list, sparked by an inquiry from Jeanne Kramer-Smyth of the World Bank as to any additional risks posed by media such as disks that did encryption or compression. It morphed into discussion of the "how many copies" question and related issues. Below the fold, my reflections on the discussion.

Tuesday, March 21, 2017

The Amnesiac Civilization: Part 5

Part 2 and Part 3 of this series established that, for technical, legal and economic reasons there is much Web content that cannot be ingested and preserved by Web archives. Part 4 established that there is much Web content that can currently be ingested and preserved by public Web archives that, in the near future, will become inaccessible. It will be subject to Digital Rights Management (DRM) technologies which will, at least in most countries, be illegal to defeat. Below the fold I look at ways, albeit unsatisfactory, to address these problems.

Friday, March 17, 2017

The Amnesiac Civilization: Part 4

Part 2 and Part 3 of this series covered the unsatisfactory current state of Web archiving. Part 1 of this series briefly outlined the way the W3C's Encrypted Media Extensions (EME) threaten to make this state far worse. Below the fold I expand on the details of this threat.

Wednesday, March 15, 2017

SHA1 is dead

On February 23rd a team from CWI Amsterdam (where I worked in 1982) and Google Research published The first collision for full SHA-1, marking the "death of SHA-1". Using about 6500 CPU-years and 110 GPU-years, they created two different PDF files with the same SHA-1 hash. SHA-1 is widely used in digital preservation, among many other areas, despite having been deprecated by NIST through a process starting in 2005 and becoming official by 2012.

There is an accessible report on this paper by Dan Goodin at Ars Technica. These collisions have already caused trouble for systems in the field, for example for Webkit's Subversion repository. Subversion and other systems use SHA-1 to deduplicate content; files with the same SHA-1 are assumed to be identical. Below the fold, I look at the implications for digital preservation.

Monday, March 13, 2017

The Amnesiac Civilization: Part 3

In Part 2 of this series I criticized Kalev Leetaru's Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web for failing to take into account the cost of doing a better job. Below the fold I ask whether, even with unlimited funds, it would actually be possible to satisfy Leetaru's reasonable-sounding requirements, and whether those requirements would actually solve the problems of Web archiving.

Friday, March 10, 2017

Dr. Pangloss and Data in DNA

Last night I gave a 10-minute talk at the Storage Valley Supper Club, an event much beloved of the good Dr. Pangloss. The title was DNA as a Storage Medium; it was a slightly edited section of The Medium-Term Prospects for Long-Term Storage Systems. Below the fold, an edited text with links to the sources.

Wednesday, March 8, 2017

The Amnesiac Civilization: Part 2

Part 1 of The Amnesiac Civilization predicted that the state of Web archiving would soon get much worse. How bad it is right now and why? Follow me below the fold for Part 2 of the series. I'm planning at least three more parts:
  • Part 3 will assess how practical some suggested improvements might be.
  • Part 4 will look in some detail at the Web DRM problem introduced in Part 1.
  • Part 5 will discuss a "counsel of despair" approach that I've hinted at in the past.

Friday, March 3, 2017

The Amnesiac Civilization: Part 1

Those who cannot remember the past are condemned to repeat it
George Santayana: Life of Reason, Reason in Common Sense (1905)
Who controls the past controls the future. Who controls the present controls the past.
George Orwell: Nineteen Eighty-Four (1949)
Santayana and Orwell correctly perceived that societies in which the past is obscure or malleable are very convenient for ruling elites and very unpleasant for the rest of us. It is at least arguable that the root cause of the recent inconveniences visited upon ruling elites in countries such as the US and the UK was inadequate history management. Too much of the population correctly remembered a time in which GDP, the stock market and bankers' salaries were lower, but their lives were less stressful and more enjoyable.

Two things have become evident over the past couple of decades:
  • The Web is the medium that records our civilization.
  • The Web is becoming increasingly difficult to collect and preserve in order that the future will remember its past correctly.
This is the first in a series of posts on this issue. I start by predicting that the problem is about to get much, much worse. Future posts will look at the technical and business aspects of current and future Web archiving. This post is shorter than usual to focus attention on what I believe is an important message

In a 2014 post entitled The Half-Empty Archive I wrote, almost as a throw-away:
The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.
The link was to a post by Cory Doctorow in which he wrote:
We are Huxleying ourselves into the full Orwell.
He clearly understood some aspects of the problem caused by DRM on the Web:
Everyone in the browser world is convinced that not supporting Netflix will lead to total marginalization, and Netflix demands that computers be designed to keep secrets from, and disobey, their owners (so that you can’t save streams to disk in the clear).
Two recent developments got me thinking about this more deeply, and I realized that neither I nor, I believe, Doctorow comprehended the scale of the looming disaster. It isn't just about video and the security of your browser, important as those are. Here it is in as small a nutshell as I can devise.

Almost all the Web content that encodes our history is supported by one or both of two business models: subscription, or advertising. Currently, neither model works well. Web DRM will be perceived as the answer to both. Subscription content, not just video but newspapers and academic journals, will be DRM-ed to force readers to subscribe. Advertisers will insist that the sites they support DRM their content to prevent readers running ad-blockers. DRM-ed content cannot be archived.

Imagine a world in which archives contain no subscription and no advertiser-supported content of any kind.

Notes from FAST17

As usual, I attended Usenix's File and Storage Technologies conference. Below the fold, my comments on the presentations I found interesting.

Thursday, March 2, 2017

Injecting Faults in Distributed Storage

I'll record my reactions to some of the papers at the 2017 FAST conference in a subsequent post. But one of them has significant implications for digital preservation systems using distributed storage, and deserves a post to itself. Follow me below the fold as I try to draw out these implications.