Tuesday, August 22, 2017

Economic Model of Long-Term Storage

Cost vs. Kryder rate
As I wrote last month in Patting Myself On The Back, I started working on economic models of long-term storage six years ago. I got a small amount of funding from the Library of Congress; when that ran out I transferred the work to students at UC Santa Cruz's Storage Systems Research Center. This work was published here in 2012 and in later papers (see here).

What I wanted was a rough-and-ready Web page that would allow interested people to play "what if" games. What the students wanted was something academically respectable enough to get them credit. So the models accumulated lots of interesting details.

But the details weren't actually useful. The extra realism they provided was swamped by the uncertainty from the "known unknowns" of the future Kryder and interest rates. So I never got the rough-and-ready Web page. Below the fold, I bring the story up-to-date and point to a little Web site that may be useful.

Tuesday, August 8, 2017

Approaching The Physical Limits

As storage media technology gets closer and closer to the physical limits, progress on reducing the $/GB number slows down. Below the fold, a recap of some of these issues for both disk and flash.

Thursday, August 3, 2017

Preservation Is Not A Technical Problem

As I've always said, preserving the Web and other digital content for posterity is an economic problem. With an unlimited budget collection and preservation isn't a problem. The reason we're collecting and preserving less than half the classic Web of quasi-static linked documents, and much less of "Web 2.0", is that no-one has the money to do much better.

The budgets of libraries and archives, the institutions tasked with acting as society's memory, have been under sustained attack for a long time. I'm working on a talk and I needed an example. So I drew this graph of the British Library's annual income in real terms (year 2000 pounds). It shows that the Library's income has declined by almost 45% in the last decade.

Memory institutions that can purchase only half what they could 10 years ago aren't likely to greatly increase funding for acquiring new stuff; it's going to be hard for them just to keep the stuff (and the staff) they already have.

Below the fold, the data for the graph and links to the sources.

Tuesday, August 1, 2017

Disk media market update

Its time for an update on the disk media market., based on reporting from The Register's Chris Mellor here and here and here.

Thursday, July 27, 2017

Decentralized Long-Term Preservation

Lambert Heller is correct to point out that:
name allocation using IPFS or a blockchain is not necessarily linked to the guarantee of permanent availability, the latter must be offered as a separate service.
Storage isn't free, and thus the "separate services" need to have a viable business model. I have demonstrated that increasing returns to scale mean that the "separate service" market will end up being dominated by a few large providers just as, for example, the Bitcoin mining market is. People who don't like this conclusion often argue that, at least for long-term preservation of scholarly resources, the service will be provided by a consortium of libraries, museums and archives. Below the fold I look into how this might work.

Tuesday, July 25, 2017

Initial Coin Offerings

The FT's Alphaville blog has started a new series, called ICOmedy looking at the insanity surrounding Initial Coin Offerings (ICOs). The blockchain hype has created an even bigger opportunity to separate the fools from their money than the dot-com era did. To motivate you to follow the series, below the fold there are some extracts and related links.

Thursday, July 20, 2017

Patting Myself On The Back

Cost vs. Kryder rate
I started working on economic models of long-term storage six years ago, and quickly discovered the effect shown in this graph. It plots the endowment, the money which, deposited with the data and invested at interest, pays for the data to be stored "forever", as a function of the Kryder rate, the rate at which $/GB drops with time. As the rate slows below about 20%,  the endowment needed rises rapidly. Back in early 2011 it was widely believed that 30-40% Kryder rates were a law of nature, they had been that way for 30 years. Thus, if you could afford to store data for the next few years you could afford to store it forever

2014 cost/byte projection
As it turned out, 2011 was a good time to work on this issue. That October floods in Thailand destroyed 40% of the world's disk manufacturing capacity, and disk prices spiked. Preeti Gupta at UC Santa Cruz reviewed disk pricing in 2014 and we produced this graph. I wrote at the time:
The red lines are projections at the industry roadmap's 20% and a less optimistic 10%. [The graph] shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Backblaze average $/GB
Thanks to Backblaze's admirable transparency, we have 3 years more data. Their blog reports on their view of disk pricing as a bulk purchaser over many years. It is far more detailed than the data Preeti was able to work with. Eyeballing the graph, we see a 2013 price around 5c/GB and a 2017 price around half that. A 10% Kryder rate would have meant a 2017 price of 3.2c/GB, and a 20% rate would have meant 2c/GB, so the out-turn lies between the two red lines on our graph. It is difficult to make predictions, especially about the future. But Preeti and I nailed this one.

This is a big deal. As I've said many times:
Storage will be
Much less free
Than it used to be
The real cost of a commitment to store data for the long term is much greater than most people believe, and there is no realistic prospect of a technological discontinuity that would change this.

Tuesday, July 11, 2017

Is Decentralized Storage Sustainable?

There are many reasons to dislike centralized storage services. They include business risk, as we see in le petit musée des projets Google abandonnés, monoculture vulnerability and rent extraction. There is thus naturally a lot of enthusiasm for decentralized storage systems, such as MaidSafe, DAT and IPFS. In 2013 I wrote about one of their advantages in Moving vs. Copying. Among the enthusiasts is Lambert Heller. Since I posted Blockchain as the Infrastructure for Science, Heller and I have been talking past each other. Heller is talking technology; I have some problems with the technology but they aren't that important. My main problem is an economic one that applies to decentralized storage irrespective of the details of the technology.

Below the fold is an attempt to clarify my argument. It is a re-statement of part of the argument in my 2014 post Economies of Scale in Peer-to-Peer Networks, specifically in the context of decentralized storage networks.

Thursday, July 6, 2017

Archive vs. Ransomware

Archives perennially ask the question "how few copies can we get away with?"
This is a question I've blogged about in 2016 and 2011 and 2010, when I concluded:
  • The number of copies needed cannot be discussed except in the context of a specific threat model.
  • The important threats are not amenable to quantitative modeling.
  • Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".
I've also written before about the immensely profitable business of ransomware. Recent events, such as WannaCrypt, NotPetya and the details of NSA's ability to infect air-gapped computers should convince anyone that ransomware is a threat to which archives are exposed. Below the fold I look into how archives should be designed to resist this credible threat.

Thursday, June 29, 2017

"to promote the progress of useful Arts"

This is just a quick note to say that anyone who believes the current patent and copyright systems are working "to promote the progress of useful Arts" needs to watch Bunnie Huang's talk to the Stanford EE380 course, and read Bunnie's book The Hardware Hacker. Below the fold, a brief explanation.

Tuesday, June 27, 2017

Wall Street Journal vs. Google

After we worked together at Sun Microsystems, Chuck McManis worked at Google then built another search engine (Blekko). His contribution to the discussion on Dave Farber's IP list about the argument between the Wall Street Journal and Google is very informative. Chuck gave me permission to quote liberally from it in the discussion below the fold.

Thursday, June 22, 2017

WAC2017: Security Issues for Web Archives

Jack Cushman and Ilya Kreymer's Web Archiving Conference talk Thinking like a hacker: Security Considerations for High-Fidelity Web Archives is very important. They discuss 7 different security threats specific to Web archives:
  1. Archiving local server files
  2. Hacking the headless browser
  3. Stealing user secrets during capture
  4. Cross site scripting to steal archive logins
  5. Live web leakage on playback
  6. Show different page contents when archived
  7. Banner spoofing
Below the fold, a brief summary of each to encourage you to do two things:
  1. First, view the slides.
  2. Second, visit http://warc.games., which is a sandbox with
    a local version of Webrecorder that has not been patched to fix known exploits, and a number of challenges for you learn how they might apply to web archives in general.

Tuesday, June 20, 2017

Analysis of Sci-Hub Downloads

Bastian Greshake has a post at the LSE's Impact of Social Sciences blog based on his F1000Research paper Looking into Pandora's Box. In them he reports on an analysis combining two datasets released by Alexandra Elbakyan:
  • A 2016 dataset of 28M downloads from Sci-Hub between September 2015 and February 2016.
  • A 2017 dataset of 62M DOIs to whose content Sci-Hub claims to be able to provide access.
Below the fold, some extracts and commentary.

Thursday, June 15, 2017

Emulation: Windows10 on ARM

At last December's WinHEC conference, Qualcomm and Microsoft made an announcement to which I should have paid more attention:
Qualcomm ... announced that they are collaborating with Microsoft Corp. to enable Windows 10 on mobile computing devices powered by next-generation Qualcomm® Snapdragon™ processors, enabling mobile, power efficient, always-connected cellular PC devices. Supporting full compatibility with the Windows 10 ecosystem, the Snapdragon processor is designed to enable Windows hardware developers to create next generation device form factors, providing mobility to cloud computing.
The part I didn't think about was:
New Windows 10 PCs powered by Snapdragon can be designed to support x86 Win32 and universal Windows apps, including Adobe Photoshop, Microsoft Office and Windows 10 gaming titles.
How do they do that? The answer is obvious: emulation! Below the fold, some thoughts.

Tuesday, June 13, 2017

Crowd-sourced Peer Review

At Ars Technica, Chris Lee's Journal tries crowdsourcing peer reviews, sees excellent results takes off from a column at Nature by a journal editor, Benjamin List, entitled Crowd-based peer review can be good and fast. List and his assistant Denis Höfler have come up with a pre-publication peer-review process that, while retaining what they see as its advantages, has some of the attributes of post-publication review as practiced, for example, by Faculty of 1000. See also here. Below the fold, some commentary.

Thursday, June 8, 2017

Public Resource Audits Scholarly Literature

I (from personal experience), and others, have commented previously on the way journals paywall articles based on spurious claims that they own the copyright, even when there is clear evidence that they know that these claims are false. This is copyfraud, but:
While falsely claiming copyright is technically a criminal offense under the Act, prosecutions are extremely rare. These circumstances have produced fraud on an untold scale, with millions of works in the public domain deemed copyrighted, and countless dollars paid out every year in licensing fees to make copies that could be made for free.
The clearest case of journal copyfraud is when journals claim copyright on articles authored by US federal employees:
Work by officers and employees of the government as part of their official duties is "a work of the United States government" and, as such, is not entitled to domestic copyright protection under U.S. law. So, inside the US there is no copyright to transfer, and outside the US the copyright is owned by the US government, not by the employee. It is easy to find papers that apparently violate this, such as James Hansen et al's Global Temperature Change. It carries the statement "© 2006 by The National Academy of Sciences of the USA" and states Hansen's affiliation as "National Aeronautics and Space Administration Goddard Institute for Space Studies".
Perhaps the most compelling instance is the AMA falsely claiming to own the copyright on United States Health Care Reform: Progress to Date and Next Steps by one Barack Obama.

Now, Carl Malamud tweets:
Public Resource has been conducting an intensive audit of the scholarly literature. We have focused on works of the U.S. government. Our audit has determined that 1,264,429 journal articles authored by federal employees or officers are potentially void of copyright.
They extracted metadata from Sci-Hub and found:
Of the 1,264,429 government journal articles I have metadata for, I am now able to access 1,141,505 files (90.2%) for potential release.
This is already extremely valuable work. But in addition:
2,031,359 of the articles in my possession are dated 1923 or earlier. These 2 categories represent 4.92% of scihub. Additional categories to examine include lapsed copyright registrations, open access that is not, and author-retained copyrights.
It is long past time for action against the rampant copyfraud by academic journals.

Tip of the hat to James R. Jacobs.

Tuesday, May 30, 2017

Blockchain as the Infrastructure for Science? (updated)

Herbert Van de Sompel pointed me to Lambert Heller's How P2P and blockchains make it easier to work with scientific objects – three hypotheses as an example of the persistent enthusiasm for these technologies as a way of communicating and preserving research, among other things. Another link from Herbert, Chris H. J. Hartgerink's Re-envisioning a future in scholarly communication from this year's IFLA conference, proposes something similar:
Distributing and decentralizing the scholarly communications system is achievable with peer-to-peer (p2p) Internet protocols such as dat and ipfs. Simply put, such p2p networks securely send information across a network of peers but are resilient to nodes being removed or adjusted because they operate in a mesh network. For example, if 20 peers have file X, removing one peer does not affect the availability of the file X. Only if all 20 are removed from the network, file X will become unavailable. Vice versa, if more peers on the network have file X, it is less likely that file X will become unavailable. As such, this would include unlimited redistribution in the scholarly communication system by default, instead of limited redistribution due to copyright as it is now.
I first expressed skepticism about this idea three years ago discussing a paper proposing a P2P storage infrastructure called Permacoin. It hasn't taken over the world. [Update: my fellow Sun Microsystems alum Radia Perlman has a broader skeptical look at blockchain technology. I've appended some details.]

I understand the theoretical advantages of peer-to-peer (P2P) technology. But after nearly two decades researching, designing, building, deploying and operating P2P systems I have learned a lot about how hard it is for these theoretical advantages actually to be obtained at scale, in the real world, for the long term. Below the fold, I try to apply these lessons.

Friday, May 26, 2017

I'm Doing It Wrong (personal)

Retirement, that is. Nearly six months into retirement from Stanford, I can see my initial visions of kicking back in a La-Z-Boy with time to read and think were unrealistic.

First, until you've been through it, you have no idea how much paperwork getting to be retired involves. I'm still working on it. Second, I'm still involved in the on-going evolution of the LOCKSS architecture, and I'm now working with the Internet Archive to re-think the economic model of long-term storage I built with students at UC Santa Cruz. Third, I have travel coming up, intermittently sick grandkids, and a lot of sysadmin debt built up over the years on our home network. I haven't even started on the mess in the garage.

This is just a series of feeble excuses for why, as Atrios likes to say, "extra sucky blogging" for the next month or so. Sorry about that.

Thursday, May 18, 2017

"Privacy is dead, get over it" [updated]

I believe it was in 1999 that Scott McNealy famously said "privacy is dead, get over it". It is a whole lot deader now than it was then. A month ago in Researcher Privacy I discussed Sam Kome's CNI talk about the surveillance abilities of institutional network technology such as central wireless and access proxies. There's so much more to report on privacy that below the fold there can't be more than some suggested recent readings, as an update to my 6-month old post Open Access and Surveillance. [See a major update at the end]

Tuesday, May 9, 2017

Another Class of Blockchain Vulnerabilities

For at least three years I've been pointing out a fundamental problem with blockchain systems, and indeed peer-to-peer (P2P) systems in general, which is that maintaining their decentralized nature in the face of economies of scale (network effects, Metcalfe's Law, ...) is pretty close to impossible. I wrote a detailed analysis of this issue in Economies of Scale in Peer-to-Peer Networks. Centralized P2P systems, in which a significant minority (or in the case of Bitcoin an actual majority) can act in coordination perhaps because they are conspiring together, are vulnerable to many attacks. This was a theme of our SOSP "Best Paper" winner in 2003.

Now, Catalin Cimpanu at Bleeping Computer reports on research showing yet another way in which P2P networks can become vulnerable through centralization driven by economies of scale. Below the fold, some details.

Thursday, May 4, 2017

Tape is "archive heroin"

I've been boring my blog readers for years with my skeptical take on quasi-immortal media. Among the many, many reasons why long media life, such as claimed for tape, is irrelevant to practical digital preservation is that investing in long media life is a bet against technological progress.

Now, at IEEE Spectrum, Marty Perlmutter's The Lost Picture Show: Hollywood Archivists Can’t Outpace Obsolescence is a great explanation of why tape's media longevity is irrelevant to long-term storage:
While LTO is not as long-lived as polyester film stock, which can last for a century or more in a cold, dry environment, it’s still pretty good.

The problem with LTO is obsolescence. Since the beginning, the technology has been on a Moore’s Law–like march that has resulted in a doubling in tape storage densities every 18 to 24 months. As each new generation of LTO comes to market, an older generation of LTO becomes obsolete. LTO manufacturers guarantee at most two generations of backward compatibility. What that means for film archivists with perhaps tens of thousands of LTO tapes on hand is that every few years they must invest millions of dollars in the latest format of tapes and drives and then migrate all the data on their older tapes—or risk losing access to the information altogether.

That costly, self-perpetuating cycle of data migration is why Dino Everett, film archivist for the University of Southern California, calls LTO “archive heroin—the first taste doesn’t cost much, but once you start, you can’t stop. And the habit is expensive.” As a result, Everett adds, a great deal of film and TV content that was “born digital,” even work that is only a few years old, now faces rapid extinction and, in the worst case, oblivion.
Note also that the required migration consumes a lot of bandwidth, meaning that in order to supply the bandwidth needed to ingest the incoming data you need a lot more drives. This reduces the tape/drive ratio, and thus decreases tape's apparent cost advantage. Not to mention that migrating data from tape to tape is far less automated and thus far more expensive than migrating between on-line media such as disk.

Tuesday, May 2, 2017

Distill: Is This What Journals Should Look Like?

A month ago a post on the Y Combinator blog announced that they and Google have launched a new academic journal called Distill. Except this is no ordinary journal consisting of slightly enhanced PDFs, it is a big step towards the way academic communication should work in the Web era:
The web has been around for almost 30 years. But you wouldn’t know it if you looked at most academic journals. They’re stuck in the early 1900s. PDFs are not an exciting form.

Distill is taking the web seriously. A Distill article (at least in its ideal, aspirational form) isn’t just a paper. It’s an interactive medium that lets users – “readers” is no longer sufficient – work directly with machine learning models.
Below the fold, I take a close look at one of the early articles to assess how big a step this is.

Friday, April 21, 2017

A decade of blogging

A decade ago today I posted Mass-market scholarly communication to start this blog. Now, 459 posts later I would like to thank everyone who has read and especially those who have commented on it.

Blogging is useful to me for several reasons:
  • It forces me to think through issues.
  • It prevents me forgetting what I thought when I thought through an issue.
  • Its a much more effective way to communicate with others in the same field than publishing papers.
  • Since I'm not climbing the academic ladder there's not much incentive for me to publish papers anyway, although I have published quite a few since I started LOCKSS.
  • I've given quite a few talks too. Since I started posting the text of a talk with links to the sources it has become clear that it is much more useful to readers than posting the slides.
  • I use the comments as a handy way to record relevant links, and why I thought they were relevant.
There weren't  a lot of posts until in 2011 I started to target one post a week. I thought it would be hard to come up with enough topics, but pretty soon afterwards half-completed or note-form drafts started accumulating. My posting rate has accelerated smoothly since, and most weeks now get two posts. Despite this, I have more drafts lying around than ever.

Wednesday, April 19, 2017

Emularity strikes again!

The Internet Archive's massive collection of software now includes an in-browser emulation in the Emularity framework of the original Mac with MacOS from 1984 to 1989, and a Mac Plus with MacOS 7.0.1 from 1991. Shaun Nichols at The Register reports that:
The emulator itself is powered by a version of Hampa Hug's PCE Apple emulator ported to run in browsers via JavaScript by James Friend. PCE and PCE.js have been around for a number of years; now that tech has been married to the Internet Archive's vault of software.
Congratulations to Jason Scott and the software archiving team!

Thursday, April 13, 2017

Bufferbloat

This is just a brief note to point out that, after a long hiatus, my friend Jim Gettys has returned to blogging with Home products that fix/mitigate bufferbloat, an invaluable guide to products that incorporate some of the very impressive work undertaken by the bufferbloat project, CeroWrt, and the LEDE WiFi driver. The queuing problems underlying bufferbloat, the "lag" that gamers complain about and other performance issues at the edge of the Internet can make home Internet use really miserable. It has taken appallingly long for the home router industry to start shipping products with even the initial fixes released years ago. But a trickle of products is now available, and it is a great service for Jim to point at them.

Wednesday, April 12, 2017

Identifiers: A Double-Edged Sword

This is the last of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy, Research Access for the 21st Century, and The Orphans of Scholarship.

Geoff Bilder's Open Persistent Identifier Infrastructures: The Key to Scaling Mandate Auditing and Assessment Exercises was ostensibly a report on the need for and progress in bringing together the many disparate identifier systems for organizations in order to facilitate auditing and assessment processes. It was actually an insightful rant about how these processes were corrupting the research ecosystem. Below the fold, I summarize Geoff's argument (I hope Geoff will correct me if I misrepresent him) and rant back.

Tuesday, April 11, 2017

The Orphans of Scholarship

This is the third of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy and Research Access for the 21st Century.

Herbert Van de Sompel, Michael Nelson and Martin Klein's To the Rescue of the Orphans of Scholarly Communication reported on an important Mellon-funded project to investigate how all the parts of a research effort that appear on the Web other than the eventual article might be collected for preservation using Web archiving technologies. Below the fold, a summary of the 67-slide deck and some commentary.

Monday, April 10, 2017

Research Access for the 21st Century

This is the second of my posts from CNI's Spring 2017 Membership Meeting. The first is Researcher Privacy.

Resource Access for the 21st Century, RA21 Update: Pilots Advance to Improve Authentication and Authorization for Content by Elsevier's Chris Shillum and Ann Gabriel reported on the effort by the oligopoly publishers to replace IP address authorization with Shibboleth. Below the fold, some commentary.

Friday, April 7, 2017

Researcher Privacy

The blog post I was drafting about the sessions I found interesting at the CNI Spring 2017 Membership Meeting got too long, so I am dividing it into a post per interesting session. First up, below the fold, perhaps the most useful breakout session. Sam Kome's Protect Researcher Privacy in the Surveillance Era, an updated version of his talk at the 2016 ALA meeting, led to animated discussion.

Tuesday, March 28, 2017

EU report on Open Access

The EU's ambitious effort to provide immediate open access to scientific publications as the default by 2020 continues with the publication of Towards a competitive and sustainable open access publishing market in Europe, a report commissioned by the OpenAIRE 2020 project. It contains a lot of useful information and analysis, and concludes that:
Without intervention, immediate OA to just half of Europe's scientific publications will not be achieved until 2025 or later.
The report:
considers the economic factors contributing to the current state of the open access publishing market, and evaluates the potential for European policymakers to enhance market competition and sustainability in parallel to increasing access.
Below the fold, some quotes, comments, and an assessment.

Thursday, March 23, 2017

Threats to stored data

Recently there's been a lively series of exchanges on the pasig-discuss mail list, sparked by an inquiry from Jeanne Kramer-Smyth of the World Bank as to any additional risks posed by media such as disks that did encryption or compression. It morphed into discussion of the "how many copies" question and related issues. Below the fold, my reflections on the discussion.

Tuesday, March 21, 2017

The Amnesiac Civilization: Part 5

Part 2 and Part 3 of this series established that, for technical, legal and economic reasons there is much Web content that cannot be ingested and preserved by Web archives. Part 4 established that there is much Web content that can currently be ingested and preserved by public Web archives that, in the near future, will become inaccessible. It will be subject to Digital Rights Management (DRM) technologies which will, at least in most countries, be illegal to defeat. Below the fold I look at ways, albeit unsatisfactory, to address these problems.

Friday, March 17, 2017

The Amnesiac Civilization: Part 4

Part 2 and Part 3 of this series covered the unsatisfactory current state of Web archiving. Part 1 of this series briefly outlined the way the W3C's Encrypted Media Extensions (EME) threaten to make this state far worse. Below the fold I expand on the details of this threat.

Wednesday, March 15, 2017

SHA1 is dead

On February 23rd a team from CWI Amsterdam (where I worked in 1982) and Google Research published The first collision for full SHA-1, marking the "death of SHA-1". Using about 6500 CPU-years and 110 GPU-years, they created two different PDF files with the same SHA-1 hash. SHA-1 is widely used in digital preservation, among many other areas, despite having been deprecated by NIST through a process starting in 2005 and becoming official by 2012.

There is an accessible report on this paper by Dan Goodin at Ars Technica. These collisions have already caused trouble for systems in the field, for example for Webkit's Subversion repository. Subversion and other systems use SHA-1 to deduplicate content; files with the same SHA-1 are assumed to be identical. Below the fold, I look at the implications for digital preservation.

Monday, March 13, 2017

The Amnesiac Civilization: Part 3

In Part 2 of this series I criticized Kalev Leetaru's Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web for failing to take into account the cost of doing a better job. Below the fold I ask whether, even with unlimited funds, it would actually be possible to satisfy Leetaru's reasonable-sounding requirements, and whether those requirements would actually solve the problems of Web archiving.

Friday, March 10, 2017

Dr. Pangloss and Data in DNA

Last night I gave a 10-minute talk at the Storage Valley Supper Club, an event much beloved of the good Dr. Pangloss. The title was DNA as a Storage Medium; it was a slightly edited section of The Medium-Term Prospects for Long-Term Storage Systems. Below the fold, an edited text with links to the sources.

Wednesday, March 8, 2017

The Amnesiac Civilization: Part 2

Part 1 of The Amnesiac Civilization predicted that the state of Web archiving would soon get much worse. How bad it is right now and why? Follow me below the fold for Part 2 of the series. I'm planning at least three more parts:
  • Part 3 will assess how practical some suggested improvements might be.
  • Part 4 will look in some detail at the Web DRM problem introduced in Part 1.
  • Part 5 will discuss a "counsel of despair" approach that I've hinted at in the past.

Friday, March 3, 2017

The Amnesiac Civilization: Part 1

Those who cannot remember the past are condemned to repeat it
George Santayana: Life of Reason, Reason in Common Sense (1905)
Who controls the past controls the future. Who controls the present controls the past.
George Orwell: Nineteen Eighty-Four (1949)
Santayana and Orwell correctly perceived that societies in which the past is obscure or malleable are very convenient for ruling elites and very unpleasant for the rest of us. It is at least arguable that the root cause of the recent inconveniences visited upon ruling elites in countries such as the US and the UK was inadequate history management. Too much of the population correctly remembered a time in which GDP, the stock market and bankers' salaries were lower, but their lives were less stressful and more enjoyable.

Two things have become evident over the past couple of decades:
  • The Web is the medium that records our civilization.
  • The Web is becoming increasingly difficult to collect and preserve in order that the future will remember its past correctly.
This is the first in a series of posts on this issue. I start by predicting that the problem is about to get much, much worse. Future posts will look at the technical and business aspects of current and future Web archiving. This post is shorter than usual to focus attention on what I believe is an important message

In a 2014 post entitled The Half-Empty Archive I wrote, almost as a throw-away:
The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.
The link was to a post by Cory Doctorow in which he wrote:
We are Huxleying ourselves into the full Orwell.
He clearly understood some aspects of the problem caused by DRM on the Web:
Everyone in the browser world is convinced that not supporting Netflix will lead to total marginalization, and Netflix demands that computers be designed to keep secrets from, and disobey, their owners (so that you can’t save streams to disk in the clear).
Two recent developments got me thinking about this more deeply, and I realized that neither I nor, I believe, Doctorow comprehended the scale of the looming disaster. It isn't just about video and the security of your browser, important as those are. Here it is in as small a nutshell as I can devise.

Almost all the Web content that encodes our history is supported by one or both of two business models: subscription, or advertising. Currently, neither model works well. Web DRM will be perceived as the answer to both. Subscription content, not just video but newspapers and academic journals, will be DRM-ed to force readers to subscribe. Advertisers will insist that the sites they support DRM their content to prevent readers running ad-blockers. DRM-ed content cannot be archived.

Imagine a world in which archives contain no subscription and no advertiser-supported content of any kind.

Update: the succeeding posts in the series are:

Notes from FAST17

As usual, I attended Usenix's File and Storage Technologies conference. Below the fold, my comments on the presentations I found interesting.

Thursday, March 2, 2017

Injecting Faults in Distributed Storage

I'll record my reactions to some of the papers at the 2017 FAST conference in a subsequent post. But one of them has significant implications for digital preservation systems using distributed storage, and deserves a post to itself. Follow me below the fold as I try to draw out these implications.

Tuesday, February 28, 2017

Bundled APCs Considered Harmful

In More From Mackie-Mason on Gold Open Access I wrote
The publishers ... are pushing bundled APCs to librarians as a way to retain the ability to extract monopoly rents. As the Library Loon perceptively points out:
The key aspect of Elsevier’s business model that it will do its level best to retain in any acquisitions or service launches is the disconnect between service users and service purchasers.
I just realized that there is another pernicious aspect of bundled APC (Author Processing Charges) deals such as the recent deal between the Gates Foundation and AAAS. It isn't just that the deal involves Gates paying over the odds. It is that AAAS gets the money without necessarily publishing any articles. This gives them a financial incentive to reject Gates-funded articles, which would take up space in the journal for which AAAS could otherwise charge an APC.

Thursday, February 23, 2017

Poynder on the Open Access mess

Do not be put off by the fact that it is 36 pages long. Richard Poynder's Copyright: the immoveable barrier that open access advocates underestimated is a must-read. Every one of the 36 pages is full of insight.

Briefly, Poynder is arguing that the mis-match of resources, expertise and motivation makes it futile to depend on a transaction between an author and a publisher to provide useful open access to scientific articles. As I have argued before, Poynder concludes that the only way out is for Universities to act:
As it happens, the much-lauded Harvard open access policy contains the seeds for such a development. This includes wording along the lines of: “each faculty member grants to the school a nonexclusive copyright for all of his/her scholarly articles.” A rational next step would be for schools to appropriate faculty copyright all together. This would be a way of preventing publishers from doing so, and it would have the added benefit of avoiding the legal uncertainty some see in the Harvard policies. Importantly, it would be a top-down diktat rather than a bottom-up approach. Since currently researchers can request a no-questions-asked opt-out, and publishers have learned that they can bully researchers into requesting that opt-out, the objective of the Harvard OA policies is in any case subverted.
Note the word "faculty" above. Poynder does not examine the issue that very few papers are published all of whose authors are faculty. Most authors are students, post-docs or staff. The copyright in a joint work is held by the authors jointly, or if some are employees working for hire, jointly by the faculty authors and the institution. I doubt very much that the copyright transfer agreements in these cases are actually valid, because they have been signed only by the primary author (most frequently not a faculty member), and/or have been signed by a worker-for-hire who does not in fact own the copyright.

Thursday, February 16, 2017

Postel's Law again

Eight years ago I wrote:
In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:
"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law,
Recently, discussion on a mailing list I'm on focused on the downsides of Postel's Law. Below the fold, I try to explain why most of these downsides don't apply to the "accept" side, which is the side that matters for digital preservation.

Tuesday, February 14, 2017

RFC 4810

A decade ago next month Wallace et al published RFC 4810 Long-Term Archive Service Requirements. Its abstract is:
There are many scenarios in which users must be able to prove the existence of data at a specific point in time and be able to demonstrate the integrity of data since that time, even when the duration from time of existence to time of demonstration spans a large period of time. Additionally, users must be able to verify signatures on digitally signed data many years after the generation of the signature. This document describes a class of long-term archive services to support such scenarios and the technical requirements for interacting with such services.
Below the fold, a look at how it has stood the test of time.

Tuesday, February 7, 2017

Coronal Mass Ejections (again)

Back in 2014 I blogged about one of digital preservation's less well-known risks, coronal mass ejections (CME).  Additional information accumulated in the comments. Last October:
"President Barack Obama .. issued an Executive Order that defines what the nation’s response should be to a catastrophic space weather event that takes out large portions of the electrical power grid, resulting in cascading failures that would affect key services such as water supply, healthcare, and transportation.
Two recent studies bought the risk back into focus and convinced me that my 2014 post was too optimistic. Below the fold, more gloom and doom.

Tuesday, January 31, 2017

Preservable emulations

This post is an edited extract from my talk at last year's IIPC meeting. This part was the message I was trying to get across, but I buried the lede at the tail end. So I'm repeating it here to try and make the message clear.

Emulation technology will evolve through time. The way we expose emulations on the Web right now means that this evolution will break them. We're supposed to be preserving stuff, but the way we're doing it isn't preservable. We need to expose emulations to the Web in a future-proof way, a way whereby they can be collected, preserved and reanimated using future emulation technologies. Below the fold, I explain what is needed using the analogy of PDFs.

Wednesday, January 25, 2017

Rick Whitt on Digital Preservation

Google's Rick Whitt has published "Through A Glass, Darkly" Technical, Policy, and Financial Actions to Avert the Coming Digital Dark Ages (PDF), a very valuable 114-page review of digital preservation aimed at legal and policy audiences. Below the fold, some encomia and some quibbles (but much less than 114 pages of them).

Thursday, January 19, 2017

The long tail of non-English science

Ben Panko's English Is the Language of Science. That Isn't Always a Good Thing is based on Languages Are Still a Major Barrier to Global Science, a paper in PLOS Biology by Tatsuya Amano, Juan P. González-Varo and William J. Sutherland. Panko writes:
For the new study, Amano's team looked at the entire body of research available on Google Scholar about biodiversity and conservation, starting in the year 2014. Searching with keywords in 16 languages, the researchers found a total of more than 75,000 scientific papers. Of those papers, more than 35 percent were in languages other than English, with Spanish, Portuguese and Chinese topping the list.

Even for people who try not to ignore research published in non-English languages, Amano says, difficulties exist. More than half of the non-English papers observed in this study had no English title, abstract or keywords, making them all but invisible to most scientists doing database searches in English.
Below the fold, how this problem relates to work by the LOCKSS team.

Tuesday, January 10, 2017

Gresham's Law

Jeffrey Beall, who has done invaluable work identifying predatory publishers and garnered legal threats for his pains, reports that:
Hyderabad, India-based open-access publisher OMICS International is on a buying spree, snatching up legitimate scholarly journals and publishers, incorporating them into its mega-fleet of bogus, exploitative, and low-quality publications. ... OMICS International is on a mission to take over all of scholarly publishing. It is purchasing journals and publishers and incorporating them into its evil empire. Its strategy is to saturate scholarly publishing with its low-quality and poorly-managed journals, aiming to squeeze out and acquire legitimate publishers.
Below the fold, a look at how OMICS demonstrates the application of Gresham's Law to academic publishing.

Friday, January 6, 2017

Star Wars Storage Media

At Motherboard, Sarah Jeong's From Tape Drives to Memory Orbs, the Data Formats of Star Wars Suck is a must-read compendium of the ridiculous data storage technologies of the Empire and its enemies.

Its a shame that she uses "formats" when she means "media". But apart from serious questions like:
Why must the Death Star plans be stored on a data tape the size of four iPads stacked on top each other? Obi-Wan can carry a map of the entire galaxy in a glowing marble, and at the end of Episode II, Count Dooku absconds with a thumb drive or something that contains the Death Star plans.
absolutely the best thing about it is that it inspired Cory Doctorow to write Why are the data-formats in Star Wars such an awful mess? Because filmmakers make movies about filmmaking. Doctorow understands that attitudes to persistent data storage are largely hang-overs from the era of floppy disks and ZIP drives:
But we have a persistent myth of the fragility of data-formats: think of the oft-repeated saw that books are more reliable than computers because old floppy disks and Zip cartridges are crumbling and no one can find a drive to read them with anymore. It's true that media goes corrupt and also true that old hardware is hard to find and hard to rehabilitate, but the problem of old floppies and Zips is one of the awkward adolescence of storage: a moment at which hard-drives and the systems that managed them were growing more slowly than the rate at which we were acquiring data.
So:
the destiny of our data will be to move from live, self-healing media to live, self-healing media, without any time at rest in near-line or offline storage, the home of bitrot. 
Just go read the whole of both pieces.

Thursday, January 5, 2017

Transition (personal)

After eighteen and a quarter years I'm now officially retired from Stanford and the LOCKSS Program. Its been a privilege to work with the LOCKSS team all this time, and especially with Tom Lipkis, whose engineering skills were essential to the program's success.

I'm grateful to Michael Keller, Stanford's Librarian, who has consistently supported the program, to the National Science Foundation, Sun Microsystems, and the Andrew W. Mellon Foundation (especially to Don Waters) for funding the development of the system, and to the member institutions of the LOCKSS Alliance and the CLOCKSS Archive for supporting the system's operations.

I'm still helping with a couple of on-going projects, so I still have a stanford.edu e-mail address. And I have the generous Stanford retiree benefits. Apart from my duties as a grandparent, and long-delayed tasks such as dealing with the mess in the garage, I expect also to be doing what I can to help the Internet Archive, and continuing to write for my blog.

Wednesday, January 4, 2017

Error 400: Blogger is Bloggered

If you tried to post a comment and got the scary message:
Bad Request
Error 400
please read below the fold for an explanation and a work-around.

Tuesday, January 3, 2017

Travels with a Chromebook

Two years ago I wrote A Note Of Thanks as I switched my disposable travel laptop from an Asus Seashell to an Acer C720 Chromebook running Linux. Two years later I'm still traveling with a C720. Below the fold, an update on my experiences.