Friday, April 21, 2017

A decade of blogging

A decade ago today I posted Mass-market scholarly communication to start this blog. Now, 459 posts later I would like to thank everyone who has read and especially those who have commented on it.

Blogging is useful to me for several reasons:
  • It forces me to think through issues.
  • It prevents me forgetting what I thought when I thought through an issue.
  • Its a much more effective way to communicate with others in the same field than publishing papers.
  • Since I'm not climbing the academic ladder there's not much incentive for me to publish papers anyway, although I have published quite a few since I started LOCKSS.
  • I've given quite a few talks too. Since I started posting the text of a talk with links to the sources it has become clear that it is much more useful to readers than posting the slides.
  • I use the comments as a handy way to record relevant links, and why I thought they were relevant.
There weren't  a lot of posts until in 2011 I started to target one post a week. I thought it would be hard to come up with enough topics, but pretty soon afterwards half-completed or note-form drafts started accumulating. My posting rate has accelerated smoothly since, and most weeks now get two posts. Despite this, I have more drafts lying around than ever.

Wednesday, April 19, 2017

Emularity strikes again!

The Internet Archive's massive collection of software now includes an in-browser emulation in the Emularity framework of the original Mac with MacOS from 1984 to 1989, and a Mac Plus with MacOS 7.0.1 from 1991. Shaun Nichols at The Register reports that:
The emulator itself is powered by a version of Hampa Hug's PCE Apple emulator ported to run in browsers via JavaScript by James Friend. PCE and PCE.js have been around for a number of years; now that tech has been married to the Internet Archive's vault of software.
Congratulations to Jason Scott and the software archiving team!

Thursday, April 13, 2017


This is just a brief note to point out that, after a long hiatus, my friend Jim Gettys has returned to blogging with Home products that fix/mitigate bufferbloat, an invaluable guide to products that incorporate some of the very impressive work undertaken by the bufferbloat project, CeroWrt, and the LEDE WiFi driver. The queuing problems underlying bufferbloat, the "lag" that gamers complain about and other performance issues at the edge of the Internet can make home Internet use really miserable. It has taken appallingly long for the home router industry to start shipping products with even the initial fixes released years ago. But a trickle of products is now available, and it is a great service for Jim to point at them.

Wednesday, April 12, 2017

Identifiers: A Double-Edged Sword

This is the last of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy, Research Access for the 21st Century, and The Orphans of Scholarship.

Geoff Bilder's Open Persistent Identifier Infrastructures: The Key to Scaling Mandate Auditing and Assessment Exercises was ostensibly a report on the need for and progress in bringing together the many disparate identifier systems for organizations in order to facilitate auditing and assessment processes. It was actually an insightful rant about how these processes were corrupting the research ecosystem. Below the fold, I summarize Geoff's argument (I hope Geoff will correct me if I misrepresent him) and rant back.

Tuesday, April 11, 2017

The Orphans of Scholarship

This is the third of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy and Research Access for the 21st Century.

Herbert Van de Sompel, Michael Nelson and Martin Klein's To the Rescue of the Orphans of Scholarly Communication reported on an important Mellon-funded project to investigate how all the parts of a research effort that appear on the Web other than the eventual article might be collected for preservation using Web archiving technologies. Below the fold, a summary of the 67-slide deck and some commentary.

Monday, April 10, 2017

Research Access for the 21st Century

This is the second of my posts from CNI's Spring 2017 Membership Meeting. The first is Researcher Privacy.

Resource Access for the 21st Century, RA21 Update: Pilots Advance to Improve Authentication and Authorization for Content by Elsevier's Chris Shillum and Ann Gabriel reported on the effort by the oligopoly publishers to replace IP address authorization with Shibboleth. Below the fold, some commentary.

Friday, April 7, 2017

Researcher Privacy

The blog post I was drafting about the sessions I found interesting at the CNI Spring 2017 Membership Meeting got too long, so I am dividing it into a post per interesting session. First up, below the fold, perhaps the most useful breakout session. Sam Kome's Protect Researcher Privacy in the Surveillance Era, an updated version of his talk at the 2016 ALA meeting, led to animated discussion.

Tuesday, March 28, 2017

EU report on Open Access

The EU's ambitious effort to provide immediate open access to scientific publications as the default by 2020 continues with the publication of Towards a competitive and sustainable open access publishing market in Europe, a report commissioned by the OpenAIRE 2020 project. It contains a lot of useful information and analysis, and concludes that:
Without intervention, immediate OA to just half of Europe's scientific publications will not be achieved until 2025 or later.
The report:
considers the economic factors contributing to the current state of the open access publishing market, and evaluates the potential for European policymakers to enhance market competition and sustainability in parallel to increasing access.
Below the fold, some quotes, comments, and an assessment.

Thursday, March 23, 2017

Threats to stored data

Recently there's been a lively series of exchanges on the pasig-discuss mail list, sparked by an inquiry from Jeanne Kramer-Smyth of the World Bank as to any additional risks posed by media such as disks that did encryption or compression. It morphed into discussion of the "how many copies" question and related issues. Below the fold, my reflections on the discussion.

Tuesday, March 21, 2017

The Amnesiac Civilization: Part 5

Part 2 and Part 3 of this series established that, for technical, legal and economic reasons there is much Web content that cannot be ingested and preserved by Web archives. Part 4 established that there is much Web content that can currently be ingested and preserved by public Web archives that, in the near future, will become inaccessible. It will be subject to Digital Rights Management (DRM) technologies which will, at least in most countries, be illegal to defeat. Below the fold I look at ways, albeit unsatisfactory, to address these problems.

Friday, March 17, 2017

The Amnesiac Civilization: Part 4

Part 2 and Part 3 of this series covered the unsatisfactory current state of Web archiving. Part 1 of this series briefly outlined the way the W3C's Encrypted Media Extensions (EME) threaten to make this state far worse. Below the fold I expand on the details of this threat.

Wednesday, March 15, 2017

SHA1 is dead

On February 23rd a team from CWI Amsterdam (where I worked in 1982) and Google Research published The first collision for full SHA-1, marking the "death of SHA-1". Using about 6500 CPU-years and 110 GPU-years, they created two different PDF files with the same SHA-1 hash. SHA-1 is widely used in digital preservation, among many other areas, despite having been deprecated by NIST through a process starting in 2005 and becoming official by 2012.

There is an accessible report on this paper by Dan Goodin at Ars Technica. These collisions have already caused trouble for systems in the field, for example for Webkit's Subversion repository. Subversion and other systems use SHA-1 to deduplicate content; files with the same SHA-1 are assumed to be identical. Below the fold, I look at the implications for digital preservation.

Monday, March 13, 2017

The Amnesiac Civilization: Part 3

In Part 2 of this series I criticized Kalev Leetaru's Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web for failing to take into account the cost of doing a better job. Below the fold I ask whether, even with unlimited funds, it would actually be possible to satisfy Leetaru's reasonable-sounding requirements, and whether those requirements would actually solve the problems of Web archiving.

Friday, March 10, 2017

Dr. Pangloss and Data in DNA

Last night I gave a 10-minute talk at the Storage Valley Supper Club, an event much beloved of the good Dr. Pangloss. The title was DNA as a Storage Medium; it was a slightly edited section of The Medium-Term Prospects for Long-Term Storage Systems. Below the fold, an edited text with links to the sources.

Wednesday, March 8, 2017

The Amnesiac Civilization: Part 2

Part 1 of The Amnesiac Civilization predicted that the state of Web archiving would soon get much worse. How bad it is right now and why? Follow me below the fold for Part 2 of the series. I'm planning at least three more parts:
  • Part 3 will assess how practical some suggested improvements might be.
  • Part 4 will look in some detail at the Web DRM problem introduced in Part 1.
  • Part 5 will discuss a "counsel of despair" approach that I've hinted at in the past.

Friday, March 3, 2017

The Amnesiac Civilization: Part 1

Those who cannot remember the past are condemned to repeat it
George Santayana: Life of Reason, Reason in Common Sense (1905)
Who controls the past controls the future. Who controls the present controls the past.
George Orwell: Nineteen Eighty-Four (1949)
Santayana and Orwell correctly perceived that societies in which the past is obscure or malleable are very convenient for ruling elites and very unpleasant for the rest of us. It is at least arguable that the root cause of the recent inconveniences visited upon ruling elites in countries such as the US and the UK was inadequate history management. Too much of the population correctly remembered a time in which GDP, the stock market and bankers' salaries were lower, but their lives were less stressful and more enjoyable.

Two things have become evident over the past couple of decades:
  • The Web is the medium that records our civilization.
  • The Web is becoming increasingly difficult to collect and preserve in order that the future will remember its past correctly.
This is the first in a series of posts on this issue. I start by predicting that the problem is about to get much, much worse. Future posts will look at the technical and business aspects of current and future Web archiving. This post is shorter than usual to focus attention on what I believe is an important message

In a 2014 post entitled The Half-Empty Archive I wrote, almost as a throw-away:
The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.
The link was to a post by Cory Doctorow in which he wrote:
We are Huxleying ourselves into the full Orwell.
He clearly understood some aspects of the problem caused by DRM on the Web:
Everyone in the browser world is convinced that not supporting Netflix will lead to total marginalization, and Netflix demands that computers be designed to keep secrets from, and disobey, their owners (so that you can’t save streams to disk in the clear).
Two recent developments got me thinking about this more deeply, and I realized that neither I nor, I believe, Doctorow comprehended the scale of the looming disaster. It isn't just about video and the security of your browser, important as those are. Here it is in as small a nutshell as I can devise.

Almost all the Web content that encodes our history is supported by one or both of two business models: subscription, or advertising. Currently, neither model works well. Web DRM will be perceived as the answer to both. Subscription content, not just video but newspapers and academic journals, will be DRM-ed to force readers to subscribe. Advertisers will insist that the sites they support DRM their content to prevent readers running ad-blockers. DRM-ed content cannot be archived.

Imagine a world in which archives contain no subscription and no advertiser-supported content of any kind.

Update: the succeeding posts in the series are:

Notes from FAST17

As usual, I attended Usenix's File and Storage Technologies conference. Below the fold, my comments on the presentations I found interesting.

Thursday, March 2, 2017

Injecting Faults in Distributed Storage

I'll record my reactions to some of the papers at the 2017 FAST conference in a subsequent post. But one of them has significant implications for digital preservation systems using distributed storage, and deserves a post to itself. Follow me below the fold as I try to draw out these implications.

Tuesday, February 28, 2017

Bundled APCs Considered Harmful

In More From Mackie-Mason on Gold Open Access I wrote
The publishers ... are pushing bundled APCs to librarians as a way to retain the ability to extract monopoly rents. As the Library Loon perceptively points out:
The key aspect of Elsevier’s business model that it will do its level best to retain in any acquisitions or service launches is the disconnect between service users and service purchasers.
I just realized that there is another pernicious aspect of bundled APC (Author Processing Charges) deals such as the recent deal between the Gates Foundation and AAAS. It isn't just that the deal involves Gates paying over the odds. It is that AAAS gets the money without necessarily publishing any articles. This gives them a financial incentive to reject Gates-funded articles, which would take up space in the journal for which AAAS could otherwise charge an APC.

Thursday, February 23, 2017

Poynder on the Open Access mess

Do not be put off by the fact that it is 36 pages long. Richard Poynder's Copyright: the immoveable barrier that open access advocates underestimated is a must-read. Every one of the 36 pages is full of insight.

Briefly, Poynder is arguing that the mis-match of resources, expertise and motivation makes it futile to depend on a transaction between an author and a publisher to provide useful open access to scientific articles. As I have argued before, Poynder concludes that the only way out is for Universities to act:
As it happens, the much-lauded Harvard open access policy contains the seeds for such a development. This includes wording along the lines of: “each faculty member grants to the school a nonexclusive copyright for all of his/her scholarly articles.” A rational next step would be for schools to appropriate faculty copyright all together. This would be a way of preventing publishers from doing so, and it would have the added benefit of avoiding the legal uncertainty some see in the Harvard policies. Importantly, it would be a top-down diktat rather than a bottom-up approach. Since currently researchers can request a no-questions-asked opt-out, and publishers have learned that they can bully researchers into requesting that opt-out, the objective of the Harvard OA policies is in any case subverted.
Note the word "faculty" above. Poynder does not examine the issue that very few papers are published all of whose authors are faculty. Most authors are students, post-docs or staff. The copyright in a joint work is held by the authors jointly, or if some are employees working for hire, jointly by the faculty authors and the institution. I doubt very much that the copyright transfer agreements in these cases are actually valid, because they have been signed only by the primary author (most frequently not a faculty member), and/or have been signed by a worker-for-hire who does not in fact own the copyright.

Thursday, February 16, 2017

Postel's Law again

Eight years ago I wrote:
In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:
"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law,
Recently, discussion on a mailing list I'm on focused on the downsides of Postel's Law. Below the fold, I try to explain why most of these downsides don't apply to the "accept" side, which is the side that matters for digital preservation.

Tuesday, February 14, 2017

RFC 4810

A decade ago next month Wallace et al published RFC 4810 Long-Term Archive Service Requirements. Its abstract is:
There are many scenarios in which users must be able to prove the existence of data at a specific point in time and be able to demonstrate the integrity of data since that time, even when the duration from time of existence to time of demonstration spans a large period of time. Additionally, users must be able to verify signatures on digitally signed data many years after the generation of the signature. This document describes a class of long-term archive services to support such scenarios and the technical requirements for interacting with such services.
Below the fold, a look at how it has stood the test of time.

Tuesday, February 7, 2017

Coronal Mass Ejections (again)

Back in 2014 I blogged about one of digital preservation's less well-known risks, coronal mass ejections (CME).  Additional information accumulated in the comments. Last October:
"President Barack Obama .. issued an Executive Order that defines what the nation’s response should be to a catastrophic space weather event that takes out large portions of the electrical power grid, resulting in cascading failures that would affect key services such as water supply, healthcare, and transportation.
Two recent studies bought the risk back into focus and convinced me that my 2014 post was too optimistic. Below the fold, more gloom and doom.

Tuesday, January 31, 2017

Preservable emulations

This post is an edited extract from my talk at last year's IIPC meeting. This part was the message I was trying to get across, but I buried the lede at the tail end. So I'm repeating it here to try and make the message clear.

Emulation technology will evolve through time. The way we expose emulations on the Web right now means that this evolution will break them. We're supposed to be preserving stuff, but the way we're doing it isn't preservable. We need to expose emulations to the Web in a future-proof way, a way whereby they can be collected, preserved and reanimated using future emulation technologies. Below the fold, I explain what is needed using the analogy of PDFs.

Wednesday, January 25, 2017

Rick Whitt on Digital Preservation

Google's Rick Whitt has published "Through A Glass, Darkly" Technical, Policy, and Financial Actions to Avert the Coming Digital Dark Ages (PDF), a very valuable 114-page review of digital preservation aimed at legal and policy audiences. Below the fold, some encomia and some quibbles (but much less than 114 pages of them).

Thursday, January 19, 2017

The long tail of non-English science

Ben Panko's English Is the Language of Science. That Isn't Always a Good Thing is based on Languages Are Still a Major Barrier to Global Science, a paper in PLOS Biology by Tatsuya Amano, Juan P. González-Varo and William J. Sutherland. Panko writes:
For the new study, Amano's team looked at the entire body of research available on Google Scholar about biodiversity and conservation, starting in the year 2014. Searching with keywords in 16 languages, the researchers found a total of more than 75,000 scientific papers. Of those papers, more than 35 percent were in languages other than English, with Spanish, Portuguese and Chinese topping the list.

Even for people who try not to ignore research published in non-English languages, Amano says, difficulties exist. More than half of the non-English papers observed in this study had no English title, abstract or keywords, making them all but invisible to most scientists doing database searches in English.
Below the fold, how this problem relates to work by the LOCKSS team.

Tuesday, January 10, 2017

Gresham's Law

Jeffrey Beall, who has done invaluable work identifying predatory publishers and garnered legal threats for his pains, reports that:
Hyderabad, India-based open-access publisher OMICS International is on a buying spree, snatching up legitimate scholarly journals and publishers, incorporating them into its mega-fleet of bogus, exploitative, and low-quality publications. ... OMICS International is on a mission to take over all of scholarly publishing. It is purchasing journals and publishers and incorporating them into its evil empire. Its strategy is to saturate scholarly publishing with its low-quality and poorly-managed journals, aiming to squeeze out and acquire legitimate publishers.
Below the fold, a look at how OMICS demonstrates the application of Gresham's Law to academic publishing.

Friday, January 6, 2017

Star Wars Storage Media

At Motherboard, Sarah Jeong's From Tape Drives to Memory Orbs, the Data Formats of Star Wars Suck is a must-read compendium of the ridiculous data storage technologies of the Empire and its enemies.

Its a shame that she uses "formats" when she means "media". But apart from serious questions like:
Why must the Death Star plans be stored on a data tape the size of four iPads stacked on top each other? Obi-Wan can carry a map of the entire galaxy in a glowing marble, and at the end of Episode II, Count Dooku absconds with a thumb drive or something that contains the Death Star plans.
absolutely the best thing about it is that it inspired Cory Doctorow to write Why are the data-formats in Star Wars such an awful mess? Because filmmakers make movies about filmmaking. Doctorow understands that attitudes to persistent data storage are largely hang-overs from the era of floppy disks and ZIP drives:
But we have a persistent myth of the fragility of data-formats: think of the oft-repeated saw that books are more reliable than computers because old floppy disks and Zip cartridges are crumbling and no one can find a drive to read them with anymore. It's true that media goes corrupt and also true that old hardware is hard to find and hard to rehabilitate, but the problem of old floppies and Zips is one of the awkward adolescence of storage: a moment at which hard-drives and the systems that managed them were growing more slowly than the rate at which we were acquiring data.
the destiny of our data will be to move from live, self-healing media to live, self-healing media, without any time at rest in near-line or offline storage, the home of bitrot. 
Just go read the whole of both pieces.

Thursday, January 5, 2017

Transition (personal)

After eighteen and a quarter years I'm now officially retired from Stanford and the LOCKSS Program. Its been a privilege to work with the LOCKSS team all this time, and especially with Tom Lipkis, whose engineering skills were essential to the program's success.

I'm grateful to Michael Keller, Stanford's Librarian, who has consistently supported the program, to the National Science Foundation, Sun Microsystems, and the Andrew W. Mellon Foundation (especially to Don Waters) for funding the development of the system, and to the member institutions of the LOCKSS Alliance and the CLOCKSS Archive for supporting the system's operations.

I'm still helping with a couple of on-going projects, so I still have a e-mail address. And I have the generous Stanford retiree benefits. Apart from my duties as a grandparent, and long-delayed tasks such as dealing with the mess in the garage, I expect also to be doing what I can to help the Internet Archive, and continuing to write for my blog.

Wednesday, January 4, 2017

Error 400: Blogger is Bloggered

If you tried to post a comment and got the scary message:
Bad Request
Error 400
please read below the fold for an explanation and a work-around.

Tuesday, January 3, 2017

Travels with a Chromebook

Two years ago I wrote A Note Of Thanks as I switched my disposable travel laptop from an Asus Seashell to an Acer C720 Chromebook running Linux. Two years later I'm still traveling with a C720. Below the fold, an update on my experiences.