Thursday, September 5, 2013

Noteworthy papers at iPRES2013

Below the fold I discuss some papers from iPres2013 that I found particularly interesting.

The team from Freiburg University working on delivering emulation as a service had one of my favorite papers at IDCC2013 last January. But the team easily topped that with Cloudy Emulation - Efficient and Scalable Emulation-based Services (PDF), presented by Isgandar Valizada.

They were able to demonstrate a number of emulations of antique hardware running preserved operating systems and interactive applications embedded in web pages using only HTML5, no special plugins. Screen-shots are available here. This is extraordinarily important work. As I've been pointing out since the second post to this blog six years ago, open source emulations of all interesting hardware architectures are available. The argument for migration has, in essence, been that it is too inconvenient to deliver these emulations running the preserved digital objects to future readers, so we have to migrate them instead. But what delivery method could be more convenient than embedding  a live emulation in a Web page simply by pasting a link into it?

Read the paper to see the considerable amount of infrastructure needed to make this work, including dealing with the problems caused by the fact that much of the preserved software is encumbered by proprietary licenses. But this work need be done only once, and the Freiburg team have done it for us as open source.

Luis Faria presented Automatic Preservation Watch using Information Extraction on the Web (PDF) by an international team from the EU-funded SCAPE project. In their real use-case from the Dutch KB's e-Depot they extract, from the natural language Web, information about publishers and their journals, and compare this with the laboriously hand-generated information that forms the basis of services such as the Keepers Registry and the repositories whose information it aggregates. They conclude:
Comparing the information with the Keepers registry we find that more that 50% of the automatically fetched data is not on the registry and should be added, proving that this method is effective and can provide a much needed contribution for the automatic watch of the publisher community.
This is a particularly interesting result because it roughly confirms the estimates I made independently for my talk to the Preservation at Scale workshop, namely that preservation efforts currently reach less than half of the material that they should.
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?" Their results are somewhat difficult to interpret, but for their two more random samples they report:
    URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
Actually, their data suggest that the problem is worse. They extracted about 2,000 journals from about 500 publishers, so a fairly small sample of the space of journals. And the sample is biased towards the journals that are easiest to find. The journals in the long tail that are not preserved are precisely those least likely to be in the sample.

1 comment:

David. said...

The Wednesday keynote was Digital Information Storage in DNA by Paul Bertone, based on the same paper in Nature as Ewan Birney's keynote at IDCC last January. Paul repeated the paper's claim that DNA storage would be cheaper than tape in 10 years. I debunked this claim in this blog post. In the questions I pressed him on the validity of their model of tape costs, and he backed away from defending it.

I believe that DNA is likely to be a very important long-term storage medium, and that the work the EMBL team is doing is very important. But the claim that DNA will be cheaper than tape in a decade is not credible. It is off by 3-4 orders of magnitude, and they need to stop making it if they want people in the storage business to take them seriously.