Sunday, October 2, 2011

Preserving Linked Data

I attended the workshop on Semantic Digital Archives that was part of the Theory and Practice of Digital Libraries conference in Berlin. Only a few of the papers were interesting from the preservation viewpoint:
  • Kun Qian of the University of Magdeburg addressed the fact that the OAIS standard does not deal with security issues, proposing an interesting framework for doing so.
  • Manfred Thaller described work in the state of North-Rhine Westphalia to use open source software such as IRODS to implement a somewhat LOCKSS-like distributed preservation network for cultural heritage institutions using their existing storage infrastructure. Information in the network will be aggregated by a single distribution portal implemented with Fedora that will feed content to sites such as Europeana.
  • Felix Ostrowski of Humboldt University, who works on the LuKII project, discussed an innovative approach to handling metadata in the LOCKSS system using RDFa to include the metadata in the HTML files that LOCKSS boxes preserve. Unlike the normal environment in which LOCKSS boxes operate, where they simply have to put up with whatever the e-journal publisher decides to publish, LuKII has control over both the publisher and the LOCKSS boxes. They can therefore use RDFa to tightly bind metadata to the content it describes.
My take on the preservation issues of linked data is as follows.

Linked data uses URIs. Linked data can thus be collected for preservation by archives other than the original publisher using existing web crawling techniques such as the Internet Archive’s Heritrix. Enabling multiple archives to collect and preserve linked data will be essential; some of the publishers will inevitably fail for a variety of reasons. Motivating web archives to do this will be important, as will tools to measure the extent to which they succeed. The various archives preserving linked data items can republish them, but only at URIs different from the original one, since they do not control the original publisher’s DNS entry. Links to the original will not resolve to the archive copies, removing them from the world of linked data. This problem is generic to web archiving. Solving it is enabled by the Memento technology, which is on track to become an IETF/W3C standard. It will be essential that both archives preserving, and tools accessing linked data implement Memento. There are some higher level issues in the use of Memento, but as it gets wider use they are likely to be resolved before they become critical for linked data. Collection using web crawlers and re-publishing using Memento provide archives with a technical basis for linked open data preservation, but they also need a legal basis. Over 80% of current data sources do not provide any license information; these sources will be problematic to archive. Even those data sources that do provide license information may be problematic, their license may not allow the operations required for preservation. Open data licenses do not merely permit and encourage re-use of data, they permit and encourage its preservation.

1 comment:

Felix said...

Just a minor correction: LuKII (unfortunately) does not have full "control over both the publisher and the LOCKSS boxes". My paper was rather a hypothesis on how RDFa could be used in the context of LOCKSS.