Friday, March 28, 2014

PREMIS & LOCKSS

We were asked if the CLOCKSS Archive uses PREMIS metadata. The answer is no, and a detailed explanation is below the fold.

The CLOCKSS archive is implemented using LOCKSS technology. LOCKSS systems do not use PREMIS. As with OAIS, there are significant conceptual mismatches between the PREMIS model based on it and the reality of the content LOCKSS typically preserves. For example, the concept of "digital object" is hard to apply to preserving an artifact such as an e-journal that continually publishes new, compound objects with only a loose semantic structure. The view that e-journals consist of volumes that consist of issues that consist of articles only loosely corresponds to the real world.

As regards format metadata such as is generated by JHOVE, we are skeptical of its utility in the LOCKSS system because it is expensive to generate, unreliable, and of marginal relevance to content which is unlikely to suffer format obsolescence in the foreseeable future, and if it does may well be rendered via emulation rather than format migration.

Nevertheless, we integrated FITS into one version of the LOCKSS  daemon and used it to generate format metadata for the content in the CLOCKSS Archive. We do not use this version of the daemon in production CLOCKSS boxes:
  • FITS is several times bigger than the production LOCKSS daemon.
  • We do not have the resources to audit it for potential risks to the preserved content that it might pose.
  • The computational and I/O resources it consumes are significant.
  • Even if the metadata FITS generates were reliable, it would not be of operational significance in the CLOCKSS environment. 
As regards bibliographic metadata, to be affordable at the scale at which they operate, LOCKSS networks generally depend heavily on extracting metadata automatically, in whatever form it can be found in the content, and performing the minimal processing needed to support the needs of users, primarily for DOI and OpenURL resolution. Human intervention can be considered only at a very coarse level, a journal volume or above.

The CLOCKSS Archive uses bibliographic metadata for four purposes:
  • For billing, the number of articles received from each publisher must be counted. The article-level metadata needed is only the existence of an article.
  • For Keepers and KBART reports. These need volume-level metadata.
  • To locate content that is the subject of a board-approved trigger event in order to extract a copy from the archive. This typically needs volume-level metadata.
  • Once content has been triggered, to update DOI and OpenURL resolvers. This needs detailed article-level metadata.
Although we extract detailed article-level metadata, note that for the vast majority of the archive's content there is no operational need for it. It is needed only for the tiny fraction that is triggered, and only after the trigger event.

CLOCKSS is a dark archive. Until it is triggered, there are no readers to access the content, so there are no readers demanding the kinds of access that PREMIS bibliographic metadata would support. If the CLOCKSS board were to decide that PREMIS metadata support was important enough to justify the rather significant development costs that would be involved, it would be possible to implement it because LOCKSS supports similar semantic units to those that PREMIS describes. Although they are internally factored differently than the PREMIS data model; it would be possible to externalize a data dictionary for our content or respond to a query in terms of the data model that it describes.

Information is stored in several places within the CLOCKSS network, including the LOCKSS repository (storage-level metadata), the title database (preservation-unit level metadata), and the metadata database (bibliographic-level metadata). This information is tied together internally using a preservation-unit level "archival unit" identifier (AUID). Traversing these databases would enable us to generate a PREMIS compliant data dictionary. Responding to a query for any PREMIS-defined entity could be answered by mapping it to a range of AUIDs, and from there to information stored in these databases.

For example, a PREMIS Intellectual Entity (e.g. a journal article) is represented in the the metadata database. It can be located using an Intellectual Entity key such as its DOI or an ISSN and other bibliographic information. Using the AUID associated with that article allows us to retrieve its preservation-level metadata such as the Archival Unit parameters and attributes that specify its Agents and Rights. It also enables us to retrieve the Object Entities and their physical characteristics, and the Events related to provenance from the associated storage-level metadata in the repository.

But note that this isn't an operation that would ever be performed in the CLOCKSS archive. It is a dark archive; no access to preserved content is permitted unless and until it is triggered. Triggering happens externally at a journal (or conceptually at a volume level) and internally at an AUID level, not at an article level. It is a one-time process initiated by the board that hands content off under CC license to multiple re-publishing sites, from where readers can access it in the same way that they accessed it from the original publisher, via the Web. The CLOCKSS archive has no role in these reader accesses.

Philip Gust of the LOCKSS team provided some of the content above.

2 comments:

Angela Dappert said...

David, we have already exchanged thoughts, but since I am concerned about this blog potentially discouraging people from using the de-facto PREMIS standard based on misunderstandings, I wanted to clarify a few things here.


‘ As with OAIS, there are significant conceptual mismatches’: You will be interested to know that in the coming version of PREMISv3.0 we make an explicit effort to break with the OAIS tradition. We have analysed where the focus on OAIS has let to modelling inconsistencies and unnatural breaks in the life-cycle management and are eliminating those.

‘‘the concept of "digital object" is hard to apply to preserving an artifact’: I fully agree with this and had to make all sorts of adjustments and exceptions in data modelling to cope with this in the past.
In order to create metadata you have to have an underlying domain model of objects, agents, events, rights, environments, etc. so that you can attach metadata to the instances of these entities. There is no way around that – whether this is in XML, spreadsheets, relational databases etc.
But more importantly, when you process the digital assets the software that processes them has to have the same underlying data models. You cannot really avoid having to create it.
But - you can certainly avoid excessively detailed data models in both cases – you should! But PREMIS certainly does not encourage users to get overly complex.

‘ As regards format metadata such as is generated by JHOVE, we are skeptical of its utility’:: Again, any metadata that you capture, whether in PREMIS or any other metadata framework needs to be based solely on your business requirements. The question is, what functions do I need to perform and what do I need to know in order to do that? The latter part determines which metadata you choose to collect. JHOVE offers you lots of metadata, but the institution who manages the digital assets needs to determine what of it it needs to extract and store – rather than extract on demand or not use at all.
So this does not really have much to do with PREMIS, other than that PREMIS offers you the right sort of semantic units where you can store this information if you decide to do so.

‘As regards bibliographic metadata’: PREMIS does not actually have any semantic units for bibliographic metadata. It is assumed that there are other frameworks in which one can deal with those. PREMIS has so called intellectual entities, which are at the moment simply pointers to bibliographic metadata that might be held elsewhere. We will make some changes to ‘intellectual entities’ in version 3 so that one can better link to rights or events that may be described in PREMIS, but the underlying principle remains the same: bibliographic metadata is out of scope.

Angela Dappert said...

And furthermore:

‘it would be possible to implement it because LOCKSS supports similar semantic units to those that PREMIS describes’: : It is important to note that PREMIS is completely implementation independent. It is only a data dictionary – that is a way of organising your domain model with applicable semantic units that describe the entities in the model. It helps you think about your domain and its requirements. It does not at all specify how you implement it and almost everything is optional rather than mandatory – so that you can chose only what is needed by you.
I have identified what I have called 5 degrees of freedom, in the past. We have used them as the basis for the conformance statement in PREMIS.
1. A repository is free to implement its semantic units using names different from those defined in the PREMIS data dictionary.
2. A repository is free to implement its semantic units at higher or lower granularity than defined in the PREMIS data dictionary.
3. An implementation can extend extensible PREMIS semantic units with other semantic units (this is isomorphic to using greater granularity)
4. An implementation does not have to record mandatory metadata explicitly if it can generate it for exchange
5. Controlled vocabulary is recommended but not compulsory.

The important conclusion is that it does not matter in what form you store your preservation metadata as long as you can produce an export format that can be mapped to the data dictionary.

From what you say it sounds as if LOCKSS is actually PREMIS conformant.