The CLOCKSS archive is implemented using LOCKSS technology. LOCKSS systems do not use PREMIS. As with OAIS, there are significant conceptual mismatches between the PREMIS model based on it and the reality of the content LOCKSS typically preserves. For example, the concept of "digital object" is hard to apply to preserving an artifact such as an e-journal that continually publishes new, compound objects with only a loose semantic structure. The view that e-journals consist of volumes that consist of issues that consist of articles only loosely corresponds to the real world.
As regards format metadata such as is generated by JHOVE, we are skeptical of its utility in the LOCKSS system because it is expensive to generate, unreliable, and of marginal relevance to content which is unlikely to suffer format obsolescence in the foreseeable future, and if it does may well be rendered via emulation rather than format migration.
Nevertheless, we integrated FITS into one version of the LOCKSS daemon and used it to generate format metadata for the content in the CLOCKSS Archive. We do not use this version of the daemon in production CLOCKSS boxes:
- FITS is several times bigger than the production LOCKSS daemon.
- We do not have the resources to audit it for potential risks to the preserved content that it might pose.
- The computational and I/O resources it consumes are significant.
- Even if the metadata FITS generates were reliable, it would not be of operational significance in the CLOCKSS environment.
The CLOCKSS Archive uses bibliographic metadata for four purposes:
- For billing, the number of articles received from each publisher must be counted. The article-level metadata needed is only the existence of an article.
- For Keepers and KBART reports. These need volume-level metadata.
- To locate content that is the subject of a board-approved trigger event in order to extract a copy from the archive. This typically needs volume-level metadata.
- Once content has been triggered, to update DOI and OpenURL resolvers. This needs detailed article-level metadata.
CLOCKSS is a dark archive. Until it is triggered, there are no readers to access the content, so there are no readers demanding the kinds of access that PREMIS bibliographic metadata would support. If the CLOCKSS board were to decide that PREMIS metadata support was important enough to justify the rather significant development costs that would be involved, it would be possible to implement it because LOCKSS supports similar semantic units to those that PREMIS describes. Although they are internally factored differently than the PREMIS data model; it would be possible to externalize a data dictionary for our content or respond to a query in terms of the data model that it describes.
Information is stored in several places within the CLOCKSS network, including the LOCKSS repository (storage-level metadata), the title database (preservation-unit level metadata), and the metadata database (bibliographic-level metadata). This information is tied together internally using a preservation-unit level "archival unit" identifier (AUID). Traversing these databases would enable us to generate a PREMIS compliant data dictionary. Responding to a query for any PREMIS-defined entity could be answered by mapping it to a range of AUIDs, and from there to information stored in these databases.
For example, a PREMIS Intellectual Entity (e.g. a journal article) is represented in the the metadata database. It can be located using an Intellectual Entity key such as its DOI or an ISSN and other bibliographic information. Using the AUID associated with that article allows us to retrieve its preservation-level metadata such as the Archival Unit parameters and attributes that specify its Agents and Rights. It also enables us to retrieve the Object Entities and their physical characteristics, and the Events related to provenance from the associated storage-level metadata in the repository.
But note that this isn't an operation that would ever be performed in the CLOCKSS archive. It is a dark archive; no access to preserved content is permitted unless and until it is triggered. Triggering happens externally at a journal (or conceptually at a volume level) and internally at an AUID level, not at an article level. It is a one-time process initiated by the board that hands content off under CC license to multiple re-publishing sites, from where readers can access it in the same way that they accessed it from the original publisher, via the Web. The CLOCKSS archive has no role in these reader accesses.
Philip Gust of the LOCKSS team provided some of the content above.