Tuesday, August 12, 2014

TRAC Audit: Lessons

This is the third in a series of posts about CRL's TRAC audit of the CLOCKSS Archive. Previous posts announced the release of the certification report, and recounted the audit process. Below the fold I look at the lessons we and others can learn from our experiences during the audit.

TRAC vs. ISO 16363

There is some confusion in the audit process because, although the original TRAC criteria underwent an ISO standardization process that resulted in ISO 16363, this resulted in differences in detail between the TRAC and ISO 16363 criteria. Following the Scholar's Portal, and in the interest of future re-audits, we decided to use the ISO 16363 criteria in our submission. But, as CRL says in the Certification Report:
The primary metrics used by CRL in its assessments are those specified in the Trustworthy Repositories Audit and Certification (TRAC) checklist. TRAC was developed by a joint task force formed by the Research Libraries Group (RLG) and the National Archives and Records Administration in 2003 to provide criteria for use in identifying digital repositories capable of reliably storing, migrating, and providing long-term access to digital collections. TRAC represents best current practice and thinking about the organizational and technical infrastructure required for a digital repository to be considered trustworthy and thus worthy of investment by the research and research library communities. The approved ISO standard for Trustworthy Digital Repositories (ISO 16363), was also used in this audit. Because there is currently no ISO-approved mechanism for accrediting certifying bodies for the TDR standard, CRL’s certification is to TRAC criteria.
We believe that future audits should use the ISO criteria, so setting up an "ISO-approved mechanism for accrediting certifying bodies" and getting bodies such as CRL accredited under it is important. Until this is done the full value of the work to standardize ISO 16363 cannot be realized.

The authors of ISO 16383 have set up a body they call "Primary Trustworthy Digital Repository Authorisation Body (ISO-PTAB)" and are running a (rather expensive) training course. They say:
The Primary Trustworthy Digital Repository Authorisation Body (ISO-PTAB) plays a major role in training auditors and repository managers.  There are three important ISO standards:
  • ISO 14721 (OAIS – a reference model for what is required for an archive to provide long-term preservation of digital information)
  • ISO 16363 (Audit and certification of trustworthy digital repositories – sets out comprehensive metrics for what an archive must do, based on OAIS)
  • ISO 16919 (soon to be published - Requirements for bodies providing audit and certification of candidate trustworthy digital repositories – specifies the competencies and requirements on auditing bodies)

OAIS vs. CLOCKSS

Note the "what an archive must do, based on OAIS" above. Writing the OAIS Conformance Documents made the mis-match between the theory of the OAIS reference model and the practice of digital preservation in the Web era, and in particular that of the CLOCKSS Archive, evident. The conceptual mis-matches between the OAIS Reference Architecture, upon which ISO 16363 is firmly based, and the CLOCKSS Archive's architecture fall into four broad areas:
  • CLOCKSS is a dark archive. Eventual readers of the archive's content are unknown, and have no influence over when, whether and how content is released from the archive. The OAIS concept of Designated Community is thus difficult to apply.
  • CLOCKSS ingests streams of content. Content ingested by crawling the Web, as much of the CLOCKSS Archive's content is, is not pushed from the content submitter to the archive but pulled by the archive from the publisher. The publishers of academic journals emit a continual stream of content; any division into units is imposed by the archive, not by the publisher. The OAIS concept of Submission Information Package, (SIP) and the relationship it envisages between the submitter and the archive, is difficult to apply. The concept of Archival Information Package (AIP) also has some detailed mis-matches, since to collect a stream an AIP must be created before it contains any content, and subsequently accumulate content over time instead of, as OAIS envisages, being wrapped around a pre-existing collection of content at creation time.
  • CLOCKSS has a centralized organization but a distributed implementation. Efforts are under way to reconcile the completely centralized OAIS model with the reality of distributed digital preservation, as for example in collaborations such as the MetaArchive and between the  Royal and University Library in Copenhagen and the library of the University of Aarhus. Although the organization of the CLOCKSS Archive is centralized, serious digital archives like CLOCKSS require a distributed implementation, if only to achieve geographic redundancy. The OAIS model fails to deal with distribution even at the implementation level, let alone at the organizational level.
  • The CLOCKSS Archive contracts-out its operations. The CLOCKSS Archive not-for-profit achieves its low cost of operations by contracting them all out under two contracts with Stanford University. This enables many costs to be shared with the other users of the LOCKSS technology, to the benefit of both. The OAIS model fails to deal with organizational divisions such as this.
Another mis-match between OAIS and web archiving would have been a problem had CLOCKSS not been a dark archive. Access to archived Web content, via Memento (RFC7089), direct link or text search, occurs at the level of an individual URL. The OAIS concept of Dissemination Information Package is difficult to apply to access of this kind; it says:
In response to a request, the OAIS provides all or a part of an AIP to a Consumer in the form of a Dissemination Information Package (DIP). The DIP may also include collections of AIPs, and it may or may not have complete PDI. The Packaging Information will necessarily be present in some form so that the Consumer can clearly distinguish the information that was requested. Depending on the dissemination media and Consumer requirements, the Packaging Information may take various forms.
Although there is obviously a lot of room for interpretation here, it does not appear to cover the case where the Consumer requests, and the archive delivers, a digital object (the headers and body of a URL) in exactly the form it was ingested with no Packaging Information. This is what Consumers of archived Web content want. It is true that, for example, Memento adds header information to its response, but that information serves to point to other archived digital objects, potentially in other archives, so it can't be considered Packaging Information for the requested DIP. Fortunately for us, the trigger process of the CLOCKSS Archive does deliver a package containing many URLs, so it more closely matches the OAIS DIP concept.

The OAIS reference model has been rendered significantly obsolete by developments in digital content and the technology for preserving it. Requiring organizations being audited to shoe-horn the documentation of their practices and technologies into this outdated framework imposes considerable unnecessary costs. It may also result in archives being unfairly penalized for decisions that match the real world but not the outdated OAIS model. A revision of ISO16363 to make it less dependent on OAIS and more relevant to actual digital preservation practice is imperative.

Preparation

Preparing to be audited, at least for the first time, is a very time-consuming process. In particular, the load falls almost exclusively on senior management and technical staff; no-one more junior has the comprehensive big-picture knowledge to create and edit the necessary documents. Thus a decision to be audited implies a decision to accept significant delays in other activities that require attention from senior staff, and significant costs for senior staff time. We estimate that our audit consumed between two and three person-years of senior staff time.

Finding ways to reduce the resources consumed by certification is imperative. As I have shown, the major reason future readers will fail to access content that should have been preserved for them is economic. Diverting resources from actually preserving content into side issues such as certification is counter-productive. We got considerable value from our audit, but whether this value was worth the resources and disruption it took is debatable.

Transparency

The default policy for TRAC and ISO16363 audits should be that all documentation provided to the auditors should be made public at the end of the audit, absent a document-specific case being made for it to be withheld. There are two main reasons:
  • The audit process is expensive and time-consuming, diverting resources that could have preserved content. Reducing the burden of the process should be a major goal. The Scholar's Portal website was a considerable help in our audit, and we hope that the CLOCKSS documentation website will similarly assist future audits.
  • Archiving services such as CLOCKSS depend entirely on trust. For the same reason that we believe only open source software should be used in archiving, we believe that archives should provide the public with detailed information about how their processes use the software, their view of the risks to preservation their approach involves, and how these risks are mitigated.

Communication

We were very happy with our use of three Wikis for the audit, an internal one, a confidential one and a to-be-public one. But, in retrospect, a decision from the start to use the confidential and to-be-public Wikis for bi-directional communication between the team and the auditors would have been an improvement:
  • It would have eliminated the re-formatting needed to transform e-mail and documents from the auditors into Wiki format.
  • Allowing the auditors to edit their questions and comments into the documents would have placed them in context and allowed them to use links.
  • The edit history of the pages would be a record of the communication to and from the auditors.
The downside would be that it would need considerable discipline on both sides to maintain the necessary confidentiality.

Auditors' Visit

We weren't proactive enough in communicating with the auditors about the agenda for the site visit. We believe that auditors should request, and archives plan to provide, a presentation organized around the archive's workflow. From the archive's point of view this provides a coherent structure for the presentation. From the auditors' point of view, this acts as a checklist to make sure all areas are covered.

We strongly recommend archives do a full run-through of their presentation including all live demos shortly before the visit.I know this sounds obvious, but it wasn't until we did a complete run-through that we realized we needed an overview document and presentation.

Prior agreement between the auditors' delegation and the archive about some means such as audio or video recording of the discussions during the visit would assist both sides.

2 comments:

Hvdsomp said...

David, this totally resonates with me. Most discussion that I follow with regard to collecting scholarly output for the purpose of archiving to a very large extent lack any thinking in terms of the web and hence web archiving. That is despite the fact that all those scholarly materials are published on the web, evolve there, and potentially vanish from there. The typical perspective is what I would refer to as a "back-office" archival approach in which stuff is handed over "under the table" from a content owner to an archive. Very infrequently do I hear mention of a "front-door" approach in which materials are transferred, as is the case with web archiving, by means of interacting with their URI either following a pull (crawling) or push (on-demand) approach. It seems to me that this consideration relates to your musings about OAIS, which has a perspective that is rather more related the former.

David. said...

The Trust & Certification parallel session at the 3rd EUDAT Conference featured an interesting discussion. There are three increasingly rigorous levels of certification available to research data archives; the Data Seal of Approval (DSA), NESTOR's DIN31644, and TRAC/ISO16363. A fairly large number of data archives have chosen DSA, 36 have been awarded and another 34 are in the process. These layers were described as bronze, silver and gold.

The DSA is awarded after a process of self-audit and peer-review based on 16 guidelines. This is a much lighter-weight process than TRAC/ISO16363, and typically takes 2-3 months wall time. Re-certification is required after 2 years and should (as with the other levels) be significantly less expensive. DANS estimated that their recent re-certification against DSA took 250 hours of staff time. Some of this may also be accounted against preparations to move to DIN31644. I believe that there is great value in having multiple levels of certification. Moving gradually up the levels, as DANS is doing, rather than going directly to the highest level, as CLOCKSS did, is clearly a less disruptive approach.

There was consensus on many of the benfits of undergoing even the bronze-level DSA certification process, which broadly match the benefits we realized from our TRAC audit. DANS expressed them thus:
* Moving the archive to a higher level of professionalism.
* Forcing the archive to make its work processes more explicit and decide unresolved issues.
* Building trust from funders, users, etc.