Tuesday, August 5, 2014

TRAC Audit: Process

This is the second in a series of posts about CRL's audit of the CLOCKSS Archive. In the first, I announced the release of the certification report. In this one I recount the process of being audited and what we did during it. Follow me below the fold for a long story, but not as long as the audit process.

Update: the third post discussing the lessons to be drawn is here.

From the point of view of an Archive being audited, the TRAC process can be divided into phases:
  • The Archive negotiates a contract with an auditing organization, in our case CRL.
  • The Archive generates and submits to the Auditors documentation describing its organization, policies, operations, technology and so on in great detail.
  • The Auditors request further information, and evidence to support the claims in the documentation.
  • A delegation of Auditors visits the Archive to ask questions, receive demonstrations, examine equipment, and so on.
  • The Auditors prepare a draft Certification Report, which is reviewed by the Archive.
  • The Auditors release their report.

Submission Phase

Our contract was signed in July last year, but the LOCKSS team started work on the necessary documentation about 6 months earlier. I was assigned responsibility for the submission, and spent full-time June through September editing previously written documents, writing new ones and organizing the submission.

The example provided by the previous TRAC audit of Scholar's Portal was enormously useful in these early stages. In particular we observed that:
  • They used a Wiki to assemble the documents they submitted to their auditors.
  • The Wiki included a page for every one of the ISO16363 Criteria.
  • The Wiki was made public, promoting transparency and providing support for future audits.
While we decided to follow each of these examples, we also saw ways in which Scholar's Portal's approach could be improved:
  • The structure of the documentation, determined by the ISO16363 criteria, was extremely useful to the auditors but opaque to anyone not intimately familiar with OAIS and the audit criteria.
  • The pages for the criteria were in some respects repetitious, which we felt would cause difficulty in ensuring that the information was consistent.
  • Because the pages were implemented by a Wiki, the edit history of each page was visible to the auditors and the public, which could inhibit free discussion among the team as the pages were created.
We decided on a more complex, but more useful structure for our document submission. First, we decided to submit two Wikis:
  • A Wiki (documents.clockss.org) that, following the Scholar's Portal example, would be made public at the end of the audit.
  • A Wiki (trac.clockss.org) to contain all the confidential information that the auditors would request, to be taken down at the end of the audit.
Second, we decided to use a third, internal Wiki to which only the team would have access to create and edit the content for these Wikis. Once the content had passed review and final edit, it was copied from this internal Wiki to the appropriate one of the other Wikis. Our goal was that the set of documents that would eventually be made public, those in documents.clockss.org, would have three distinct audiences:
  • The auditors, who needed to access the content via the ISO16363 criteria.
  • Members of the LOCKSS team, so that as far as possible these documents would be detailed enough to replace our internal documentation.
  • Interested members of the public, who needed to access the content via some understandable, non-OAIS-related structure.
Thus we decided that documents.clockss.org would contain both:
  • A set of documents organized around coherent themes, which we called "the documents". The themes we ended up with were:
  • A set of pages matching the hierarchical structure of the ISO16363 criteria, with one page for each criterion. This set of pages we called "the criteria". They would, as far as possible, serve only as a finding aid, with each page having minimal content but linking to the appropriate sections of "the documents".
The first step was to create, in the internal Wiki, a copy of the Scholar's Portal Wiki's structure, matching the hierarchical structure of the ISO16363 criteria, Sections 3, 4 and 5. The result was a tree of Wiki pages, whose 101 blank leaf nodes represented actual criteria and whose internal nodes were index pages linking to the actual criteria pages.

The next step was to provide the leaf nodes with relevant content; notes about what content the auditors would need to see to judge that criterion, and links to appropriate sections of as yet non-existent pages in "the documents". A typical leaf node page was very sparse, for example:
4.2.3 - The repository shall document the final disposition of all SIPs.
Relevant Documents
The next step was to create "the documents" pointed to by "the criteria". Although not in a suitable form, a good deal of the necessary content already existed as CLOCKSS Board documents, published papers from the LOCKSS team, this blog, and the LOCKSS team's internal Wiki and bug tracking system. This material was reviewed and incorporated as appropriate in "the documents". Despite this "the documents" mostly had to be written from scratch, after extensive consultation with the relevant team members. Writing "the documents" in this way had several beneficial effects:
  • It revealed that in some cases different team members had different ideas about how the process worked, or were in fact executing a different process from the one documented in the internal Wiki.
  • The team came to distinguish for the first time between the team members as individuals, and the roles they played in the various processes. The documents were written to assign responsibilities to roles, not to individuals, and a page on the internal Wiki was created under the control of the LOCKSS Executive Director that mapped from these roles to individual team members.
  • These new documents were placed under a formal document change system. Each specifies the roles which must review, and the role which must approve future changes to them. Because the documents are in a Wiki, conformance to this system is easy to establish through the edit comments.
At the end of September the team was satisfied that "the documents" and "the criteria" were ready to submit. Each page was copied from the internal Wiki to documents.clockss.org and the auditors given read-only accounts on it. In this way the details of the editing process that led to the final submission were kept private to the team.

Discussion Phase

About 6 weeks after the submission the auditors responded with an e-mailed list of questions and requests for further documentation covering:
  • Statistics. The auditors asked for detailed statistics in three areas:
    • Counts of articles, journals, files, etc.
    • The rate of growth of the archive.
    • A list of file formats with a count of the instances of each.
  • Administrative Documents. The auditors asked for 15 categories of such documents, all of which were confidential.
  • Content Samples. The auditors asked for sample output from FITS, which could be provided, and sample content, which could not. Since CLOCKSS is a dark archive, absent a trigger event access to content in the archive is not permitted.
  • Reports. The auditors asked for samples of three kinds of report.
  • Additional Requests. The auditors asked for four further responses, one based on one of "the documents", one based on the LOCKSS team's January 2005 D-Lib paper and two based on a 2007 analysis of LOCKSS by CRL.
FITS output and many of the requested statistics were not reports that the team compiled as part of normal operations of the Archive. Generating them, and collecting all the requested administrative documents, took some time.

Lists of URLs for two sample archival units (AUs) were provided, one collected by web harvesting and one supplied from the publisher via file transfer.  Keepers and KBART reports were already public and, in fact, already linked from the appropriate places in "the documents". A sample of the weekly internal report from the LOCKSS team to the CLOCKSS Executive Director was provided.

Once the requested information had been collected, the part that could be made public but was not already public was edited into "the documents". Pages containing the requested confidential information were added to the trac.clockss.org Wiki. A page in that Wiki was created to form the response to the auditors' request by taking the text of their e-mail, adding Wiki markup, and then adding the text of the response to each request in bold font, with links to the confidential information or to "the documents" as appropriate.

The auditors were notified of our response about five weeks after the request, just before Christmas 2013. They were given password-protected read-only accounts on trac.clockss.org that allowed them to read these pages.

Inquisition Phase

About eight weeks after our response to the auditors' first request we received a proposed schedule for the auditors' on-site visit, covering two days about five weeks later. It consisted of a list of requests for aspects of the Archive's functions to be demonstrated, and a set of questions similar to, but more detailed than, those in the first request. There were a total of 36 such questions covering:
  • Ingest
  • Storage and data management
  • Metadata
  • Integrity
  • Miscellaneous
  • Follow up to previous statements
  • Understandability, Rendering Content, and Representation Information
  • Content examples
Some of the questions provided convincing evidence that the auditors used all available materials, and were looking for inconsistencies. They included:
  • A question based on our 9-year-old format migration paper whose answer was in "the documents". The first request had requested evidence for this answer, which the first response had provided.
  • Two questions based on our 14-year-old first paper on the LOCKSS prototype.
Among the 36 were two questions about cases of apparent failure to preserve completely, one in recently triggered content and one in the list of URLs provided as part of the first response. The second was a misunderstanding; the first was a real error on our part. It was diagnosed as a failure to follow the specified trigger process, and fixed. The root cause was diagnosed as confusion due to the fact that the publisher in question was one whose content was normally collected via web harvest. Exceptionally the publisher had delivered content via file transfer for immediate triggering.

Using the internal Wiki, we developed responses to the questions in the proposed schedule, and started planning suitable demonstrations. As soon as we started to consider which team members were most appropriate to handle each question and demonstration, it became obvious that the structure of the proposed schedule was inappropriate. Instead, we suggested a re-organization of the schedule to the auditors. We proposed to structure the presentations of demonstrations and answers to questions around the CLOCKSS Archive's workflow, thus:
  • Engaging. The work of the CLOCKSS Executive Director and the Director of Publisher Outreach in recruiting publishers and libraries.
  • Preparing. The work of the LOCKSS team in preparing the CLOCKSS system to ingest the content of newly recruited publishers.
  • Ingesting. The operations of the CLOCKSS system as it ingests the flow of content from established publishers, and the quality control and monitoring processes performed by the LOCKSS team as it does.
  • Preserving. The operations of the CLOCKSS system as it preserves the ingested content, and the monitoring processes performed by the LOCKSS team as it does.
  • Extracting. The processes that extract metadata from the preserved content, and the uses to which the metadata is put.
  • Triggering. The processes that occur when the CLOCKSS board declares a trigger event to extract preserved content from the CLOCKSS system and deliver it in usable form to re-publishing sites.
This allowed the team leads for each area to present their area of expertise as a coherent whole. We made an initial allocation of time slots to each of these areas and assigned the relevant team members to develop presentations for each slot. I created a page with the content of the auditor's proposed schedule, and inserted:
  • Detailed answers to each of their specific questions with links to the relevant sections of "the documents". In some cases this required detailed consultation with the relevant team member.
  • An outline of each of the requested demonstrations. In some cases the relevant team member chose the specific example to be demonstrated.
Then a page was created for each of the areas, with the auditor's questions, the detailed answers and the requested demonstrations. Each presentation was to have two components:
  • An overview of the area, based on the content of "the documents".
  • A list of each of the questions relevant to that area with answers.
All team members were encouraged to read all the documentation that had been compiled in case they would face questions during the auditors' visit.

The auditors request for a demonstration of the LOCKSS: Polling and Repair Protocol in action, even via annotated logs, posed significant problems. The directory trees and files for production preserved content (AUs) are very large and a poll on a typical AU takes a long time. The daemon runs many such polls simultaneously. If the daemon's logging mechanism were configured to generate enough detail to follow every step of one real poll, because that level of detail would apply to every poll under way for the duration of that poll, the volume of log data would be enormous.

Instead we gave a live demo of the file structure and polling process on a small AU of synthetic content in the LOCKSS team's STF testing framework. We used STF to create a network of 5 virtual LOCKSS boxes each running the full LOCKSS daemon in 5 processes on the laptop running the projector used for the demos. Each was specially configured to preserve just the synthetic AU. STF caused the first box (the poller) to call a poll on it. The remaining 4 boxes were voters in this poll. The results of the poll showed up in the poll status page of the poller with, as expected, 100% agreement and in the vote status page of each voter. The daemons in all these boxes had logging configured to show details of this single poll, and these logs were shown to the auditors after they watched the poll proceed. The logs are linked from the documentation here.

Then we re-created the same network with the same content, but on the poller we damaged the current version of the content of one of the AU's URLs before calling the poll. The results of the poll showed up in the poll status page of the poller showing that the damage was repaired, and in the vote status page of each voter. They were shown the poll and vote status pages of production CLOCKSS boxes to demonstrate that similar processes were under way in the real CLOCKSS network.

About a week before the auditors' visit, they e-mailed a draft document setting out the view they had derived from "the documents" of the mapping between the workflow of the CLOCKSS Archive and the OAIS model of the flow from SIP to AIP to DIP (SIP-AIP-DIP). This highlighted some areas of uncertainty and revealed some significant misunderstandings. I made extensive edits to the auditors' draft clarifying the uncertainties and returned it to them

The two days preceding the auditor's visit were given over to a complete run-through, with each presenter giving their talk and team members playing the role of auditors. A review after the run-through changed some of the time allocations and, more importantly, identified a missing presentation. Looking back on SIP-AIP-DIP and the run-through it was clear that the set of documents lacked an introductory document: LOCKSS: Basic Concepts. A slot was created after the initial introduction to present an overview of seven basic concepts:
This supporting document was written and the presentation created overnight. Other presentations were edited to respond to feedback from the rehearsal audience, and collected in PDF form on the single laptop used to project them. This avoided time-wasting projector-swapping. The PDFs contained live hyperlinks to web pages and demonstrations.

The presentations and demonstrations generally went well, and the auditors expressed satisfaction at the end. The only significant problem we encountered was that, although a team member had been assigned to record the auditors' questions and our answers in the internal Wiki during this time, the discussion rapidly overwhelmed their typing, so our record of the discussion was inadequate.

Follow-Up

At the end of the visit the pages we had created in the internal Wiki underwent a final edit to reflect as far as possible the outcome of the discussions, and were then transferred to the confidential Wiki (trac.clockss.org) to provide a record of our answers. This part of the confidential Wiki was structured as an introduction, a page for each of the workflow stages described above, and a page with the text of the auditor's proposed schedule, linking each question to the appropriate location in the relevant workflow stage page with the answer.

About 1 month after the visit the auditors made one final request for information. As before, we put the text of the request and our answer into the confidential Wiki.

About two months after the visit the auditors sent us a draft of their certification report for review. We made 10 comments, ranging from trivial to significant, with suggested rewordings. About six weeks after these comments, the auditors released their certification report, which addressed all of them. Shortly after that,  CLOCKSS Archive management put out a press release, the LOCKSS team made documents.clockss.org publicly accessible, and I put up a blog post.

No comments: