Tuesday, February 14, 2017

RFC 4810

A decade ago next month Wallace et al published RFC 4810 Long-Term Archive Service Requirements. Its abstract is:
There are many scenarios in which users must be able to prove the existence of data at a specific point in time and be able to demonstrate the integrity of data since that time, even when the duration from time of existence to time of demonstration spans a large period of time. Additionally, users must be able to verify signatures on digitally signed data many years after the generation of the signature. This document describes a class of long-term archive services to support such scenarios and the technical requirements for interacting with such services.
Below the fold, a look at how it has stood the test of time.

The RFC's overview of the problem a long-term archive (LTA) must solve is still exemplary, especially in its stress on the limited lifetime of cryptographic techniques (Section 1):
Digital data durability is undermined by continual progress and change on a number of fronts. The useful lifetime of data may exceed the life span of formats and mechanisms used to store the data. The lifetime of digitally signed data may exceed the validity periods of public-key certificates used to verify signatures or the cryptanalysis period of the cryptographic algorithms used to generate the signatures, i.e., the time after which an algorithm no longer provides the intended security properties. Technical and operational means are required to mitigate these issues.
But note the vagueness of the very next sentence:
A solution must address issues such as storage media lifetime, disaster planning, advances in cryptanalysis or computational capabilities, changes in software technology, and legal issues.
There is no one-size-fits-all affordable digital preservation technology, something the RFC implicitly acknowledges. But it does not even mention the importance of basing decisions on an explicit threat model when selecting or designing an appropriate technology. More than 18 months before the RFC was published, the LOCKSS team made this point in Requirements for Digital Preservation Systems: A Bottom-Up Approach. Our explicit threat model was very useful in the documentation needed for the CLOCKSS Archive's TRAC certification.

How to mitigate the threats? Again, Section 1 is on point:
A long-term archive service aids in the preservation of data over long periods of time through a regimen of technical and procedural mechanisms designed to support claims regarding a data object. For example, it might periodically perform activities to preserve data integrity and the non-repudiability of data existence by a particular point in time or take actions to ensure the availability of data. Examples of periodic activities include refreshing time stamps or transferring data to a new storage medium.
Section 4.1.1 specifies a requirement that is still not implemented in any ingest pipeline I've encountered:
The LTA must provide an acknowledgement of the deposit that permits the submitter to confirm the correct data was accepted by the LTA.
It is normal for Submission Information Packages (SIPs) to include checksums of their components, bagit is typical in this respect. The checksums allow the archive increased confidence that the submission was not corrupted in transit. But they don't do anything to satisfy RFC 4810's requirement that the submitter be reassured that the archive got the right data. Even if the archive reported the checksums to the submitter, this doesn't tell the submitter anything useful. The archive could simply have copied the checksums from the submission without validating them.

SIPs should include a nonce. The archive should prepend the nonce to each checksummed item, and report the resulting checksum back to the submitter, who can validate them, thus mitigating among others the threat that the SIP might have been tampered with in transit. This is equivalent to the first iteration of Shah et al's audit technology.

Note also how the RFC follows (without citing) the OAIS Reference Model in assuming a "push" model of ingest.

The RFC correctly points out that an LTA will rely on, and trust, services that must not be provided by the LTA itself, for example (Section 4.2.1):
Supporting non-repudiation of data existence, integrity, and origin is a primary purpose of a long-term archive service. Evidence may be generated, or otherwise obtained, by the service providing the evidence to a retriever. A long-term archive service need not be capable of providing all evidence necessary to produce a non-repudiation proof, and in some cases, should not be trusted to provide all necessary information. For example, trust anchors [RFC3280] and algorithm security policies should be provided by other services. An LTA that is trusted to provide trust anchors could forge an evidence record verified by using those trust anchors.
and (Section 2):
Time Stamp: An attestation generated by a Time Stamping Authority (TSA) that a data item existed at a certain time. For example, [RFC3161] specifies a structure for signed time stamp tokens as part of a protocol for communicating with a TSA.
But the RFC doesn't explore the problems that this reliance causes. Among these are:
  • Recursion. These services, which depend on encryption technologies that decay over time, must themselves rely on long-term archiving services to maintain, for example, a time-stamped history of public key validity. The RFC does not cite Petros Maniatis' 2003 Ph.D. thesis Historic integrity in distributed systems on precisely this problem.
  • Secrets. The encryption technologies depend on the ability to keep secrets for extended periods even if not, as the RFC explains, for the entire archival period. Keeping secrets is difficult and it is more difficult to know whether, or when, they leaked. The damage to archival integrity which the leak of secrets enables may only be detected after the fact, when recovery may not be possible. Or it may not be detected, because the point at which the secret leaked may be assumed to be later than it actually was.
These problems were at the heart of the design of the LOCKSS technology. Avoiding reliance on external services or extended secrecy led to its peer-to-peer, consensus-based audit and repair technology.

Encryption poses many problems for digital preservation. Section 4.5.1 identifies another:
A long-term archive service must provide means to ensure confidentiality of archived data objects, including confidentiality between the submitter and the long-term archive service. An LTA must provide a means for accepting encrypted data such that future preservation activities apply to the original, unencrypted data. Encryption, or other methods of providing confidentiality, must not pose a risk to the associated evidence record.
Easier said than done. If the LTA accepts encrypted data without the decryption key, the best it can do is bit-level preservation. Future recovery of the data depends on the availability of the key which, being digital information, will need itself to have been stored in an LTA. Another instance of the recursive nature of long-term archiving.

"Mere bit preservation" is often unjustly denigrated as a "solved problem". The important archival functions are said to be "active", modifying the preserved data and therefore requiring access to the plaintext. Thus, on the other hand, the archive might have the key and, in effect, store the plaintext. The confidentiality of the archived data then depends on the archive's security remaining impenetrable over the entire archival period, something about which one might reasonably be skeptical.

Section 4.2.2 admits that "mere bit preservation" is the sine qua non of long-term archiving:
Demonstration that data has not been altered while in the care of a long-term archive service is a first step towards supporting non-repudiation of data.
and goes on to note that "active preservation" requires another service:
Certification services support cases in which data must be modified, e.g., translation or format migration. An LTA may provide certification services.
It isn't clear why the RFC thinks it is appropriate for an LTA to certify the success of its own operations. A third-party certification service would also need access to pre- and post-modification plaintext, increasing the plaintext's attack surface and adding another instance of the problems caused by LTAs relying on external services discussed above.

Overall, the RFC's authors did a pretty good job. Time has not revealed significant inadequacies beyond those knowable at the time of publication.

No comments: