In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:Recently, discussion on a mailing list I'm on focused on the downsides of Postel's Law. Below the fold, I try to explain why most of these downsides don't apply to the "accept" side, which is the side that matters for digital preservation.
"Be conservative in what you do; be liberal in what you accept from others."Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law,
Two years after my post, Eric Allman wrote The Robustness Principle Reconsidered, setting out the reasons why Postel's Law isn't an unqualified boon. He writes that Postel's goal was interoperability:
The intent of the Robustness Principle was to maximize interoperability between network service implementations, particularly in the face of ambiguous or incomplete specifications. If every implementation of some service that generates some piece of protocol did so using the most conservative interpretation of the specification and every implementation that accepted that piece of protocol interpreted it using the most generous interpretation, then the chance that the two services would be able to talk with each other would be maximized.But:
In recent years, however, that principle has been challenged. This isn't because implementers have gotten more stupid, but rather because the world has become more hostile. Two general problem areas are impacted by the Robustness Principle: orderly interoperability and security.Allman argues, based on his experience with SMTP and Kirk McKusick's with NFS, that interoperability arises in one of two ways, the "rough consensus and running code" that characterized NFS (and TCP), or from detailed specifications:
the specification may be ambiguous: two engineers build implementations that meet the spec, but those implementations still won't talk to each other. The spec may in fact be unambiguous but worded in a way that some people misinterpret. ... The specification may not have taken certain situations (e.g., hardware failures) into account, which can result in cases where making an implementation work in the real world actually requires violating the spec. ... the specification may make implicit assumptions about the environment (e.g., maximum size of network packets supported by the hardware or how a related protocol works), and those assumptions may be incorrect or the environment may change. Finally, and very commonly, some implementers may find a need to enhance the protocol to add new functionality that isn't defined by the spec.His arguments here are very similar to those I made in Are format specifications important for preservation?:
I'm someone with actual experience of implementing a renderer for a format from its specification. Based on this, I'm sure that no matter how careful or voluminous the specification is, there will always be things that are missing or obscure. There is no possibility of specifying formats as complex as Microsoft Office's so comprehensively that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same product.The "rough consensus and running code" approach isn't perfect either. As Allman relates, it takes a lot of work to achieve useful interoperability:
The original InterOp conference was intended to allow vendors with NFS (Network File System) implementations to test interoperability and ultimately demonstrate publicly that they could interoperate. The first 11 days were limited to a small number of engineers so they could get together in one room and actually make their stuff work together. When they walked into the room, the vendors worked mostly against only their own systems and possibly Sun's (since as the original developer of NFS, Sun had the reference implementation at the time). Long nights were devoted to battles over ambiguities in the specification. At the end of those 11 days the doors were thrown open to customers, at which point most (but not all) of the systems worked against every other system.The primary reason is that even finding all the corner cases is difficult, and so is deciding for each whether the sender needs to be more conservative or the receiver needs to be more liberal.
The security downside of Postel's Law is even more fundamental. The law requires the receiver to accept, and do something sensible with, malformed input. Doing something sensible will almost certainly provide an attacker with the opportunity to make the receiver do something bad.
An example is in encrypted protocols such as SSL. They typically provide for the initiator to negotiate with the receiver the specifics of the encryption to be used. Liberal receivers can be negotiated down to the use of an obsolete algorithm, vitiating the security of the conversation. Allman writes:
Everything, even services that you may think you control, is suspect. It's not just user input that needs to be checked—attackers can potentially include arbitrary data in DNS (Domain Name System) results, database query results, HTTP reply codes, you name it. Everyone knows to check for buffer overflows, but checking incoming data goes far beyond that.Security appears to demand receivers be extremely conservative, but that would kill off interoperability; Allman argues that a balance between these conflicting goals is needed.
Ingest and dissemination in digital preservation are more restricted cases of both interoperability and security. As regards interoperability:
- Ingest is concerned with interoperability between the archive and the real world. As digital archivists we may be unhappy that, for example, one of the consequences of Postel's Law is that in the real world almost none of the HTML conforms to the standard. But our mission requires that we observe Postel's Law and not act on this unhappiness. It would be counter-productive to go to websites and say "if you want to be archived you need to clean up your HTML".
- Dissemination is concerned with interoperability between the archive and an eventual reader's tools. Traditionally, format migration has been the answer to this problem, whether preemptive or on-access. More recently, emulation-based strategies such as Ilya Kreymer's oldweb.today avoid the problem of maintaining interoperability through time by reconstructing a contemporaneous environment.
- Ingest. In the good old days when Web archives simply parsed the content they ingested to find the links, the risk to their ingest infrastructure was minimal. But now the Web has evolved from inter-linked static documents to a programming environment, the risk to the ingest infrastructure from executing the content is significant. Precautions are needed, such as sandbox-ing the ingest systems.
- Dissemination. Many archives attempt to protect future readers by virus-scanning on ingest. But, as I argued in Scary Monsters Under The Bed, this is likely to be both ineffective and counter-productive. As digital archivists we may not like the fact that the real world contains malware, but our mission requires that we not deprive future scholars of the ability to study it. Optional malware removal on access is a suitable way to mitigate the risk to scholars not interested in malware (cf. the Internet Archive's Malware Museum).
No comments:
Post a Comment