As I've repeatedly pointed out, the resources available for Web archiving are completely inadequate to the task. Jefferson Bailey shows that this is especially true of Web archiving programs in University libraries. Nelson's research program depends upon Web archiving programs for its material. His audience was composed of the foundations that fund Web archiving research, and the University librarians who should (but generally don't) fund Web archiving programs. Thus for him to present such an unremittingly negative view of (a part of) the field was to say the least politically inept.
What Can't be Done?What does Nelson propose we do to address the risks of Web archives being used as a vector for disinformation? He isn't very specific, saying:
- We need new models for web archiving and verifying authenticity. The Heritrix/Wayback Machine technology stack has limited our thinking. (Slide 96)
- I suspect the core of the new model will have a lot in common with click farms. (Slide 98)
- Record what we saw at crawl time as a baseline. Then we need a distance measure for crawl time and replay time. (Slide 99)
My issue with this idea is that it simply isn't a practical solution to the problem of disinformation campaigns using Web archives. The only reason they can do so is that the Wayback Machine's two-decade history and its vast scale have led people to overestimate its credibility. As I write archive.org is the 232-nd most visited site on the Internet, far outranking national libraries such as the Library of Congress (#4,587) or the British Library (#13,278). archive.is is a well-established and well-known Web archive built on a different stack. It is ranked #7,066. oldweb.today is ranked 320,538. The Wayback Machine is in a whole different mindshare league from other Web archives. As Nelson's account of the Joy Reid blog firestorm shows, it was the fact that the Wayback Machine was the only Web archive people knew about that enabled the situation to be elucidated.
The only way I can see to displace the Wayback Machine's dominant mindshare involves:
- Creating a new, somehow attack-proof technology stack.
- Assembling significantly greater hardware resources than archive.org's.
- Somehow populating them with a reasonable facsimile of the Wayback Machine's two-decade history of the Web.
- Mounting a sustained campaign of attacking the Wayback Machine and publicizing the results of the attacks to destroy its credibility.
What Can Be Done?Addressing the problem of excess credibility of the Wayback Machine by competing with it just isn't a sensible approach. What is needed is some way to adjust perceptions of the Wayback Machine, and other Web archives, closer to reality. Since we cannot have perfection in collection, preservation and replay, what we need is increased transparency, so that the limitations of our necessarily imperfect Web archives are immediately evident to users. Off the top of my head, I can think of a number of existing technologies whose wider adoption would serve this end.
Source TransparencyCurrent Web archive replay mechanisms treat all content as if they were equally certain of its source. The source of content obtained via HTTP is significantly suspect. The fact that content was obtained via HTTPS does not guarantee its source.
Web archives should crawl via HTTPS whenever possible, but this isn't enough to ensure that the preserved content reflects what the target Web site sent. Whenever the Web site supports it, the crawlers should use Certificate Transparency, auditing and storing the Signed Certificate Timestamps (SCTs) they receive along with the content. This would establish, in a verifiable way, that the content came from the target Web site.
To provide the necessary transparency, Web archive replay user interfaces should clearly differentiate between HTTP and HTTPS content, and for HTTPS+CT include the identities of the logs that issued the SCTs stored with the content. This would provide archive users high confidence that HTTPS+CT content:
- Came from the host in the URL.
- Was collected during the intersection of the validity of the certificates in the SCTs.
It is true that, since I captured the images above in 2016 the Wayback Machine has added a tiny "About this capture" button at the bottom right of the timeline bar, but as you see in the image to the left it isn't very noticeable.
Some of this information should be provided to non-experts. They don't need to see the list of URLs; they do need to know the range of dates. It should be indicated by the default presentation of the UI, not hidden behind an obscure button.
It is easy to think of ways other than Kreymer's to provide such an indication; for example the Wayback Machine could grey out in the timeline the range of dates present in the replayed page, zooming in if the range was small.
If the range displayed is small, it is likely that the page is "temporally coherent" in the terminology of Nelson and his co-authors. They have done significant work in this area, for example Only One Out of Five Archived Web Pages Existed as Presented. A large range, as in the case of Nijinski and Pavlova's home page, does not necessarily indicate that the page is "temporally incoherent". The cats home page is in fact correctly rendered. But it does suggest some caution in accepting the replayed page as evidence.
If the Web site supplied and the crawler recorded a Last-Modified header for the base page URL and for an embedded resource, these datetimes can be compared. In A Framework for Evaluation of Composite Memento Temporal Coherence Nelson et al show how to analyze the relationship between them. They demonstrate that in some cases it is likely, or even certain, that the embedded resource in the replayed page does not reflect its state when the base page was captured. This is a "temporal violation", and the replay mechanism should alert the user to it.
Fidelity Transparency 1The Wayback Machine and similar replay technologies mislead the user in another way. There's no indication that the 2016 browser uses different fonts and background colors from those an original visitor would have seen.
Kreymer has shown how to replay pages using a nearly contemporaneous browser, in a way that could be transparent to the user and does not involve modifying the replayed content in any way. The difficulty is that doing so is expensive for the archive, an important consideration when it is serving around four million unique IPs each day on a very limited budget, as the Internet Archive does.
Nevertheless, it would be possible for the UI of replay technologies like the Wayback Machine, after displaying their rendering of the preserved page, to offer the option of seeing an oldweb.today style rendering. Since users would be unlikely to choose the option, the cost would probably be manageable.
Fidelity Transparency 2
The indexing and UI for these screenshots needs more work; this capture of my blog doesn't appear in the regular Wayback Machine timeline. After viewing the Wayback Machine style replay, the user could be offered the choice of viewing the screenshot or the oldweb.today style replay. Again, I expect few users would take the screenshot offer.
ConclusionIt is clear that if Web archives are to fulfill their role as important parts of society's memory, they need credibility. Nelson is probably right that their credibility will come under increasing attack from disinformation merchants, including governments. Eliminating the inadequacies of current ingest techniques would be extremely expensive, and given current budgets would result in much less content being collected.
But to put a more positive spin than Nelson's on the situation, relatively affordable improvements could clearly be made to current replay technologies that would make malign or careless misrepresentation of their testimony as to the past much harder. And thus degrading their credibility less likely.