Thursday, June 20, 2019

Michael Nelson's CNI Keynote: Part 3

Here is the conclusion of my three-part "lengthy disquisition" on Michael Nelson's Spring CNI keynote Web Archives at the Nexus of Good Fakes and Flawed Originals (Nelson starts at 05:53 in the video, slides).

Part 1 and Part 2 addressed Nelson's description of the problems of the current state of the art. Below the fold I address the way forward.

As I've repeatedly pointed out, the resources available for Web archiving are completely inadequate to the task. Jefferson Bailey shows that this is especially true of Web archiving programs in University libraries. Nelson's research program depends upon Web archiving programs for its material. His audience was composed of the foundations that fund Web archiving research, and the University librarians who should (but generally don't) fund Web archiving programs. Thus for him to present such an unremittingly negative view of (a part of) the field was to say the least politically inept.

What Can't be Done?

What does Nelson propose we do to address the risks of Web archives being used as a vector for disinformation? He isn't very specific, saying:
  • We need new models for web archiving and verifying authenticity. The Heritrix/Wayback Machine technology stack has limited our thinking. (Slide 96)
  • I suspect the core of the new model will have a lot in common with click farms. (Slide 98)
  • Record what we saw at crawl time as a baseline. Then we need a distance measure for crawl time and replay time. (Slide 99)
As I understand this, he is proposing we somehow build a stack more credible than the Wayback Machine. Given some specifics about this stack would actually do, implementing it would not take huge resources. Kreymer has shown the feasibility of implementing an alternate stack.

My issue with this idea is that it simply isn't a practical solution to the problem of disinformation campaigns using Web archives. The only reason they can do so is that the Wayback Machine's two-decade history and its vast scale have led people to overestimate its credibility. As I write archive.org is the 232-nd most visited site on the Internet, far outranking national libraries such as the Library of Congress (#4,587) or the British Library (#13,278). archive.is is a well-established and well-known Web archive built on a different stack. It is ranked #7,066. oldweb.today is ranked 320,538. The Wayback Machine is in a whole different mindshare league from other Web archives. As Nelson's account of the Joy Reid blog firestorm shows, it was the fact that the Wayback Machine was the only Web archive people knew about that enabled the situation to be elucidated.

The only way I can see to displace the Wayback Machine's dominant mindshare involves:
  • Creating a new, somehow attack-proof technology stack.
  • Assembling significantly greater hardware resources than archive.org's.
  • Somehow populating them with a reasonable facsimile of the Wayback Machine's two-decade history of the Web.
  • Mounting a sustained campaign of attacking the Wayback Machine and publicizing the results of the attacks to destroy its credibility.
It would be possible for a government to do this, but very hard to do in a deniable way. For anyone else to attempt it would be a profoundly destructive waste of time and resources, which would likely fail.

What Can Be Done?

Addressing the problem of excess credibility of the Wayback Machine by competing with it just isn't a sensible approach. What is needed is some way to adjust perceptions of the Wayback Machine, and other Web archives, closer to reality. Since we cannot have perfection in collection, preservation and replay, what we need is increased transparency, so that the limitations of our necessarily imperfect Web archives are immediately evident to users. Off the top of my head, I can think of a number of existing technologies whose wider adoption would serve this end.

Source Transparency

Current Web archive replay mechanisms treat all content as if they were equally certain of its source. The source of content obtained via HTTP is significantly suspect. The fact that content was obtained via HTTPS does not guarantee its source.

Web archives should crawl via HTTPS whenever possible, but this isn't enough to ensure that the preserved content reflects what the target Web site sent. Whenever the Web site supports it, the crawlers should use Certificate Transparency, auditing and storing the Signed Certificate Timestamps (SCTs) they receive along with the content. This would establish, in a verifiable way, that the content came from the target Web site.

To provide the necessary transparency, Web archive replay user interfaces should clearly differentiate between HTTP and HTTPS content, and for HTTPS+CT include the identities of the logs that issued the SCTs stored with the content. This would provide archive users high confidence that HTTPS+CT content:
  • Came from the host in the URL.
  • Was collected during the intersection of the validity of the certificates in the SCTs.

Temporal Transparency

Source
Nelson is correct to point out that replayed Web pages frequently lack temporal integrity. Some current replay technologies, for example the Wayback Machine, obscure this lack. Examine the image from 2016 of the Wayback Machine replaying Nijinski and Pavlova's home page, and you would believe that you were seeing the page, which (credibly) claims to date from 11th January 1995, as captured on 1st December 1998. This is the date encoded in the Wayback Machine URL, and it is the highlighted one of the multiple "capture" dates in the timeline inserted at the top of the replayed page.

Source
But we already have example technology showing that pretending that a replayed page has a single date isn't necessary. Ilya Kreymer's oldweb.today prominently displays the range of dates of the Mementos it assembles into a replayed page, as shown in this image of the same page replayed via oldweb.today. In this case, displaying the same page exclusively from the Internet Archive, the components date between 1st December 1998 and 19th September 2000.

The Wayback Machine and similar UIs mislead the user into believing that the "page used to look like this" because in the default UI there's no indication that the replayed page wasn't collected in a single atomic operation on 1st December 1998.

It is true that, since I captured the images above in 2016 the Wayback Machine has added a tiny "About this capture" button at the bottom right of the timeline bar, but as you see in the image to the left it isn't very noticeable.

Expert users would know to click on the "About this capture" button to see the image at the right, with red text showing that the images were from 21 months later than the date shown in the timeline and encoded in the URL.

Some of this information should be provided to non-experts. They don't need to see the list of URLs; they do need to know the range of dates. It should be indicated by the default presentation of the UI, not hidden behind an obscure button.

It is easy to think of ways other than Kreymer's to provide such an indication; for example the Wayback Machine could grey out in the timeline the range of dates present in the replayed page, zooming in if the range was small.

If the range displayed is small, it is likely that the page is "temporally coherent" in the terminology of Nelson and his co-authors. They have done significant work in this area, for example Only One Out of Five Archived Web Pages Existed as Presented. A large range, as in the case of Nijinski and Pavlova's home page, does not necessarily indicate that the page is "temporally incoherent". The cats home page is in fact correctly rendered. But it does suggest some caution in accepting the replayed page as evidence.

If the Web site supplied and the crawler recorded a Last-Modified header for the base page URL and for an embedded resource, these datetimes can be compared. In A Framework for Evaluation of Composite Memento Temporal Coherence Nelson et al show how to analyze the relationship between them. They demonstrate that in some cases it is likely, or even certain, that the embedded resource in the replayed page does not reflect its state when the base page was captured. This is a "temporal violation", and the replay mechanism should alert the user to it.

Fidelity Transparency 1

The Wayback Machine and similar replay technologies mislead the user in another way. There's no indication that the 2016 browser uses different fonts and background colors from those an original visitor would have seen.

Kreymer has shown how to replay pages using a nearly contemporaneous browser, in a way that could be transparent to the user and does not involve modifying the replayed content in any way. The difficulty is that doing so is expensive for the archive, an important consideration when it is serving around four million unique IPs each day on a very limited budget, as the Internet Archive does.

Nevertheless, it would be possible for the UI of replay technologies like the Wayback Machine, after displaying their rendering of the preserved page, to offer the option of seeing an oldweb.today style rendering. Since users would be unlikely to choose the option, the cost would probably be manageable.

Fidelity Transparency 2

Source
A second way to provide transparency for replay technologies is for the archive to create and preserve a screenshot of the rendered page at ingest time along with the bitstreams that it received. The Internet Archive started doing this on an experimental basis around 2012; the image to the right is a screenshot they captured of this blog from 2nd February 2013. Their advanced Brozzler crawler does this routinely for selected  Web sites.

The indexing and UI for these screenshots needs more work; this capture of my blog doesn't appear in the regular Wayback Machine timeline. After viewing the Wayback Machine style replay, the user could be offered the choice of viewing the screenshot or the oldweb.today style replay. Again, I expect few users would take the screenshot offer.

Source
archive.is also captures screenshots, and has integrated them into their UI. The image on the left is a screenshot they captured of this blog from 12th July 2012. Note that archive.is only captured a screen's worth, whereas the Internet Archive captured the whole page, and that the top links, such as "Anmelden", are in German.

Capturing screenshots like this, especially complete page screenshots, is the obvious way to expose most of the confusion caused by executing Javascript during replay of Mementos. Unfortunately, it significantly increases both the compute cost of ingest, and the storage cost of preservation.

Conclusion

It is clear that if Web archives are to fulfill their role as important parts of society's memory, they need credibility. Nelson is probably right that their credibility will come under increasing attack from disinformation merchants, including governments. Eliminating the inadequacies of current ingest techniques would be extremely expensive, and given current budgets would result in much less content being collected.

But to put a more positive spin than Nelson's on the situation, relatively affordable improvements could clearly be made to current replay technologies that would make malign or careless misrepresentation of their testimony as to the past much harder. And thus degrading their credibility less likely.

Acknowledgments

Although this trilogy was greatly improved (and expanded) thanks to constructive comments on an earlier draft by Michael Nelson, Ilya Kreymer and Jefferson Bailey, you should not assume that they agree with any of it. Indeed, I hope that Michael Nelson will take up my offer of a guest post, in which he can make the extent of his disagreement clear. These are important and difficult issues; the more discussion the better.

1 comment:

IlyaK said...

Thanks for the mention of oldweb.today and my previous work. I agree with much of what you're saying regarding the need for transparency and in part have created a new system to further illustrate the point.

Another important way that archives can improve transparency is by making the raw, unaltered WARC data available, which contains hashes of resources. This is relatively inexpensive to do, but is generally not done for security reasons, as its useful to be able to exclude certain WARC records, so downloading original WARC files is often not desirable. However, access to raw WARC data could still be provided through existing access control systems (robots check, exclusion rules, etc...)

With latest browser technology, specifically Service Workers, it is now actually possible to move replay entirely to the browser, thus making the web archive replay something that can be observed and verified locally.
I am excited to share a new prototype that demonstrates this:

https://wab.ac/ which supports WARC (and HAR) replay directly the browser. (Source: https://github.com/webrecorder/wabac.js)

(Using Service Workers for replay has also been a subject of research at ODU as part of their Reconstructive project).

Since the replay system itself is indeed also a web page, it can be archived into an existing archive, and I've experimented with doing just that using Internet Archive's on-demand archiving:

https://web.archive.org/web/20190628004414/https://wab.ac/


Using this system, it is possible to provide an alternative replay mechanism to IA's existing replay mechanism, from inside its own web archive!

The alternative replay can be accessed directly, eg:

https://web.archive.org/web/20190628004414/https://wab.ac/#/20180131235341|https://www.cnn.com/

This alternative replay provides a banner that lists a temporal range instead of a single date, as another example to help with temporal transparency. The system could of course be expanded to provide other visual hints or to limit temporal incoherence by restricting what range from the base page resources will be loaded.

Of course, someone could certainly create a 'malicious' replay system that then attempts to meddle with the data, but if it happens in the browser, at least it could be detected. If the replay system is fully open source, even better. It becomes even more important to be able to verify both the replay system itself and the source of the data, so perhaps certificate transparency and new Signed Exchanges proposals being developed will need to become even more essential to web archiving.


Client side replay has other significant implications for web archives. By moving the replay to the client, the cost of running a web archive replay system could potentially be reduced to that of a simple file server that servers WARCs and indexes. A small enough web archive, it could be hosted entirely on a free hosting service.
For example, the replay rendered by following the link is entirely free (minus the custom domain), hosted courtesy of GitHub and replayed in the browser:
https://wab.ac/?coll_example=examples/netpreserve-twitter.warc&url=/example/https://netpreserveblog.wordpress.com/2019/05/29/warc-10th-anniversary/

I think an important goal is to make web archives as ubiquitous as for example, PDFs, and I think this approach might help move things in that direction.