Thursday, May 24, 2018

How Far Is Far Enough?

When collecting an individual web site for preservation by crawling it is necessary to decide where its edges are, which links encountered are "part of the site" and which are links off-site. The crawlers use "crawl rules" to make these decisions. A simple rule would say:
Collect all URLs starting https://www.nytimes.com/
NoScript on http://nytimes.com
If a complex "site" is to be properly preserved the rules need to be a lot more complex. The image shows the start of the list of DNS names from which the New York Times home page embeds resources. Preserving this single page, let alone the "whole site", would need resources from at least 17 DNS names. Rules are needed for each of these names. How are all these more complex rules generated? Follow me below the fold for the answer, and news of an encouraging recent development.

Fundamentally, the decision as to what is included in "the site" and what is excluded is a subjective one to be taken by a curator. Are the ads part of the New York Times, or not? Are supplementary materials hosted at Amazon part of a journal article, or not? Human input to these decisions is necessary.

Using webrecorder.io, a curator can collect a one-time, high-fidelity snapshot in time of a complex Web site by clicking on all the links she thinks relevant. This is an effective but time-consuming process, and it preserves only a single Web site.

It isn't feasible for the LOCKSS and CLOCKSS systems to use manual ingest in this way. They preserve thousands of journal Web sites, each of which is accreting new content on a regular basis. Their crawlers revisit each site weekly or monthly, depending on the site's publication schedule, and check for new content. The crawl rules they use to do so are defined in a journal-specific "plugin". The process by which plugins are created is documented here; it is a manual process, but:
  • It happens once per journal, not once per visit (but see the caveat below).
  • Because most journal's Web sites are provided by one of the small number of major publishing platforms, plugins for most journals can inherit much of their information from a template for the publishing platform.
Less specialized archives lack these advantages.

Now, a new project from the Memento team holds out the promise of similar optimizations for more generic Web sites. The concept for Memento Tracer is to crowd-source a database of webrecorder.io-like crawls of complex Web sites in a form that can be analyzed to generate abstract templates similar to the platform templates on which LOCKSS plugins are mostly based. The project Web site says:
The Memento Tracer framework introduces a new collaborative approach to capture web publications for archival purposes. It is inspired by existing capture approaches yet aims for a new balance between the scale at which capturing can be conducted and the quality of the snapshots that result.
Like existing web crawler approaches, Memento Tracer uses server-side processes to capture web publications. As is the case with LOCKSS, these processes leverage the insight that web publications in any given portal are typically based on the same template and hence share features such as lay-out and interactive affordances. As is the case with webrecorder.io, human guidance helps to achieve high fidelity captures. But with Memento Tracer, heuristics that apply to an entire class of web publications are recorded, not individual web publications. These heuristics can collaboratively be created by curators and deposited in a shared community repository. When the server-side capture processes come across a web publication of a class for which heuristics are available, they can apply them and hence capture faithfull snapshots at scale. 
As the Memento team write, Tracer is in a nascent but promising state:
It is hard to say when Memento Tracer will be ready for a test ride, let alone for prime time. The components are currently experimental but we are making promising progress. The process of recording Traces and capturing web publications on the basis of these Traces has been demonstrated successfully for publications in a range of portals. But there also remain challenges that we are investigating, including:
  • User interface to support recording Traces for complex sequences of interactions.
  • Limitations of the browser event listener approach for recording Traces.
  • Language used to express Traces.
  • Organization of the shared repository for Traces.
  • Selection of a Trace for capturing a web publication in cases where different page layouts and interactive affordances are available for web publications that share a URI pattern.
The team have a set of demo screencasts, showing:
  • A somewhat higher level of capture automation than webrecorder.io
  • A very compact JSON representation of the template, albeit on very regularly structured sites such as FigShare, GitHub and SlideShare.
I believe Tracer has the potential to become an important enhancement to Web archiving technology. But our experience with LOCKSS plugins leads me to issue the following caveat.

Caveat: It is unfortunate but inevitable that platforms and publications that use them are continually being tweaked to "optimize the user experience". In other, more realistic, words to confuse both the regular reader and the crawler plugin that is supposed to collect the publication. Thus an important part of a LOCKSS plugin is a heuristic that guesses how much content a crawl is expected to collect. If a crawl fails to collect as much as expected, or collects a lot more, an alert is generated. The ingest team will assess the crawl in question to detect how the user experience has been optimized, and tweak the plugin to match.

Over time Tracer heuristics will become obsolete and misleading. A mechanism similar to LOCKSS' will be needed to detect this obsolescence. When a heuristic becomes obsolete there needs to be a way to persuade a curator to re-crawl the site(s) that generated it, and to replace the stored heuristic by one more suited to the newly optimized user experience.

3 comments:

  1. Thanks for this post, David. Regarding the caveat: Clearly there is that problem because, in essence, what happens is an abstract form of screen scraping. But, I am rather confident that the problem is tractable in the proposed approach:

    - A quality control mechanism at the end of the server's crawling process can detect that something is wrong with the Trace at hand: When an instruction in a Trace is no longer valid, the associated call to the headless browser API will result in an error message. This error can be intercepted and is an indication that the Trace has to be re-recorded.

    - One can imagine acting automatically upon such an error condition: There could be an alerting mechanism from the server-side crawler environment to the shared repository when a Trace doesn't work anymore. If the repository is - say - GitHub, this could be done by automatically posting an Issue.

    ReplyDelete
  2. I don't think it will be adequate to depend on getting an error from the headless browser to detect that a Trace is obsolete. In our experience one of the big difficulties in detecting that a site has optimized the user's experience is soft-403 and soft-404 responses. Some mechanism similar to that Rene Voorburg implemented for robustify.js will be needed in addition:

    "I just uploaded a new version of the robustify.js helper script (https://github.com/renevoorburg/robustify.js) that attempts to recognize soft-404s. It does so by forcing a '404' with a random request and comparing the results of that with the results of the original request (using fuzzy hashing). It seems to work very well but I am missing a good test set of soft 404's."

    ReplyDelete
  3. I take your point about soft 403s/404s, but I think the headless browser would catch errors when the html template has changed: it will presumably throw an error when a particular kind of div/class/id/etc. is no longer in the html and it can no longer perform some action that it was supposed to do (and thus not even resulting in an http request).

    ReplyDelete