Tuesday, August 23, 2016

Content negotiation and Memento

Back in March Ilya Kreymer summarized discussions he and I had had about a problem he'd encountered building oldweb.today thus:
a key problem with Memento is that, in its current form, an archive can return an arbitrarily transformed object and there is no way to determine what that transformation is. In practice, this makes interoperability quite difficult.
What Ilya was referring to was that, for a given Web page, some archives have preserved the HTML, the images, the CSS and so on, whereas some have preserved a PNG image of the page (transforming it by taking a screenshot). Herbert van de Sompel, Michael Nelson and others have come up with a creative solution. Details below the fold.


I suggested that what we were really talking about was yet another form of content negotiation; Memento (RFC7089) specifies content negotiation in the time dimension, HTTP specifies content negotiation in the format and language "dimensions", and what Ilya wanted was content negotiation in the "transform" dimension to allow a requestor to choose between transformed and untransformed versions of the page. Ilya's list of transforms was:
  • none - the URL content exactly as originally received.
  • screenshot - an image of the rendered page.
  • altered-dom - the DOM altered as, for example, by archive.is.
  • url-rewritten - URLs in the page rewritten to point to preserved pages in the archive.
  • banner-inserted - the page framed by archival metadata as, for example, by the Wayback Machine.
Ilya's and my idea was that a new HTTP header would be defined to support this form of content negotiation.

Banner-inserted content
outlined in red
Shawn Jones, Herbert and Michael objected that defining new HTTP headers was hard, and wrote a detailed post which explained the scope of the problem:
In the case of our study, we needed to access the content as it had existed on the web at the time of capture. Research by Scott Ainsworth requires accurate replay of the headers as well. These captured mementos are also invaluable to the growing number of research studies that use web archives. Captured mementos are also used by projects like oldweb.today, that truly need to access the original content so it can be rendered in old browsers. It seeks consistent content from different archives to arrive at an accurate page recreation. Fortunately, some web archives store the captured memento, but there is no uniform, standard-based way to access them across various archive implementations.
Their proposal was to use two different Memento TimeGates, one for the transformed and one for the un-transformed content.

The elegance of Herbert et al's latest proposal comes from eliminating the need to define new HTTP headers or to use multiple TimeGates. Instead, they propose using the standard Prefer header from RFC7240. They write:
Consider a client that prefers a true, raw memento for http://www.cnn.com. Using the Prefer HTTP request header, this client can provide the following request headers when issuing an HTTP HEAD/GET to a memento.
GET /web/20160721152544/http://www.cnn.com/ HTTP/1.1 Host: web.archive.org Prefer: original-content, original-links, original-headers Connection: close
As we see above, the client specifies which level of raw-ness it prefers in the memento. In this case, the client prefers a memento with the following features:
  1. original-content - The client prefers that the memento returned contain the same HTML, JavaScript, CSS, and/or text that existed in the original resource at the time of capture.
  2. original-links - The client prefers that the memento returned contain the links that existed in the original resource at the time of capture.
  3. original-headers - The client prefers that the memento response uses X-Archive-Orig-* to express the values of the original HTTP response headers from the moment of capture.
The memento that is returned can carry the the Preference-Applied HTTP response header indicating which of the requested preferences have been applied to the returned content. This is closely analogous to the earlier suggestion of content negotiation but doesn't require either new headers or multiple TimeGates.

The details of their proposal are important, you should read it.

No comments: