Monday, May 7, 2012

Harvesting and Preserving the Future Web

Kris Carpenter Negulescu of the Internet Archive and I organized a half-day workshop on the problems of harvesting and preserving the future Web during the International Internet Preservation Coalition General Assembly 2012 at the Library of Congress. My involvement was spurred by my long-time interest in the evolution of the Web from a collection of linked documents whose primary language was HTML to a programming environment whose primary language is Javascript.

In preparation for the workshop Kris & I, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation:
  1. Database driven features
  2. Complex/variable URI formats
  3. Dynamically generated URIs
  4. Rich, streamed media
  5. Incremental display mechanisms
  6. Form-filling
  7. Multi-sourced, embedded content
  8. Dynamic login, user-sensitive embeds
  9. User agent adaptation
  10. Exclusions (robots.txt, user-agent, ...)
  11. Exclusion by design
  12. Server-side scripts, RPCs
  13. HTML5
A forthcoming document will elaborate with examples on this list and the other issues identified at the workshop. Some partial solutions are already being worked on. For example, Google,  the Institut national de l'audiovisuel in France, and the Internet Archive among others have active programs involving executing the content they collect using "headless browsers" such as Phantom JS.

But the clear message from the workshop is that the old goal of preserving the user experience of the Web is no longer possible. The best we can aim for is to preserve a user experience, and even that may in many cases be out of reach. An interesting example of why this is so is described in an article on A/B testing in Wired. It explains how web sites run experiments on their users, continually presenting them with randomly selected combinations of small changes as part of a testing program:
Use of a technique called multivariate testing, in which myriad A/B tests essentially run simultaneously in as many combinations as possible, means that the percentage of users getting some kind of tweak may well approach 100 percent, making “the Google search experience” a sort of Platonic ideal: never encountered directly but glimpsed only through imperfect derivations and variations.
It isn't just that one user's experience differs from another's. The user can never step into the same river twice. Even if we can capture and replay the experience of stepping into it once, the next time will be different, and the differences may be meaningful, or random perturbations. We need to re-think the whole idea of preservation.


  1. Perhaps it is a matter of defining and preserving some "representative" experiences of the web.
    I also suspect this problem applies to most types of objects, and in particular it applies when applying emulation strategies to preserve them, i.e. it is very difficult to preserve "the" experience of a .doc file rendering. Many different user environment configurations could alter that experience in many different ways (e.g. inclusion of different fonts, different screen sizes, colour calibrations etc).
    I have been advocating preserving "representative" experiences of objects for a while because of this problem. As I see it the average user of a .doc file that was shared outside of any one system would only be expected to render it in a representative environment not the exact same environment that the object was originally intended/created to be rendered within.
    The same could apply to websites, we could identify combinations of environments, configurations and website files that provide an experience that is representative of the average experience (or a few experiences that represent different interesting perspectives) and preserve that(/those).

    Thanks for the thought-provoking post!

  2. Great point by Euan and I agree completely. Preserving "*the* user experience" was never a realistic goal either in the digital world or the analog one. Constraints of resources, technology, theory -- all will make the archive, by nature, incomplete. Formerly that meant limiting the volume of an accession and now it means limiting the extent of the emulation or the parameters of a technical representation.

    In many ways, the siren's song of preserving digital objects is that we often seem to have the technical tools to preserve expansive amounts of context and artifactuality but then crash upon the rocks when realizing how many of their characteristics are truly evanescent.

    Love the blog and look forward to this document, as many of the problems mentioned here are often unforeseen by smaller institutions undertaking web archiving projects.

  3. I disagree. In the early days of the Web the user experience was close enough to identical for every user, and with simple tools it could be harvested and replayed well enough, that everyone ignored the differences. Preserving the user experience was a realistic goal in those days, indeed a goal that was largely achieved.

    This set the paradigm in which we have been working ever since. Over time the differences grew; we fell shorter and shorter of the goal. But we always had a Platonic ideal of the user experience in view.

    What we're seeing now is a future in which the common parts of the user experience are unimportant to the user. What we can capture and replay is more about the user than about the site. And the synthetic users we can construct to do the harvesting just aren't that interesting or representative of a real user.

    The problems facing the teams trying to preserve multi-player games and virtual worlds are coming to the Web.

  4. Interesting points David.

    Correct me if I am wrong but it seems that you are suggesting that we should see web as interactive software that is widely distributed. And he interactive and distributed aspects make it near impossible to preserve given current technology. For example preserving a particular person's facebook experience is not the same as preserving the experience of any random person using facebook. In order to do the latter you would have to maintain the ability to sign up to a facebook account at a particular time and interact with relevant users (possibly from that time).

    When seen that way it is a very huge challenge and possibly impossible.

    I would add though that the platonic ideal that you refer to was probably never realistic for the majority of digital objects and the majority of web sites. It may have been the goal but it was probably never actually achieved or rarely achieved, or, more interestingly, never knowingly achieved.

    Perhaps we just need to set expectations lower and we could be aided in doing that by showing how difficult it is to achieve that ideal with even simple older digital objects (due to various complex dependencies not being able to be preserved such as variations in hardware used to render objects).

    Thanks again,

    Euan Cochrane

  5. I don't agree that, in the early days, preserving the user experience of the Web was an impossible goal. Because of the very simple nature of early web sites, say in 1998, both the Internet Archive and the early LOCKSS prototype were able to achieve a very high degree of fidelity when replaying the typical preserved site.

    Both had (then as now) problems with coverage, and with inter-site experiences, but in terms of setting initial expectations for what Web preservation should be able to achieve these problems were seen as less important than the fidelity they were able to achieve within a single site.

    Also, I'm arguing for more fundamental re-thinking than simply lowering expectations of what can be achieved with current techniques by demonstrating their flaws.

  6. david, thank you, first,
    for bringing up the issue.

    and second, for standing up
    to stress its importance
    when people tell you that
    your goal is "impossible",
    and probably always was...

    we're losing "commonality"
    _and_ letting history slip
    through our fingers like
    grains of sand at a beach.

    the problem with "lowering
    the expectation" of our
    preservationists is that
    nobody really knows how to
    keep it from hitting zero.

    because that's where it'll
    end up if we stay on this
    far-too-slippery slope...

    the web is our newspaper.
    it is our diary, and our
    history book. do we really
    want to write them in ink
    guaranteed to fade away?

    and _sooner_, not later?
    to the point that it is
    already fully invisible
    even when first written?

    as you said, david, we
    clearly need to rethink,
    and do it _fundamentally_.

    don't let people
    pooh-pooh you...


  7. Thanks for the clarification.

    My suggestion is that we have never had the ability to preserve every person's experience of the web with full fidelity (except in the very early days). The problem is just more pronounced now. The reason for this is the differences in configurations of software and hardware that users have used to interact with websites and the affect they have on the information presented to users when viewing those sites.
    I also would suggest that we could now preserve pretty well a 'static' version of one particular person's (or a few's) experience of, for example, their facebook account. Or even a week's worth of their facebook account changes. But would that be enough?

    That is not to say there hasn't been great work done. One or more experiences of a huge number of web sites have been preserved with full fidelity.

    I fully agree that would be awful and a great loss if we didn't find a way to preserve a far richer experience of the modern/future web and I completely agree that lowering expectations seems like admitting defeat. I just think that when you put it in perspective the web has always been difficult to preserve because of its dynamic, distributed and dependent nature. Luckily the expectation was not set set that we would preserve every person's experience of the web, or that all the inter-web links would continue to be preserved. And we could apply this same strategy to the future web.

  8. david-

    i get it.

    you're telling us we will
    soon be going off a cliff,
    and need to change course.

    this is very important...

    vital, crucial, imperative.

    you need to persist.

    do not be waylaid by the
    people who wanna tell you
    that we have always been
    on this particular path...

    maybe we have.

    or maybe we haven't.

    but it's beside the point.

    because we are about to go
    off the edge of a cliff...

    i get it. you must persist.


  9. Via Slashdot, we learn that Google is now executing the Javascript it finds. Among other things, this may be the reason for:

    "Other people have reported strange things happening to shopping carts. One rumor is that large orders are being placed consisting of one of each item on sale, only to be aborted at the last minute."

  10. I was remiss in not noticing that the report from this workshop was posted to the IIPC's web site. It is a comprehensive overview of the issues raised.