DSHR's Blog: Harvesting and Preserving the Future Web

Monday, May 7, 2012

Harvesting and Preserving the Future Web

Kris Carpenter Negulescu of the Internet Archive and I organized a half-day workshop on the problems of harvesting and preserving the future Web during the International Internet Preservation Coalition General Assembly 2012 at the Library of Congress. My involvement was spurred by my long-time interest in the evolution of the Web from a collection of linked documents whose primary language was HTML to a programming environment whose primary language is Javascript.

In preparation for the workshop Kris & I, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation:

Database driven features
Complex/variable URI formats
Dynamically generated URIs
Rich, streamed media
Incremental display mechanisms
Form-filling
Multi-sourced, embedded content
Dynamic login, user-sensitive embeds
User agent adaptation
Exclusions (robots.txt, user-agent, ...)
Exclusion by design
Server-side scripts, RPCs
HTML5

A forthcoming document will elaborate with examples on this list and the other issues identified at the workshop. Some partial solutions are already being worked on. For example, Google, the Institut national de l'audiovisuel in France, and the Internet Archive among others have active programs involving executing the content they collect using "headless browsers" such as Phantom JS.

But the clear message from the workshop is that the old goal of preserving the user experience of the Web is no longer possible. The best we can aim for is to preserve a user experience, and even that may in many cases be out of reach. An interesting example of why this is so is described in an article on A/B testing in Wired. It explains how web sites run experiments on their users, continually presenting them with randomly selected combinations of small changes as part of a testing program:

Use of a technique called multivariate testing, in which myriad A/B tests essentially run simultaneously in as many combinations as possible, means that the percentage of users getting some kind of tweak may well approach 100 percent, making “the Google search experience” a sort of Platonic ideal: never encountered directly but glimpsed only through imperfect derivations and variations.

It isn't just that one user's experience differs from another's. The user can never step into the same river twice. Even if we can capture and replay the experience of stepping into it once, the next time will be different, and the differences may be meaningful, or random perturbations. We need to re-think the whole idea of preservation.

10 comments:

euanc said...: Perhaps it is a matter of defining and preserving some "representative" experiences of the web.
I also suspect this problem applies to most types of objects, and in particular it applies when applying emulation strategies to preserve them, i.e. it is very difficult to preserve "the" experience of a .doc file rendering. Many different user environment configurations could alter that experience in many different ways (e.g. inclusion of different fonts, different screen sizes, colour calibrations etc).
I have been advocating preserving "representative" experiences of objects for a while because of this problem. As I see it the average user of a .doc file that was shared outside of any one system would only be expected to render it in a representative environment not the exact same environment that the object was originally intended/created to be rendered within.
The same could apply to websites, we could identify combinations of environments, configurations and website files that provide an experience that is representative of the average experience (or a few experiences that represent different interesting perspectives) and preserve that(/those).

Thanks for the thought-provoking post!; May 7, 2012 at 9:40 PM
jeffersonbailey said...: Great point by Euan and I agree completely. Preserving "*the* user experience" was never a realistic goal either in the digital world or the analog one. Constraints of resources, technology, theory -- all will make the archive, by nature, incomplete. Formerly that meant limiting the volume of an accession and now it means limiting the extent of the emulation or the parameters of a technical representation.

In many ways, the siren's song of preserving digital objects is that we often seem to have the technical tools to preserve expansive amounts of context and artifactuality but then crash upon the rocks when realizing how many of their characteristics are truly evanescent.

Love the blog and look forward to this document, as many of the problems mentioned here are often unforeseen by smaller institutions undertaking web archiving projects.; May 8, 2012 at 7:25 PM
David. said...: I disagree. In the early days of the Web the user experience was close enough to identical for every user, and with simple tools it could be harvested and replayed well enough, that everyone ignored the differences. Preserving the user experience was a realistic goal in those days, indeed a goal that was largely achieved.

This set the paradigm in which we have been working ever since. Over time the differences grew; we fell shorter and shorter of the goal. But we always had a Platonic ideal of the user experience in view.

What we're seeing now is a future in which the common parts of the user experience are unimportant to the user. What we can capture and replay is more about the user than about the site. And the synthetic users we can construct to do the harvesting just aren't that interesting or representative of a real user.

The problems facing the teams trying to preserve multi-player games and virtual worlds are coming to the Web.; May 9, 2012 at 6:14 AM
euanc said...: Interesting points David.

Correct me if I am wrong but it seems that you are suggesting that we should see web as interactive software that is widely distributed. And he interactive and distributed aspects make it near impossible to preserve given current technology. For example preserving a particular person's facebook experience is not the same as preserving the experience of any random person using facebook. In order to do the latter you would have to maintain the ability to sign up to a facebook account at a particular time and interact with relevant users (possibly from that time).

When seen that way it is a very huge challenge and possibly impossible.

I would add though that the platonic ideal that you refer to was probably never realistic for the majority of digital objects and the majority of web sites. It may have been the goal but it was probably never actually achieved or rarely achieved, or, more interestingly, never knowingly achieved.

Perhaps we just need to set expectations lower and we could be aided in doing that by showing how difficult it is to achieve that ideal with even simple older digital objects (due to various complex dependencies not being able to be preserved such as variations in hardware used to render objects).

Thanks again,

Euan Cochrane; May 10, 2012 at 4:26 PM
David. said...: I don't agree that, in the early days, preserving the user experience of the Web was an impossible goal. Because of the very simple nature of early web sites, say in 1998, both the Internet Archive and the early LOCKSS prototype were able to achieve a very high degree of fidelity when replaying the typical preserved site.

Both had (then as now) problems with coverage, and with inter-site experiences, but in terms of setting initial expectations for what Web preservation should be able to achieve these problems were seen as less important than the fidelity they were able to achieve within a single site.

Also, I'm arguing for more fundamental re-thinking than simply lowering expectations of what can be achieved with current techniques by demonstrating their flaws.; May 12, 2012 at 1:07 PM
bowerbird said...: david, thank you, first,
for bringing up the issue.

and second, for standing up
to stress its importance
when people tell you that
your goal is "impossible",
and probably always was...

we're losing "commonality"
_and_ letting history slip
through our fingers like
grains of sand at a beach.

the problem with "lowering
the expectation" of our
preservationists is that
nobody really knows how to
keep it from hitting zero.

because that's where it'll
end up if we stay on this
far-too-slippery slope...

the web is our newspaper.
it is our diary, and our
history book. do we really
want to write them in ink
guaranteed to fade away?

and _sooner_, not later?
to the point that it is
already fully invisible
even when first written?

as you said, david, we
clearly need to rethink,
and do it _fundamentally_.

don't let people
pooh-pooh you...

-bowerbird; May 14, 2012 at 12:00 PM
euanc said...: Thanks for the clarification.

My suggestion is that we have never had the ability to preserve every person's experience of the web with full fidelity (except in the very early days). The problem is just more pronounced now. The reason for this is the differences in configurations of software and hardware that users have used to interact with websites and the affect they have on the information presented to users when viewing those sites.
I also would suggest that we could now preserve pretty well a 'static' version of one particular person's (or a few's) experience of, for example, their facebook account. Or even a week's worth of their facebook account changes. But would that be enough?

That is not to say there hasn't been great work done. One or more experiences of a huge number of web sites have been preserved with full fidelity.

I fully agree that would be awful and a great loss if we didn't find a way to preserve a far richer experience of the modern/future web and I completely agree that lowering expectations seems like admitting defeat. I just think that when you put it in perspective the web has always been difficult to preserve because of its dynamic, distributed and dependent nature. Luckily the expectation was not set set that we would preserve every person's experience of the web, or that all the inter-web links would continue to be preserved. And we could apply this same strategy to the future web.; May 14, 2012 at 2:41 PM
bowerbird said...: david-

i get it.

you're telling us we will
soon be going off a cliff,
and need to change course.

this is very important...

vital, crucial, imperative.

you need to persist.

do not be waylaid by the
people who wanna tell you
that we have always been
on this particular path...

maybe we have.

or maybe we haven't.

but it's beside the point.

because we are about to go
off the edge of a cliff...

i get it. you must persist.

-bowerbird; May 16, 2012 at 12:45 PM
David. said...: Via Slashdot, we learn that Google is now executing the Javascript it finds. Among other things, this may be the reason for:

"Other people have reported strange things happening to shopping carts. One rumor is that large orders are being placed consisting of one of each item on sale, only to be aborted at the last minute."; June 4, 2012 at 2:57 AM
David. said...: I was remiss in not noticing that the report from this workshop was posted to the IIPC's web site. It is a comprehensive overview of the issues raised.; February 19, 2013 at 11:25 AM