Thursday, April 4, 2013

Talk at Spring 2013 CNI

Kris Carpenter Negulescu and I gave talks at the Spring 2013 CNI meeting in a project briefing entitled "Its Not Your Grandfather's Web Any Longer". They were based on the workshop we ran at the 2012 IIPC meeting at the Library of Congress looking at the problems of harvesting and preserving the future Web. I talked about the problems the workshop identified and Kris talked about the solutions people are working on. Below the fold is an edited text of my part of the talk with links to the sources.

DEC 340 display
I am a grandfather and I'm old enough to remember the Web the way it was. Actually, I'm old enough to remember back long before there was a Web. About this time in 2015 will be the 50th anniversary of the first program I ever wrote. A couple of years later I started life as a Cambridge undergraduate and encountered computer graphics on a PDP-7 with a DEC 340 display. Just before that, one of the smartest people I ever ended up meeting co-authored a paper on computer graphics entitled On the Design of Display Processors. (also here). Ivan Sutherland introduced the Wheel of Reincarnation as applied to graphics hardware; the idea that the hardware design was cyclical, oscillating between a fixed-function I/O peripheral that over time grew into a programmable I/O processor that over time grew into a fully-functional computer driving a fixed-function display.

Virtua Fighter on NV1
GeForce GTX280
The wheel is still rolling along, if a bit slower than in Ivan's day. When we started NVIDIA 20 years ago we designed a fixed-function graphics chip, whereas almost all of our competitors designed fully programmable graphics chips. It was one of the key performance advantages that allowed us to get, for the first time ever, arcade games running at full frame rate on a PC. This is Sega's Virtua Fighter on NV1, preserved on YouTube. The bottleneck was the framebuffer memory; we used every available memory cycle to render graphics, they had to waste some proportion of the memory cycles fetching instructions for their processor from the framebuffer. Now, of course, NVIDIA's chips are fully programmable GPUs. In 20 years they have gone half-way round the wheel, from fixed-function to programmable.

The wheel also applies to graphics and user interface software. The early window systems, such as SunView and the Andrew window system, were libraries of fixed functions, as was the X Window System. James Gosling and I tried to move a half-turn around the wheel with NeWS, which was a user interface environment fully programmable in PostScript, but we were premature. 12/1/98
What has this to do with the Web? The half-turn around the wheel that James & I couldn't manage has happened to the Web. The Web that Tim Berners-Lee invented was a practical implementation of Ted Nelson's utopian concept of Xanadu, a web of documents connected by hyperlinks, encoded in a fixed-function document description language. That Web was relatively easy to ingest for preservation, because a crawler could visit the page, find the links in it, and follow them to other pages it needed to ingest. And it was relatively easy to preserve once ingested, because the content of each document changed infrequently, so two visits in succession would probably obtain the same content. The phenomenal success of the Internet Archive was based on this model. Here, from the Wayback Machine, is the front page of BBC News from more than 15 years ago. The only really difficult aspect of the problem was dissemination, replaying the collected content so that the links resolved to their preserved targets, wherever they were available, rather than their current targets. It wasn't until the recent work of Herbert van de Sompel, Michael Nelson and the Memento team that this was resolved.

Visiting the New York Times
But, as we should have predicted based on this history, the Web we all use today is a half-turn around the wheel from that Web. That Web's primary language was HTML, a document description language. The browser downloaded documents and rendered the fixed set of primitives they contained. Browsers still do that, of course. Once network protocols like HTTP and HTML become widely used, they can't be changed in incompatible ways. But mostly what they do is download and run programs in the current Web's primary language, Javascript. Javascript is a programming language, not a document description language. Your browser is only incidentally a document rendering engine, its primary function is as a virtual machine. I use a Firefox plugin called NoScript; this screen grab shows that at least 11 sites wanted to run programs in my browser as I visited the front page of the New York Times a couple of weeks ago.

Kris and I organized a workshop last year to look at the problems the evolution of the Web poses for attempts to collect and preserve it. Here is the list of problem areas the workshop identified:
  • Database driven features & functions
  • Complex/variable URI formats & inconsistent/variable link implementations
  • Dynamically generated, ever changing, URIs
  • Rich Media
  • Scripted, incremental display & page loading mechanisms
  • Scripted, HTML forms
  • Multi-­sourced, embedded material
  • Dynamic login/auth services: captchas, cross-­site/social authentication, & user-­sensitive embeds
  • Alternate display based on user agent or other parameters
  • Exclusions by convention
  • Exclusions by design
  • Server side scripts & remote procedure calls
  • HTML5 "web sockets"
  • Mobile publishing
Clearly, I don't have time to look at each of these in detail. For that, you should consult the document from the workshop (PDF). I will try to abstract from these individual problems a few big-picture problems.

Not an article from Graft
First, in order to preserve content from the Web you need to able to access it. Increasingly, as news disappears behind slightly porous paywalls, and sites use social network identities to gate access, this is becoming more difficult. Equipping a crawler with a credit card to pay for access is not a practical approach. But the problem is compounded by another trend. Sites that forbid access, for example because the content is behind a paywall, and show you a login page no longer do so with an error code, such as a 403 Forbidden. They send a login page with a 200 OK code. An example of the result is this article from the defunct journal Graft as preserved at the Internet Archive.
An article from Graft
From the CLOCKSS archive (PDF) the article looks like this. A Web archive full of pages explaining that the actual content is inaccessible may accurately reflect the state of the Web at that time but it isn't a great way of preserving our digital heritage. Determining that some content that was obtained with a 200 OK code is actually not valid content to be preserved is hard. Because the LOCKSS system is intended to preserve subscription content this is an issue we deal with all the time. The custom per-site login page detectors we have to write are rather crufty.

When someone says HTML, you tend to think of a page like this, which looks like vanilla HTML but is actually an HTML5 geolocation demo. Most of what you actually get from Web servers now is programs, like the nearly 12KB of Javascript included by the three lines near the top. This is the shortest of the three files. A crawler collecting the page can't just scan the content to find the links to the next content it needs to request. It has to execute the program to find the links, which raises all sorts of issues.

The program may be malicious, so its execution needs to be carefully sandboxed. Even if it doesn't intend to be malicious, its execution will take an unpredictable amount of time, which can amount to a denial-of-service attack on the crawler. How many of you have encountered Web pages that froze your browser? Executing may not be slow enough to amount to an attack, but it will be a lot more expensive than simply scanning for links. Ingesting the content just got a lot more expensive in compute terms, which doesn't help the whole economic sustainability is job #1 issue.

It is easy to say, execute the content. But the execution of the content depends on the inputs the program obtains, in this case the geolocation of the crawler and the weather there at the time of the crawl. In general these inputs depend on, for example, the set of cookies and the contents of HTML5's local storage in the "browser" the crawler is emulating, the state of all the external databases the program may call upon, and the user's inputs. So the crawler has not merely to run the program, it has to emulate a user mousing and clicking everywhere in the page in search of behaviors that trigger new links.

But we're not just finding links for its own sake, we want to preserve those links and the content they point to for future dissemination. If we preserve the programs for future re-execution we also have to preserve some of their inputs, such as the responses to database queries, and supply those responses at the appropriate times during the re-execution of the program. Other inputs, such as mouse movements and clicks, have to be left to the future reader to supply. This is very tricky, including as it does issues such as faking secure connections.

Re-executing the program in the future is a very fragile endeavour. This isn't because the Javascript virtual machine will have become obsolete. It is well-supported by an open source stack. It is because it is very difficult to be sure which are the significant inputs you need to capture, preserve, and re-supply. A trivial example is a Javascript program that displays the date. Is the correct preserved behavior to display the date when it was ingested, to preserve the original user experience? Or is it to display the date when it will be re-executed, to preserve the original functionality? There's no right answer.

Among the projects exploring the problems of preserving executable objects are:
  • Olive at C-MU, which is preserving virtual machines containing the executable object, but not I believe their inputs.
  • The EU-funded Workflow 4Ever project, which is trying to encapsulate scientific workflows and adequate metadata for their later re-use into Research Objects. The metadata includes sample datasets and the corresponding results, so that correct preservation can be demonstrated. Generating the metadata for re-execution of a significant workflow is a major effort (PDF).
Even for workflows, which are a simpler case than generic Javascript, Workflow 4Ever has to impose some restrictions in order to make it work. You can think of this as analogous to PDF/A; turning off all the hard-to-preserve aspects. For the broader world of Web preservation, the HTML/A approach isn't likely to be robust enough, even if we could persuade Web sites to publish two different versions, one for use and one for preservation.

An alternative that preserves the user experience but not the functionality is, in effect, to push the system one more half-turn around the wheel, reducing the content to fixed-function primitives. Not to try to re-execute the program but to try to re-render the result of execution. The YouTube of Virtua Fighter is a simple example of this kind. It may be the best that can be done to cope with the special complexities of video games.

In the re-render approach, as the crawler executed the program it would record the display, and build a map of its sensitive areas with the results of activating each of them. You can think of this as a form of pre-emptive format migration, an approach that both Jeff Rothenberg and I have argued against for a long time. As with games, it may be that this, while flawed, is the best we can do with the programs on the Web we have.

A Prairie Home Companion
The Who Sell Out
What are all these programs doing in your browser? Mostly, what they do is capture information about you so it can be sold. I won't shed many tears if we fail to preserve this aspect of the Web! But some of the captured and sold information drives what you see in the page, such as advertisements. I've never understood why archivists think preserving spoof ads is important, whether they are selling fake products (A Prairie Home Companion) or real (The Who Sell Out) ones, but preserving real ads, such as those that dominate our political discourse, is not important.

The programs that run in your browser these days also ensure that every visit to a web page is a unique experience, full of up-to-the-second personalized content. The challenge of preserving the Web is like that of preserving theatre or dance. Every performance is a unique and unrepeatable interaction between the performers, in this case a vast collection of dynamically changing databases, and the audience. Actually, it is even worse. Preserving the Web is like preserving a dance performed billions of times, each time for an audience of one, who is also the director of their individual performance.

We need to ask what we're trying to achieve by preserving Web content. We haven't managed to preserve everything so far, and we can expect to preserve even less going forward. There are a range of different goals:
  • At one extreme, the Internet Archive's Wayback Machine tries to preserve samples of the whole Web. Although I was skeptical when Brewster explained what he wanted to do, I was wrong. A series of samples of the Web through time turns out to be an incredibly valuable resource. In the early days the unreliability of HTTP and the simplicity of the content meant that the sample was fairly random; the difficulties caused by the evolution of the Web mean a gradually increasing systematic bias.
  • The LOCKSS Program, at the other extreme, samples in another way. It tries to preserve everything about a carefully selected set of Web pages, mostly academic journals. The sample is created by librarians selecting journals; it is a sample because we don't have the resources to preserve every academic journal even if there was an agreed definition of that term. Again, the evolution of the Web is making this job gradually more and more difficult.
For both preserving a sample of the whole Web and preserving selected parts of it, the most important thing going forward will be to deploy a variety of approaches. Each approach, each type of crawler, will have a systematic bias. Because, at least in principle, Memento now allows a unified view across all Web archives, combining different approaches should "just work" to provide much more complete preservation of the Web than any individual archive, even the Internet Archive, can achieve on its own.


euanc said...

Hi David,

You seem to be suggesting that some sort of emulation based solution (n conjunction with migration-based solutions) will be necessary for preserving some types of websites.

I'm not sure if you have heard about the work the bwFLa team have been doing on Emulation as a Service (EaaS) but you might be interested.
There are details here, here, and here.

Unknown said...

We do a great deal of work with, ranging from the simple (flat hypertext) to the complex (data streams interpreted by server-side processes) to the utterly insane (MP3s called by QuickTime videos as played in ShockWave objects as invoked in JavaScript windows).

Since we deal with a very narrow focus, I can afford to spend as much time (and work as many heroics) as necessary to document and archive these WWW-based objects. Speaking coarsely, what I've found is:

-Some sites are lost causes for client-side archiving. If you only care about generally acquiring content, this can sometimes be sidestepped (e.g. PHP-driven background carousels), but maintaining functionality is impossible in these cases. It's a tradeoff you have to decide on - get something quickly (maybe) or get everything slowly (usually contacting site owners, which isn't always realistic)

-Sites that can be fully archived client-side requires a considerable time and expertise investment. Artisanally archiving a site is not unlike picking a lock: it requires mentally constructing the interior from your few tools, along with a fair amount of intuition and luck. Doing it at scale, even with industry-standard tools, is out of the question. I documented an example of a non-machine-readable site here

The Internet Archive's standard of good-enough is about the best that can be done at scale. Combined with documentation (so that what we recall about the Web in the 00s, but wasn't captured, isn't lost), this seems to be the only reasonable way to preserve the Web at scale.

David. said...

Euan, that isn't at all what I am suggesting. I can't even see what it is that you think would need to be emulated. Javascript? There is no need to emulate Javascript. Whether the current Javascript implementations survive into the future, or whether for some bizarre reason we have to run them on an emulator, has no effect whatsoever on the problems Kris and I discussed in our talks.

<rant>This obsession with Rothenberg's emulation vs. migration dichotomy really has to stop. It is making rational discussion of the problems of digital preservation impossible. Emulation and migration are proposed solutions to the problem of format obsolescence. As far as the Web is concerned, there is no problem of format obsolescence.

Has anyone out there heard of this process called "science"? The way it works is that people come up with theories about how the world behaves. Based on these theories they make predictions about the future. Then we wait to see whether these predictions come true.

If one of the competing theories makes predictions that come true in the real world, and one makes predictions that don't come true, then the one with the bad predictions is discredited and we stop using it, because using it has led us into error.

Science is really a neat idea, and we have a good example of science in the case of format obsolescence. Starting around 1995, some people had a theory of format obsolescence that, based on history from the pre-Web era, format obsolescence was imminent. Starting around 1998, I had a theory, based on the history of network protocols and open source software, that format obsolescence in the Web environment would be an extremely slow process, if it happened at all. Both sides made public predictions, in my case on this blog repeatedly since 2007 and in other forums since at least 2005.

Last fall, an experiment was published by Matt Holden of INA that showed that my prediction was correct and the prediction of imminent format obsolescence, at least for Web formats, was not correct. Matt showed that Web formats, even for high-risk audiovisual content, had not gone obsolete in more than 15 years.

In the light of this experiment, we can be confident in saying that format obsolescence is not a significant risk in preserving the Web. Arguing about migration vs. emulation in the context of Web preservation is a distraction from addressing the significant problems. In the context of the Web, the theory of imminent format obsolescence is not useful because it is wrong. Proponents of this theory need to focus on fixing their theory so that it predicts no format obsolescence on the Web in more than 15 years.

Kris and my talks were intended to draw attention to the significant problems with current Web preservation. These are that the direction in which the Web is evolving means we are less and less able to collect Web content for preservation, and less and less able to accurately reproduce it for future access. Neither of these problems is addressed by either migration (note that my reference to migration was an analogy - I don't really think that the YouTube video of Virtua Fighter is a migration of the game) or emulation.</rant>

David. said...

Alexander - I agree with most of your comment except the last paragraph. Clearly, at scale it isn't possible to do the detailed per-site work that you do, or that gets built in to the LOCKSS system's plugins for particular academic publishing platforms.

I don't know if this is what you intended, but your last paragraph implies that you think that going forward the Internet Archive and other at-scale archives preserving samples of the Web using similar technology will be able to continue doing about as good a job as they have in the past. Kris' and my talks were intended to convey the message that this isn't the case. The evolution of the Web means that the current technology is gradually losing effectiveness. Although efforts, such as the ones Kris described in her talk, are being made to develop new technologies the best it seems they can achieve is to slow the loss in effectiveness, not to reverse it.

euanc said...

Hi David,

I tried to reply again here but i ran over the 4000 character limit. I've posted another reply on my blog here.

David. said...

Euan's blog post in response to my rant makes it clear that I need to be more explicit.

Note that here we are discussing only Web content, not general digital content. There are two types of problem that the evolution of the Web poses for collecting and preserving Web content: collecting it and disseminating it.

The docoment from last year's workshop, and Kris's talk were all about the problems the evolution of the Web poses for collecting the content. Unless the content can be collected it cannot in the future be disseminated. Emulation is simply irrelevant to the problems of collecting Web content. Migration is irrelevant to these problems too, in that in order to migrate content you need to be able to collect it first.

In my talk I also discussed some problems of future dissemination, mostly the fact that we have to decide whether we are preserving the functionality or the user experience of the content. This is a far deeper question than whether to use emulation or migration.

What triggered my rant was that it appeared from his first comment that Euan's thinking about digital preservation was so captured by the emulation vs. migration dichotomy that he couldn't see that, although I was discussing digital preservation, the dichotomy was irrelevant to my talk even if format obsolescence was imminent.

Yet we know that format obsolescence is not imminent. So the emulation issue was doubly irrelevant, since the problem it addresses isn't going to be significant in the foreseeable future, whereas the problems that Kris & I described are significant problems right now. The variety of approaches I commended in my conclusion refers not to doing both emulation and migration, as neither is necessary, but to a variety of approaches to collecting the content in order to mitigate the systematic sampling bias caused by almost everyone using the same crawler.

So here I am arguing about emulation vs. migration instead of about the significant problems facing the preservation of Web content right now. As my rant said "This obsession with Rothenberg's emulation vs. migration dichotomy really has to stop. It is making rational discussion of the problems of digital preservation impossible". At least in the case of Web content, I rest my case.

And, yes, I understand that much of Euan's post relates to Rothenberg Still Wrong not to this post. I'll address that part another time.

David. said...

CNI has posted the video of Kris' and my talks on YouTube and Vimeo.

And I'm sorry there's been a delay in responding to Euan's comments. It's obviously going to be a long post and I have been hard up against too many deadlines to give it the attention it deserves.