Monday, August 22, 2011

Moonalice plays Palo Alto

On July 23rd the band Moonalice played in Rinconada Park as part of Palo Alto's Twilight Concert series. The event was live-streamed over the Internet. Why is this interesting? Because Moonalice streams and archives their gigs in HTML5. Moonalice is Roger McNamee's band. Roger is a long-time successful VC (he invested in Facebook and Yelp, for example) and, based on his experience using HTML5 for the band, believes that HTML5 will transform the Web. On June 28th he spoke at the Paley Center for the Media and explained why. An overview appeared at Business Insider, but you really need to watch the full video of the talk. Watch it then follow me below the fold. I'll explain some implications for digital preservation.

I've been warning for some time that one of the fundamental problems facing digital preservation is the evolution of content from static to dynamic. What, exactly, does it mean to preserve something that is different every time you look at it?

When the dynamic elements in Web pages were things like advertisements or news headlines it was possible to argue that preserving these elements was not important. Indeed, some preservation services argued that their mission was limited to preserving the intellectual content (PDF).

Even if you don't believe that preserving context matters (if so, see the digression below), the evolution of the Web has meant that more and more of the intellectual content is delivered as dynamic elements. It started with streaming video; no-one can understand the 2008 US election without YouTube. The improved user interface capabilities of AJAX mean that it is now used even by academic journal publishers such as the Royal Society of Chemistry.

The evolution of user interface protocols from static to programmable has a history. Bob Sproull and Elaine Thomas published the first Network Graphics Protocol in 1974. It was developed at Xerox PARC in the early days of the ARPAnet, and was designed to adapt to various types of display hardware, including both storage tubes and high-performance vector graphics terminals. Nearly a decade later, James Gosling & I based the Andrew Window Manager on a similar fixed-function protocol between applications and the process managing the display. The X Window System continues this approach to this day.

Even as we built the Andrew system, the limitations of a fixed-function protocol were obvious. At Sun Microsystems, James & I built the opposite, a user interface environment completely implemented in and programmable in PostScript called NeWS (Network Extensible Window System). Although we initially thought about the protocol between the application and the user interface environment as being PostScript, the way typical applications used it was different. Simpler applications were written entirely in PostScript, and ran wholly inside the NeWS server. An analog today would be an application written wholly in Javascript running inside the browser. More complex applications ran elsewhere on the network, and started by downloading into the NeWS server a PostScript program that talked an application-specific protocol back to the application. This was a precursor of AJAX.

The key impact of HTML5 is that, in effect, it changes the language of the Web from HTML to Javascript, from a static document description language to a programming language. It is true that the Javascript in HTML5 is delivered in an HTML container, just as the video and audio content of a movie is delivered in a container such as MP4. But increasingly people will use HTML5 the way NeWS programmers used PostScript, as a way of getting part or maybe even all of their application running inside the browser. The communication between the browser and the application's back-end running in the server will be in some application-specific, probably proprietary, and possibly even encrypted format.

Things aren't as bad as they could be; the service has to deliver an interpreter for the proprietary, encrypted format before it can deliver the content in that format. But obtaining and decrypting the content requires running this interpreter in the browser or, more to the point for preservation, in the Web crawler. Although a crawler could collect the interpreter, then run it and collect any content it accessed, this does not guarantee that re-running the interpreter later with the preserved content will work. For example, the interpreter could be programmed to stop working a few minutes after it was delivered to the browser (or crawler).

Another way of expressing the same thought is that HTML5 allows content owners to implement a semi-effective form of DRM for the Web. This is a major reason Roger McNamee is so excited about it. Content owners have been attracted by the walled garden app approach, because although it restricts the potential market and may require paying taxes to the platform, it makes it harder to steal the content. So, for example, Kindle was not just a hardware device, but also an app for iPhone, iPad and Android. But now we have browsers which support enough HTML5 capabilities, we also have Kindle in a browser, which looks and feels just like an app but needs no installation. Amazon loves the HTML5 approach, because if you buy a book in Apple's walled app garden, Apple takes 30% of the gross. If you buy a book in your browser, Apple gets 0%. Moonalice loves HTML5 because you can watch their shows on your phone without needing a Moonalice app. These advantages make up for the somewhat less effective DRM than in the app.

First with AJAX and now with HTML5, the Web is evolving in a way that makes preserving Web content much harder. Traditional Web crawlers, such as the Internet Archive's widely used Heritrix, will become less and less effective at collecting. A major effort to develop techniques capable of collecting and preserving AJAX and HTML5 is urgently needed. Working with students at Carnegie-Mellon's Silicon Valley campus, the LOCKSS team has started to investigate techniques for collecting AJAX content.

A digression about preserving context.

In 2007, this now removed page at the JSTOR website used to say:

“If, in keeping with our mission to function as a trusted archive, JSTOR is to serve as a substitute for the journal volumes on the shelves, it must offer an electronic version that is a faithful replication of the original. An image-based approach ensures the integrity of the materials in the archive, while also retaining the appearance and “look and feel” of the journal in its original preservation. This is central to our mission and a key basis upon which JSTOR was founded.”
This agrees with the LOCKSS philosophy, but it contrasts with the "intellectual content" approach espoused by Portico's Eileen Fenton. The importance of context can be illustrated by Andrei Codrescu's Exquisite Corpse. These are three screen-grabs of the home page at different times (click image for larger):

Would preserving the words alone without the typography, layout and images be useful?


Chris Rusbridge said...

Would preserving all of Shakespeare except the First Folio be useful?

I think the answer depends on your point of view. For some, the "work" or the "story" (the intellectual content) is what is important. For some, the object, the experience of the original, is essential.

Preserving the work, if possible, can be (I think) much cheaper than preserving the original, the experience, the full context. You have often argued that if we take the expensive preservation option we preserve less.

So, I hope your new venture can indeed find a cheap way to preserve the experience with full context. If not, I would settle for more content and a bit less context, accepting that some content may be irreparably lost in the process!

David. said...

On the Web, it has typically been cheaper to preserve the context than to construct a complex ingest pipeline to extract the "intellectual content" and preserve it alone. Compare web archiving approaches such as the Internet Archive and LOCKSS with Portico's complex and expensive ingest processes.

I believe this will remain true through the transition to HTML5, although it will involve developing new techniques for Web preservation. Extracting and preserving the "intellectual content" in an HTML5 world will also require developing new techniques, which are likely to remain more complex, difficult and expensive.

Unknown said...

I'm happy see more effort being put into capturing the dynamic web, but I don't think HTML5 is all bad. The new video tag makes archiving video content much easier - instead of the kind of complex incantations required to archive flash video, we can get the payload directly over HTTP.

Also, I think the case of hash-bang URIs is very informative, and illustrates how the need to ensure website remain discoverable (indexable by Google) will push against the tendency to obscure the data (if at some cost in terms of increased crawler complexity).

Unknown said...
This comment has been removed by a blog administrator.
David. said...

Two well-taken points. Thanks for the links.

David. said...

Just one example of what becomes possible when the Web becomes a programming environment is that people can layer programming environments on top that make particular kinds of applications much easier to build. Opa is a recent case in point. It allows one to write a single program that gets compiled into both the server and client sides, with the language run-time handling details such as session management. The client side is compiled into Javascript. Opa may not change the world; there are some issues with Opa's open source license. But it is an excellent example of the effects of the transition to programmability.

David. said...

The Economist writes about the transition to HTML 5.