DSHR's Blog: Dangerous Complacency

The topic of web archiving has been absent from this blog for a while, but recently Sawood Alam alerted me to Cliff Lynch's post from January entitled The Dangerous Complacency of “Web Archiving” Rhetoric. Lynch's thesis is that using the term "web archiving" obscures the fact that we can only collect and preserve a fraction of "the Web". The topic is one I've written about many times, at least since my Spring 2009 CNUI Plenary, so below the fold I return to it.

Lynch writes:

The World Wide Web turned 30 years old in 2021. During the past three decades, it has vastly evolved and changed in character. A huge number of information resources and services are accessible through the web, but not genuinely part of it; they share few of the characteristics typical of web sites in the 1990s. Indeed, “the web” has become a sloppy shorthand for a hugely diverse universe of digital content and services that happen to be accessible though a web browser (though more and more of these services are used through custom apps, particularly on mobile devices). We no longer genuinely understand the universe this shorthand signifies, much less what “archiving” it means or what purposes this can or cannot serve.

He acknowledges the importance and successes of web archiving institutions such as the Internet Archive then continues:

In fact, one of my concerns is that the success of these organizations in what we might think of as “traditional” web archiving has given rise to a good deal of complacency among much of the cultural memory sector. And much worse: the broader public has a sense that everything is being taken care of by these organizations; no crisis here! This is particularly troublesome because we will need the support of the broad public in changing norms and perhaps legal frameworks to permit more effective collecting from the full spectrum of participants in this new digital universe.

And later:

Think about this: if you are trying to collect or preserve Facebook, or even a news site, what are you trying to do? You could try to collect content that a specific individual or group of people posted or authored. You could try (though this is probably hopeless, given the intellectual property, terms of use, and privacy issues) to capture all the (public) content posted to the site. But if the point is to understand how the service actually appeared to the public at a given time, what you want to know is what material is shown to visitors most frequently. And, with personalization, to genuinely and deeply understand the impact of such a service on society, what you really need to capture is how the known or imputed attributes of visitors determined what material they were shown at a given point in time.

I don't have a problem with what Lynch wrote, rather with what he left out. The problem Lynch describes isn't new. The Spring 2009 Plenary was before I started posting the entire text of my talks, so my exact words were not preserved, but I did post the slides. Here are the two relevant to this topic:

"Preservation" implies static, isolated object
Web 0.9 is like reading a printed book
Web 1.0 dynamically inserts personalized adverts
No-one preserves the adverts, but they're important
With the Night Mail Rudyard Kipling (1905)
The Who Sell Out The Who (1967)
A Prairie Home Companion Garrison Keillor (1974-)
Web 2.0 is dynamic, interconnected
Each page view is unique, mash-ed up from services
Pages change as you watch them
What does it mean to preserve a unique, dynamic page?

And:

User Generated Content
To understand 2008 election you need to save blogs
To do that you need to save YouTube, photo sites, ...
So that the links to them keep working ...
Technical, legal, scale obstacles almost insuperable
Multi-player games & virtual worlds
Even if you could get the data and invest in the servers
They're dead without the community - Myst (1993)
Dynamic databases & links to them
e.g. Google Earth mash-ups - is Google Earth forever?

BBC News

A year later I had started posting the text, so I have my 2010 JCDL keynote. It was entitled Stepping Twice Into The Same River, and in it I wrote:

Pre-Web documents and early Web content were static, and thus relatively easy to preserve. This is what the early Web looked like - the BBC News front page of 1st December 1998 from the Internet Archive. The evolution of the Web was towards dynamic content. This started with advertisements. No-one preserves the ads. The reason is that they are dynamic, every visitor to the page sees a different set of ads. What would it mean to preserve the ads?

Now the Web has become a world of services, typically standing on other services, and so on ad infinitum, ... Everyone talks about data on the Web. But what is out there is not just data, it is data wrapped in services of varying complexity, from simple, deliver me this comma-separated list, to complex, such as Google Earth.

Even in those early days the BBC News website claimed to be "updated every minute of every day", so the title of my talk was aposite. Twelve years ago it was understood that there was no way to "archive" http://news.bbc.co.uk because even though it didn't contain ads, its content was continually changing.

Things got rapidly worse. In 2011 I wrote Moonalice plays Palo Alto:

On July 23^rd the band Moonalice played in Rinconada Park as part of Palo Alto's Twilight Concert series. The event was live-streamed over the Internet. Why is this interesting? Because Moonalice streams and archives their gigs in HTML5. Moonalice is Roger McNamee's band. Roger is a long-time successful VC (he invested in Facebook and Yelp, for example) and, based on his experience using HTML5 for the band, believes that HTML5 will transform the Web.
...
The key impact of HTML5 is that, in effect, it changes the language of the Web from HTML to Javascript, from a static document description language to a programming language. It is true that the Javascript in HTML5 is delivered in an HTML container, just as the video and audio content of a movie is delivered in a container such as MP4. But increasingly people will use HTML5 the way NeWS programmers used PostScript, as a way of getting part or maybe even all of their application running inside the browser. The communication between the browser and the application's back-end running in the server will be in some application-specific, probably proprietary, and possibly even encrypted format.

Anyone active in Web archiving understood that there was a crisis. The next year:

Kris Carpenter Negulescu of the Internet Archive and I organized a half-day workshop on the problems of harvesting and preserving the future Web during the International Internet Preservation Coalition General Assembly 2012 at the Library of Congress.
...
In preparation for the workshop Kris & I, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation.
...
But the clear message from the workshop is that the old goal of preserving the user experience of the Web is no longer possible. The best we can aim for is to preserve a user experience, and even that may in many cases be out of reach.

This was a decade ago. Already, efforts were underway to address some of the purely technical issues using headless browsers to mimic a simulated user's experience of visiting a Web site. In this way a sample user experience could be collected, although how representative it was would be unknown. This uncertainty is inherent to the Web; it is a best-efforts network, so no conclusions can be drawn from missing resources in an archive. The legal and philosophical problems Lynch cites were already well understood, but assessed as hopeless absent massive, unrealistic increases in funding.

So what has changed in the ensuing decade to spark Lynch's post? First, as Lynch stresses, the technical problems have continued to increase:

It’s interesting to note how many are now only secondarily accessible via web browser, with preference given to apps, including apps that live on “smart TVs,” “smart cars,” “smart phones,” and the like. It is also worth noting that there is a new generation of content services emerging that are increasingly less accessible via web browser, even on a secondary basis, and are even more sequestered gardens. These will present new challenges for collection and for preservation. And we don’t have authoritative data collection services drawing timely maps of this volatile landscape.

Second, the importance of preserving and studying the inaccessible parts of the web has, as Lynch writes, increased:

These challenges are at the core of current debates about the impact of social media on society, the effects of disinformation and misinformation and the failure of social media platforms to manage such attacks, and related questions. It is clear, though, that both laws and terms of service are aligned to prevent a great deal of effective and vital work here, both in preservation and in researching the impact of these services; many social media platforms seem actively hostile to the accountability that preserving their behavior might bring.

Lynch is right about this. For example, The Economist writes in The invasion of Ukraine is not the first social media war, but it is the most viral:

The preponderance of Ukraine-friendly messages on Western users’ news-feeds hardly means the information war is over. “The narrative that Ukraine has won the information war is complacent and not necessarily backed up by anything empirical,” Mr Miller argues. Independent researchers say that social networks’ unwillingness to share data makes it difficult to assess how information is spreading online. That is especially true of organic content, and of newer platforms, in particular TikTok. “There is no systemic, reliable way to look across these platforms and see what the information ecosystems look like,” laments Brandon Silverman, co-founder of CrowdTangle, a social analytics tool.

Finally we come to the elephant in the room, the continual increase in the mismatch between the cost of preserving even the parts of the Web that can be preserved, and the funding available to do the job. I have been writing about the economic constraints on Web archiving since the start of the LOCKSS Program, which was designed to provide libraries with insurance for their expensive subscriptions to academic journals:

Libraries have to trade off the cost of preserving access to old material against the cost of acquiring new material. They tend to favor acquiring new material. To be effective, subscription insurance must cost much less than the subscription itself.

As regards the costs of general Web archiving technology, examples include 2014's Talk "Costs: Why Do We Care?":

Overall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?

We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.

We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels.

We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.

Such a radical re-think requires understanding where the costs go in our current preservation methodology, and how they can be funded.

And my critique of Kalev Leetaru's demand for additional metadata, How Much Of The Internet Does The Wayback Machine Really Archive?, in 2015's You get what you get and you don't get upset:

the fact is that metadata costs money. It costs money to generate, it costs money to document, it costs money to store. Web archives, and the Internet Archive in particular, are not adequately funded for the immense scale of their task, as I pointed out in The Half-Empty Archive. So better metadata means less data. It is all very well for researchers to lay down the law about the kind of metadata that is "absolutely imperative", "a necessity" or "more and more imperative" but unless they are prepared to foot the bill for generating, documenting and storing this metadata, they get what they get and they don't get upset.

In 2017 I wrote a 5-part series around the mismatch, The Amnesiac Civilization, and followed up with Preservation Is Not A Technical Problem pointing out that even national libraries couldn't afford funding for Web archiving:

The budgets of libraries and archives, the institutions tasked with acting as society's memory, have been under sustained attack for a long time. I'm working on a talk and I needed an example. So I drew this graph of the British Library's annual income in real terms (year 2000 pounds). It shows that the Library's income has declined by almost 45% in the last decade.

Memory institutions that can purchase only half what they could 10 years ago aren't likely to greatly increase funding for acquiring new stuff; it's going to be hard for them just to keep the stuff (and the staff) they already have.

And the amount spent on Web archiving is miniscule (2017 figures):

Adding hardware and staff, the Internet Archive spends each year on Web archiving $3,400,000, or 21% of the total. This is about 2.4 times as much as all the ARL libraries combined.

Compared to anything else spent on the Web, or on library collections, the amounts spent on Web archiving are derisory. On their 2019 budget of nearly $37M, the Internet Archive maintains what is currently the 165-th most-visited site on the Web. If the earlier proportion still stood, in 2019 they would have spent about $7.8M on Web archiving, less than half what Facebook spent lobbying Congress in 2019. Sustainably doubling the Internet Archive's Web archiving budget would need an endowment of, assuming a real return of 5%, over $150M. In terms of the wealth the Web has generated this is pocket change, it is less than 0.1% of Jeff Bezos' net worth alone. But realistically it is completely unachievable.

Lynch advocates for a twofold response to the problems he identifies:

All of this content, all of these services, need to be carefully examined by collection developers rising to Carol Mandel’s challenge to thoughtfully and selectively collect from the web, and also by digital archivists, scholars, preservationists, and documentalists seeking to capture context, presentation, disparate impacts, and the many other telling aspects of the current digital universe.

And:

Part of the challenge will be to shift law and public policy to enable this work.

Even as Lynch was writing in January, the prospects were bleak for additional funding on the scale needed to:

vastly increase the expensive human resources devoted to collecting and preserving "the Web",
not merely defend against the intellectual property oligopoly's continual efforts to prevent collection and preservation, but lobby to reverse them,
and to defend against the publishers' attempt to destroy the Internet Archive by exploiting its well-meaning efforts to mitigate the effects of the pandemic on libraries.

Now, as we head into the inevitable recession and inflation caused by the current "special military operation", the prospects for funding are clearly much worse.

There is clearly value in identifying the parts of "the Web" that aren't being collected, preserved, and disseminated to scholars. But in an era when real resources are limited, and likely shrinking, proposals to address these deficiencies need to be realistic about what can be achieved with the available resources. They should be specific about what current tasks should be eliminated to free up resources for these additional efforts, or the sources of (sustainable, not one-off) additional funding for them.

DSHR's Blog

Thursday, March 31, 2022

Dangerous Complacency

No comments: