Monday, March 13, 2017

The Amnesiac Civilization: Part 3

In Part 2 of this series I criticized Kalev Leetaru's Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web for failing to take into account the cost of doing a better job. Below the fold I ask whether, even with unlimited funds, it would actually be possible to satisfy Leetaru's reasonable-sounding requirements, and whether those requirements would actually solve the problems of Web archiving.

Leetaru is as concerned as I am that society retain an accurate record of citizen's information environment. He accurately identifies what, in an ideal world, should be archived:
Today the web is all about streaming video and audio. ... Multimedia is difficult to archive not only because of its size (its quite easy to accumulate a few petabytes of HD video without much difficulty), but also because most streaming video sites don’t make it easy to download the original source files. ... In our device-centric world in which we watch videos from large-format televisions, ultra resolution desktops, low resolution phones, etc it is also important to recognize that streaming sites typically offer multiple versions of a video in different resolutions and compression levels that can result in dramatically different viewing experiences. ... Some platforms also go to extended lengths to try and prevent unauthorized downloading of their content via special encodings, encryption and other protections.
So he wants multiple versions of petabytes of video. While from a technical perspective this might be "quite easy", from a funding perspective it isn't. The Internet Archive currently stores around 30PB and adds about 15TB/day, of which I believe the Web archive is about half. Using Amazon S3 pricing, adding 5PB of video would add about 10% to the Archive's budget in storage charges the first year alone, so it would be a big-ish deal. Not to mention the legal problems of dealing with "special encodings, encryption and other protections".

Leetaru also quite reasonably wants comprehensive collections of even the private parts of social media sites:
nearly all major social platforms are moving towards extensive privacy settings and default settings that encourage posts to be shared only with friends. ... This means that even if companies like Facebook decided to make available a commercial data stream of all public content across the entire platform, the stream would capture only a minuscule fraction of the daily life of the platform’s 2 billion users.
which he admits is hopeless:
From a web archival standpoint, the major social media platforms are largely inaccessible for archiving. ... Facebook ... continually adapts its technical countermeasures and has utilized legal threats in the past to discourage bulk downloading and distribution of user data. Shifting social norms around privacy mean that regardless of technological or legal countermeasures, users are increasingly walling off their data and making it unavailable for the public access needed to archive it. In short, as social media platforms wall off the Internet, their new private parallel Internets cannot be preserved, even as society is increasingly relying on those new walled gardens to carry out daily life.
He and I agree that the future is looking dim for the desktop PC, so he wants to archive all the many mobile versions of every page:
Over the last few years Internet users have increasingly turned to mobile devices from cellphones to tablets to access the Internet. From early mobile-optimized sites to today’s mobile-first world, the Internet of today is gradually leaving its desktop roots behind. Google has been a powerful force behind this transition, penalizing sites that do not offer mobile versions.
Adding mobile web support to web archives is fairly trivial, but it is remarkable how few archives have implemented complete robust mobile support. Even those that offer basic mobile crawling support rarely crawl all versions of a page to test for how differences in device and screen capabilities affect the returned content and the level of dynamic customization in use.
I think Leetaru is wrong to claim that mobile support is "fairly trivial", but even "fairly trivial" enhancements incur development, testing and maintenance costs. Not to mention the costs of finding, crawling and storing the many different mobile versions of a site.

Leetaru is expecting Web archives to do many times more crawling and storing than they currently do, with no additional resources. So not going to happen.

But even if it did, this doesn't even begin to address the real problem facing Web archives. Leetaru writes:
An increasing number of servers scan the user agent field and deny access to the mobile edition of a page unless the client is an actual mobile device, meaning an ordinary crawler requesting a mobile page, but using its standard desktop user agent tag will simply be redirected to the desktop version of the page. Some sites go even further, returning versions of the site tailored for tablets versus smartphones and even targeting specific devices for truly customized user experiences, requiring multiple device emulation to fully preserve a page in all its forms.
But he doesn't address the major sources of variability among the versions of web page content, which are personalization and geolocation. It used to be the case that society's basic information environment was mass media, and it was safe to assume that all consumers of each of those mediums saw the same content. This hasn't been the case for years; every visitor to a site with a significant audience sees different content. This started with the advertisements. Every visit to every page gets a different selection of ads, based on a real-time auction. Web archives responded by no longer collecting the ads.

A much more advanced form of targeting content has recently become controversial in politics:
In an article for Campaign magazine last February, he described how [Cambridge Analytica] had “helped supercharge Leave.EU’s social media campaign by ensuring the right messages are getting to the right voters online.”
There are doubts about Cambridge Analytica's claims, but it is clear that even outside social media sites, the capability to individually tailor the content, not just the ads, at a URI is increasingly likely to be used.

If Leetaru wants to archive every version of a site he needs a Web archive not merely to emulate every possible browser and device combination, but every possible user and location combination. After all, I definitely see a different version of many sites from my laptop when I'm at home from when I'm behind the Great Firewall of Cameron.

There are about 3.4*109 Internet users from about 200 countries, so there are about 6.8*1011 possible versions of every Web page for each browser and device combination. Say there are 100 of these combinations, and the average Web page is about 2.3*106 bytes. So storing a single Web page could take up to about 1.6*1020 bytes, or 160 exabytes.

But storage isn't actually the problem, since deduplication and compression would greatly reduce the storage needed. The problem is that in order to be sure the archive has found all the versions, it has to download them all before it can do the deduplication and compression.

I believe the Internet Archive's outbound bandwidth is around 2*109 byte/s. Assuming the same inbound bandwidth to ingest all those versions of the page, it would take about 8*1010 seconds, or about 2.5*103 years, to ingest a single page. And that assumes that the Web site being archived would be willing to devote 2GB/s of outbound bandwidth for two-and-a-half millenia to serving the archive rather than actual users.

The point here is to make it clear that, no matter how much resource is available, knowing that an archive has collected all, or even a representative sample, of the versions of a Web page is completely impractical. This isn't to say that trying to do a better job of collecting some versions of a page is pointless, but it is never going to provide future researchers with the certainty they crave.


Chris Rusbridge said...

David, I agree the task as described here is hopeless, for the reasons you discuss and perhaps many others. I'm not sure though that this makes us an amnesiac civilisation. We and our ancestors have never collected everything, or even near it. Libraries effectively and collectively attempt to collect "published material", primarily books and more serious journals. Archives collect small parts of the un-published works of selected individuals and organisations. Masses of stuff gets left out of this collecting net, and always has done, Many genres are at best fitfully collected. Much has been written about the problem of acid-paper newspapers; microfilming was an attempt at solving the preservation parts of that, but the approach leaves out many extra and local editions. I'm not sure of the collecting status of "trashy" periodicals, but I doubt there's much redundancy or completeness. Very few of the millions of written but un-published books get collected and preserved. Video and audio content has such complex rights issues that collections are specialised and likely incomplete, and certainly don't incorporate every version. Some benefit from being sold in relatively stable and widely dispersed form (CDs, DVDs etc), but that's still nowhere near all. What proportion of home movies and videos are preserved from before the internet era? And so on. We tend to look at our successes and think, wow, look how well we have preserved all this paper stuff, and how badly this new ephemeral internet-based world compares. We forget how vulnerable large classes of paper-based and other material has been, and how much has been lost.

Yet civilisation has not fallen, and if it seems to be falling now it is not for those reasons.

Isn't the real question two-fold: what classes of material ARE of critical importance to the future of our civilisation? And how can we work together to preserve those classes of material?

David. said...

Chris, it was never previously the case that future scholars were deprived of entire categories of important content and, that for content that they did have access to, they could never be sure that what they saw was what contemporaneous readers saw. Which is where we will soon be with, for example, YouTube and online news sites. See the upcoming part 4 about Web DRM.