Tuesday, August 17, 2021

Zittrain On Internet Rot

I spent two decades working on the problem of preserving digital documents, especially those published on the Web, in the LOCKSS Program. So I'm in agreement with the overall argument of Jonathan Zittrain's The Internet Is Rotting, that digital information is evanescent and mutable, and that libraries are no longer fulfilling their mission to be society's memory institutions. He writes:
People tend to overlook the decay of the modern web, when in fact these numbers are extraordinary—they represent a comprehensive breakdown in the chain of custody for facts. Libraries exist, and they still have books in them, but they aren’t stewarding a huge percentage of the information that people are linking to, including within formal, legal documents. No one is. The flexibility of the web—the very feature that makes it work, that had it eclipse CompuServe and other centrally organized networks—diffuses responsibility for this core societal function.
And concludes:
Society can’t understand itself if it can’t be honest with itself, and it can’t be honest with itself if it can only live in the present moment. It’s long overdue to affirm and enact the policies and technologies that will let us see where we’ve been, including and especially where we’ve erred, so we might have a coherent sense of where we are and where we want to go.
In our first paper about LOCKSS, Vicky Reich and I wrote:
Librarians have a well-founded confidence in their ability to provide their readers with access to material published on paper, even if it is centuries old. Preservation is a by-product of the need to scatter copies around to provide access. Librarians have an equally well-founded skepticism about their ability to do the same for material published in electronic form. Preservation is totally at the whim of the publisher.

A subscription to a paper journal provides the library with an archival copy of the content. Subscribing to a Web journal rents access to the publisher’s copy. The publisher may promise "perpetual access", but there is no business model to support the promise. Recent events have demonstrated that major journals may vanish from the Web at a few months notice.
Although I agree with Zittrain's big picture, I have some problems with his details. Below the fold I explain the issues I have with them.

History

Zittrain starts with an account of the origins of the Internet that is very convenient for the argument he proceeds to make, but is simply wrong. It is true that:
The internet’s distinct architecture arose from a distinct constraint and a distinct freedom: First, its academically minded designers didn’t have or expect to raise massive amounts of capital to build the network; and second, they didn’t want or expect to make money from their invention.
But neither of those facts had anything to do with the reasons why the basic protocols of the Internet are the way they are. Zittrain writes:
The internet’s framers thus had no money to simply roll out a uniform centralized network the way that, for example, FedEx metabolized a capital outlay of tens of millions of dollars to deploy liveried planes, trucks, people, and drop-off boxes, creating a single point-to-point delivery system. Instead, they settled on the equivalent of rules for how to bolt existing networks together.

Rather than a single centralized network modeled after the legacy telephone system, operated by a government or a few massive utilities, the internet was designed to allow any device anywhere to interoperate with any other device, allowing any provider able to bring whatever networking capacity it had to the growing party. And because the network’s creators did not mean to monetize, much less monopolize, any of it, the key was for desirable content to be provided naturally by the network’s users, some of whom would act as content producers or hosts, setting up watering holes for others to frequent.
The reason why the protocols were designed to facilitate interoperation among diverse devices is that that is precisely what they needed to do. In those long-gone days computers were huge, expensive and very different from each other. There was no agreement on anything as basic as the number of bits in a byte or a word. The PDP-7 I used as an undegraduate had 18-bit words, the Titan I also used then had 48-bit words, the CDC-6000 series computers I did my Ph.D. on had 60-bit words, the PDP-10 I did my post-doc on had 36-bit words. Attempting to enforce uniformity on the Cambrian explosion of computer architctures was obviously futile. Now, when 8, 32, and 64 are universal, it is easy to forget that this wasn't always the case.

Zittrain also conveniently rewrites the history of HTTP and HTML:
So the internet was a recipe for mortar, with an invitation for anyone, and everyone, to bring their own bricks. Tim Berners-Lee took up the invite and invented the protocols for the World Wide Web, an application to run on the internet. If your computer spoke “web” by running a browser, then it could speak with servers that also spoke web, naturally enough known as websites. Pages on sites could contain links to all sorts of things that would, by definition, be but a click away, and might in practice be found at servers anywhere else in the world, hosted by people or organizations not only not affiliated with the linking webpage, but entirely unaware of its existence. And webpages themselves might be assembled from multiple sources before they displayed as a single unit, facilitating the rise of ad networks that could be called on by websites to insert surveillance beacons and ads on the fly, as pages were pulled together at the moment someone sought to view them.
The technical name for the idea that web pages can be "assembled from multiple sources" is transclusion, and it was introduced in programming languages by COBOL in 1960, and in digital documents by Ted Nelson in 1980 as part of his invention of hypertext. With all due respect to Sir Tim Berners-Lee, his 1989 development of the HyperText Transfer Protocol (HTTP) and the HyperText Markup Language (HTML) built on a long history of software. Transclusion had been a normal feature for around three decades, so the idea that there was anything special about transclusion in the Web is ridiculous.

And the idea that a Web in which Sir Tim Berners-Lee startup, "World Wide Web, Inc.", billed for access would have succeeded is equally laughable:
And like the internet’s own designers, Berners-Lee gave away his protocols to the world for free—enabling a design that omitted any form of centralized management or control, since there was no usage to track by a World Wide Web, Inc., for the purposes of billing. The web, like the internet, is a collective hallucination, a set of independent efforts united by common technological protocols to appear as a seamless, magical whole.
The open, decentralized Web protocols were the reason everyone could join in and make it universal. Zittrain even elides the history of the "content moderation" problem:
This absence of central control, or even easy central monitoring, has long been celebrated as an instrument of grassroots democracy and freedom. It’s not trivial to censor a network as organic and decentralized as the internet. But more recently, these features have been understood to facilitate vectors for individual harassment and societal destabilization, with no easy gating points through which to remove or label malicious work not under the umbrellas of the major social-media platforms, or to quickly identify their sources. While both assessments have power to them, they each gloss over a key feature of the distributed web and internet: Their designs naturally create gaps of responsibility for maintaining valuable content that others rely on. Links work seamlessly until they don’t. And as tangible counterparts to online work fade, these gaps represent actual holes in humanity’s knowledge.
Zittrain is obviously correct about the problem. Mike Masnick recounts a wonderful recent exemplar in Social Network GETTR, Which Promised To Support 'Free Speech' Now Full Of Islamic State Jihadi Propaganda. But "more recently" is a stretch. The downsides of a social medium to which anyone can post without moderation are familiar to anyone who was online in the days of the Usenet:
Usenet is culturally and historically significant in the networked world, having given rise to, or popularized, many widely recognized concepts and terms such as "FAQ", "flame", sockpuppet, and "spam".
...
Likewise, many conflicts which later spread to the rest of the Internet, such as the ongoing difficulties over spamming, began on Usenet.:
"Usenet is like a herd of performing elephants with diarrhea. Massive, difficult to redirect, awe-inspiring, entertaining, and a source of mind-boggling amounts of excrement when you least expect it."

— Gene Spafford, 1992

Links

Zittrain is rightly concerned with both causes for the evanescence of Web content:
  • link rot, i.e. a URL that at time T in the past resolved to content but now returns 404.
  • content drift, i.e. a URL that at time T in the past resolved to content C(T) but now resolves to different content C(now)
This concern leads him to replace nearly all the links in his post with a URL for the content of that link at the time he was writing as archived at Perma.cc. This is a service similar to the Internet Archive's "Save Page Now" run by the Harvard Law School Library:
When a user creates a Perma.cc link, Perma.cc archives the referenced content and generates a link to an archived record of the page. Regardless of what may happen to the original source, the archived record will always be available through the Perma.cc link.

Users go to the Perma.cc website and input a URL. Perma.cc downloads the material at that URL and gives back a new URL (a “Perma.cc link”) that can then be inserted in a paper, article, blog or whatever the author needs.
Unfortunately, Perma.cc suffers from a serious design flaw. Here, for example, is one of the Perma.cc URLs Zittrain uses — https://perma.cc/5Y7B-KNXU. It points to an archived copy of https://en.wikipedia.org/wiki/The_Library_of_Babel as it was on 29th June 2021 at 6:16pm. There is a minor and a major problem with this substitution.

The minor problem is that in this case the substitution is redundant. Wikipedia, like all wikis, provides access to the history of its pages. The version current as of 29th June 2021 at 6:16pm was edited by ClueBot NG on June 2nd at 8:38am and, so long as Wikipedia is on-line, is available at https://en.wikipedia.org/w/index.php?title=The_Library_of_Babel&oldid=1026440022. Is Perma.cc guaranteed to survive longer than Wikipedia? Unless it is, the history link is better than the Perma.cc one.

The major problem is that in all cases the substitution creates a single point of failure. Suppose Perma.cc were to fail but The Atlantic and Wikipedia were to remain on-line. The URL https://perma.cc/5Y7B-KNXU would return 404, even though the precise content it used to point to was still available. All the Perma.cc links in the original would have rotted, even though the content in question might still be live on the Web, or available from less-doomed archives. Worse, the Perma.cc domain might have been sold to a porn site, so all the Perma.cc links would have drifted, resolving to porn. Zittrain would surely admit that this isn't ideal.

Wikipedia references
As Zittrain points out, link rot happens all the time. In particular, it happens all the time to Wikipedia. In 2016's Fixing broken links in Wikipedia I discussed a collaboration between the Internet Archive and the Wikipedia community that had already replaced over a million rotted links in Wikipedia with links to archived versions. Wikipedians are encouraged, as I did with the late John Wharton's page, to add to the footnoted reference a link to an archived version. But, if they omit this precaution and the URL eventually returns 404, there is now a bot that will detect the 404 and replace the URL with one from the Internet Archive.

The bot can do this because it knows:
  • The URL that was live but is now dead.
  • The time that the now-dead URL was added to the Wikipedia page, via Wikipedia's history mechanism.
  • Using an Internet Archive URL as the substitute for the 404 URL is highly likely to work, because the Internet Archive's Wayback Machine routinely collects and preserves the content of all external links in Wikipedia pages.
  • How to construct an Internet Archive URL representing the archived content of the now-dead URL as close before the time it was added as possible.
Could The Atlantic create a similar bot that replaced the URLs in dead Perma.cc links with links to the live Web, or to the Internet Archive? No, because there is no way to know from a URL such as https://perma.cc/5Y7B-KNXU what the original link to the content was, and when it was created. That knowledge went to the grave with Perma.cc.

The Internet Archive doesn't routinely collect Wikipedia pages, because their history is available from Wikipedia itself and there are more vulnerable sites upon which to spend the Internet Archive's limited resources. But I just created a version of https://en.wikipedia.org/w/index.php?title=The_Library_of_Babel via "Save Page Now". The URL of this copy is https://web.archive.org/web/20210806200502/https://en.wikipedia.org/w/index.php?title=The_Library_of_Babel. Unlike the Perma.cc URL, bots can easily and unambiguously parse this URL into the archive (web.archive.org), the date and time of capture (20210806200502) and the URL (https://en.wikipedia.org/w/index.php?title=The_Library_of_Babel). Perma.cc's use of obfuscated URLs is a serious design flaw.

The fact that, from the Internet Archive URL, it is possible to detemine the original URL and the date and time of its collection (technically known as the Memento) means that if the Internet Archive URL fails to resolve it is possible to look for a replacement in other Web archives. As I wrote in Fixing broken links in Wikipedia:
But wiring the Internet Archive in as the only source of archived Web pages, while expedient in the short term, is also a problem. It is true that the Wayback Machine is by far the largest repository of archived URLs, but research using Memento (RFC7089) has shown that significantly better reproduction of archived pages can be achieved by aggregating all the available Web archives.

Reinforcing the public perception that the Wayback Machine is the only usable Web archive reduces the motivation for other institutions, such as national libraries, to maintain their own Web archiving efforts. Given the positive effects of aggregating even relatively small Web archives, this impairs the quality of the reader's experience of the preserved Web, and thus Wikipedia.
RFC7089 specifies a system whereby Web archives can interoperate to provide an aggregated view of the total space of archived Web content, so that a system such as Ilya Kreymer's oldweb.today can access the Memento of each component of a Web page closest in time to the user's request. As I wrote in 2015:
BBC News via oldweb.today
Ilya Kreymer has used [Docker] to implement oldweb.today, a site through which you can view Web pages from nearly a dozen Web archives using a contemporary browser. Here, for example, is the front page of the BBC News site as of 07:53GMT on 13th October 1999 viewed with Internet Explorer 4.01 on Windows.

Perma.cc does export the information Memento needs to include its holdings in aggregations. This is undoubtedly a good thing, but it still means that the knowledge of what original URL is represented by https://perma.cc/5Y7B-KNXU and when it was collected would go the grave with Perma.cc unless some other service collected it from Perma.cc and preserved it in a usable form. This wouldn't be necessary if Perma.cc used parseable URLs like the Internet Archive's.

Copyright

Zittrain correctly identifies two causes of Web evanescence that are caused by copyright:
  • The one that we noted in our initial LOCKSS paper, that since libraries now rent access to content, it can be withdrawn or changed at the whim of the publisher. He writes:
    Libraries in these scenarios are no longer custodians for the ages of anything, whether tangible or intangible, but rather poolers of funding to pay for fleeting access to knowledge elsewhere.
  • The problem that Web sites operating under the DMCA's "safe harbor" provisions to export content face takedown notices from claimed or real copyright owners and, being typically unable to verify the claims, have to remove access to the targeted material.
He discusses the system called Lumen which, on a voluntary basis, records the takedown requests received by many major sites. But Zittrain doesn't specifically note that Web archives are just a special case of Web sites, and are thus vulnerable to targeted attacks using real or spurious DMCA takedowns, impairing their mission of recording Web history.

Mutability

Zittrain doesn't mention the fundamental problem Web archives have. When content was published on paper, the technology required that every reader every time they read saw the same content. The Web enables every reader every time they read to see completely different content. I've written about this problem several times, for example:
Personalization, geolocation and adaptation to browsers and devices mean that each of the about 3.4*109 Internet users may see different content from each of about 200 countries they may be in, and from each of the say 100 device and browser combinations they may use. Storing every possible version of a single average Web page could thus require downloading about 160 exabytes, 8000 times as much Web data as the Internet Archive holds.

Source
The situation is even worse. Ads are inserted by a real-time auction system, so even if the page content is the same on every visit, the ads differ. Future scholars, like current scholars studying Russian use of social media in the 2016 US election, will want to study the ads but they won't have been systematically collected, unlike political ads on TV.[21]

The point here is that, no matter how much resource is available, knowing that an archive has collected all, or even a representative sample, of the versions of a Web page is completely impractical. This isn't to say that trying to do a better job of collecting some versions of a page is pointless, but it is never going to provide future researchers with the certainty they crave. And doing a better job of each page will be expensive.
Edward Snowden describes the effects of our personalized information ecosystem driven by the profit-seeking FAANGs in Apophenia:
The real cost to this recursive construction of reality from the ephemera of our preferences is that it tailors a separate world for each individual.
...
Shakespeare said that all the world’s a stage. But in this case it’s staged specifically for you, the audience who's also the star.
Even after a Web archive has collected content, it may be mutated. Because Web archives are simply a specialized kind of Web site, they are vulnerable both to the pervasive insecurity of the Internet. and to a range of archive-specific attacks such as those discussed by Michael Nelson in his CNI plenary talk entitled Web Archives at the Nexus of Good Fakes and Flawed Originals. A naive trust in Web archives is just as unjustified as a naive trust in libraries was. Physical libraries were vulnerable to threats including:
Library collections of published content on paper formed a model distributed, fault-tolerant system. It had a high degree of replication on durable write-once media, with the replicas under independent administration. It was easy to locate a replica for access, but hard to locate all of them with the goal of suppressing or altering them. This system inspired the technology developed in the LOCKSS (Lots Of Copies Keep Stuff Safe) Program.

Zittrain is very complimentary about the Internet Archive, and stresses that its funding is massively inadequate to the scale of the problem:
That’s vital work, and it should be supported much more, whether with government subsidy or more foundation support. (The Internet Archive was a semifinalist for the MacArthur Foundation’s “100 and Change” initiative, which awards $100 million individually to worthy causes.)
But Zittrain doesn't mention two problems that the success of the Internet Archive has caused. First, it has benefited from the kinds of increasing returns to scale that W. Brian Arthur described in Increasing Returns and Path Dependence in the Economy. It is so much larger than any other Web archive that the public's (false) perception is that it is the only Web archive, and that because it is so big it has solved the problem of Web archiving.

Second, because it benefits from increasing returns to scale, even if funders understand that the public's perception is false, they understand that the Internet Archive's "bang for the buck" in terms of the amount of content preserved per dollar is so much larger than any other archive that directing funds elsewhere seems inefficient.

These two effects mean that Web archiving suffers from a single point of failure problem. The Internet Archive maintains two copies of its collection, one in San Francisco and one across the Bay in Richmond. But it hasn't managed to fund a copy outside US jurisdiction to buffer against legal attacks, much less fund a second implementation of its technical infrastructure to defend against monoculture risk. Alas, at the scale of the Internet Archive, LOCKSS is prohibitively expensive. And given the infeasibility of detailed curation at that scale, the collection is more like a sample. Thus it is arguable that noise in the sample can be more efffectively reduced by a larger sample than by preventing loss or corruption in a smaller sample.

Conclusion

None of this should detract from Zittrain's basic message, that the transition from paper to the Web has placed society's memory at much greater risk. He make sound proposals for changes to the legal framework for on-line publishing:
The law should hesitate before allowing the scope of remedies for claimed infringements of rights—whether economic ones such as copyright or more personal, dignitary ones such as defamation—to expand naturally as the ease of changing what’s already been published increases.

Compensation for harm, or the addition of corrective material, should be favored over quiet retroactive alteration. And publishers should establish clear and principled policies against undertaking such changes under public pressure that falls short of a legal finding of infringement. (And, in plenty of cases, publishers should stand up against legal pressure, too.)
But he doesn't face up to the fundamental problem. We know how to collect and preserve successive versions of content published on the Web. At scale, we don't know how to distinguish between content worthy of preservation and junk. This means we have to try to collect and preserve everything. No-one is willing to pay enough to do an adequate job of this. The fundamental problem isn't technical, or legal. It is economic.

No comments: