Monday, November 6, 2017

Keynote at Pacific Neighborhood Consortium

I was invited to deliver a keynote at the 2017 Pacific Neighborhood Consortium in Tainan, Taiwan. My talk, entitled The Amnesiac Civilization, was based on the series of posts earlier this year with the same title. The theme was "Data Informed Society", and my abstract was:
What is the data that informs a society? It is easy to think that it is just numbers, timely statistical information of the kind that drives Google Maps real-time traffic display. But the rise of text-mining and machine learning means that we must cast our net much wider. Historic and textual data is equally important. It forms the knowledge base on which civilization operates.

For nearly a thousand years this knowledge base has been stored on paper, an affordable, durable, write-once and somewhat tamper-evident medium. For more than five hundred years it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. LOCKSS is the name of the program at the Stanford Libraries that Vicky Reich and I started in 1998. We took a distributed approach; providing libraries with tools they could use to preserve knowledge in the Web world. They could work the way they were used to doing in the paper world, by collecting copies of published works, making them available to readers, and cooperating via inter-library loan. Two years earlier, Brewster Kahle had founded the Internet Archive, taking a centralized approach to the same problem.

Why are these programs needed? What have we learned in the last two decades about their effectiveness? How does the evolution of Web technologies place their future at risk?
Below the fold, the text of my talk.

Introduction

I'm honored to join the ranks of your keynote speakers, and grateful for the opportunity to visit beautiful Taiwan. You don't need to take notes, or photograph the slides, or even struggle to understand my English, because the whole text of my talk, with links to the sources and much additional material in footnotes, has been posted to my blog.

What is the data that informs a society? It is easy to think that it is just numbers, timely statistical information of the kind that drives Google Maps real-time traffic display. But the rise of text-mining and machine learning means that we must cast our net much wider. Historic and textual data is equally important. It forms the knowledge base on which civilization operates.

Qing dynasty print of Cai Lun
Ever since 105AD when Cai Lun (蔡伦) invented the process for making paper, civilizations have used it to record their history and its context in everyday life. Archives and libraries collected and preserved originals. Scribes labored to create copies, spreading the knowledge they contained. Bi Sheng's (毕昇) invention of movable type in the 1040s AD greatly increased the spread of copies and thus knowledge, as did Choe Yun-ui's (최윤의) 1234 invention of bronze movable type in Korea, Johannes Gutenberg's 1439 development of the metal type printing press in Germany, and Hua Sui's (华燧) 1490 introduction of bronze type in China.[1]

Thus for about two millennia civilizations have been able to store their knowledge base on this affordable, durable, write-once, and somewhat tamper-evident medium. For more than half a millennium it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. But for about two decades the knowledge base has been migrating off paper and on to the Web.

Lots Of Copies Keep Stuff Safe is the name of the program at the Stanford Libraries that Vicky Reich and I started 19 years ago last month. We took a distributed approach to preserving knowledge; providing libraries with tools they could use to continue in the Web world their role in the paper world of collecting copies of published works and making them available to readers. Two years earlier, Brewster Kahle had founded the Internet Archive, taking a centralized approach to the same problem.

My talk will address three main questions:
  • Why are these programs needed?
  • What have we learned in the last two decades about their effectiveness?
  • How does the evolution of Web technology place their future at risk?

Why archive the Web?

Paper is a durable medium, but the Web is not. From its earliest days users have experienced "link rot", links to pages that once existed but have vanished. Even in 1997 they saw it as a major problem:
6% of the links on the Web are broken according to a recent survey by Terry Sullivan's All Things Web. Even worse, linkrot in May 1998 was double that found by a similar survey in August 1997.

Linkrot definitely reduces the usability of the Web, being cited as one of the biggest problems in using the Web by 60% of the users in the October 1997 GVU survey. This percentage was up from "only" 50% in the April 1997 survey.[4]
Figure 1b
Research at scale in 2001 by Lawrence et al validated this concern. They:
analyzed 270,977 computer science journal papers, conference papers, and technical reports ... From the 100,826 articles cited by another article in the database (thus providing us with the year of publication), we extracted 67,577 URLs. ... Figure 1b dramatically illustrates the lack of persistence of Internet resources. The percentage of invalid links in the articles we examined varied from 23 percent in 1999 to a peak of 53 percent in 1994.
The problem is worse than this. Martin Klein and co-authors point out that Web pages suffer two forms of decay or reference rot:
  • Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.
  • Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.
Similarity over time at arXiv
They examined scholarly literature on the Web and found:
one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
The problem gets worse through time:
even for articles published in 2012 only about 25% of referenced resources remain unchanged by August of 2015. This percentage steadily decreases with earlier publication years, although the decline is markedly slower for arXiv for recent publication years. It reaches about 10% for 2003 through 2005, for arXiv, and even below that for both Elsevier and PMC.
Thus, as the arXiv graph shows, they find that, after a few years, it is very unlikely that a reader clicking on a web-at-large link in an article will see what the author intended.

Source
This isn't just a problem for scholarly literature, it is even worse on the general Web. The British Library's Andy Jackson analyzed the UK Web Archive and:
was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%.[5]
It isn't just that Web servers can go away or their contents be rewritten. Access to Web pages is mediated by the Domain Name Service (DNS), and they can become inaccessible because the domain owner fails to pay the registrar, or their DNS service, or for political reasons:
In March 2010 every webpage with the domain address ending in .yu disappeared from the internet – the largest ever to be removed. This meant that the internet history of the former Yugoslavia was no longer available online. Dr Anat Ben-David, from the Open University in Israel, has managed to rebuild about half of the lost pages – pages that document the Kosovo Wars, which have been called "the first internet war”.[8]
The terms "link rot" and "content drift" suggest randomness but in many cases they hide deliberate suppression or falsification of information. More than a decade ago, in only my 6th blog post, I wrote:
Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception.
Source
Examples of Winston Smith's work are everywhere, such as:
George W. Bush’s “Mission Accomplished” press release. The first press release read that ‘combat operations in Iraq have ceased.’ After a couple weeks, that was changed to ‘major combat operations have ceased.’ And then, the whole press release disappeared off the White House’s website completely.[7]
Source
Britain has its Winston Smiths too:
One of the most enduring symbols of 2016's UK Brexit referendum was the huge red "battle bus" with its message, "We send the EU £350 million a week, let's fund our NHS instead. Vote Leave." ... Independent fact-checkers declared the £350 million figure to be a lie. Within hours of the Brexit vote, the Leave campaign scrubbed its website of all its promises, and Nigel Farage admitted that the £350 million was an imaginary figure and that the NHS would not see an extra penny after Brexit.[6]
Source
Data on the Web is equally at risk. Under the Harper administration, Canadian librarians fought a long, lonely struggle with their Winston Smiths:
Protecting Canadians’ access to data is why Sam-Chin Li, a government information librarian at the University of Toronto, worked late into the night with colleagues in February 2013, frantically trying to archive the federal Aboriginal Canada portal before it disappeared on Feb. 12. The decision to kill the site, which had thousands of links to resources for Aboriginal people, had been announced quietly weeks before; the librarians had only days to train with web-harvesting software.[11]
EOT Crawl Statistics
A year ago, a similar but much larger emergency data rescue effort swung into action in the US:
Between Fall 2016 and Spring 2017, the Internet Archive archived over 200 terabytes of government websites and data. This includes over 100TB of public websites and over 100TB of public data from federal FTP file servers totaling, together, over 350 million URLs/files.
Partly this was the collaborative "End of Term" (EOT) crawl that is organized at each change of Presidential term, but this time there was added urgency:
Through the EOT project’s public nomination form and through our collaboration with the DataRefugeEnvironmental Data and Governance Initiative (EDGI), and other efforts, over 100,000 webpages or government datasets were nominated by citizens and preservationists for archiving.
1473 nova remains
Note the emphasis on datasets. It is important to keep scientific data, especially observations that are not repeatable, for the long term. A recent example is Korean astronomers' records of a nova in 1437, which provide strong evidence that:
"cataclysmic binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested. After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.[12]
By BabelStone, CC BY-SA 3.0
Source
580 years is peanuts. An example more than 5 times older is from China. In the Shang dynasty:
astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.
Today, those eclipse records would be on the Web, not paper or bone. Will astronomers 3200 or even 580 years from now be able to use them?

What have we learned about archiving the Web?

I hope I've convinced you that a society whose knowledge base is on the Web is doomed to forget its past unless something is done to preserve it[18]. Preserving Web content happens in three stages:
  • Collection
  • Preservation
  • Access

What have we learned about collection?

In the wake of NASA's March 2013 takedown of their Technical Report Server James Jacobs, Stanford's Government Documents librarian, stressed the importance of collecting Web content:
pointing to web sites is much less valuable and much more fragile than acquiring copies of digital information and building digital collections that you control. The OAIS reference model for long term preservation makes this a requirement ... “Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.” Pointing to a web page or PDF at nasa.gov is not obtaining any control.
Memory institutions need to make their own copies of Web content. Who is doing this?

Internet Archive HQ
The Internet Archive is by far the largest and most used Web archive, having been trying to collect the whole of the Web for more than two decades. Its "crawlers" start from a large set of "seed" web pages and follow links from them to other pages, then follow those links, according to a set of "crawl rules". Well-linked-to pages will be well represented; they may be important or they may be "link farms". Two years ago Kalev Leetaru wrote:
of the top 15 websites with the most snapshots taken by the Archive thus far this year, one is an alleged former movie pirating site, one is a Hawaiian hotel, two are pornography sites and five are online shopping sites. The second-most snapshotted homepage is of a Russian autoparts website and the eighth-most-snapshotted site is a parts supplier for trampolines.
The Internet Archive's highly automated collection process may collect a lot of unimportant stuff, but it is the best we have at collecting the "Web at large"[25]. The Archive's recycled church in San Francisco, and its second site nearby sustain about 40Gb/s outbound and 20Gb/s inbound serving about 4M unique IPs/day. Each stores over 3*1011 Web pages, among much other content. The Archive has been for many years in the top 300 Web sites in the world. For comparison, the Library of Congress typically ranks between 4000 and 6000.

Network effects mean that technology markets in general and the Web in particular are winner-take-all markets. Just like Google in search, the Internet Archive is the winner in its market. Other institutions can't compete in archiving the whole Web, they must focus on curated collections.

UK Web Archive
The British Library, among other national libraries, has been collecting their "national Web presence" for more than a decade. One problem is defining "national Web presence". Clearly, it is more than the .uk domain, but how much more? The Portuguese Web Archive defines it as the .pt domain plus content embedded in or redirected from the .pt domain. That wouldn't work for many countries, where important content is in top-level domains such as .com.

Dr. Regan Murphy Kao of Stanford's East Asian Library described their approach to Web collecting:
Mao Kobayashi
we sought to archive a limited number of blogs of ground-breaking, influential figures – people whose writings were widely read and represented a new way of approaching a topic. One of the people we choose was Mao Kobayashi. ... Mao broke with tradition and openly described her experience with cancer in a blog that gripped Japan. She harnessed this new medium to define her life rather than allow cancer to define it.[3]
Curated collections have a problem. What made the Web transformational was the links (see Google's PageRank). Viewed in isolation, curated collections break the links and subtract value. But, viewed as an adjunct to broad Web archives they can add value in two ways:
UK Web Archive link analysis
  • By providing quality assurance, using greater per-site resources to ensure that important Web resources are fully collected.
  • By providing researchers better access to preserved important Web resources than the Internet Archive can. For example, better text search or data mining. The British Library has been a leader in this area.
Nearly one-third of a trillion Web pages at the Internet Archive is impressive, but in 2014 I reviewed the research into how much of the Web was then being collected[13] and concluded:
Somewhat less than half ... Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.
Costa et al ran surveys in 2010 and 2014 and concluded in 2016:
during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.
I revisited this topic earlier this year and concluded that we were losing ground rapidly. Why is this? The reason is that collecting the Web is expensive, whether it uses human curators, or large-scale technology, and that Web archives are pathetically under-funded:
The Internet Archive's budget is in the region of $15M/yr, about half of which goes to Web archiving. The budgets of all the other public Web archives might add another $20M/yr. The total worldwide spend on archiving Web content is probably less than $30M/yr, for content that cost hundreds of billions to create[19].
My rule of thumb has been that collection takes about half the lifetime cost of digital preservation, preservation about a third, and access about a sixth. So the world may spend only about $15M/yr collecting the Web.

As an Englishman it is important in this forum I also observe that, like the Web itself, archiving is biased towards English. For example, pages in Korean are less than half as likely to be collected as pages in English[20].

What have we learned about preservation?

Since Jeff Rothenberg's seminal 1995 Ensuring the Longevity of Digital Documents it has been commonly assumed that digital preservation revolved around the problem of formats becoming obsolete, and thus their content becoming inaccessible. But in the Web world formats go obsolete very slowly if at all. The real problem is simply storing enough bits reliably enough for the long term.[9]

Kryder's Law
Or rather, since we actually know how to store bits reliably, it is finding enough money to store enough bits reliably enough for the long term. This used to be a problem we could ignore. Hard disk, a 60-year-old technology, is the dominant medium for bulk data storage. It had a remarkable run of more than 30 years of 30-40%/year price declines; Kryder's Law, the disk analog of Moore's Law. The rapid cost decrease meant that if you could afford to store data for a few years, you could afford to store it "forever".

Kryder's Law breakdown
Just as Moore's Law slowed dramatically as the technology approached the physical limits, so did Kryder's Law. The slowing started in 2010, and was followed by the 2011 floods in Thailand, causing disk prices to double and not recover for 3 years. In 2014 we predicted Kryder rates going forward between 10-20%, the red lines on the graph, meaning that:
If the industry projections pan out ... by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Economic model output
So far, our prediction has proved correct, which is bad news. The graph shows the endowment, the money which, deposited with the data and invested at interest, will cover the cost of storage "forever". It increases strongly as the Kryder rate falls below 20%, which it has. Absent unexpected technological change, the cost of long-term data storage is far higher than most people realize.[10]

What have we learned about access?

There is clearly a demand for access to the Web's history. The Internet Archive's Wayback Machine provides well over 1M users/day access to it.

1995 Web page
On Jan 11th 1995, the late Mark Weiser, CTO of Xerox PARC, created Nijinksy and Pavlova's Web page, perhaps the start of the Internet's obsession with cat pictures. You can view this important historical artifact by pointing your browser to the Internet Archive's Wayback Machine, which captured the page 39 times between Dec 1st 1998 and May 11th 2008. What you see using your modern browser is perfectly usable, but it is slightly different from what Mark saw when he finished the page over 22 years ago.


Via oldweb.today
Ilya Kreymer used two important recent developments in digital preservation
to build oldweb.today , a Web site that allows you to view preserved Web content using the browser that its author would have. The first is that emulation & virtualization techniques have advanced to allow Ilya to create on-the-fly behind the Web page a virtual machine running, in this case, a 1998 version of Linux with a 1998 version of the Mosaic browser visiting the page. Note the different fonts and background. This is very close to what Mark would have seen in 1995.

Source
The Internet Archive is by far the biggest Web archive, for example holding around 40 times as much data as the UK Web Archive. But the smaller archives contain pages it lacks, and there is little overlap between them, showing the value of curation.

The second development is Memento (RFC7089), a Web protocol that allows access facilities such as oldweb.today to treat the set of compliant Web archives as if it were one big archive. oldweb.today aggregates many Web archives, pulling each Web resource a page needs from the archive with the copy closest in time to the requested date.[14]

The future of Web preservation

There are two main threats to the future of Web preservation, one economic and the other a combination of technological and legal.

The economic threat

Preserving the Web and other digital content for posterity is primarily an economic problem. With an unlimited budget collection and preservation isn't a problem. The reason we're collecting and preserving less than half the classic Web of quasi-static linked documents is that no-one has the money to do much better. The other half is more difficult and thus more expensive. Collecting and preserving the whole of the classic Web would need the current global Web archiving budget to be roughly tripled, perhaps an additional $50M/yr.

Then there are the much higher costs involved in preserving the much more than half of the dynamic "Web 2.0" we currently miss.

British Library real income
If we are to continue to preserve even as much of society's memory as we currently do we face two very difficult choices; either find a lot more money, or radically reduce the cost per site of preservation.

It will be hard to find a lot more money in a world where libraries and archive budgets are decreasing. For example, the graph shows that the British Library's income has declined by 45% in real terms over the last decade.[2]

The Internet Archive is already big enough to reap economies of scale, and it already uses innovative engineering to minimize cost. But Leetaru and others criticize it for:
  • Inadequate metadata to support access and research.
  • Lack of quality assurance leading to incomplete collection of sites.
  • Failure to collect every version of a site.
Generating good metadata and doing good QA are hard to automate and thus the first two are expensive. But the third is simply impossible.

The technological/legal threat

The economic problems of archiving the Web stem from its two business models, advertising and subscription:
Note weather in GIF of reloads of
CNN page from Wayback Machine[17]
  • To maximize the effectiveness of advertising, the Web now potentially delivers different content on every visit to a site. What does it mean to "archive" something that changes every time you look at it?
  • To maximize the effectiveness of subscriptions, the Web now potentially prevents copying content and thus archiving it. How can you "archive" something you can't copy?
Personalization, geolocation and adaptation to browsers and devices mean that each of the about 3.4*109 Internet users may see different content from each of about 200 countries they may be in, and from each of the say 100 device and browser combinations they may use. Storing every possible version of a single average Web page could thus require downloading about 160 exabytes, 8000 times as much Web data as the Internet Archive holds.

Source[16]
The situation is even worse. Ads are inserted by a real-time auction system, so even if the page content is the same on every visit, the ads differ. Future scholars, like current scholars studying Russian use of social media in the 2016 US election, will want to study the ads but they won't have been systematically collected, unlike political ads on TV.[21]

The point here is that, no matter how much resource is available, knowing that an archive has collected all, or even a representative sample, of the versions of a Web page is completely impractical. This isn't to say that trying to do a better job of collecting some versions of a page is pointless, but it is never going to provide future researchers with the certainty they crave. And doing a better job of each page will be expensive.

Although it is possible to collect some versions of today's dynamic Web pages, it is likely soon to become impossible to collect any version of most Web pages. Against unprecedented opposition, Netflix and other large content owners with subscription or pay-per-view business models have forced W3C, the standards body for the Web, to mandate that browsers support Encrypted Media Extensions (EME) or Digital Rights Management (DRM) for the Web.[15]

EME data flows
The W3C's diagram of the EME stack shows an example of how it works. An application, i.e. a Web page, requests the browser to render some encrypted content. It is delivered, in this case from a Content Distribution Network (CDN), to the browser. The browser needs a license to decrypt it, which it obtains from the application via the EME API by creating an appropriate session then using it to request the license. It hands the content and the license to a Content Decryption Module (CDM), which can decrypt the content using a key in the license and render it.

What is DRM trying to achieve? Ostensibly, it is trying to ensure that each time DRM-ed content is rendered, specific permission is obtained from the content owner. In order to ensure that, the CDM cannot trust the browser it is running in. For example, it must be sure that the browser can see neither the decrypted content nor the key. If it could see, and save for future use, either it would defeat the purpose of DRM. The license server will not be available to the archive's future users, so preserving the encrypted content without the license is pointless.

Content owners are not stupid. They realized early on that the search for uncrackable DRM was a fool's errand[22]. So, to deter reverse engineering, they arranged for the 1998 Digital Millenium Copyright Act (DMCA) to make any attempt to circumvent protections on digital content a criminal offense. US trade negotiations mean that almost all countries (except Israel) have DMCA-like laws.[23]

Thus we see that the real goal of EME is to ensure that, absent special legislation such as a few national libraries have, anyone trying to either capture the decrypted content, or preserve the license for future use, would be committing a crime. Even though the British Library, among others, has the legal right to capture the decrypted content, it is doubtful that they have the technical or financial resources to do so at scale.

Scale is what libraries and archives would need. Clearly, EME will be rapidly adopted for streaming video sites, not just Netflix but YouTube, Vimeo and so on. Even a decade ago, to study US elections you needed YouTube video, but it will no longer be possible to preserve Web video.

But that's not the big impact that EME will have on society's memory. It is intended for video and audio, but it will be hard for W3C to argue that other valuable content doesn't deserve the same protection, for example academic journals. DRM will spread to other forms of content. The business models for Web content are of two kinds, and both are struggling:
  • Paywalled content. It turns out that, apart from movies and academic publishing, only a very few premium brands such as The Economist, the Wall Street Journal and the New York Times have viable subscription business models based on (mostly) paywalled content. Even excellent journalism such as The Guardian is reduced to free access, advertising and voluntary donations. Part of the reason is that Googling the headline of paywalled news stories often finds open access versions of the content. Clearly, newspapers and academic publishers would love to use Web DRM to ensure that their content could be accessed only from their site, not via Google or Sci-Hub.
  • Advertising-supported content. The market for Web advertising is so competitive and fraud-ridden that Web sites have been forced into letting advertisers run ads that are so obnoxious and indeed riddled with malware, and to load up their sites with trackers, that many users have rebelled.[24] They use ad-blockers; these days it is pretty much essential to do so to keep yourself safe and to reduce bandwidth consumption. Not to mention that sites such as Showtime are so desperate for income that their ads mine cryptocurrency in your browser. Sites are very worried about the loss of income from blocked ads. Some, such as Forbes, refuse to supply content to browsers that block ads (which, in Forbes case, turned out to be a public service; the ads carried malware). DRM-ing a site's content will prevent ads being blocked. Thus ad space on DRM-ed sites will be more profitable, and sell for higher prices, than space on sites where ads can be blocked. The pressure on advertising-supported sites, which include both free and subscription news sites, to DRM their content will be intense.
Thus the advertising-supported bulk of what we think of as the Web, and the paywalled resources such as news sites that future scholars will need will become un-archivable.

Summary

I wish I could end this talk on an optimistic note, but I can't. The information the future will need about the world of today is on the Web. Our ability to collect and preserve it has been both inadequate and decreasing. This is primarily due to massive under-funding. A few tens of millions of dollars per year worldwide set against the trillions of dollars per year of revenue the Web generates. There is no realistic prospect of massively increasing the funding for Web archiving. The funding for the world's memory institutions, whose job it is to remember the past, has been under sustained attack for many years.

The largest component of the cost of Web archiving is the initial collection. The evolution of the Web from a set of static, hyper-linked documents to a JavaScript programming environment has been steadily raising the difficulty and thus cost of collecting the typical Web page. The increasingly dynamic nature of the resulting Web content means that each individual visit is less and less representative of "the page". What does it mean to "preserve" something that is different every time you look at it?

And now, with the advent of Web DRM, our likely future is one in which it is not simply increasingly difficult, expensive and less useful to collect Web pages, but actually illegal to do so.

Call To Action

So I will end with a call to action. Please:
  • Use the Wayback Machine's Save Page Now facility to preserve pages you think are important.
  • Support the work of the Internet Archive by donating money and materials.
  • Make sure your national library is preserving your nation's Web presence.
  • Push back against any attempt by W3C to extend Web DRM.

Footnotes

  1. For details, see volume 5 part 1 of Joseph Needham's Science and Civilisation in China.
  2. The nominal income data was obtained from the British Library's Annual Report series. The real income was computed from it using the Bank of England's official inflation calculator. More on this, including the data for the graph, is here.
  3. The BBC report on their nomination of Mao Kobayashi as one of their "100 Women 2016" includes:
    In Japan, people rarely talk about cancer. You usually only hear about someone's battle with the disease when they either beat it or die from it, but 34-year-old newsreader Mao Kobayashi decided to break the mould with a blog - now the most popular in the country - about her illness and how it has changed her perspective on life.
    It is noteworthy that, unlike the BBC, Dr. Kao doesn't link to Mao Kobayashi's blog, nor does she link to Stanford's preserved copy. Fortunately, the Internet Archive started collecting Kobayashi's blog in September 2016.
  4. Note that both links in this quote have rotted. I replaced them with links to the preserved copies in the Internet Archive's Wayback Machine.
  5. More on this and related research can be found here and here.
  6. Compare the Wayback Machine's capture of the Leave campaign's website  three days before the referendum with the day after (still featuring the battle bus with the £350 million claim) and nine weeks later (without the battle bus). This scrubbing of inconvenient history is a habit with UK Conservatives:
    The Conservatives have removed a decade of speeches from their website and from the main internet library – including one in which David Cameron claimed that being able to search the web would democratise politics by making "more information available to more people".

    The party has removed the archive from its public website, erasing records of speeches and press releases from 2000 until May 2010. The effect will be to remove any speeches and articles during the Tories' modernisation period, including its commitment to spend the same as a Labour government.
    ...
    In a remarkable step the party has also blocked access to the Internet Archive's Wayback Machine, a US-based library that captures webpages for future generations, using a software robot that directs search engines not to access the pages.
  7. Wikipedia has a comprehensive article on the "Mission Accomplished" speech. A quick-thinking reporter's copying of an on-line court docket revealed more history rewriting in the aftermath of the 2008 financial collapse. Another example of how the Wayback Machine exposed more of the Federal government's Winston Smith-ing is here.
  8. The quote is the abstract to a BBC World Series programme entitled Restoring a Lost Web Domain.
  9. To discuss the reliability requirements for long-term storage, I've been using "A Petabyte for a Century" as a theme on my blog since 2007. It led to a paper at iPRES 2008 and an article for ACM Queue entitled Keeping Bits Safe: How Hard Can It Be? which subsequently appeared in Communications of the ACM, triggering some interesting observations on copyright.
  10. I have been blogging about the costs of long-term storage with the theme "Storage Will Be Much Less Free Than It Used To Be" since at least 2010. The graph is from a simplified version of the economic model of long-term storage initially described in this 2012 paper. For a detailed discussion of technologies for long-term storage see The Medium-Term Prospects For Long-Term Storage Systems.
  11. The McLeans article continues:
    The need for such efforts has taken on new urgency since 2014, says Li, when some 1,500 websites were centralized into one, with more than 60 per cent of content shed. Now that reporting has switched from print to digital only, government information can be altered or deleted without notice, she says. (One example: In October 2012, the word “environment” disappeared entirely from the section of the Transport Canada website discussing the Navigable Waters Protection Act.)
  12. Source
    The nova was recorded in the Joseonwangjosillok, the Annals of the Joseon Dynasty. Multiple copies were printed on paper, stored in multiple, carefully designed, geographically diverse archives, and faithfully tended according to a specific process of regular audit, the details of which were carefully recorded each time. Copies lost in war were re-created. I blogged about the truly impressive preservation techniques that have kept them legible for almost 6 centuries.
  13. Estimating what proportion of the Web is preserved is a hard problem. The numerator, the content of the world's Web archives, is fairly easy to measure. But the denominator, the size of the whole Web, is extremely hard to measure. I discuss some attempts here.
  14. Ilya Kreymer details the operation of oldweb.today in a guest post on my blog.
  15. For all the gory details on the problems EME poses for archives, the security of Web browsers, and many other areas, see The Amnesiac Civilization: Part 4.
  16. The image is a slide from a talk by the amazing Maciej Cegłowski entitled What Happens Next Will Amaze You, and it will. Cory Doctorow calls Cegłowski's talks "barn-burning", and he's right.
  17. This GIF is an animation of a series of reloads of a preserved version of CNN's home page from 27 July 2013. Nothing is coming from the live Web, all the different weather images are the result of a single collection by the Wayback Machine's crawler. GIF and information courtesy of Michael Nelson.
  18. See also this excellent brief video from Arquivo.pt, the Portuguese Web Archive, explaining Web archiving for the general public. It is in Portuguese but with English subtitles.
  19. "hundreds of billions" is guesswork on my part. I don't know any plausible estimates of the investment in Web content. In this video, Daniel Gomes claims that Arquivo.pt estimated that the value of the content it preserves is €216B, but the methodology used is not given. Their estimate seems high, Portugal's 2016 GDP was €185B.
  20. In Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages Lulwah Alkwai, Michael Nelson and Michelle Weigle report that in their sample of Web pages:
    English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively.
    Their Table 31 also reveals some problems with the quality of Korean Web page archiving.
    Table XXXI. Top 10 Archived Korean URI-Rs
    Korean URI-RsMemento CountCategory
    doumi.hosting.bora.net/infomail/error/message04.html54,339Error page
    joins.com17,096News
    html.giantsoft.co.kr/404.html17,046Error page
    errdoc.gabia.net/403.html16,414Error page
    daum.net14,305Search engine
    img.kbs.co.kr/pageerror13,042Error page
    hani.co.kr/oops.html12,676Error page
    chosun.com9,839Newspaper
    donga.com9,587News
    hankooki.com7,762Search engine
    5 of the top 10 most frequently archived Korean URIs in their sample are what are called "soft 404s", pages that should return "404 Not Found" but instead return "200 OK".
  21. Senators have introduced a bill:
    the Honest Ads Act, would require companies like Facebook and Google to keep copies of political ads and make them publicly available. Under the act, the companies would also be required to release information on who those ads were targeted to, as well as information on the buyer and the rates charged for the ads.
  22. Based on reporting by Kyle Orland at Ars Technica, Cory Doctorow writes:
    Denuvo is billed as the video game industry's "best in class" DRM, charging games publishers a premium to prevent people from playing their games without paying for them. In years gone by, Denuvo DRM would remain intact for as long as a month before cracks were widely disseminated.

    But the latest crop of Denuvo-restricted games were all publicly cracked within 24 hours.

    It's almost as though hiding secrets in code you give to your adversary was a fool's errand.
  23. Notably, Portugal's new law on DRM contains several useful features including a broad exemption from anti-circumvention for libraries and archives, and a ban on applying DRM to public-domain or government-financed documents. For details, see the EFF's Deeplinks blog.
  24. Some idea of the level of fraud in Web advertising can be gained from an experiment by Business Insider:
    a Business Insider advertiser thought they had purchased $40,000 worth of ad inventory through the open exchanges when in reality, the publication only saw $97, indicating the rest of the money went to fraud.

    "There was more people saying they were selling Business Insider inventory then we could ever possibly imagine," ... "We believe there were 10 to 30 million impressions of Business Insider, for sale, every 15 minutes."

    To put the numbers in perspective, Business Insider says it sees 10 million to 25 million impressions a day.
  25. Last Thursday's example was New York's Gothamist and DNAinfo local news sites:
    A week ago, reporters and editors in the combined newsroom of DNAinfo and Gothamist, two of New York City’s leading digital purveyors of local news, celebrated victory in their vote to join a union.

    On Thursday, they lost their jobs, as Joe Ricketts, the billionaire founder of TD Ameritrade who owned the sites, shut them down.
    Twitter was unhappy:
    The unannounced closure prompted wide backlash from other members of the media on Twitter, with many pointing out that neither writers whose work was published on the sites nor readers can access their articles any longer.
    The Internet Archive has collected Gothamist since 2003; it has many but not all of the articles. By Saturday, Joe Ricketts thought better of the takedown; the sites were back up. But for how long? See also here.

Acknowledgements

I'm grateful to Herbert van de Sompel and Michael Nelson for constructive comments on drafts, and to Cliff Lynch, Michael Buckland and the participants in UC Berkeley's Information Access Seminar, who provided useful feedback on a rehearsal of this talk. That isn't to say they agree with it.

14 comments:

David. said...

The Internet Archive's Dodging the Memory Hole 2017 event covered many of the themes from my talk:

"Kahle and others made it clear that today's political climate has added a sense of urgency to digital preservation efforts. Following the 2016 election, the Internet Archive and its community of concerned archivists worked to capture 100TB of information from government websites and databases out of concern it might vanish. It's a job with no end in sight."

The event website is here.

David. said...

In the context of the "battle bus" image above, I can't resist linking to this image from the Financial Times.

David. said...

In 2017, climate change vanished from a ridiculous number of government websites by Rebecca Leber and Megan Jula starts:

"Moments after President Donald Trump took the oath of office last January, nearly all references to climate change disappeared from the White House official website. A page detailing former President Barack Obama’s plans to build a clean energy economy, address climate change, and protect the environment became a broken link (archived here)."

and goes on to provide a massive list of deletions courtesy of the Environmental and Data Governance Initiative.

David. said...

Speaking of "What does it mean to archive something that changes every time you look at it?" - preserve this!. Hat tip to David Pescovitz at Boing Boing.

David. said...

Tim Bray discovers:

"Google has stopped in­dex­ing the old­er parts of the We­b. ... Google’s not go­ing to be sell­ing many ads next to search re­sults that turn them up. So from a busi­ness point of view, it’s hard to make a case for Google in­dex­ing ev­ery­thing, no mat­ter how old and how ob­scure. ... When I have a ques­tion I want an­swered, I’ll prob­a­bly still go to Google. When I want to find a spe­cif­ic Web page and I think I know some of the words it con­tain­s, I won’t any more, I’ll pick Bing or Duck­Duck­Go."

David. said...

"Thousands of government papers detailing some of the most controversial episodes in 20th-century British history have vanished after civil servants removed them from the country’s National Archives and then reported them as lost." reports Ian Cobain at The Guardian.

These are paper documents, but like the Web edits the common thread linking the disappearances is that they are embarrassing for the government:

"Documents concerning the Falklands war, Northern Ireland’s Troubles and the infamous Zinoviev letter – in which MI6 officers plotted to bring about the downfall of the first Labour government - are all said to have been misplaced."

David. said...

"A webpage that focused on breast cancer was reportedly scrubbed from the website of the Department of Health and Human Services's (HHS) Office on Women's Health (OWH).

The changes on WomensHealth.gov — which include the removal of material on insurance for low-income people — were detailed in a new report from the Sunlight Foundation's Web Integrity Project and reported by ThinkProgress." reports Rebecca Savransky at The Hill.

David. said...

Justin Littman's Vulnerabilities in the U.S. Digital Registry, Twitter, and the Internet Archive is a must-read. It starts:

"The U.S. Digital Registry is the authoritative list of the official U.S. government social media accounts. One of the stated purposes of the U.S. Digital Registry is to “support cyber-security by deterring fake accounts that spread misinformation”."

Justin goes on to demonstrate, by tweeting as @USEmbassyRiyadh, that the registry, Twitter and the Internet Archive can all be attacked to allow extremely convincing fake official US government tweets.

Following the @USEmbassyRiyadh link shows that this particular instance has since been fixed.

David. said...

Oh, What a Fragile Web We Weave: Third-party Service Dependencies In Modern Webservices and Implications by Aqsa Kashaf et al from Carnegie Mellon examines a different kind of Web vulnerability, the way the Web's winner-take-all economics drives out redundancy of underlying services:

"Our key findings are: (1) 73.14% of the top 100,000 popular services are vulnerable to reduction in availability due to potential attacks on third-party DNS, CDN, CA services that they exclusively rely on; (2) the use of third-party services is concentrated, so that if the top-10 providers of CDN, DNS and OCSP services go down, they can potentially impact 25%-46% of the top 100K most popular web services; (3) transitive dependencies significantly increase the set of webservices that exclusively depend on popular CDN and DNS service providers, in some cases by ten times (4) targeting even less popular webservices can potentially cause significant collateral damage, affecting upto 20% of the top 100K webservices due to their shared dependencies. Based on our findings, we present a number of key implications and guidelines to guard against such Internet scale incidents in the future."

David. said...

Jefferson Bailey's talk at the Ethics of Archiving the Web conference, Lets Put Our Money Where Our Ethics Are, documents the incredibly small amount of money devoted to the immense problem of collecting and preserving the vast proportion of our cultural heritage that lives on the Web.

It is the first 18.5 minutes of this video and well worth watching.

David. said...

"The Chinese government is rewriting history in its own distorted self-image. It wants to distance itself from its unseemly past, so it's retconning history through selectively-edited educational material and blatant censorship. Sure, the Chinese government has never been shy about its desire to shut up those that don't agree with it, but a recent "heroes and martyrs" law forbids disparaging long dead political and military figures.
...
The court also decided the use of a poem in a spoof ad also harmed society in general, further cementing the ridiculousness of this censorial law and the government enforcing it."

From Chinese 'Rage Comic' Site First Victim Of Government's History-Rewriting 'Heroes And Martyrs' Law by Tim Cushing.

David. said...

Javier C. Hernández' Why China Silenced a Clickbait Queen in Its Battle for Information Control is a great example of why Web archives outside the jurisdiction of the site being archived are important:

"Since December, the authorities have closed more than 140,000 blogs and deleted more than 500,000 articles, according to the state-run news media, saying that they contained false information, distortions and obscenities."

The Epoch Times collected and republished the article.

David. said...

An Important Impeachment Memo Has Vanished From a Senate Web Server by Benjamin Wofford reports that:

"A public memo that outlined the Senate’s obligations during an impeachment trial is no longer accessible to the public on a Senate web server.

The memo, titled, “An Overview of the Impeachment Process,” dates to 2005, when it was produced by a legislative attorney working for the Congressional Research Service. It provides a legal interpretation of several impeachment-related powers and duties of the Senate, as well as the House of Representatives, based on a detailed review of the Congressional record dating back to 1868.
...
The original document can still be located, though, because it is preserved in a digital archive at the University of North Texas. (It’s also archived on the Internet Archive’s “Wayback Machine,” where it was last captured on September 30.)"

David. said...

Cory Doctorow reports that, as predicted, Three years after the W3C approved a DRM standard, it's no longer possible to make a functional indie browser. Browsers are now proprietary, the Web no longer runs on open standards.