I was invited to deliver a keynote at the
2017 Pacific Neighborhood Consortium in Tainan, Taiwan. My talk, entitled
The Amnesiac Civilization, was based on the
series of posts earlier this year with the same title. The theme was "Data Informed Society", and my abstract was:
What is the data that informs a society? It is easy to think that it is just numbers, timely statistical information of the kind that drives Google Maps real-time traffic display. But the rise of text-mining and machine learning means that we must cast our net much wider. Historic and textual data is equally important. It forms the knowledge base on which civilization operates.
For
nearly a thousand years this knowledge base has been stored on paper, an affordable, durable, write-once and somewhat tamper-evident medium. For more than five hundred years it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. LOCKSS is the name of the program at the Stanford Libraries that Vicky Reich and I started in 1998. We took a distributed approach; providing libraries with tools they could use to preserve knowledge in the Web world. They could work the way they were used to doing in the paper world, by collecting copies of published works, making them available to readers, and cooperating via inter-library loan. Two years earlier, Brewster Kahle had founded the Internet Archive, taking a centralized approach to the same problem.
Why
are these programs needed? What have we learned in the last two decades about their effectiveness? How does the evolution of Web technologies place their future at risk?
Below the fold, the text of my talk.
Introduction
I'm honored to join the ranks of your keynote speakers, and grateful for the opportunity to visit beautiful Taiwan. You don't need to take notes, or photograph the slides, or even struggle to understand my English, because the whole text of my talk, with links to the sources and much additional material in footnotes, has been posted to my blog.
What is the data that informs a society? It is easy to think that it is
just numbers, timely statistical information of the kind that drives
Google Maps real-time traffic display. But the rise of text-mining and
machine learning means that we must cast our net much wider. Historic
and textual data is equally important. It forms the knowledge base on
which civilization operates.
Ever since 105AD when
Cai Lun (蔡伦) invented the process for making
paper, civilizations have used it to record their history and its context in everyday life. Archives and libraries collected and preserved originals. Scribes labored to create copies, spreading the knowledge they contained.
Bi Sheng's (毕昇) invention of movable type in the 1040s AD greatly increased the spread of copies and thus knowledge, as did Choe Yun-ui's (최윤의)
1234 invention of bronze movable type in Korea,
Johannes Gutenberg's 1439 development of the metal type printing press in Germany, and Hua Sui's (华燧)
1490 introduction of bronze type in China.
[1]
Thus for about two millennia civilizations have been able to store their knowledge base on this affordable, durable, write-once, and somewhat tamper-evident medium. For more than half a millennium it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. But for about two decades the knowledge base has been migrating off paper and on to the Web.
Lots Of Copies Keep Stuff Safe is the name of the program at the Stanford Libraries that Vicky Reich and I started 19 years ago last month. We took a distributed approach to preserving knowledge; providing libraries with tools they could use to continue in the Web world their role in the paper world of collecting copies of published works and making them available to readers.
Two years earlier,
Brewster Kahle had founded the
Internet Archive, taking a centralized approach to the same problem.
My talk will address three main questions:
- Why are these programs needed?
- What have we learned in the last two
decades about their effectiveness?
- How does the evolution of Web
technology place their future at risk?
Why archive the Web?
Paper is a durable medium, but the Web is not. From its earliest days users have experienced "
link rot", links to pages that once existed but have vanished. Even in 1997 they saw it as a
major problem:
6% of the links on the Web are broken according to a recent survey by Terry Sullivan's All Things Web. Even worse, linkrot in May 1998 was double that found by a similar survey in August 1997.
Linkrot definitely reduces the usability of the Web, being cited as one of the biggest problems in using the Web by 60% of the users in the October 1997 GVU survey. This percentage was up from "only" 50% in the April 1997 survey.[4]
Research at scale in 2001 by
Lawrence et al validated this concern. They:
analyzed 270,977 computer science journal papers, conference papers, and technical reports ... From the 100,826 articles cited by another article in the database
(thus providing us with the year of publication), we extracted 67,577 URLs. ... Figure 1b dramatically illustrates the lack of persistence of Internet resources. The percentage of invalid links in the articles we examined varied from 23 percent in 1999 to a peak of 53 percent in 1994.
The problem is worse than this. Martin Klein and co-authors point out that
Web pages suffer two forms of decay or reference rot:
- Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.
- Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.
They
examined scholarly literature on the Web and found:
one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
The problem gets worse through time:
even for articles published in 2012 only about 25% of referenced resources remain unchanged by August of 2015. This percentage steadily decreases with earlier publication years, although the decline is markedly slower for arXiv for recent publication years. It reaches about 10% for 2003 through 2005, for arXiv, and even below that for both Elsevier and PMC.
Thus, as the arXiv graph shows, they find that, after a few years, it is very unlikely that a reader clicking on a web-at-large link in an article will see what the author intended.
This isn't just a problem for scholarly literature, it is even worse on the general Web. The British Library's Andy Jackson
analyzed the UK Web Archive and:
was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%.[5]
It isn't just that Web servers can go away or their contents be rewritten. Access to Web pages is mediated by the Domain Name Service (DNS), and they can become inaccessible because the domain owner fails to pay the registrar, or their DNS service, or for
political reasons:
In March 2010 every webpage with the domain address ending in .yu disappeared from the internet – the largest ever to be removed. This meant that the internet history of the former Yugoslavia was no longer available online. Dr Anat Ben-David, from the Open University in Israel, has managed to rebuild about half of the lost pages – pages that document the Kosovo Wars, which have been called "the first internet war”.[8]
The terms "link rot" and "content drift" suggest randomness but in many cases they hide deliberate suppression or falsification of information. More than a decade ago, in
only my 6th blog post, I wrote:
Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception.
Examples of Winston Smith's work are everywhere,
such as:
George W. Bush’s “Mission Accomplished” press release. The first press release read that ‘combat operations in Iraq have ceased.’ After a couple weeks, that was changed to ‘major combat operations have ceased.’ And then, the whole press release disappeared off the White House’s website completely.[7]
Britain has its Winston Smiths too:
One of the most enduring symbols of 2016's UK Brexit referendum was the huge red "battle bus" with its message, "We send the EU £350 million a week, let's fund our NHS instead. Vote Leave." ... Independent fact-checkers declared the £350 million figure to be a lie. Within hours of the Brexit vote, the Leave campaign scrubbed its website of all its promises, and Nigel Farage admitted that the £350 million was an imaginary figure and that the NHS would not see an extra penny after Brexit.[6]
Data on the Web is equally at risk. Under the Harper administration, Canadian librarians fought a
long, lonely struggle with their Winston Smiths:
Protecting Canadians’ access to data is why Sam-Chin Li, a government information librarian at the University of Toronto, worked late into the night with colleagues in February 2013, frantically trying to archive the federal Aboriginal Canada portal before it disappeared on Feb. 12. The decision to kill the site, which had thousands of links to resources for Aboriginal people, had been announced quietly weeks before; the librarians had only days to train with web-harvesting software.[11]
A year ago, a similar but much larger emergency data rescue effort
swung into action in the US:
Between Fall 2016 and Spring 2017, the Internet Archive archived over
200 terabytes of government websites and data. This includes over 100TB
of public websites and over 100TB of public data from federal FTP file
servers totaling, together, over 350 million URLs/files.
Partly this was the collaborative "End of Term" (EOT) crawl that is organized at each change of Presidential term, but this time there was
added urgency:
Through the EOT project’s public nomination form and through our collaboration with the DataRefuge, Environmental Data and Governance Initiative (EDGI), and other efforts, over 100,000 webpages or government datasets were nominated by citizens and preservationists for archiving.
Note the emphasis on datasets. It is important to keep scientific data, especially observations that are not repeatable, for the long term. A recent example is
Korean astronomers' records of a nova in 1437, which provide strong evidence that:
"cataclysmic
binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested. After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.[12]
580 years is peanuts. An example more than 5 times older is from China. In the
Shang dynasty:
astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this
they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.
Today, those eclipse records would be on the Web, not paper or bone. Will astronomers 3200 or even 580 years from now be able to use them?
What have we learned about archiving the Web?
I hope I've convinced you that a society whose knowledge base is on the Web is doomed to forget its past unless something is done to preserve it
[18]. Preserving Web content happens in three stages:
- Collection
- Preservation
- Access
What have we learned about collection?
In the wake of NASA's
March 2013 takedown of their Technical Report Server James Jacobs, Stanford's Government Documents librarian,
stressed the importance of collecting Web content:
pointing to web sites is much less valuable and much more fragile than acquiring copies of digital information and building digital collections that you control. The OAIS reference model for long term preservation makes this a requirement ... “Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.” Pointing to a web page or PDF at nasa.gov is not obtaining any control.
Memory institutions need to make their own copies of Web content. Who is doing this?
The Internet Archive is by far the largest and most used Web archive, having been trying to collect the whole of the Web for more than two decades. Its "crawlers" start from a large set of "seed" web pages and follow links from them to other pages, then follow those links, according to a set of "crawl rules". Well-linked-to pages will be well represented; they may be important or they may be "link farms". Two years ago
Kalev Leetaru wrote:
of the top 15 websites with the most snapshots taken by the Archive thus far this year, one is an alleged former movie pirating site, one is a Hawaiian hotel, two are pornography sites and five are online shopping sites. The second-most snapshotted homepage is of a Russian autoparts website and the eighth-most-snapshotted site is a parts supplier for trampolines.
The Internet Archive's highly automated collection process may collect a lot of unimportant stuff, but it is the best we have at collecting the "Web at large"
[25]. The Archive's recycled church in San Francisco, and its second site nearby sustain about 40Gb/s outbound and 20Gb/s inbound serving about 4M unique IPs/day. Each stores over 3*10
11 Web pages, among
much other content. The Archive has been for many years in the
top 300 Web sites in the world. For comparison, the Library of Congress typically ranks between 4000 and 6000.
Network effects mean that technology markets in general and the Web in particular are
winner-take-all markets. Just like Google in search, the Internet Archive is the winner in its market. Other institutions can't compete in archiving the whole Web, they must focus on curated collections.
The British Library, among other national libraries, has been
collecting their "national Web presence"
for more than a decade. One problem is defining "national Web
presence". Clearly, it is more than the .uk domain, but how much more?
The
Portuguese Web Archive defines it as the .pt domain plus content
embedded in or redirected from the .pt domain. That wouldn't work for
many countries, where important content is in top-level domains such as
.com.
Dr. Regan Murphy Kao of Stanford's East Asian Library
described their approach to Web collecting:
we sought to archive a limited number of blogs of ground-breaking, influential figures – people whose writings were widely read and represented a new way of approaching a topic. One of the people we choose was Mao Kobayashi. ... Mao broke with tradition and openly described her experience with cancer in a blog that gripped Japan. She harnessed this new medium to define her life rather than allow cancer to define it.[3]
Curated collections have a problem. What made the Web
transformational was the links (see Google's
PageRank). Viewed in isolation, curated collections break the links and subtract value. But, viewed as an adjunct to broad Web
archives they can add value in two ways:
- By providing quality assurance, using greater per-site resources to ensure that important Web resources are fully collected.
- By providing researchers better access to preserved important Web resources than the Internet Archive can. For example, better text search or data mining. The British Library has been a leader in this area.
Nearly one-third of a trillion Web pages at the Internet Archive is impressive, but in 2014 I reviewed the research into
how much of the Web was then being collected[13] and concluded:
Somewhat less than half ... Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.
Costa
et al ran
surveys in 2010 and 2014 and concluded in 2016:
during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.
I
revisited this topic earlier this year and concluded that we were losing ground rapidly. Why is this? The reason is that collecting the Web is expensive, whether it uses human curators, or large-scale technology, and that
Web archives are pathetically under-funded:
The Internet Archive's budget
is in the region of $15M/yr, about half of which goes to Web archiving.
The budgets of all the other public Web archives might add another
$20M/yr. The total worldwide spend on archiving Web content is probably
less than $30M/yr, for content that cost hundreds of billions to create[19].
My rule of thumb has been that collection takes about half the lifetime cost of digital preservation, preservation about a third, and access about a sixth. So the world may spend only about $15M/yr collecting the Web.
As an Englishman it is important in this forum I also observe that, like the Web itself, archiving is biased towards English. For example, pages in Korean are less than half as likely to be collected as pages in English
[20].
What have we learned about preservation?
Since Jeff Rothenberg's seminal 1995
Ensuring the Longevity of Digital Documents it has been commonly assumed that digital preservation revolved around the problem of formats becoming obsolete, and thus their content becoming inaccessible. But in the Web world
formats go obsolete very slowly if at all. The real problem is simply
storing enough bits reliably enough for the long term.
[9]
|
Kryder's Law |
Or rather, since we actually know how to store bits reliably, it is finding enough money to store enough bits reliably enough for the long term. This used to be a problem we could ignore. Hard disk, a 60-year-old technology, is the dominant medium for bulk data storage. It had a remarkable run of more than 30 years of 30-40%/year price declines; Kryder's Law, the disk analog of Moore's Law. The rapid cost decrease meant that if you could afford to store data for a few years, you could afford to store it "forever".
|
Kryder's Law breakdown |
Just as Moore's Law slowed dramatically as the technology approached the physical limits, so did Kryder's Law. The slowing started in 2010, and was followed by the 2011 floods in Thailand, causing disk prices to double and not recover for 3 years. In 2014 we predicted Kryder rates going forward between 10-20%, the red lines on the graph,
meaning that:
If the industry projections pan out ... by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
So far, our prediction has proved correct, which is bad news. The graph shows the endowment, the money which, deposited with the data and invested at interest, will cover the cost of storage "forever". It increases strongly as the Kryder rate falls below 20%, which it has.
Absent unexpected technological change, the cost of long-term data storage is far higher than most people realize.
[10]
What have we learned about access?
There is clearly a demand for access to the Web's history. The Internet Archive's Wayback Machine provides well over 1M users/day access to it.
|
1995 Web page |
On Jan 11
th 1995, the late
Mark Weiser, CTO of Xerox PARC, created
Nijinksy and Pavlova's Web page, perhaps the start of the Internet's obsession with cat pictures. You can view this important historical artifact by pointing your browser to the Internet Archive's Wayback Machine, which captured the page 39 times between
Dec 1st 1998 and
May 11th 2008. What you see using your modern browser is perfectly usable, but it is slightly different from what Mark saw when he finished the page over 22 years ago.
Ilya Kreymer used two important recent developments in digital preservation
to build
oldweb.today , a Web site that allows you to view preserved Web content using the browser that its author would have. The first is that
emulation & virtualization techniques have advanced to allow Ilya to create on-the-fly behind the Web page a virtual machine running, in this case, a 1998 version of Linux with a 1998 version of the Mosaic browser visiting the page. Note the different fonts and background. This is very close to what Mark would have seen in 1995.
The Internet Archive is by far the biggest Web archive, for example holding around 40 times as much data as the UK Web Archive. But the smaller archives contain pages it lacks, and there is
little overlap between them, showing the value of curation.
The second development is Memento (
RFC7089), a Web protocol that allows access facilities such as
oldweb.today to treat the set of compliant Web archives as if it were one big archive. oldweb.today aggregates many Web archives, pulling each Web resource a page needs from the archive with the copy closest in time to the requested date.
[14]
The future of Web preservation
There are two main threats to the future of Web preservation, one economic and the other a combination of technological and legal.
The economic threat
Preserving the Web and other digital content for posterity is primarily an
economic problem. With an
unlimited budget collection and preservation isn't a problem. The reason we're collecting and preserving
less than half the classic Web of quasi-static linked documents is that
no-one has the money to do much better. The other half is more difficult and thus more expensive. Collecting and preserving the whole of the classic Web would need the current global Web archiving budget to be roughly tripled, perhaps an additional $50M/yr.
Then there are the much higher costs involved in preserving the much more than half of the dynamic
"Web 2.0" we currently miss.
|
British Library real income |
If we are to continue to preserve even as much of society's memory as we currently do we face two very difficult choices; either find a lot more money, or radically reduce the cost per site of preservation.
It will be hard to find a lot more money in a world where libraries and archive budgets are
decreasing. For example, the graph shows that the British Library's income has declined by 45% in real terms over the last decade.
[2]
The Internet Archive is already big enough to reap economies of scale, and it already uses
innovative engineering to minimize cost. But Leetaru and others criticize it for:
- Inadequate metadata to support access and research.
- Lack of quality assurance leading to incomplete collection of sites.
- Failure to collect every version of a site.
Generating good metadata and doing good QA are hard to automate and thus the first two are expensive. But the third is simply impossible.
The technological/legal threat
The economic problems of archiving the Web stem from its two business models, advertising and subscription:
- To maximize the effectiveness of advertising, the Web now potentially delivers different content on every visit to a site. What does it mean to "archive" something that changes every time you look at it?
- To maximize the effectiveness of subscriptions, the Web now potentially prevents copying content and thus archiving it. How can you "archive" something you can't copy?
Personalization, geolocation and adaptation to browsers and devices mean that each of the about
3.4*109 Internet users may see different content from each of about 200 countries they may be in, and from each of the say 100 device and browser combinations they may use. Storing
every possible version of a single average Web page could thus require downloading about 160 exabytes, 8000 times as much Web data as the Internet Archive holds.
The situation is even worse.
Ads are inserted by a real-time auction system, so even if the page content is the same on every visit, the ads differ. Future scholars, like current scholars studying
Russian use of
social media in the 2016 US election, will want to study the ads but they won't have been systematically collected, unlike
political ads on TV.
[21]
The point here is that, no matter how much resource is available,
knowing that an archive has collected all, or even a representative sample, of the versions of a Web page is completely impractical. This isn't to say that trying to do a better job of collecting
some versions of a page is pointless, but it is never going to provide future researchers with the certainty they crave. And doing a better job of each page will be expensive.
Although it is possible to collect
some versions of today's dynamic Web pages, it is likely soon to become impossible to collect
any version of most Web pages. Against unprecedented opposition, Netflix and other large content owners with subscription or pay-per-view business models have forced W3C, the standards body for the Web, to
mandate that browsers support Encrypted Media Extensions (EME) or Digital Rights Management (DRM) for the Web.
[15]
The W3C's diagram of the EME stack shows an example of how it works. An application, i.e. a Web page, requests the browser to render some encrypted content. It is delivered, in this case from a Content Distribution Network (CDN), to the browser. The browser needs a license to decrypt it, which it obtains from the application via the EME API by creating an appropriate session then using it to request the license. It hands the content and the license to a Content Decryption Module (CDM), which can decrypt the content using a key in the license and render it.
What is DRM trying to achieve? Ostensibly, it is trying to ensure that each time DRM-ed content is rendered, specific permission is obtained from the content owner. In order to ensure that, the CDM cannot trust the browser it is running in. For example, it must be sure that the browser can see neither the decrypted content nor the key. If it could see, and save for future use, either it would defeat the purpose of DRM. The license server will not be available to the archive's future users, so preserving the encrypted content without the license is pointless.
Content owners are not stupid. They realized early on that the search
for uncrackable DRM was a fool's errand
[22]. So, to deter reverse
engineering, they arranged for the 1998 Digital Millenium Copyright Act (
DMCA) to make any attempt to circumvent protections on digital content a criminal offense. US trade negotiations mean that almost all countries (except Israel) have DMCA-like laws.
[23]
Thus we see that the real goal of EME is to ensure that, absent special legislation such as a few national libraries have, anyone trying to either capture the decrypted content, or preserve the license for future use, would be committing a crime. Even though the British Library, among others, has the legal right to capture the decrypted content, it is doubtful that they have the technical or financial resources to do so at scale.
Scale is what libraries and archives would need. Clearly, EME will be rapidly adopted for streaming video sites, not just Netflix but YouTube, Vimeo and so on. Even a decade ago, to
study US elections you needed YouTube video, but it will no longer be possible to preserve Web video.
But
that's not the big impact that EME will have on society's memory. It is intended for video and audio, but it will be hard for W3C to argue that other valuable content doesn't deserve the same protection, for example academic journals. DRM will spread to other forms of content. The
business models for Web content are of two kinds, and both are
struggling:
- Paywalled content. It turns out that, apart from movies and academic publishing, only a very few premium brands such as The Economist, the Wall Street Journal and the New York Times have viable subscription business models based on (mostly) paywalled content. Even excellent journalism such as The Guardian
is reduced to free access, advertising and voluntary donations. Part of
the reason is that Googling the headline of paywalled news stories
often finds open access versions of the content. Clearly, newspapers and
academic publishers would love to use Web DRM to ensure that their
content could be accessed only from their site, not via Google or Sci-Hub.
- Advertising-supported content. The market for Web advertising is so
competitive and fraud-ridden that Web sites have been forced into
letting advertisers run ads that are so obnoxious and indeed riddled with malware, and to load up their sites with trackers, that many users have rebelled.[24] They use ad-blockers; these days it is pretty much essential to do so to keep yourself safe and to reduce bandwidth consumption.
Not to mention that sites such as Showtime are so desperate for income that their ads mine cryptocurrency in your browser. Sites are very worried about the loss of income from blocked ads. Some,
such as Forbes, refuse to supply content to browsers that block ads
(which, in Forbes case, turned out to be a public service; the ads carried malware).
DRM-ing a site's content will prevent ads being blocked. Thus ad space
on DRM-ed sites will be more profitable, and sell for higher prices,
than space on sites where ads can be blocked. The pressure on
advertising-supported sites, which include both free and subscription
news sites, to DRM their content will be intense.
Thus the advertising-supported bulk of what we think of as the Web, and
the paywalled resources such as news sites that future scholars will
need will become un-archivable.
Summary
I wish I could end this talk on an optimistic note, but I can't. The information the future will need about the world of today is on the Web. Our ability to collect and preserve it has been both inadequate and decreasing. This is primarily due to massive under-funding. A few tens of millions of dollars per year worldwide set against the trillions of dollars per year of revenue the Web generates. There is no realistic prospect of massively increasing the funding for Web archiving. The funding for the world's memory institutions, whose job it is to remember the past, has been under sustained attack for many years.
The largest component of the cost of Web archiving is the initial collection. The evolution of the Web from a set of static, hyper-linked documents to a JavaScript programming environment has been steadily raising the difficulty and thus cost of collecting the typical Web page. The increasingly dynamic nature of the resulting Web content means that each individual visit is less and less representative of "the page". What does it mean to "preserve" something that is different every time you look at it?
And now, with the advent of Web DRM, our likely future is one in which it is not simply increasingly difficult, expensive and less useful to collect Web pages, but actually illegal to do so.
Call To Action
So I will end with a call to action. Please:
- Use the Wayback Machine's Save Page Now facility to preserve pages you think are important.
- Support the work of the Internet Archive by donating money and materials.
- Make sure your national library is preserving your nation's Web presence.
- Push back against any attempt by W3C to extend Web DRM.
Footnotes
Acknowledgements
I'm grateful to Herbert van de Sompel and Michael Nelson for constructive comments on drafts, and to Cliff Lynch, Michael Buckland and the participants in
UC Berkeley's Information Access Seminar, who provided useful feedback on a rehearsal of this talk. That isn't to say they agree with it.
The Internet Archive's Dodging the Memory Hole 2017 event covered many of the themes from my talk:
ReplyDelete"Kahle and others made it clear that today's political climate has added a sense of urgency to digital preservation efforts. Following the 2016 election, the Internet Archive and its community of concerned archivists worked to capture 100TB of information from government websites and databases out of concern it might vanish. It's a job with no end in sight."
The event website is here.
In the context of the "battle bus" image above, I can't resist linking to this image from the Financial Times.
ReplyDeleteIn 2017, climate change vanished from a ridiculous number of government websites by Rebecca Leber and Megan Jula starts:
ReplyDelete"Moments after President Donald Trump took the oath of office last January, nearly all references to climate change disappeared from the White House official website. A page detailing former President Barack Obama’s plans to build a clean energy economy, address climate change, and protect the environment became a broken link (archived here)."
and goes on to provide a massive list of deletions courtesy of the Environmental and Data Governance Initiative.
Speaking of "What does it mean to archive something that changes every time you look at it?" - preserve this!. Hat tip to David Pescovitz at Boing Boing.
ReplyDeleteTim Bray discovers:
ReplyDelete"Google has stopped indexing the older parts of the Web. ... Google’s not going to be selling many ads next to search results that turn them up. So from a business point of view, it’s hard to make a case for Google indexing everything, no matter how old and how obscure. ... When I have a question I want answered, I’ll probably still go to Google. When I want to find a specific Web page and I think I know some of the words it contains, I won’t any more, I’ll pick Bing or DuckDuckGo."
"Thousands of government papers detailing some of the most controversial episodes in 20th-century British history have vanished after civil servants removed them from the country’s National Archives and then reported them as lost." reports Ian Cobain at The Guardian.
ReplyDeleteThese are paper documents, but like the Web edits the common thread linking the disappearances is that they are embarrassing for the government:
"Documents concerning the Falklands war, Northern Ireland’s Troubles and the infamous Zinoviev letter – in which MI6 officers plotted to bring about the downfall of the first Labour government - are all said to have been misplaced."
"A webpage that focused on breast cancer was reportedly scrubbed from the website of the Department of Health and Human Services's (HHS) Office on Women's Health (OWH).
ReplyDeleteThe changes on WomensHealth.gov — which include the removal of material on insurance for low-income people — were detailed in a new report from the Sunlight Foundation's Web Integrity Project and reported by ThinkProgress." reports Rebecca Savransky at The Hill.
Justin Littman's Vulnerabilities in the U.S. Digital Registry, Twitter, and the Internet Archive is a must-read. It starts:
ReplyDelete"The U.S. Digital Registry is the authoritative list of the official U.S. government social media accounts. One of the stated purposes of the U.S. Digital Registry is to “support cyber-security by deterring fake accounts that spread misinformation”."
Justin goes on to demonstrate, by tweeting as @USEmbassyRiyadh, that the registry, Twitter and the Internet Archive can all be attacked to allow extremely convincing fake official US government tweets.
Following the @USEmbassyRiyadh link shows that this particular instance has since been fixed.
Oh, What a Fragile Web We Weave: Third-party Service Dependencies In Modern Webservices and Implications by Aqsa Kashaf et al from Carnegie Mellon examines a different kind of Web vulnerability, the way the Web's winner-take-all economics drives out redundancy of underlying services:
ReplyDelete"Our key findings are: (1) 73.14% of the top 100,000 popular services are vulnerable to reduction in availability due to potential attacks on third-party DNS, CDN, CA services that they exclusively rely on; (2) the use of third-party services is concentrated, so that if the top-10 providers of CDN, DNS and OCSP services go down, they can potentially impact 25%-46% of the top 100K most popular web services; (3) transitive dependencies significantly increase the set of webservices that exclusively depend on popular CDN and DNS service providers, in some cases by ten times (4) targeting even less popular webservices can potentially cause significant collateral damage, affecting upto 20% of the top 100K webservices due to their shared dependencies. Based on our findings, we present a number of key implications and guidelines to guard against such Internet scale incidents in the future."
Jefferson Bailey's talk at the Ethics of Archiving the Web conference, Lets Put Our Money Where Our Ethics Are, documents the incredibly small amount of money devoted to the immense problem of collecting and preserving the vast proportion of our cultural heritage that lives on the Web.
ReplyDeleteIt is the first 18.5 minutes of this video and well worth watching.
"The Chinese government is rewriting history in its own distorted self-image. It wants to distance itself from its unseemly past, so it's retconning history through selectively-edited educational material and blatant censorship. Sure, the Chinese government has never been shy about its desire to shut up those that don't agree with it, but a recent "heroes and martyrs" law forbids disparaging long dead political and military figures.
ReplyDelete...
The court also decided the use of a poem in a spoof ad also harmed society in general, further cementing the ridiculousness of this censorial law and the government enforcing it."
From Chinese 'Rage Comic' Site First Victim Of Government's History-Rewriting 'Heroes And Martyrs' Law by Tim Cushing.
Javier C. Hernández' Why China Silenced a Clickbait Queen in Its Battle for Information Control is a great example of why Web archives outside the jurisdiction of the site being archived are important:
ReplyDelete"Since December, the authorities have closed more than 140,000 blogs and deleted more than 500,000 articles, according to the state-run news media, saying that they contained false information, distortions and obscenities."
The Epoch Times collected and republished the article.
An Important Impeachment Memo Has Vanished From a Senate Web Server by Benjamin Wofford reports that:
ReplyDelete"A public memo that outlined the Senate’s obligations during an impeachment trial is no longer accessible to the public on a Senate web server.
The memo, titled, “An Overview of the Impeachment Process,” dates to 2005, when it was produced by a legislative attorney working for the Congressional Research Service. It provides a legal interpretation of several impeachment-related powers and duties of the Senate, as well as the House of Representatives, based on a detailed review of the Congressional record dating back to 1868.
...
The original document can still be located, though, because it is preserved in a digital archive at the University of North Texas. (It’s also archived on the Internet Archive’s “Wayback Machine,” where it was last captured on September 30.)"
Cory Doctorow reports that, as predicted, Three years after the W3C approved a DRM standard, it's no longer possible to make a functional indie browser. Browsers are now proprietary, the Web no longer runs on open standards.
ReplyDelete