Tuesday, December 4, 2018

Selective Amnesia

Last year's series of posts and PNC keynote entitled The Amnesiac Civilization were about the threats to our cultural heritage from inadequate funding of Web archives, and the resulting important content that is never preserved. But content that Web archives do collect and preserve is also under a threat that can be described as selective amnesia. David Bixenspan's When the Internet Archive Forgets makes the important, but often overlooked, point that the Internet Archive isn't an elephant:
On the internet, there are certain institutions we have come to rely on daily to keep truth from becoming nebulous or elastic. Not necessarily in the way that something stupid like Verrit aspired to, but at least in confirming that you aren’t losing your mind, that an old post or article you remember reading did, in fact, actually exist. It can be as fleeting as using Google Cache to grab a quickly deleted tweet, but it can also be as involved as doing a deep dive of a now-dead site’s archive via the Wayback Machine. But what happens when an archive becomes less reliable, and arguably has legitimate reasons to bow to pressure and remove controversial archived material?
...
Over the last few years, there has been a change in how the Wayback Machine is viewed, one inspired by the general political mood. What had long been a useful tool when you came across broken links online is now, more than ever before, seen as an arbiter of the truth and a bulwark against erasing history.
Below the fold, some commentary on the vulnerability of Web history to censorship.

Bixenspan discusses, with examples, the two main techniques for censoring the Wayback Machine:
That archive sites are trusted to show the digital trail and origin of content is not just a must-use tool for journalists, but effective for just about anyone trying to track down vanishing web pages. With that in mind, that the Internet Archive doesn’t really fight takedown requests becomes a problem. That’s not the only recourse: When a site admin elects to block the Wayback crawler using a robots.txt file, the crawling doesn’t just stop. Instead, the Wayback Machine’s entire history of a given site is removed from public view.

Takedowns

The ability of anyone claiming to own the copyright on some content to issue a takedown notice under the DMCA in the US, and corresponding legislation elsewhere is a problem for the Web in general, not just for archives. It is a problem that the copyright industries are constantly pushing to make worse. In almost no case is there a penalty for false claims of copyright ownership, which are frequently made by automated systems prone to false positives. The onus is on the recipient of the takedown to show that the claim is false which, given that in most cases copyright ownership is never registered, and even when it is the registration may be fraudulent, or out-of-date, poses an impossible barrier to contesting claims:
if someone were to sue over non-compliance with a DMCA takedown request, even with a ready-made, valid defense in the Archive’s pocket, copyright litigation is still incredibly expensive. It doesn’t matter that the use is not really a violation by any metric. If a rightsholder makes the effort, you still have to defend the lawsuit.
The Internet Archive's policy with respect to takedowns was based on a 2001 meeting which resulted in the Oakland Archive Policy. It is being reviewed, at least partly because the Internet Archive's exposure to possible litigation is now so much greater.

The fundamental problem here is that, lacking both a registry of copyright ownership, and any effective penalty for false claims of ownership, archives have to accept all but the most blatantly false claims, making it all too easy for their contents to be censored.

I haven't even mentioned the "right to be forgotten", the GDPR, the Australian effort to remove any judicial oversight from takedowns, or the EU's Article 13 effort to impose content filtering. All of these enable much greater abuse of takedown mechanisms.

Robots.txt

The Internet Archive's policy about robots.txt exclusions is evolving, as Mark Graham described last year in Robots.txt meant for search engines don’t work well for web archives:
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.
Robots.txt files affect two stages of Web archiving. As regards collection, archival crawlers should clearly respect entries specifically excluding them. But, as Graham points out, most robots.txt files are designed for Search Engine Optimization, and don't clearly express any preference about archiving.

A particular concern regarding dissemination is the policy of automatically preventing access to past crawls of a website that took place while its robots.txt permitted them, because its current robots.txt forbids them, perhaps because the domain name has changed hands, with the new owner having no ownership of the past content. This seems wrong. Web site owners wishing to exclude past permitted crawls should have to request exclusion, and show ownership of the past content they are asking to have redacted.

LOCKSS

Clearly, one way to deal with the selective amnesia problem is LOCKSS, Lots Of Copies Keep Stuff Safe, especially when they are in diverse jurisdictions and have different policies about takedowns and robots.txt. Nearly two decades ago, in the very first paper about the LOCKSS system, we wrote (my emphasis):
Librarians' technique for preserving access to material published on paper has been honed over the years since 415AD, when much of the world's literature was lost in the destruction of the Library of Alexandria. Their method may be summarized as:
Acquire lots of copies. Scatter them around the world so that it is easy to find some of them and hard to find all of them. Lend or copy your copies when other librarians need them.
But, alas, this doesn't work as well for Web archives:
  • A comprehensive Web archive is so large, the Internet Archive has around 20PB of Web history, that maintaining two copies in California is a major expense. The effort to establish a third copy in Canada awaits funding. Three isn't "lots".
  • The copyright industries have assiduously tried to align other countries legislation with the DMCA, so the scope for different policies is limited.
  • The rise of OCLC's Worldcat, which aggregates library catalogs in the interests of informing readers of accessible copies, has made finding all of the paper copies in libraries much easier. Similarly, the advent of Memento (RFC7089), which aggregates Web archive catalogs in the interests of informing browsers of accessible copies, has made finding all the copies in Web archives much easier. In practice, Memento is much more of a double-edged sword than Worldcat because the paper world has no effective equivalent of DMCA takedowns.
So, especially with the help of Memento, it is easy for malefactors to target all accessible copies of content in the small number of archives that will have them.

Libraries

Libraries have a special position under the DMCA, and the Internet Archive has always positioned itself as the equivalent on the Web of a public library in the paper world. This is, for example, the basis upon which their successful lending program is based. But:
"Under current copyright law, although there are special provisions that give certain rights to libraries, there is no definition of a library," explained Brandon Butler, the Director of Information Policy for the University of Virginia Library.
And in the one case in point:
"The court didn’t really care that this place called itself a library; it didn’t really shield them from any infringement allegations."
So the Internet Archive and other Web archives may not in practice qualify as libraries in the legal sense.

It would be hard for a copyright owner to argue that a national library, such as the Library of Congress or the British Library, wasn't a library. National libraries typically have special rights supporting their copyright deposit programs, but these rights have to be extended via new legislation to cover Web archives, and these collections are typically inaccessible outside the national library's campus. In any case, many national libraries are under sustained budget pressure, and many of their programs would be difficult without cooperation from copyright holders, so they are in a weak negotiating position.