Thursday, August 25, 2016

Evanescent Web Archives

Below the fold, discussion of two articles from last week about archived Web content that vanished.

At Urban Milwaukee Michail Takach reports that Journal Sentinel Archive Disappears:
Google News Archive launched [in 2008] with ambitious plans to scan, archive and release the world’s newspapers in a single public access database. ... When the project abruptly ended three years later, the project had scanned over a million pages of news from over 2,000 newspapers. Although nobody is entirely sure why the project ended, Google News Archive delivered an incredible gift to Milwaukee: free digital access to more than a century’s worth of local newspapers.
But now:
on Tuesday, August 16, the Milwaukee Journal, Milwaukee Sentinel, and Milwaukee Journal Sentinel listings vanished from the Google News Archive home page. This change came without any advance warning and still has no official explanation.
The result for Takach is:
For years, I’ve bookmarked thousands of articles and images for further exploration at a later date. In one lightning bolt moment, all of my Google News Archive bookmarks went from treasure to trash.
To be fair, this doesn't appear to be another case of Google abruptly canceling a service:
“Google News Archive no longer has permission to display this content.”
According to the Milwaukee Journal Sentinel:
“We have contracted with a new vendor (Newsbank.) It is unclear when or if the public will have access to the full inventory that was formerly available on Google News Archive.”
The owner of the content arbitrarily decided to vanish it.

At U.S. News & World Report Steven Nelson's Wayback Machine Won’t Censor Archive for Taste, Director Says After Olympics Article Scrubbed is an excellent, detailed and even-handed look at the issues raised for the Internet Archive when the Daily Beast's:
straight reporter created a gay dating profile and reported the weights, athletic events and nationalities of Olympians who contacted him, including those from "notoriously homophobic" countries. As furor spread last week, the Daily Beast revised and then retracted the article, sending latecomers to the controversy to the Wayback Machine.
The Internet Archive has routine processes that make content they have collect inaccessible, for example in response to DMCA takedown notices. It isn't clear exactly what happened in this case. Mark Graham is quoted:
“The page we’re talking about here was removed from the Wayback Machine out of a concern for safety and that’s it.”... Graham was not immediately able to think of a similar safety-motivated removal and declined to say if the Internet Archive retains a non-public copy. In fact, he says he has no proof, just circumstantial evidence, the article ever was in the Wayback Machine.
I would endorse Chris Bourg's stance on this issue:
Chris Bourg, director of libraries at the Massachusetts Institute of Technology, says the matter is a "a tricky situation where librarian/archivists values of privacy and openness come in to conflict" and says in an email the article simply could be stored in non-public form for as long as necessary.

"My personal opinion is that we should always look for answers that cause the least harm, which in this case would be to dark archive the article; and keep it archived for as long as needed to best protect the gay men who might otherwise be outed," she says. "That’s a difficult thing to do, and is no guarantee that the info won’t be released and available from other sources; but I think archivists/librarians have special responsibilities to the subjects in our collections to 'do no harm'."
These two stories bring up four points to consider:
  • The Internet Archive is the most-used, but only one among a number of Web archives which will naturally have different policies. Portals to the archived Web that use Memento to aggregate their content, such as, could well find content the Wayback machine had suppressed in other archives.
  • Copyright enables censorship. Anything on the public Web, or in public Web archives, can be rendered inaccessible without notice by the use or abuse of copyright processes, such as the DMCA takedown process.
  • Just because archived Web resources are in the custody of a major company, such as Google, or even what we may now thankfully call a major institution, the Internet Archive, does not guarantee them permanence.
  • Thus, scholars such as Takach are faced with a hard choice, either to risk losing access without notice to the resources on which their work is based, or to ignore the law and maintain a personal archive stored in their own equipment of all those resources.
While not specifically about Web archives, emptywheel's account of the removal of the Shadow Brokers files from GitHub, Reddit and Tumblr, and Roxane Gay's The Blog That Disappeared about Google's termination of Dennis Cooper's account, show that one cannot depend on what services such as these say in their Terms of Service.


David. said...

Michael Nelson points out that in this case, probably because of the short time during which the article was up, other Web archives don't appear to have captured it.

David. said...

Via Mike Masnick at Techdirt, more detail on the Milwaukee Journal-Sentinel archive vanishing from Google from Henry Grabar at Slate.