Source |
we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.Their results are not surprising, but there are a number of surprising things about their report. Below the fold, I explain.
We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.
The Web is an evanescent medium. URLs are subject to two kinds of change:
- Content drift, when a URL resolves to different content than it did previously.
- Link rot, when a URL no longer resolves.
And in news sites, government sites and Wikipedia:
- A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.
- For older content, this trend is even starker. Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.
There is a long history of research into both phenomena. Content drift is important to Web search engines. To keep their indexes up-to-date, they need to re-vist URLs frequently enough to capture changes. Thus studies of content drift started early in the history of the Web. Here are some examples from more than two decades ago:
- 23% of news webpages contain at least one broken link, as do 21% of webpages from government sites. News sites with a high level of site traffic and those with less are about equally likely to contain broken links. Local-level government webpages (those belonging to city governments) are especially likely to have broken links.
- 54% of Wikipedia pages contain at least one link in their “References” section that points to a page that no longer exists.
- The Evolution of the Web and Implications for an Incremental Crawler, December 1999.
- Keeping up with the changing Web, May 2000.
- A large-scale study of the evolution of web pages, May 2003.
- Persistence of Web references in scientific research, February 2001.
- Web page change and persistence—A four-year longitudinal study, November 2001.
- The decay and failures of web references, January 2003.
- Going, Going, Gone: Lost Internet References, October 2003.
- 404 not found: the stability and persistence of URLs published in MEDLINE, January 2004.
I like to cite an example of really bad reviewing that appeared in AAAS Science in 2003. It was Dellavalle RP, Hester EJ, Heilig LF, Drake AL, Kuntzman JW, Schillin MGLM: Going, Going, Gone: Lost Internet References. Science 2003, 302:787, a paper about the decay of Internet links. The the authors failed to acknowledge that the paper repeated, with smaller samples and somewhat worse techniques, two earlier studies that had been published in Communications of the ACM 9 months before, and in IEEE Computer 32 months before. Neither of these are obscure journals. It is particularly striking that neither the reviewers nor the editors bothered to feed the keywords from the article abstract into Google; had they done so they would have found both of these earlier papers at the top of the search results.The first surprise is that the Pew report lacks any acknowledgement that the transience of Web content is a long-established problem; like Dellavalle et al it is as if it was a new revelation.
Even before published research had quantified it, link rot and content drift were well understood and efforts were underway to mitigate them. In 1996 Brewster Kahle had founded the Internet Archive, the first of several archives of the general Web. Two years later, the LOCKSS Program was the first effort to establish a specialized archive for the academic literature. Both were intended to deliver individual pages to users. A decade later, Common Crawl was set up to deliver Web content in bulk to researchers such as the Pew team; it is not intended as a mitigation for link rot or content drift.
Although Common Crawl was a suitable resource for their research, the second surprise is that the Pew report describes and quantifies the problem of link rot, but acknowledges none of the multiple, decades-long efforts to mitigate it by archiving the Web and providing users with preserved copies of individual pages.
I note that the Pew team didn't report any effort to determine whether the content of the rotted URLs was available in any of the Web archives. Two thoughts:
ReplyDelete- In the paper world we would not consider content lost just because the medium carrying it had been moved to a new location.
- There are technological mitigations for the problems of content drift and link rot, such as Memento (RFC7089) and browser plugins that use it and other techniques to recover from rotted URLs by redirecting to their archived versions.