Tuesday, November 1, 2016

Fixing broken links in Wikipedia

Mark Graham has a post at the Wikimedia Foundation's blog, Wikipedia community and Internet Archive partner to fix one million broken links on Wikipedia:
The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community have now fixed more than one million broken outbound web links on English Wikipedia. This has been done by the Internet Archive's monitoring for all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles. This combined with the other web archiving projects, means that as pages on the Web become inaccessible, links to archived versions in the Internet Archive's Wayback Machine can take their place. This has now been done for the English Wikipedia and more than one million links are now pointing to preserved copies of missing web content.
This is clearly a good thing, but follow me below the fold.

The lead developers of the bot that has been fixing up the links are Maximilian Doerr and Stephen Balbach:
As a result of their work, in close collaboration with the non-profit Internet Archive and the Wikimedia Foundation's Wikipedia Library program and Community Tech team, now more than one million broken links have been repaired. For example, footnote #85 from the article about Easter Island, now links to the Wayback Machine instead of a now-missing page.
Footnote 85 on the Easter Island page now looks like this:
However, Alfred Metraux pointed out that the rubble filled Rapanui walls were a fundamentally different design to those of the Inca, as these are trapezoidal in shape as opposed to the perfectly fitted rectangular stones of the Inca. See also this FAQ at the Wayback Machine (archived 11 October 2007)
The URL that has been updated to point to an archived copy is:
This shows that the Internet Archive collected the page:
on 11 October 2007. This is clearly a big improvement, as Mark Graham writes:
"What Max and Stephen have done in partnership with Mark Graham at the Internet Archive is nothing short of critical for Wikipedia's enduring value as a shared repository of knowledge. Without dependable and persistent links, our articles lose their backbone of reliable sources. It's amazing what a few people can do when they are motivated by sharing - and preserving -knowledge," said Jake Orlowitz, head of the Wikipedia Library. "Having the opportunity to contribute something big to the community with a fun task like this is why I am a Wikipedia volunteer and bot operator. It's also the reason why I continue to work on this never-ending project, and I'm proud to call myself its lead developer," said Maximilian, the primary developer and operator of InternetArchiveBot.
But wiring the Internet Archive in as the only source of archived Web pages, while expedient in the short term, is also a problem. It is true that the Wayback Machine is by far the largest repository of archived URLs, but research using Memento (RFC7089) has shown that significantly better reproduction of archived pages can be achieved by aggregating all the available Web archives.

Reinforcing the public perception that the Wayback Machine is the only usable Web archive reduces the motivation for other institutions, such as national libraries, to maintain their own Web archiving efforts. Given the positive effects of aggregating even relatively small Web archives, this impairs the quality of the reader's experience of the preserved Web, and thus Wikipedia.

Perhaps at some point the InternetArchiveBot could be replaced by a MementoBot that inserted links to a Memento aggregator instead of directly to the Wayback Machine. The Wayback Machine would still be the source for most of the broken link replacements, but more links would resolve. Other Web archives would get credit for their efforts, in the cooperative spirit of Brewster Kahle's "Building Libraries Together".

[Update - Blogger appears to have bloggered the link to Mark Graham's post, so I fixed it. Sorry about that.]


Sawood Alam said...

In my opinion a "RobustLinksBot" (http://robustlinks.mementoweb.org/spec/) would be a better choice instead of a "MementoBot". It allows different approaches of decorating links, such as, 1) linking to the URI-R (original URI) and providing a specific URI-M (Memento URI) along with version datetime information to allow TimeGate negotiation on any other archive or Memento aggregator or 2) linking to a specific URI-M and providing the URI-R and version datetime for alternate URI-M resolution. The first approach would require a client side tool or JavaScript to make it work properly, while the latter approach would work in usual browsers.

Replacing Internet Archive with a Memento agregator would still have the single point of failure issue, while applying the Robust Links approach would benefit in many ways including:

- the page markup still has the original URI in it, hence proper PageRank value is calculated and does not give unfair advantage to IA or any Memento aggregator
- if the URI-M resolution service endpoint changes, it only gets added in the client and does not require to mass edit all the pages again, causing unnecessary revision of almost all the pages in the Wiki
- links are accessible in countries where IA (or some Memento aggregators for that matter) are banned

Martin Klein said...

I would like to elaborate on Sawood's comment by stressing two points:

1) We have proposed the concept of "Robust Links" to address the issue of broken links. The idea is that by decorating links with additional information, we enable fallback options in case of link rot and content drift. In addition to the original URI, a link is enhanced with the datetime of linking plus the URI of an archived copy of the original. In case the original URI does not resolve anymore at some point in the future, the archived copy can be used. In case the archival copy is inaccessible, the original URI and the datetime can be used to look for alternative copies in other archives.
We provide some motivating examples and a detailed description of the link decoration approach and an example for Robust Links in action can be found in this D-Lib Magazine article.

2) Software to help create something like a MementoBot already exists. LANL's Prototyping Team has developed a Python-based Memento library, which has been utilized by PyWikiBot, Wikipedia's bot to check for dead links and find archived replacement copies (and more). However, to the best of our knowledge, neither the library nor PyWikiBot has been used in the effort described above.