|Gerd Badur CC BY-SA 3.0, Source|
47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive.Kannada is an Indian language spoken by only about 38 million people. Below the fold, some commentary.
Barack Obama’s archived Twitter page, shown in the image above, is in English, but the page template is in Urdu. You may notice that some of the information, such as, "followers", "following", "log in", etc. are not display in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman in "The vulnerability in the US digital registry, Twitter, and the Internet Archive". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.Justin Littman's post is a must-read! They started digging and:
We downloaded the TimeMap of [Obama's] page using MemGator and then downloaded all the mementos in it for analysis. We found that his Twitter page was archived in 47 different languages (all the languages that Twitter currently supports, a subset of which is supported in their widgets) across five different web archives, including Internet Archive (IA), Archive-It (AIT), Library of Congress (LoC), UK Web Archive (UKWA), and Portuguese Web Archive (PT). Our dataset shows that overall only 53% of his pages (out of over 9,000 properly archived mementos) were archived in English. Of the remaining 47% mementos 22% were archived in Kannada and 25% in 45 other languages combined.They came up with a list of 5 possible explanations and methodically eliminated each of them in a process which is impossible to summarize briefly. Reading their description of the process illuminates a number of interesting issues in Web archiving, such as the lengths Twitter goes to in ensuring that their pages are not cached.
They finally tracked the problem to the way Heritrix' use of session cookies interacts with Twitter's language negotiation:
The page source of Twitter has a list of alternate links for each language they provide localization for (currently, 47 languages). This list can get added to the frontier queue of the crawler. Though, these links have a different URI (i.e., having a query parameter "?lang=But why was Kannada so dominant?
"), once any of these links are loaded, the session is set for that language until the language is explicitly changed or the session expires/cleared.
The fact that Kannada ("kn") is the last language in the list is why it is so prevalent in web archives. While other language specific links overwrite the session set by their predecessor, the last one affects many more Twitter links in the frontier queue. Twitter started supporting Kannada along with three other Indian languages in July 2015 and placed it at the very end of language related alternate links. Since then, it has been captured more often in various archives than any other non-English language. Before these new languages were added, Bengali used to be the last link in the alternate language links for about a year. Our dataset shows dense archival activity for Bengali between July 2014 to July 2015, then Kannada took over. This confirms our hypothesis about the spatial placement of the last language related link sticking the session for a long time with that language. This affects all upcoming links in the crawlers' frontier queue from the same domain until another language specific link overwrites the session.Disabling session cookies in the crawler isn't a good idea, because many site features depend on them. The best fix they were able to come up with is to expire cookies quickly.
The problem of Twitter pages in Kannada is another aspect of the overall problem I discussed in The Amnesiac Civilization: Part 3:
But [Kalev Leetaru] doesn't address the major sources of variability among the versions of web page content, which are personalization and geolocation. It used to be the case that society's basic information environment was mass media, and it was safe to assume that all consumers of each of those mediums saw the same content. This hasn't been the case for years; every visitor to a site with a significant audience sees different contentPresumably Kalev Leetaru's demand to archive every version of every Web page would include the language versions. This would add another factor of at least 47 to my estimate that:
storing a single Web page could take up to about 1.6*1020 bytes, or 160 exabytes.7.5 zettabytes is a lot of data for one page.
The problem here isn't that some archived Twitter pages have Kannada templates. After all, there are 38M potential Twitter users who speak Kannada and would have seen those pages with a Kannada template, if their browsers were configured correctly. Rather it is that the archives aren't reflecting the Web at large, because Kannada templates are over-represented. On a population basis they should be maybe 0.6% of the total, not 25%. And if Twitter pages are representative of the language usage of the Web-at-large, they should be a much smaller fraction than 0.6%.
Given that we can only archive a fraction of the Web, we would like Web archives to be a representative sample of the Web. Getting that would require an unbiased random sample of Web pages. Detecting sources of bias in the sample we have and figuring out how to reduce them in future crawls, as Sawood Alam and Plinio Vargas have done, is important.