Thursday, March 28, 2019

The 47 Links Mystery

Nearly a year ago, in All Your Tweets Are Belong To Kannada, I blogged about Cookies Are Why Your Archived Twitter Page Is Not in English. It describes some fascinating research by Sawood Alam and Plinio Vargas into the effect of cookies on the archiving of multi-lingual web-sites.

Sawood Alam just followed up with Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages, another fascinating exploration of these effects. Follow me below the fold for some commentary.

The starting point for Alam’s exploration was replaying from the Internet Archive a memento from the preserved Twitter timeline of Pratik Sinha.

Alam's Figure 1
As you can see, although the tweets were correctly rendered in English, parts of the frame around them were rendered in Portuguese, Urdu and English. These are among the 47 languages Twitter supports. Alam has highlighted these parts in different colors, and distinguished content in the initial HTTPS response (dotted border) from subsequent lazily loaded content (solid border). This page never existed in the real world; the result of the replay process is misleading.

From there Alam describes an intricate process of figuring out why this happened. You should read the whole post to follow along. Briefly, Twitter supports two ways for the reader's browser to specify the language:
In order to fetch a page in a specific language (from their 47 currently supported languages) one can either add a "?lang=" query parameter in the URI (e.g., https://twitter.com/ibnesayeed?lang=ur for Urdu) or send a Cookie header containing the "lang=<language-code>" name/value pair. A URI query parameter takes precedence in this case and also sets the "lang" Cookie accordingly (overwriting any existing value) for all the subsequent requests until overwritten again explicitly.
Alam's Figure 5
If a memento request includes the "?lang=" query parameter, the sidebars are correctly rendered in the requested language, in this case Urdu. And because Urdu is written right-to-left, the sidebars correctly move to the left side.

The mystery starts being revealed when Alam notes that:
The HTML page also contains 47 language-specific alternate links (and one x-default hreflang) in its markup (with "?lang=" style parameters). These alternate links will also be added in the frontier queue of the crawler in some order.
Because each of these links uses the query parameter to specify the language, each time the crawler follows one of the 47 links the "lang" cookie will be updated. The cookie affects the entire domain, so the new setting will control everything the crawler fetches from that domain until it follows the next one of the 47 links.

The problem is exacerbated because:
When cookies are used for content negotiation, the server should advertise it in the "Vary" header, but Twitter does not.
Alam concludes:
Accommodating cookies at capture/crawl time, but not utilizing them at replay time has this consequence of cookie violations, resulting in defaced composite mementos.
Fixing this problem is really hard, as Alam goes on to describe.

I hope this brief summary motivates you to read the whole post. It is a fine piece of work.

No comments: