Sawood Alam just followed up with Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages, another fascinating exploration of these effects. Follow me below the fold for some commentary.
The starting point for Alam’s exploration was replaying from the Internet Archive a memento from the preserved Twitter timeline of Pratik Sinha.
Alam's Figure 1 |
From there Alam describes an intricate process of figuring out why this happened. You should read the whole post to follow along. Briefly, Twitter supports two ways for the reader's browser to specify the language:
In order to fetch a page in a specific language (from their 47 currently supported languages) one can either add a "?lang=" query parameter in the URI (e.g., https://twitter.com/ibnesayeed?lang=ur for Urdu) or send a Cookie header containing the "lang=<language-code>" name/value pair. A URI query parameter takes precedence in this case and also sets the "lang" Cookie accordingly (overwriting any existing value) for all the subsequent requests until overwritten again explicitly.
Alam's Figure 5 |
The mystery starts being revealed when Alam notes that:
The HTML page also contains 47 language-specific alternate links (and one x-default hreflang) in its markup (with "?lang=Because each of these links uses the query parameter to specify the language, each time the crawler follows one of the 47 links the "lang" cookie will be updated. The cookie affects the entire domain, so the new setting will control everything the crawler fetches from that domain until it follows the next one of the 47 links." style parameters). These alternate links will also be added in the frontier queue of the crawler in some order.
The problem is exacerbated because:
When cookies are used for content negotiation, the server should advertise it in the "Vary" header, but Twitter does not.Alam concludes:
Accommodating cookies at capture/crawl time, but not utilizing them at replay time has this consequence of cookie violations, resulting in defaced composite mementos.Fixing this problem is really hard, as Alam goes on to describe.
I hope this brief summary motivates you to read the whole post. It is a fine piece of work.
No comments:
Post a Comment