Tuesday, December 29, 2020

Michael Nelson's Group On Archiving Twitter

The rise and fall of the Trump administration has amply illustrated the importance of Twitter in the historical record. Alas, Twitter has no economic motivation to cater to the needs of historians. As they work to optimize Twitter's user experience, the engineers are likely completely unaware of the problems they are causing the Web archives trying to preserve history. Even if they were aware, they would be unable to justify the time and effort necessary to mitigate them.

Over the last six months Michael Nelson's group at Old Dominion University have continued their excellent work to evaluate exactly how much trouble future historians will have to contend with in three new blog posts from Kritika Garg and Himarsha Jayanetti:
Below the fold, some commentary on each of them.

The incentives for Web sites are to "optimize the user experience" to maximize "engagement", to adapt to mobile devices, and to bombard the user with ads. This leads to much of the content they deliver being dynamic, so that each visitor sees personalized content that is different from what every other visitor sees, and that changes during each visit. This has rendered much of the Web extraordinarily unpleasant to view, as described in this memorable rant from Alistair Dabbs:
This leaves you chasing after buttons and tabs as they slide around, jump up and down, run about in circles and generally act like some demented terrier that has just dug up a stash of cocaine-laced Bonio.
Archiving this dynamic content poses both philosophical and practical difficulties:
  • Philosophically, no matter how hard the Web archive's crawler tries to emulate a generic visitor, the fact remains that the precise content it collects would never have been seen by any human. So how can it be said that it represents the historical record? Archives need to provide future historians with ways to estimate the range of differences between what they see, and what a representative sample of human visitors would have seen at the time. This problem is particularly acute with social media, and targeted ads. In both cases the whole point of the system is to show different visitors different content.
  • Practically, the prevalance of dynamic content means that the Web archive's crawler must execute the page using a headless browser rather than simply collect the content of the URIs it finds in the HTML. Despite doing so, even ignoring the philospohical issues, problems remain. Much of the page, as Garg & Jayanetti show for Twitter, may be the result of API calls. For example, many sites use API calls to implement an "infinite scroll". How does the headless browser know when to stop scrolling? These issues may require capturing a screen-grab of the page in addition to the content of the URIs found during rendering.
Garg & Jayanetti's work illustrates the depth and complexity of the analysis needed to achieve even a flawed representation of a dynamic social media site at a point in time. Clearly, Twitter has become important enough for Web archives to devote significant effort to capturing. Despite this, Garg & Jayanetti show that they often fail, and the resources sucked up in their attempts divert attention from less prominent sites.

Twitter Was Already Difficult To Archive, Now It's Worse!

Garg & Jayanetti start their July blog post by explaining the cause of the increasing problems in archiving Twitter:
Twitter launched a major redesign to its website in July 2019, but users were still able to access the old version of twitter.com through some workarounds. Twitter shut down the legacy version on June 1st, 2020, necessitating all desktop users to use the new mobile-inspired layout. Starting May 8, 2020, Twitter began warning users still accessing the legacy version to switch to a supported browser (Figure 2). Due to this changeover, archiving services like Internet Archive (IA)Perma.ccConifer, etc. started failing to archive Twitter pages. The web archiving services which use Twitter’s legacy interface are no longer able to access Twitter pages.
As they started analyzing the problem, they had an inspiration:
We noticed that unlike the old UI, the new UI issues calls to api.twitter.com to build out the pages (Figure 4). This made us think “well, how does Google still crawl tweets? Google is almost surely not using a headless browser to contact api.twitter.com”. This is what led us to try out the “Googlebot” user agent. We used Chrome's incognito mode with the user agent set as “Googlebot” (Figure 5). The incognito mode helped us avoid cookies and the “Googlebot” caused twitter.com to return the old UI in response to our requests.
In this way they were able to analyze the differences between the old and the new user interfaces. They went on to show that this change caused failures to archive Twitter among Web archives, and track how using one of a set of "bot" user-agents could force Twitter to supply the old UI and remedy the problem. They also showed that Twitter's rate-limits caused intermittent, and in some cases consistent archiving failures.

They also identified a significant soft-404 problem:
The consequences to web archives of Twitter changing their UI can be seen with the example of the recent “white power” video clip retweet by President Donald Trump, which he later decided to delete (Figure 18). Twitter sends a 200 OK response for that particular tweet even though it doesn't exist anymore. Twitter sends a 200 OK response for a tweet that never existed although it sends an actual 404 when accessed by certain bots. We analyzed the TimeMap for that individual tweet on July 8, 2020, using the Wayback CDX Server API. We noticed that there were 18 mementos which received 200 OK status code from Twitter. Among these 18 mementos, 14 are “sorry, the page does not exist” or “browser no longer supported” error captures with only four captures successfully preserved. Since these failed mementos received 200 OK responses, this makes it harder for the crawler to judge the validity of the memento. The interactive user will also have to click through a sea of these soft 404s to find one correctly archived memento.
The post provided a comprehensive description of the problem, which helped archives adapt to the changes.

New Twitter UI: Replaying Archived Twitter Pages That Never Existed

In November Jayanetti & Garg returned with an analysis of one of the ways archived Twitter pages were now misleading:
Twitter's new UI can result in replaying Twitter profile pages that never existed on the live web. In our previous blog post, we talked about how difficult it is to archive Twitter's new UI, and in this blog post, we uncover how the new Twitter UI mementos in the Internet Archive are vulnerable to temporal violations.
Web "pages" these days are composed from the content of a number of URIs, and the content of any or all of these URIs varies through time. This is especially true of components that are the result of API queries such as those returning JSON from api.twitter.com. Web archives try to collect all the components of a "page" in a timely fashion, but they cannot guaurantee that the set they record corresponds to a set that a reader would have seen. And in some cases reassembling a page from the content of the URIs "closest" in time to that of the root page results in pages that could not have been viewed by any reader, a "temporal violation".

The old Twitter UI included the most recent 20 tweets in the HTML of the root page, so at least these tweets were likely to be temporally coherent. But the new UI's HTML has no tweets, it collects them via a query to api.twitter.com, one of five queries involved in rendering the page. Because crawling the new UI is so tricky, and because api.twitter.com imposes rate limits, each visit by a crwler has a good chance of failing to collect some of the query responses, so temporal violations on replay are probable. The old UI could also generate temporal violations, but they were less frequent. Jayanetti & Garg exhaustively analyze each of the five queries of the new UI and demonstrate temporal violations for each. They conclude:
Looking back at our examples we can conclude that pages built with JSON responses are subject to temporal violations. This has implications for the historical record: replayed pages with Twitter's new UI are likely to have significant temporal violations that will be all but impossible for regular users to detect. Perhaps web archives should try to detect such common issues and acknowledge them to provide necessary context.
This, as I said above, is the big lesson to take away from the efforts to archive dynamic Web content. Since replaying it won't replicate what any human saw at the time, it is important for Web archives to provide some indication of the range of differences between what they present as historical and what humans would have seen.

Twitter Added Labels On Its Old User Interface

Garg & Jayanetti returned in December with even more bad news for future historians. They identified a historically important difference between the new Twitter UI that humans were seeing and the old Twitter UI that archives were preserving:
To curb the spread of misinformation on their platform, Twitter warns users by labeling the tweets with misleading content. Twitter labels first came to light on May 26, 2020 when Twitter decided to assign one of their fact-check labels to a tweet from U.S. President Donald Trump. A few days later, on May 29, 2020 Twitter labeled another tweet of Trump for violating Twitter rules about glorifying violence. ... Initially, when we replayed these tweets from web archives in May 2020, we noticed that none of the mementos had the labels. This has been covered in our blog post “Twitter Was Already Difficult To Archive, Now It's Worse!”, which includes how different archiving platforms failed to replay Twitter's label. This is because Twitter did not add the label to the old UI at that time. However, as shown in Figure 2, we can see that Twitter has added the “Violated Twitter Rules” label to its old UI. However, the fact-check label still does not exist in the old UI.
After the same kind of detailed analysis as their previous two posts, they conclude:
Hence, there is a span of time between May 26, 2020 and August 26, 2020, where these labels did appear on the live web but not in archived copies with the old Twitter UI. There are real implications to this issue, since historians and researchers may draw inaccurate conclusions about what appeared on the live web.
Since the difference between the old and new UIs is real, and controlled by Twitter, the only way to avoid misleading future historians would be to archive the new UI, the one that humans see. But, as they have described, this is extraoridinarily difficult, and does not eliminate all differences between what the archive preserves and what an individual human would have seen.

Michael Nelson summed up the situation in e-mail:
In short, historians are going to have a real mess on their hands when they go to replay stuff from the archive: the distance between the live web and the archived web is considerable, even for pages that are just minutes or hours in the past.

2 comments:

  1. How do the security services (NSA, GCHQ, etc.) archive and index Twitter then? Perhaps they are paying for and getting a custom feed, as they get one for SWIFT.

    ReplyDelete