Tuesday, December 31, 2019

Web Packaging for Web Archiving

Supporting Web Archiving via Web Packaging by Sawood Alam, Michele C Weigle, Michael L Nelson, Martin Klein, and Herbert Van de Sompel is their position paper for the Internet Architecture Board's ESCAPE workshop (Exploring Synergy between Content Aggregation and the Publisher Ecosystem). It describes the considerable potential importance of Web Packaging, the topic of the workshop, for Web archiving, but also the problems it poses because, like the Web before Memento, it ignores the time dimension.

Source: Frederic Filloux
Despite living in the heart of Silicon Valley, our home Internet connection is 3M/1Mbit DSL from Sonic; we love our ISP and I refuse to do business with AT&T or Comcast. As you can imagine, the speed with which Web pages load has been a topic of particular interest for this blog, for example here and here. (which starts from a laugh-out-loud, must-read post from Maciej Cegłowski). Then, three years ago, Frederic Filloux's Bloated HTML, the best and the worse triggered my rant Fighting the Web Flab:
Filloux continues:
In due fairness, this cataract of code loads very fast on a normal connection.
His "normal" connection must be much faster than my home's 3Mbit/s DSL. But then the hope kicks in:
The Guardian technical team was also the first one to devise a solid implementation of Google's new Accelerated Mobile Page (AMP) format. In doing so, it eliminated more than 80% of the original code, making it blazingly fast on a mobile device.
Great, but AMP is still 20 bytes of crud for each byte of content. What's the word for 20 times faster than "blazingly"?
Web Packaging is a response to:
In recent years, a number of proprietary formats have been defined to enable aggregators of news and other articles to republish Web resources; for example, Google’s AMP, Facebook’s Instant Articles, Baidu’s MIP, and Apple’s News Format.
Below the fold I look into the history that got us to this point, and where we may be going.

According to Charles Herzfeld, ARPA Director (1965–1967), when design work on the ARPANET started around 1969:
the ARPANET came out of our frustration that there were only a limited number of large, powerful research computers in the country, and that many research investigators, who should have access to them, were geographically separated from them.
In other words, the goal was to provide access to diverse computational resources, i.e. host computers. Using it necessarily involved naming these hosts, and by 1985 the Domain Name System (DNS) was developed to provide a uniform way of translating between the numeric IP addresses for these hosts, and their textual names.

In 1989, Tim Berners-Lee's goal for the World-Wide Web was to provide a uniform way to name, and thus access, content, not computers. But he had to work within the structure of the Internet and its DNS, so the Uniform Record Locator named content with a two-part structure, naming:
  • the computer that held it, and
  • a name for the content within that computer's name space.
Naming content via the two-part URL greatly lowered the barrier to entry for the Web, and allowed rapid adoption.  But it introduced performance issues, because accessing the content from the computer named in the URL each time is inefficient.

So browsers implemented a cache, holding copies of recently accessed content. If needed, it could be re-rendered from the copy. In a sense, the browser was now lying about the source of the content, claiming that it came from the DNS name in the URL, when it actually came from the cache. It was a fairly harmless lie, because the browser actually did get the content from the original source, just not when the current access happened. It could still believe the content was authentic based on the source.

If the content were static, the lie about the source was harmless. But if it were dynamic, the lie could be harmful. The content at the source could have changed since the access that created the cached copy. This led to cache expiration times, and the IF-MODIFIED-SINCE header on HTTP requests.

But the browser cache was the start of a slippery slope. The next step was for the infrastructure to implement caching, but that involved really lying to the browser about the source of the content. It came from the infrastructure cache, such as a Content Distribution Network (CDN) node. The browser had to trust the infrastructure cache not to misrepresent the putative source's content.

Obviously, the speed with which content can be accessed will be decreased the further in network terms the copy being accessed is from the browser making the request. As the Internet expanded, this effect increased. There are two ways to mitigate it:
  • Switch to location-independent naming.
  • Retain naming by location, but implement mechanisms for disguising the actual location as the named location.
In 2013's Moving vs. Copying I discussed efforts to replace the host-naming of TCP/IP with content naming, such as the work of Van Jacobson and Content-Centric Networking (CCN). These efforts envisage a network containing potentially many copies of given content, any of which can respond to a request naming that content. Since the fundamental communication paradigm implements caching, it "just works". Alas, as we see with the glacial deployment of IPv6, the infrastructure of the Internet is highly resistant to architectural change, so despite their many theoretical advantages, CCN and related systems haven't achieved scale.

Because it layered on top of the existing Web infrastructure rather than trying to replace it, Bram Cohen's BitTorrent did achieve scale. By implementing naming-by-content, it allowed clients to download in parallel from copies of the named content at multiple locations, thereby increasing performance. Of course, because it layered on the existing IP infrastructure, "under the hood" it used DNS and host naming, thus revealing the location of copies and enabling an industry of copyright trolling.

In 2015 Ben Thompson published Aggregation Theory, explaining how the advent of the Internet's cost-free distribution gave huge advantages to aggregators:
no longer do distributors compete based upon exclusive supplier relationships, with consumers/users an afterthought. Instead, suppliers can be commoditized leaving consumers/users as a first order priority. By extension, this means that the most important factor determining success is the user experience: the best distributors/aggregators/market-makers win by providing the best experience, which earns them the most consumers/users, which attracts the most suppliers, which enhances the user experience in a virtuous cycle.

The result is the shift in value predicted by the Conservation of Attractive Profits. Previous incumbents, such as newspapers, book publishers, networks, taxi companies, and hoteliers, all of whom integrated backwards, lose value in favor of aggregators who aggregate modularized suppliers — which they often don’t pay for — to consumers/users with whom they have an exclusive relationship at scale.
Thompson followed in 2017 with Defining Aggregators , in which he identified the three key attributes of Internet aggregators:
  • Direct Relationship with Users
  • Zero Marginal Costs For Serving Users
  • Demand-driven Multi-sided Networks with Decreasing Acquisition Costs
Note two things:
  1. Web Packaging is a response to:
    In recent years, a number of proprietary formats have been defined to enable aggregators of news and other articles to republish Web resources; for example, Google’s AMP, Facebook’s Instant Articles, Baidu’s MIP, and Apple’s News Format.
    All these "proprietary formats" are the work of aggregators.

  2. For aggregators:
    the most important factor determining success is the user experience
Thus the motivation behind the push for Web Packaging and its predecessor "proprietary formats" is the critical need for aggregators to be in control of their user experience (UX). Two of the most important aspects of their UX are response time and, even more so, the long tail of response time. For example, in its early days Google had two big advantages over Alta Vista, the incumbent search engine. The first was Page Rank, which meant Google's results were noticeably better. I explained the second in Krste Asanović Keynote at FAST14:
Low tail latency is one major reason why Google beat Alta Vista to become the dominant search engine. Alta Vista's centralized architecture often delivered results faster than Google, but sometimes was much slower. Google's front-ends fanned the query out to their distributed architecture, waited a fixed time to collect results, and delivered whatever results they had at that predictable time. The proportion of Google's searches that took noticeably longer was insignificant. Alta Vista's architecture meant that a failure caused a delay, Google's architecture meant that a failure caused the search result to be slightly worse. Even a small proportion of noticeable delays is perceived as a much worse user interface.
The canonical paper on long tail response is The Tail At Scale by Dean and Barroso, which has the following key insights:
  • even rare performance hiccups affect a significant fraction of all requests in large-scale distributed systems.
  • eliminating all sources of latency variability in large-scale systems is impractical, especially in shared environments.
  • using an approach analogous to fault-tolerant computing, tail-tolerant software techniques form a predictable whole out of less-predictable parts.
Consider an architecture in which the aggregator simply forwards the user's request to the originator of the content. The user experience is controlled not merely by the originator, but also by all the other resources (JavaScript libraries, advertisements, trackers, fonts, ...) that the originator chooses to include. Each individual host supplying a resource to the originator's page can cause both a fixed delay from the overhead of setting up a connection, and a variable delay caused, for example, by varying load on that host. The aggregator has very little control over the resulting UX.

Ideally, the aggregator would like the originator to deliver it a prepackaged bundle of the content and all the resources it needs. In that way the aggregator would be in complete control of the UX. Aggregators also want to lie about the source of the content, but in the opposite direction to CDNs. They want the reader to believe that the content comes from the aggregator, not from the source being aggregated.

This is the goal towards which Web Packaging and its predecessors are pointing. The client would issue a single request to a single server, thereby minimizing the overhead of setting up, maintaining and tearing down connections. The response it would receive would contain everything needed to render the page.

Were it to be achieved, Web archiving would be trivial. The archive would request the bundle from the aggregator, and could subsequently replay it in the knowledge that it contained everything needed for rendering. Supporting Web Archiving via Web Packaging by Sawood Alam et al goes into the details:
Web Packaging is an emerging standard [30] that enables content aggregators and distributors to deliver related groups of resources from various origins to user-agents in the form of a package on behalf of publishers. It replaces a prior work called Packaging on the Web[23]. Currently, its specification is split in three different modular layers, namely Signing[29], Bundling[27], and Loading[28].
Alam et al point out a number of  inadequacies in the current proposal from the perspective of archiving, most importantly the lack of a Memento-like time negotiation framework. But the key fact is that its usefulness to Web archiving depends entirely on whether publishers use it, and if they do, the extent to which their bundles include all, as opposed to only some of the resources needed to render their content.

Unfortunately, there are a number of downsides to Web Packaging that, in my view, will severely limit its adoption and, even were it fairly widely adopted, would greatly reduce its application to Web archiving:
  • While Web Packaging's bundling can greatly reduce the overhead of multiple connections, it does so at the cost of wasting bandwidth by redundantly sending resources that the bundle cannot know are already in the browser's cache, and/or by sending all variants of a resource that would otherwise have been the subject of content negotiation. Mozilla's position paper notes:
    Web packaging depends on a view of content that is relatively static. Ideally, a single page can be represented by a single static bundle of content. For a completely offline case, this is a hard requirement, the package has to be ready for use by all potential audiences.
    In a world where bandwidth was infinite and free, this would not matter, but most browsers don't live in such a world.
  • The whole point of Web Packaging is to avoid browsers requesting content directly from the originating publisher. Thus the publisher has to create a bundle containing content it signed before it knew anything about the requesting browser and its context. Mozilla's position paper outlines the restrictions this places upon publishers, including importantly:
    Sites routinely customize content to individuals. Many sites produce different content based on the browser used, the form factor of the device, the estimated location of the requester, and things like pixel density of the screen. It’s possible to build packages with all possible variations of content and to use browser-based mechanisms like CSS media queries, the <picture> element, or script to manage selecting the right variants of content. However, this could increase the size of the package, which might reduce or reverse any potential performance benefit.

    The extent to which a single package can support personalization will depend on the quantity and variety that can be bundled into the package without causing it to expand too much in size. If the goal is to minimize package size for performance reasons, then the degree of variation that a bundle can support will be limited. More expansive personalization might require the use of multiple bundles. However, using multiple bundles limits the scope of client-side content adaptation techniques in negative ways.
    So Web Packaging requires publishers to make difficult technical decisions whose benefit accrues to the aggregators. Mozilla's position paper has many more examples of such tradeoffs.
  • One major restriction that Web Packaging places on the resources that can be signed and included in bundles is that they cannot safely use private information such as cookies or user storage. This greatly impairs the typical publisher's user experience, and makes it unlikely that bundles will in practice contain all the resources needed for rendering.
  • While the Web's current model for authenticating content, by the HTTPS connection it arrived on, is imperfect it has at least been battle-tested. Web Packaging is a completely different model. In theory, authentication by signing is significantly better, but it is also imperfect. Malfeasance by certificate authorities and the theft of keys are both common failure modes. Although the authors have been careful to consider potential attacks, the Web Packaging specification is full of requirements on both browsers and servers which, unless implemented precisely, pose significant risks.
However illusory its benefits may be, almost the entire business model of Web publishing depends upon exploiting detailed knowledge about a particular request to construct a personalized response to it, including elements such as feeds, advertisements, watermarks, DRM, and so on. The goal of Web Packaging and its, aggregator generated, predecessors is to transfer the ability to do so from the original publisher to the aggregator. Experience, such as the fiasco of Facebook's "pivot to video", should show publishers that increasing their dependence on aggregators in this way is a recipe for disaster.

1 comment:

David. said...

Dean and Barroso's The Tail At Scale explained importance of low latency as opposed to high throughput for the user experience. Jim Salter's Apple’s M1 is a fast CPU—but M1 Macs feel even faster due to QoS provides another example:

"Apple's M1 processor is a world-class desktop and laptop processor—but when it comes to general-purpose end-user systems, there's something even better than being fast. We're referring, of course, to feeling fast—which has more to do with a system meeting user expectations predictably and reliably than it does with raw speed.

Howard Oakley—author of several Mac-native utilities such as Cormorant, Spundle, and Stibium—did some digging to find out why his M1 Mac felt faster than Intel Macs did, and he concluded that the answer is QoS. If you're not familiar with the term, it's short for Quality of Service—and it's all about task scheduling."