Tuesday, June 9, 2015

Preserving the Ads?

Quinn Norton writes in The Hypocrisy of the Internet Journalist:
It’s been hard to make a living as a journalist in the 21st century, but it’s gotten easier over the last few years, as we’ve settled on the world’s newest and most lucrative business model: invasive surveillance. News site webpages track you on behalf of dozens of companies: ad firms, social media services, data resellers, analytics firms — we use, and are used by, them all.
...
I did not do this. Instead, over the years, I only enabled others to do it, as some small salve to my conscience. In fact, I made a career out of explaining surveillance and security, what the net was doing and how, but on platforms that were violating my readers as far as technically possible.
...
We can become wizards in our own right, a world of wizards, not subject to the old powers that control us now. But it’s going to take a lot of work. We’re all going to have to learn a lot — the journalists, the readers, the next generation. Then we’re going to have to push back on the people who watch us and try to control who we are.
Georgis Kontaxis and Monica Chew won "Best Paper" at the recent Web 2.0 Security and Privacy workshop for Tracking Protection in Firefox for Privacy and Performance (PDF). They demonstrated that Tracking Protection provided:
a 67.5% reduction in the number of HTTP cookies set during a crawl of the Alexa top 200 news sites. [and] a 44% median reduction in page load time and 39% reduction in data usage in the Alexa top 200 news site.
Below the fold, some details and implications for preservation:

Firefox's Tracking Protection uses a blacklist of tracking-related sites and prevents the browser loading content from them:
Tracking Protection blocks at least one unsafe element on 99% of the sites tested. In addition, Tracking Protection blocks 11 tracking elements in 50% of the sites and, in an extreme case, 150 tracking elements.
This blocking has a big impact:
As an example, www.weather.com loads in 3.5 seconds with Tracking Protection versus 6.3 seconds without and results in data usage of 2.8 MB (98 HTTP requests) versus 4.3 MB (219 HTTP requests), respectively. Even though Tracking Protection prevents initial requests for only 4 HTML script elements, without Tracking Protection, an additional 45 domains are contacted. Of the additional resources downloaded without Tracking Protection enabled, 57% are JavaScript (as identified by the content-type HTTP header) and 27% are images. The largest elements appear to be JavaScript libraries with advertisement-related names, each on the order of 10 or 100 KB. Even though client-side caching can alleviate data usage, we observe high-entropy GET parameters that will cause the browser to fetch them each time.
Chew was interviewed by Emil Protalinski at Venture Beat:
“I believe that Mozilla can make progress in privacy, but leadership needs to recognize that current advertising practices that enable ‘free’ content are in direct conflict with security, privacy, stability, and performance concerns — and that Firefox is first and foremost a user-agent, not an industry-agent.”

She further argues that advertising “does not make content free” but “merely externalizes the costs in a way that incentivizes malicious or incompetent players.” She cites unsafe plugins and malware examples that result in sites requiring more resources to load, which in turn translates to costs in bandwidth, power, and stability. “It will take a major force to disrupt this ecosystem and motivate alternative revenue models,” she added. “I hope that Mozilla can be that force.”
I've argued for a long time that just because a part of a page is an ad doesn't mean it isn't interesting to scholars. But web archives typically don't collect the ads, so scholars won't get to study them. There are good reasons for this. Because the ads are inserted dynamically by Javascript, crawlers need to execute the page to get them. Saving and replaying the Javascript is either not going to work, or not get ads that were current at the time of the page, so saving a screengrab is the only practical way to record what the reader saw.

But the Kontaxis and Chew paper adds yet more reasons why future scholars won't get to study the ads in Web archives:
  • Getting the ads is even more expensive than we thought, in both bandwidth (50% more data to fetch from weather.com) and time (30% more for weather.com). The cost of executing the page is already problematic, and numbers like these make it worse.
  • Most of this bandwidth and time is not actually about delivering an ad, it is about collecting data on the user. So it is irrelevant to what the crawler is trying to achieve; it is just a waste of resources.
  • More importantly, an archive crawler will not have the tracking history that a real user would, so the ads the crawler sees will be generic, not targeted. They won't in any sense be representative of what any real person would have seen at that time, since the real person would have a tracking history and so would see ads targeted at them specifically.
Of course, it isn't just the ads that the crawler sees that are different from what actual people see. In The Myth of a Borderless Internet Jillian C. York discusses the effect of geolocation on the content your browser sees:
In fact, while many treat online social spaces like the proverbial town square, they are actually more like shopping malls, privately owned and authorized to restrict content however they deem appropriate.
The country the crawler is in (or that the net thinks it is in) can affect what gets preserved in another way. National firewalls such as the Great Firewall of China and the Great Firewall of Cameron in the UK will either prevent content being preserved or, if as in the UK the crawler can opt-out, content the majority of the public could not see. And, in order to be preserved, the crawler has to find the content by following links to it. This can be difficult if, as we see right now with a site organizing resistance to "Fast Track" authority, all the major social network sites ban posting its URLs.

This is one more example of the increasing fundamental difficulty of preserving the Web, because as time goes by the Web I experience is less and less like the Web you experience. What does it mean to "preserve" something that is different for each person, and different each time each person looks at it?

3 comments:

David. said...

Cory Doctorow points to a "another of [Maciej Cegłowski's] barn-burning speeches". It is entitled What Happens Next Will Amaze You and it is a must-read exploration of the ecosystem of the Web and its business model of pervasive surveillance:

"State surveillance is driven by fear.

And corporate surveillance is driven by money.

The two kinds of surveillance are as intimately connected as tango partners. They move in lockstep, eyes rapt, searching each other's souls. The information they collect is complementary. By defending its own worst practices, each side enables the other."

David. said...

Klint Finley at Wired writes I Turned Off JavaScript For A Week And It Was Glorious:

"Earlier this month I resolved to join their ranks, at least for one week, and see what life was like without JavaScript. By the end of the week, I dreaded going back to the messy modern web."

Go ye and do likewise.

David. said...

Via The web is Doom pointing to his prediction:

"In July 2015 I suggested that the average web page weight would equal that of the Doom install image in about 7 months time."

coming true slightly late. The average Web page is now 2.3MB, the size of DOOM for DOS. There is a small ray of hope, the top 10 Alexa sites are getting slightly smaller.