It’s been hard to make a living as a journalist in the 21st century, but it’s gotten easier over the last few years, as we’ve settled on the world’s newest and most lucrative business model: invasive surveillance. News site webpages track you on behalf of dozens of companies: ad firms, social media services, data resellers, analytics firms — we use, and are used by, them all.Georgis Kontaxis and Monica Chew won "Best Paper" at the recent Web 2.0 Security and Privacy workshop for Tracking Protection in Firefox for Privacy and Performance (PDF). They demonstrated that Tracking Protection provided:
I did not do this. Instead, over the years, I only enabled others to do it, as some small salve to my conscience. In fact, I made a career out of explaining surveillance and security, what the net was doing and how, but on platforms that were violating my readers as far as technically possible.
We can become wizards in our own right, a world of wizards, not subject to the old powers that control us now. But it’s going to take a lot of work. We’re all going to have to learn a lot — the journalists, the readers, the next generation. Then we’re going to have to push back on the people who watch us and try to control who we are.
a 67.5% reduction in the number of HTTP cookies set during a crawl of the Alexa top 200 news sites. [and] a 44% median reduction in page load time and 39% reduction in data usage in the Alexa top 200 news site.Below the fold, some details and implications for preservation:
Firefox's Tracking Protection uses a blacklist of tracking-related sites and prevents the browser loading content from them:
Tracking Protection blocks at least one unsafe element on 99% of the sites tested. In addition, Tracking Protection blocks 11 tracking elements in 50% of the sites and, in an extreme case, 150 tracking elements.This blocking has a big impact:
She further argues that advertising “does not make content free” but “merely externalizes the costs in a way that incentivizes malicious or incompetent players.” She cites unsafe plugins and malware examples that result in sites requiring more resources to load, which in turn translates to costs in bandwidth, power, and stability. “It will take a major force to disrupt this ecosystem and motivate alternative revenue models,” she added. “I hope that Mozilla can be that force.”
But the Kontaxis and Chew paper adds yet more reasons why future scholars won't get to study the ads in Web archives:
- Getting the ads is even more expensive than we thought, in both bandwidth (50% more data to fetch from weather.com) and time (30% more for weather.com). The cost of executing the page is already problematic, and numbers like these make it worse.
- Most of this bandwidth and time is not actually about delivering an ad, it is about collecting data on the user. So it is irrelevant to what the crawler is trying to achieve; it is just a waste of resources.
- More importantly, an archive crawler will not have the tracking history that a real user would, so the ads the crawler sees will be generic, not targeted. They won't in any sense be representative of what any real person would have seen at that time, since the real person would have a tracking history and so would see ads targeted at them specifically.
In fact, while many treat online social spaces like the proverbial town square, they are actually more like shopping malls, privately owned and authorized to restrict content however they deem appropriate.The country the crawler is in (or that the net thinks it is in) can affect what gets preserved in another way. National firewalls such as the Great Firewall of China and the Great Firewall of Cameron in the UK will either prevent content being preserved or, if as in the UK the crawler can opt-out, content the majority of the public could not see. And, in order to be preserved, the crawler has to find the content by following links to it. This can be difficult if, as we see right now with a site organizing resistance to "Fast Track" authority, all the major social network sites ban posting its URLs.
This is one more example of the increasing fundamental difficulty of preserving the Web, because as time goes by the Web I experience is less and less like the Web you experience. What does it mean to "preserve" something that is different for each person, and different each time each person looks at it?