DSHR's Blog: Open Access and Surveillance

Tuesday, November 15, 2016

Open Access and Surveillance

Recent events have greatly increased concerns about privacy online. Spencer Ackerman and Ewan McAskill report for The Guardian that during the campaign Donald Trump said:

“I wish I had that power,” ... while talking about the hack of Democratic National Committee emails. “Man, that would be power.”

and that Snowden's ACLU lawyer, Ben Wizner said:

“I think many Americans are waking up to the fact we have created a presidency that is too powerful.”

Below the fold, some thoughts on online surveillance and how it relates to the Open Access movement.

Governments Are Surveilling You Online

Glenn Greenwald's Three New Scandals Show How Pervasive and Dangerous Mass Surveillance is in the West, Vindicating Snowden underlines the dangers of government's surveillance of everything everyone does online by pointing out that:

Earlier this month, a special British court that rules on secret spying activities issued an emphatic denunciation of the nation's domestic mass surveillance programs. The court found that "British security agencies have secretly and unlawfully collected massive volumes of confidential personal data, including financial information, on citizens for more than a decade." Those agencies, the court found, "operated an illegal regime to collect vast amounts of communications data, tracking individual phone and web use and other confidential personal information, without adequate safeguards or supervision for 17 years."

and that:

On Thursday, an even more scathing condemnation of mass surveillance was issued by the Federal Court of Canada. The ruling "faulted Canada's domestic spy agency for unlawfully retaining data and for not being truthful with judges who authorize its intelligence programs." Most remarkable was that these domestic, mass surveillance activities were not only illegal, but completely unknown to virtually the entire population in Canadian democracy, even though their scope has indescribable implications for core liberties.

and that:

law enforcement officials in Montreal are now defending "a highly controversial decision to spy on a La Presse columnist [Patrick Lagacé] by tracking his cellphone calls and texts and monitoring his whereabouts as part of a necessary internal police investigation." The targeted journalist, Lagacé, had enraged police officials by investigating their abusive conduct, and they then used surveillance technology to track his calls and movements to unearth the identity of his sources. Just as that scandal was exploding, it went, in the words of the Montreal Gazette, "from bad to worse" as the ensuing scrutiny revealed that police had actually "tracked the calls and movements of six journalists that year after news reports based on leaks revealed Michel Arsenault, then president of Quebec's largest labour federation, had his phone tapped."

In the wake of Snowden's revelations everyone should assume that, whether or not they have legal authority to do so, governments (not just their own) are tracking everything they do online.

Companies Are Surveilling You Online

Of course, everyone should also assume that, whether or not they gave permission, corporations (not just the ones to whose end-user license agreement they consented) are also tracking everything they do online. As usual, Maciej Cegłowski describes the situation aptly:

We're used to talking about the private and public sector in the real economy, but in the surveillance economy this boundary doesn't exist. Much of the day-to-day work of surveillance is done by telecommunications firms, which have a close relationship with government. The techniques and software of surveillance are freely shared between practitioners on both sides. All of the major players in the surveillance economy cooperate with their own country's intelligence agencies, and are spied on (very effectively) by all the others.

and:

Just like industrialized manufacturing changed the relationship between labor and capital, surveillance capitalism is changing the relationship between private citizens and the entities doing the tracking. Our old ideas about individual privacy and consent no longer hold in a world where personal data is harvested on an industrial scale.

Steven Englehardt and Arvind Narayanan's Online tracking: A 1-million-site measurement and analysis (also here) presents:

the largest and most detailed measurement of online tracking conducted to date, based on a crawl of the top 1 million websites. We make 15 types of measurements on each site, including stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and the exchange of tracking data between different sites ("cookie syncing"). Our findings include multiple sophisticated fingerprinting techniques never before measured in the wild.

Englehardt and Narayanan's goal is to:

transform web privacy measurement into a widespread practice by creating a tool that is useful not just to our colleagues but also to regulators, self-regulators, the press, activists, and website operators, who are often in the dark about third-party tracking on their own domains. We also seek to lessen the burden of continual oversight of web tracking and privacy, by developing a robust and modular platform for repeated studies.

Source

Note the comment about the website operator's ignorance of third-party tracking activities. Maciej Cegłowski's What Happens Next Will Amaze You illustrated the Byzantine complexity of the advertising and tracking ecosystem. Clearly, a web site owner cannot know every detail of the dynamically changing business relationships behind the ads and trackers on their site. Nor can they know where the data the ads and trackers are collecting flow through these relationships. Data, once collected by one participant in the ecosystem, can be sold, traded and exchanged with others with no notification to the web site owner. The web site owner has neither knowledge of nor control over the information being collected from their readers.

Although the ecosystem is very complex, it is subject to very strong increasing returns to scale. The benefit to a government or an advertiser of a panopticon tracking everyone is vastly greater than one tracking 1 in 10. The result, as Englehardt and Narayanan found, is consolidation:

Overall, our results show cause for concern, but also encouraging signs. In particular, several of our results suggest that while online tracking presents few barriers to entry, trackers in the tail of the distribution are found on very few sites and are far less likely to be encountered by the average user. Those at the head of the distribution, on the other hand, are owned by relatively few companies and are responsive to the scrutiny resulting from privacy studies.

In fact, the consolidation they found is astonishing:

Our large scale allows us to answer a rather basic question: how many third parties are there? In short, a lot: the total number of third parties present on at least two first parties is over 81,000.

What is more surprising is that the prevalence of third parties quickly drops off: only 123 of these 81,000 are present on more than 1% of sites. This suggests that the number of third parties that a regular user will encounter on a daily basis is relatively small. The effect is accentuated when we consider that different third parties may be owned by the same entity. All of the top 5 third parties, as well as 12 of the top 20, are Google-owned domains. In fact, Google, Facebook, Twitter and AdNexus are the only third-party entities present on more than 10% of sites.

So there are really only four primary commercial panopticons, but by cooperating with them many other, smaller panopticons can track effectively.

How Are You Being Surveilled?

They, and the smaller ones, use two classes of instrumentation to track you, cookies and fingerprinting.

Cookies

The Same-Origin Policy for cookies is intended to prevent cookies from being read by domains that didn't set them:

A page can set a cookie for its own domain or any parent domain, as long as the parent domain is not a public suffix. ... The browser will make a cookie available to the given domain including any sub-domains, no matter which protocol (http/https) or port is used. ... When you read a cookie, you cannot see from where it was set.

By doing so it was intended that cookies not be shared between domains. Why would the panopticons want to share cookies between domains? The goal of a tracker is to have the browser tell it the identity of the reader whatever page is being read. If cookies could be shared across domains, the tracker would set a cookie the first time, and read it from each other page. But the Same-Origin Policy means other pages won't return the cookie set by the first page, hence the panopticons' need to work around the policy and provide a page-independent ID for the user.

The result is cookie syncing:

Cookie syncing, a workaround to the Same-Origin Policy, allows different trackers to share user identifiers with each other. Besides being hard to detect, cookie syncing enables back-end server-to-server data merges hidden from public view, which makes it a privacy concern.

Cookie syncing works in this way:

If tracker A wants to share its ID for a user with tracker B, it can do so in one of two ways: embedding the ID in the request URL to tracker B, or in the referer URL.

Cookie syncing can be very effective at enabling surveillance:

From the Snowden leaks, we learnt that that NSA "piggybacks" on advertising cookies for surveillance and exploitation of targets. How effective can this technique be? We present one answer to this question. We consider a threat model where a surveillance agency has identified a target by a third-party cookie ... The adversary uses this identifier to coerce or compromise a third party into enabling surveillance or targeted exploitation.

We find that some cookies get synced over and over again to dozens of third parties; we call these promiscuous cookies. ... This means that if the adversary has identified a user by such a cookie, their ability to surveil or target malware to that user will be especially good. The most promiscuous cookie that we found belongs to the domain adverticum.net; it is synced or leaked to 82 other parties which are collectively present on 752 of the top 1,000 websites! In fact, each of the top 10 most promiscuous cookies is shared with enough third parties to cover 60% or more of the top 1,000 sites.

Fingerprinting

The other way to provide a page-independent ID for the reader is fingerprinting. Narayanan, interviewed at fivethirtyeight.com reports:

In the ad tech industry, cookies are gradually being shunted in favor of fingerprinting. The reason that fingerprinting is so effective is that even if you have a device that you think is identical to the device of the person sitting next to you, there are going to be a number of differences in the behavior of your browser. The set of fonts installed on your browser could be different. The precise version number of the browser could be different. Your battery status could be different from that of the person next to you, or anybody else in the world. And it turns out that if you put all of these pieces of information together, a unique or nearly unique picture of the behavior of your device emerges that's going to be relatively stable over time. And that enables your companies to recognize you when you come back.

Your device's fingerprint is necessarily the same for all pages that you view, no matter where they came from; no work-around is needed. Many Javascript APIs can be subverted to fingerprint Web browsers; among those that Englehardt and Narayanan found in use are WebRTC, the font APIs and AudioContext.

Who Gets Tracking Data?

Once an organization has tracking information about a user, it becomes an asset to be monetized. Google and Facebook use it to target ads, but they and the legions of less powerful trackers also sell the information to others, probably many times over. Cegłowski writes:

Surveillance capitalism has some of the features of a zero-sum game. The actual value of the data collected is not clear, but it is definitely an advantage to collect more than your rivals do. Because human beings develop an immune response to new forms of tracking and manipulation, the only way to stay successful is to keep finding novel ways to peer into people's private lives. And because much of the surveillance economy is funded by speculators, there is an incentive to try flashy things that will capture the speculators' imagination, and attract their money.

This creates a ratcheting effect where the behavior of ever more people is tracked ever more closely, and the collected information retained, in the hopes that further dollars can be squeezed out of it.

The scale of dissemination is revealed by Englehardt and Narayanan's results about cookie syncing:

The most prolific cookie-syncing third party is [Google's] doubleclick.net - it shares 108 different cookies with 118 other third parties ... More interestingly, we find that the vast majority of top third parties sync cookies with at least one other party: 45 of the top 50, 85 of the top 100, 157 of the top 200, and 460 of the top 1,000. This adds further evidence that cookie syncing is an underappreciated and under-researched privacy concern.

We also find that third parties are highly connected by synced cookies. Specifically, of the top 50 third parties that are involved in cookie syncing, the probability that a random pair will have at least one cookie in common is 85%. The corresponding probability for the top 100 is 66%.

Information about your online behavior is very widely disseminated.

What Could Be Done To Limit Tracking?

Georgis Kontaxis and Monica Chew won "Best Paper" at the 2015 Web 2.0 Security and Privacy workshop for Tracking Protection in Firefox for Privacy and Performance (PDF). They demonstrated that Tracking Protection provided:

a 67.5% reduction in the number of HTTP cookies set during a crawl of the Alexa top 200 news sites. [and] a 44% median reduction in page load time and 39% reduction in data usage in the Alexa top 200 news site.

Alas, Tracking Protection relies on human-curated lists of tracking domains to block, so it can't be completely effective.

Even if you aren't worried about governments and corporations tracking your every move online, your Web experience is still being impaired by trackers. Typically at least a third of all the data you receive while browsing has no visible effect and doesn't contribute to your user experience (but does track your behavior). Thus limiting tracking would definitely make your life better.

Unfortunately, the only effective way to limit tracking is to limit the ability of domains other than the one you intended to visit to inject content, and in particular Javascript into the pages you view. This isn't feasible; if only because it would kill off Internet advertiising, which is what provides the cornucopia of "free" content that makes the Web what it is.

What Could Be Done To Limit Dissemination?

Assuming that we can't eliminate tracking, perhaps the best that can be done is to limit the flow of tracking data through the ecosystem. Jack Balkin and Jonathan Zittrain's A Grand Bargain to Make Tech Companies Trustworthy suggests a legal framework to address the dissemination problem, which they characterize thus:

As we use these services, they learn more and more about us. They see who we are, but we are unable to see into their operations or understand how they use our data. As a result, we have to trust online services, but we have no real guarantees that they will not abuse our trust. Companies share information about us in any number of unexpected and regrettable ways, and the information and advice they provide can be inconspicuously warped by the companies' own ideologies or by their relationships with those who wish to influence us, whether people with money or governments with agendas.

They use the analogy of fiduciaries such as doctors, lawyers, and accountants:

Like older fiduciaries, these businesses have become virtually indispensable. Like older fiduciaries, these companies collect a lot of personal information that could be used to our detriment. And like older fiduciaries, these businesses enjoy a much greater ability to monitor our activities than we have to monitor theirs. As a result, many people who need these services often shrug their shoulders and decide to trust them. But the important question is whether these businesses, like older fiduciaries, have legal obligations to be trustworthy. The answer is that they should.

And the analogy with the bargain between copyright owners and users that underlies the Digital Millennium Copyright Act (DMCA):

Congress could respond with a "Digital Millennium Privacy Act" that offers a parallel trade-off to that of the DMCA: accept the federal government's rules of fair dealing and gain a safe harbor from uncertain legal liability, or stand pat with the status quo.

The DMPA would provide a predictable level of federal immunity for those companies willing to subscribe to the duties of an information fiduciary and accept a corresponding process to disclose and redress privacy and security violations. As with the DMCA, those companies unwilling to take the leap would be left no worse off than they are today - subject to the tender mercies of state and local governments. But those who accept the deal would gain the consistency and calculability of a single set of nationwide rules. Even without the public giving up on any hard-fought privacy rights recognized by a single state, a company could find that becoming an information fiduciary could be far less burdensome than having to respond to multiple and conflicting state and local obligations.

This might help prevent perhaps the most troubling aspect of this corporate surveillance, the way information collected by these primary panopticons is disseminated through the ecosystem. The concentration of tracking means that even if only the big 4 accepted a fiduciary responsibility, the diffusion of collected information would be greatly reduced.

Why Are The Trackers Tracking You?

Tim Wu has a new book, The Attention Merchants: The Epic Scramble to Get Inside Our Heads. His account of the evolution of the attention economy that is driving this corporate surveillance is well worth reading. He starts his historical survey in September 1833 with the launch of Benjamin Day's New York Sun:

rival papers could not at first fathom out how the Sun was able to charge less, provide more news, reach a larger audience, and still come out ahead. What Day had figued out was that newsstand earnings were trivial; advertising revenue could make it all happen.

Source

This graph shows the huge switch of advertising dollars from newspapers to Google and Facebook. Much of the impetus for this switch in the flow of funds comes from the improved ability of the two-way, online medium as opposed to the one-way print medium to provide advertisers feedback as to the effectiveness of their ads. I.e. it is a result of tracking.

The switch is a big problem for society; the newspapers used some of the money to report the actual news, but Google and Facebook feel no such social responsibility. As Joshua Benton writes:

I’m from a small town in south Louisiana. The day before the election, I looked at the Facebook page of the current mayor. Among the items he posted there in the final 48 hours of the campaign: Hillary Clinton Calling for Civil War If Trump Is Elected. Pope Francis Shocks World, Endorses Donald Trump for President. Barack Obama Admits He Was Born in Kenya. FBI Agent Who Was Suspected Of Leaking Hillary’s Corruption Is Dead.

These are not legit anti-Hillary stories. (There were plenty of those, to be sure, both on his page and in this election cycle.) These are imaginary, made up, frauds. And yet Facebook has built a platform for the active dispersal of these lies - in part because these lies travel really, really well. (The pope’s “endorsement” has over 868,000 Facebook shares. The Snopes piece noting the story is fake has but 33,000.)

But What About Academic Journals?

Englehardt and Narayanan found that:

sites on the low end of the [tracking prevalence] spectrum are mostly sites which belong to government organizations, universities, and non-profit entities.

Despite this, and using less rigorous techniques than theirs, more than 18 months ago Eric Hellman nevertheless found that 16 of the top 20 Research Journals Let Ad Networks Spy on Their Readers. Academic publishers are not immune to the lure of profit from selling their audience to the "attention economy". Why should researchers be concerned about being sold in this way?

There are the obvious privacy issues, and not just personal privacy. For example, Rick Luce at the Los Alamos National Labs set up a system that ingested and re-published academic journals for Federal researchers. The goal was to prevent non-Federal bodies observing the articles that Federal researchers working on classified projects were reading. From that information clues could be gathered as to what classified projects were working on. Pharma companies have similar concerns.

There is also the fact that ads are dangerous. Because ads are a highly effective way to get Javascript to run on lots of browsers, they are a highly effective way to get malware to run on lots of browsers. Recent examples of malvertising include:

AdGholas, which may have been running since 2013 attacking a million visits a day to:
113 domains, including some big names such as The New York Times, Le Figaro, The Verge, PCMag, IBTimes, ArsTechnica, Daily Mail, Telegraaf, La Gazetta dello Sport, CBS Sports, Top Gear, Urban Dictionary, Playboy, Answers.com, Sky.com, and more.
The demonstration of HEIST, which showed:
The HTTPS cryptographic scheme protecting millions of websites is vulnerable to a newly revived attack that exposes encrypted e-mail addresses, social security numbers, and other sensitive data even when attackers don't have the ability to monitor a targeted end user's Internet connection. ... an end user need only encounter an innocuous-looking JavaScript file hidden in an Web advertisement or hosted directly on a webpage.
VirtualDonna, which ran on 3000 top Japanese sites and hit 100K visits/day.
GooNky, a similar campaign that included abusing a certificate authority to allow encrypted traffic.

Although to be safe you need to run an ad-blocker, many sites now refuse to show content to browsers that block ads. There is no cost to the site if one of their advertisers damages your computer, because their Terms of Service explicitly disclaim liability.

But there is at least one more, less obvious issue. I have written before about the priority of the publishing oligopoly to ensure they control the only easily accessible copy even of open access content, mentioning the value in the Web world of page views. But I didn't sufficiently appreciate where the value came from. It isn't just the visible value of being able to show the reader ads. It is the invisible, but probably greater, value of being able to sell the ability to track their readers' visits to the surveillance companies. Readers of academic, especially STEM, journals are high-priority targets for both commercial and governmental reasons.

Thus it is likely that discussions of various open access models, and the role of institutional repositories (IRs) have misunderstood the business model they sought to disrupt. This insight explains, for example, why Elsevier is so determined that IRs contain only metadata, not actual content, and why buying SSRN was a sound investment. Both enhance Elsevier's value to the panopticons. It would be very interesting to know whether the number of trackers on SSRN has increased since it was purchased.

Conclusion

I'll leave the last word to Maciej Cegłowski, who was interviewed last Friday by Russell Brandom at The Verge:

Outside of intelligence services, police typically obtain company data with a court-ordered warrant, and most companies freely admit to filling lawful requests for data in that form. It also gives users some security: Trump or no, a judge has to sign off on probable cause before the warrant can be issued. As the prospect of another encryption battle looms, the warrant process may be the one part of the system that both sides can agree on.

But for Ceglowski, the presumption of the rule of law simply may not apply under Trump. “I don’t think the US is going to turn into a lawless state overnight, but look at the dynamic in places like Russia,” he said. “He’s not going to care enough to prevent such abuses at a lower level and certainly he’s going to protect anyone who’s taken to task for them.”

“I hate to sound fear-mongering but I’m from Poland,” he continued. “This fits a pattern that I recognize. It’s just that it hasn’t happened before in the United States.”

Back in June Cegłowski wrote:

the surveillance economy is way too dangerous. Even if you trust everyone spying on you right now, the data they're collecting will eventually be stolen or bought by people who scare you. We have no ability to secure large data collections over time.

The goal should be not to make the apparatus of surveillance politically accountable (though that is a great goal), but to dismantle it. Just like we don't let countries build reactors that produce plutonium, no matter how sincere their promises not to misuse it, we should not allow people to create and indefinitely store databases of personal information. The risks are too high.

I think a workable compromise will be to allow all kinds of surveillance, but limit what anyone is allowed to store or sell.

7 comments:

Anonymous said...: This is all a bit too late for the British. The Investigatory Powers Bill, now in the very last stages of becoming law, allows the security services to require all ISPs to do deep packet inspection of all their customers' traffic and assemble 'Internet Connection Records' stored for a year and made searchable. The data has to kept for a year, as do all emails. Once they eventually get this system running any complex tracking methods become irrelevant; the agencies will have access to our reading history as well as our emails whatever we do. I think the only hope of this system not being copied elsewhere in the west is publicising it now; this has crept through with almost no reporting and no opposition, given Brexit, Trump etc hapening at the same time. Publicity might at least stop it being adopted so quietly elsewhere.; November 15, 2016 at 1:55 PM
David. said...: Cory Doctorow points to good advice for panopticons from Erica Portnoy at EFF.; November 18, 2016 at 12:29 PM
David. said...: Via /., a fun illustration of web surveillance from "Dutch media company VPRO and Amsterdam based interactive design company Studio Moniker".

Turn on sound and visit https://clickclickclick.click/. Who knew there was a ".click" top-level domain?; November 22, 2016 at 6:58 AM
David. said...: Catalin Cimpanu at Bleeping Computer reports, with a superb diagram, on a new malvertising campaign that attacks home routers from inside your home network. DNSChangerEK:

"The way this entire operation works is by crooks buying ads on legitimate websites. The attackers insert malicious JavaScript in these ads, which use a WebRTC request to a Mozilla STUN server to determine the user's local IP address.

Based on this local IP address, the malicious code can determine if the user is on a local network managed by a small home router, and continue the attack. ... These users receive a tainted ad which redirects them to the DNSChanger EK home, where the actual exploitation begins.

The next step is for the attackers to send an image file to the user's browser, which contains an AES (encryption algorithm) key embedded inside the photo using the technique of steganography.

The malicious ad uses this AES key to decrypt further traffic it receives from the DNSChanger exploit kit. ... Because the attack is carried out via the user's browser, using strong router passwords or disabling the administration interface is not enough. ... This malvertising campaign has nothing to do with the exploit against Netgear routers that came to light over the weekend, or the malvertising campaign discovered by ESET last week, which embedded malicious code inside the pixels of banner ads."; December 16, 2016 at 7:02 AM
David. said...: IF you've read Tim Wu's The Attention Merchants you'll appreciate Ben Thompson's The Great Unbundling.; January 19, 2017 at 7:01 AM
David. said...: have you ever wondered what happens to all the data the government collects, either directly or by buying it from data vendors? Glyn Moody's Chinese Officials With Government Access To Every Kind Of Personal Data Are Selling It Online will reassure you that the data is being put to good use:

"officials within the government who have ready access to this personal information are happy to sell it to anyone for low prices, no questions asked. It's possible some of the databases have been hacked by outsiders, but it seems unlikely that online break-ins could make enough of them accessible, enough of the time. Corrupt officials with continuous access would be a more reliable source for these tracking services, of which there are hundreds."

It is based on, as Moody says:

"great journalism from Guangzhou's Southern Metropolis Daily, whose reporters documented their success in buying every kind of personal data about colleagues from "tracking" services advertised online:; January 19, 2017 at 7:07 AM
David. said...: Some Microsoft support scammers achieved the Holy Grail of malvertising by getting their fake Amazon ad to the top of Google's search results for Amazon:

" The ad appeared at the very top of the Google search result for anyone searching for "amazon," and it appeared above the legitimate search result for Amazon.com.

It's not known how many people may have seen the ad, let alone clicked on it. But according to Google's own most recent statistics, Amazon is the top search result as of the most searched for retail store on the search engine -- likely accounting for millions of searchers."; February 9, 2017 at 7:18 AM