Friday, April 7, 2017

Researcher Privacy

The blog post I was drafting about the sessions I found interesting at the CNI Spring 2017 Membership Meeting got too long, so I am dividing it into a post per interesting session. First up, below the fold, perhaps the most useful breakout session. Sam Kome's Protect Researcher Privacy in the Surveillance Era, an updated version of his talk at the 2016 ALA meeting, led to animated discussion.

Sam has access to logs from the controller for the Claremont Colleges wireless network and the proxy they use to provide off-campus access to resources. He's been studying them to see how much personally identifiable information they contain and how, consistent with the institution's legitimate data-gathering needs, it can be minimized.

Just as at Stanford, the Claremont wireless network knows the person associated with each device, via a network ID. Via signal strength and triangulation, the network knows pretty exactly where they are whenever they are on campus. It can match locations and times to know who they were interacting with. It knows the Web sites they were visiting, so it knows something of what they were discussing. The proxy knows much more; what journals, articles and (thanks to the browsers now rendering PDFs internally) which page they were accessing.

Thanks to the recently-approved change to Rule 41 of the Federal Rules of Criminal Procedure, law enforcement may obtain a warrant from essentially any judge to use whatever mechanisms they like to access remotely and without notification anything they think to be of interest, such as logs containing the kind of information Sam revealed that the Claremont network was collecting.

Clearly, it is important to both anonymize the logs, and delete them as soon as possible, retaining only aggregates coarse enough to make de-anonymization difficult. Sam went into some detail about how hard it is to do this, especially with hosted or third-party services.

One advantage of IP-address and proxy-based authentication is that it deprives publishers of one source they can use to track users. Alas, even if the proxy were to comprehensively sanitize the HTTP headers, the publishers have other ways to track users. Eric Hellman's 16 of the top 20 Research Journals Let Ad Networks Spy on Their Readers is a must-read in this area, as is Steven Englehardt and Arvind Narayanan's Online tracking: A 1-million-site measurement and analysis (also here). I wrote on this topic in Open Access and Surveillance.

There is a real lack of understanding, even among students and researchers, as to the extent to which their on-line activities are tracked. Libraries could do much more to educate the campus community as to the importance of ad-blockers, VPNs, and tools such as Tor and Tails.

Two quick updates:
  • I was remiss in not pointing to Maciej CegÅ‚owski on this topic:
    imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle.
  • I managed track down Stanford's privacy policy. It does not appear to require either anonymization or minimization of logs, nor does its list of the Information available to the University include much of the information Sam sees in his logs.

4 comments:

David. said...

Cliff Lynch's long-awaited report on reader privacy is in April's First Monday. Once I've had time to digest it I'll comment further.

David. said...

The Economist has a long-ish piece entitled Technology firms and the office of the future which includes details of the surveillance technologies tech firms are rolling out, for example:

"Jensen Huang, the chief executive of Nvidia, ... says his firm plans to introduce facial recognition for entry into its new headquarters, due to open later this year.

Nvidia will also install cameras to recognise what food people are taking from the cafeteria and charge them accordingly, eliminating the need for a queue and cashier."

Universities are anxious to ape the trends of the tech industry, so this are the future for researchers too.

David. said...

And later in the same issue of The Economist there is this:

"REMEMBER that racy film you probably should not have enjoyed on Netflix last weekend? Eran Tromer’s algorithms can tell what it was. Although videos streamed from services such as Netflix, Amazon and YouTube are encrypted in various ways to ensure privacy, all have one thing in common: they leak information. Dr Tromer, of Tel Aviv university, his colleague Roei Schuster and Vitaly Shmatikov of Cornell have worked out how those leaks can identify the film you are watching—even if they cannot directly observe the stream of bits delivering it, or obtain access to the device on which you are watching it."

David. said...

Yet again we see that Maciej Cegłowski was right:

"imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle."

Equifax, the credit rating agency, just leaked what it knows about 143M. roughly one in every two Americans:

'Equifax said Thursday that 143 million people could be affected by a recent data breach in which cybercriminals stole information including names, Social Security numbers, birth dates, addresses, and the numbers of some driver's licenses.

Additionally, credit card numbers for about 209,000 people were exposed, as was "personal identifying information" on roughly 182,000 customers involved in credit report disputes."'

Why are companies allowed to store this data when it is inevitable that it will leak?