Friday, April 7, 2017

Researcher Privacy

The blog post I was drafting about the sessions I found interesting at the CNI Spring 2017 Membership Meeting got too long, so I am dividing it into a post per interesting session. First up, below the fold, perhaps the most useful breakout session. Sam Kome's Protect Researcher Privacy in the Surveillance Era, an updated version of his talk at the 2016 ALA meeting, led to animated discussion.

Sam has access to logs from the controller for the Claremont Colleges wireless network and the proxy they use to provide off-campus access to resources. He's been studying them to see how much personally identifiable information they contain and how, consistent with the institution's legitimate data-gathering needs, it can be minimized.

Just as at Stanford, the Claremont wireless network knows the person associated with each device, via a network ID. Via signal strength and triangulation, the network knows pretty exactly where they are whenever they are on campus. It can match locations and times to know who they were interacting with. It knows the Web sites they were visiting, so it knows something of what they were discussing. The proxy knows much more; what journals, articles and (thanks to the browsers now rendering PDFs internally) which page they were accessing.

Thanks to the recently-approved change to Rule 41 of the Federal Rules of Criminal Procedure, law enforcement may obtain a warrant from essentially any judge to use whatever mechanisms they like to access remotely and without notification anything they think to be of interest, such as logs containing the kind of information Sam revealed that the Claremont network was collecting.

Clearly, it is important to both anonymize the logs, and delete them as soon as possible, retaining only aggregates coarse enough to make de-anonymization difficult. Sam went into some detail about how hard it is to do this, especially with hosted or third-party services.

One advantage of IP-address and proxy-based authentication is that it deprives publishers of one source they can use to track users. Alas, even if the proxy were to comprehensively sanitize the HTTP headers, the publishers have other ways to track users. Eric Hellman's 16 of the top 20 Research Journals Let Ad Networks Spy on Their Readers is a must-read in this area, as is Steven Englehardt and Arvind Narayanan's Online tracking: A 1-million-site measurement and analysis (also here). I wrote on this topic in Open Access and Surveillance.

There is a real lack of understanding, even among students and researchers, as to the extent to which their on-line activities are tracked. Libraries could do much more to educate the campus community as to the importance of ad-blockers, VPNs, and tools such as Tor and Tails.

Two quick updates:
  • I was remiss in not pointing to Maciej CegÅ‚owski on this topic:
    imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle.
  • I managed track down Stanford's privacy policy. It does not appear to require either anonymization or minimization of logs, nor does its list of the Information available to the University include much of the information Sam sees in his logs.

1 comment:

David. said...

Cliff Lynch's long-awaited report on reader privacy is in April's First Monday. Once I've had time to digest it I'll comment further.