Sam has access to logs from the controller for the Claremont Colleges wireless network and the proxy they use to provide off-campus access to resources. He's been studying them to see how much personally identifiable information they contain and how, consistent with the institution's legitimate data-gathering needs, it can be minimized.
Just as at Stanford, the Claremont wireless network knows the person associated with each device, via a network ID. Via signal strength and triangulation, the network knows pretty exactly where they are whenever they are on campus. It can match locations and times to know who they were interacting with. It knows the Web sites they were visiting, so it knows something of what they were discussing. The proxy knows much more; what journals, articles and (thanks to the browsers now rendering PDFs internally) which page they were accessing.
Thanks to the recently-approved change to Rule 41 of the Federal Rules of Criminal Procedure, law enforcement may obtain a warrant from essentially any judge to use whatever mechanisms they like to access remotely and without notification anything they think to be of interest, such as logs containing the kind of information Sam revealed that the Claremont network was collecting.
Clearly, it is important to both anonymize the logs, and delete them as soon as possible, retaining only aggregates coarse enough to make de-anonymization difficult. Sam went into some detail about how hard it is to do this, especially with hosted or third-party services.
One advantage of IP-address and proxy-based authentication is that it deprives publishers of one source they can use to track users. Alas, even if the proxy were to comprehensively sanitize the HTTP headers, the publishers have other ways to track users. Eric Hellman's 16 of the top 20 Research Journals Let Ad Networks Spy on Their Readers is a must-read in this area, as is Steven Englehardt and Arvind Narayanan's Online tracking: A 1-million-site measurement and analysis (also here). I wrote on this topic in Open Access and Surveillance.
There is a real lack of understanding, even among students and researchers, as to the extent to which their on-line activities are tracked. Libraries could do much more to educate the campus community as to the importance of ad-blockers, VPNs, and tools such as Tor and Tails.
Two quick updates:
- I was remiss in not pointing to Maciej Cegłowski on this topic:
imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle.