Monday, January 4, 2016

Why not store it all?

Back in October I wrote Two in Two Days, linking to two of Maciej Cegłowski's barn-burning speeches, What Happens Next Will Amaze You and Haunted By Data. I should have been paying more attention, because Cory Doctorow just pointed out that three weeks after I wrote Cegłowski gave another barn-burner, this one in Sydney entitled The Website Obesity Crisis. It is a must-read, laugh-out-loud gem. Below the fold, I start from it to examine two answers to the question "why not store all of it?"

Cegłowski provides a long list of posts talking about the incredible bloat that is reducing page loads to a crawl, such as:
this 400-word-long Medium article on bloat, which includes the sentence:
"Teams that don’t understand who they’re building for, and why, are prone to make bloated products."
The Medium team has somehow made this nugget of thought require 1.2 megabytes. That's longer than Crime and Punishment, Dostoyevsky’s psychological thriller about an impoverished student who fills his head with thoughts of Napoleon and talks himself into murdering an elderly money lender.
Reading these examples I burst out laughing. But then I remembered something someone said about motes and beams, so I measured the size of my post Two in Two Days. It contains just over 5KB of words and my markup, and two of my images totalling about 44KB. Rendering it downloads about 1.5MB of data. I must fess up that I am responsible for the fact that two small images take 44KB.  But Google's Blogger platform is responsible for the remaining 1.45MB, including at least 1.2MB of JavaScript and 80K of CSS.

As an experiment, I saved the blog post and wrote a quick shell script that essentially did:
for A in *.js" ; do cat /dev/null > ${A} ; done
The resulting page is just over 300KB, still 6 times bigger than the content, but a compression factor of 5 over the original. The result is here, so you can see that it preserves the "look and feel" of the original almost perfectly.

Admittedly, the "Simple" template my blog uses is indeed very simple. Doing the same thing to a more sophisticated JavaScript-ed page would not work so well. But my experiment reinforces Cegłowski's point that the information density of the Web content we are collecting and preserving is very low. This low information content is part of what is driving the rise of ad-blockers and technologies like Firefox's Tracking Protection.

We need to be skeptical of the data we're collecting. Most of this JavaScript won't work in the future because it implements APIs to remote services. And even if it did work, it isn't doing things we want to have happen to future readers (or current ones, for that matter). Adopting these technologies for Web archiving could greatly reduce storage costs. Of course, scholars wanting to study the ads would be out of luck, but they're pretty much out of luck already.

The first reason is therefore the cost of storage. The second is the risk of storage. Two in Two Days and the earlier The Panopticon Is Good For You agreed with the point that Cegłowski made in Haunted By Data. He used the analogy between collected data and nuclear waste:
A singular problem of nuclear power is that it generated deadly waste whose lifespan was far longer than the institutions we could build to guard it. Nuclear waste remains dangerous for many thousands of years. The data we're collecting about people has this same odd property.
Doctorow flags a Wall Street Journal interview with Lawrence Lessig, in which he looks at the risk-reward ratio of keeping Big Data, and to a James Bridle essay that also uses the analogy between Big Data and nuclear threats.

Here is Lessig:
The average cost per user of a data breach is now $240 … think of businesses looking at that cost and saying “What if I can find a way to not hold that data, but the value of that data?” When we do that, our concept of privacy will be different. Our concept so far is that we should give people control over copies of data. In the future, we will not worry about copies of data, but using data. The paradigm of required use will develop once we have really simple ways to hold data. If I were king, I would say it’s too early. Let’s muddle through the next few years. The costs are costly, but the current model of privacy will not make sense going forward.
Doctorow comments;
I'm keenly interested in how this process could be accelerated. For example, what if insurers refused to offer policies to companies unless they used good, long random salts for their password hashes? Insurers aren't experts in infosec, but they're also not experts in fire-safety. Nevertheless, when it became apparent that they would lose money unless they imposed fire-safety practices on their customers, insurers came up with rules about how many sprinklers, exists, and alarms the businesses they wrote policies for needed to install.

If insurers actually attached pricetags to customer data -- assuming, say, that leaking name, social and date of birth would create $150 in liability from eventual class-action suits over losses -- and passed those costs onto companies, companies would start assuming that user-data was more like plutonium than oil.
Bridle comes at this from a different perspective, starting from the history of the Campaign for Nuclear Disarmament and visits to the Museum of Nuclear Science and History's collection of nuclear weapons:
That nausea is how I feel today – an existential dread not caused by the shadow of the bomb, but by the shadow of data. It’s easy to feel, looking back, that we spent the 20th Century living in a minefield, and I think we’re still living in a minefield now, one where critical public health infrastructure runs on insecure public phone networks, financial markets rely on vulnerable, decades-old computer systems, and everything from mortgage applications to lethal weapons systems are governed by inscrutable and unaccountable softwares. This structural and existential threat, which is both to our individual liberty and our collective society, is largely concealed from us by commercial and political interests, and nuclear history is a good primer in how that has been standard practice for quite some time.
And to Bletchley Park:
The one concession to the present at Bletchley is a small Intel-sponsored exhibition about cybersecurity, which is largely useless, but also unintentionally revealing. One of the talking heads it calls upon while advising visitors to always use a strong password when browsing online is Michael Hayden. That’s Michael Hayden, former director of NSA and CIA, who is famous in part for affirming that “we kill people with metadata” – an affirmation that data is a weapon in itself

This thing we call BIG DATA is The Bomb – a tool developed for wartime purposes which can destroy indiscriminately. I was struck hard by this realisation at Bletchley, and once seen, it can’t be unseen.
Bridle concludes:
But when that data is the names and addresses of all the children in the UK, or an HIV clinic’s medical records, or all of a cellular provider’s customer data, it’s a bit more concerning.
This data is toxic on contact, and it sticks around for a long time: it spills, it leaches into everything, it gets into the ground water of our social relationships and poisons them. And it will remain hazardous beyond our own lifetimes.
Experience shows that, given our current technology, any information maintained online will eventually leak. The only questions are how long it will take, and who will get their hands on it? Given this, I think Doctorow is on the right lines in looking to insurance companies to ask the question "what could possibly go wrong?" before quoting rates.

1 comment:

David. said...

Elizabeth Dwoskin at the WaPo reports that companies are starting to pay attention to Cegłowski and understand the risks that stored data poses.