DSHR's Blog: May 2012

Thursday, May 24, 2012

"Master Class" at Screeing the Future II

Steven Abrams, Matthew Addis and I gave a "master class" on the economics of preservation at the Screening the Future II conference run by PrestoCentre and hosted by USC. Below the fold are the details.

Dr. Pangloss' Notes From Dinner

The renowned Dr. Pangloss greatly enjoyed last night's inaugural dinner of the Storage Valley Supper Club, networking with storage industry luminaries, discussing the storage technology roadmap, projecting the storage market, and appreciating the venue. He kindly agreed to share his notes, which I have taken the liberty of elaborating slightly.

The following argument was made. The average value to be obtained from a byte you don't keep is guaranteed to be zero. The average cost of not keeping the byte is guaranteed to be zero. Thus the net average value added by not keeping a byte is guaranteed to be zero. But the average value to be obtained from a byte you keep is guaranteed to be greater than zero. The average cost of keeping the byte is guaranteed to be zero. Thus the net average value added by keeping a byte is guaranteed to be greater than zero. So we should keep everything. Happy days for the industry!
The following numbers were quoted. The number of bytes to be stored is growing at 60%/yr. The cost of storing a byte is growing at -20%. Thus the total cost of storage is growing at ~~60-20=40~~ (100+60)*(100-20)=128%/yr. And IT budgets are growing 4% a year. Happy days for the industry!

Tip of the hat to Jim Handy for correcting my math.

Monday, May 14, 2012

Lets Just Keep Everything Forever In The Cloud

Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:

Dr Fader ... coins the terms "data fetish" and "data fetishist" to describe the belief that people and organisations need to capture and hold on to every scrap of data, just in case it might be important down the road. (I recently completed a Big Data survey in which a large proportion of respondents said they intend to keep their data “forever”. Great news for the tech industry, for sure.)

The full interview is worth reading, but I want to focus on one comment, which is similar to things I hear all the time:

But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."

Follow me below the fold for some thoughts on data hoarding.

Harvesting and Preserving the Future Web

Kris Carpenter Negulescu of the Internet Archive and I organized a half-day workshop on the problems of harvesting and preserving the future Web during the International Internet Preservation Coalition General Assembly 2012 at the Library of Congress. My involvement was spurred by my long-time interest in the evolution of the Web from a collection of linked documents whose primary language was HTML to a programming environment whose primary language is Javascript.

In preparation for the workshop Kris & I, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation:

Database driven features
Complex/variable URI formats
Dynamically generated URIs
Rich, streamed media
Incremental display mechanisms
Form-filling
Multi-sourced, embedded content
Dynamic login, user-sensitive embeds
User agent adaptation
Exclusions (robots.txt, user-agent, ...)
Exclusion by design
Server-side scripts, RPCs
HTML5

A forthcoming document will elaborate with examples on this list and the other issues identified at the workshop. Some partial solutions are already being worked on. For example, Google, the Institut national de l'audiovisuel in France, and the Internet Archive among others have active programs involving executing the content they collect using "headless browsers" such as Phantom JS.

But the clear message from the workshop is that the old goal of preserving the user experience of the Web is no longer possible. The best we can aim for is to preserve a user experience, and even that may in many cases be out of reach. An interesting example of why this is so is described in an article on A/B testing in Wired. It explains how web sites run experiments on their users, continually presenting them with randomly selected combinations of small changes as part of a testing program:

Use of a technique called multivariate testing, in which myriad A/B tests essentially run simultaneously in as many combinations as possible, means that the percentage of users getting some kind of tweak may well approach 100 percent, making “the Google search experience” a sort of Platonic ideal: never encountered directly but glimpsed only through imperfect derivations and variations.

It isn't just that one user's experience differs from another's. The user can never step into the same river twice. Even if we can capture and replay the experience of stepping into it once, the next time will be different, and the differences may be meaningful, or random perturbations. We need to re-think the whole idea of preservation.

Tuesday, May 1, 2012

Catching up

This time I'm not going to apologise for the gap in posting; I was on a long-delayed vacation. Below the fold are some links I noticed in the intervals between vacating.

DSHR's Blog