Sunday, May 26, 2013

Maureen Pennock's "Web Archiving" report

Under the auspices of the Digital Preservation Coalition, Maureen Pennock has written a very comprehensive overview of Web Archiving. It is an excellent introduction to the field, and has a lot of useful references.

Thursday, May 23, 2013

How dense can storage get?

James Pitt has an interesting if not terribly useful post at Quora comparing the Bekenstein Bound, the absolute limit that physics places on the density of information, with Harvard's DNA storage experiment. He concludes:
the best DNA storage can do with those dimensions [a gram of dry DNA] is 5.6*1015 bits.

A Bekenstein-bound storage device with those dimensions would store about 1.6*1038 bits.
So, there is about a factor of 3*1022 in bits/gram beyond DNA. He also compares the Bekenstein limit with Stanford's electronic quantum holography, which stored 35 bits per electron. A Bekenstein-limit device the size of an electron would store 6.6*107 bits, so there's plenty of headroom there too. How reliable storage media this dense, and what their I/O bandwidth would be are open questions, especially since the limit describes the information density of a black hole.

Thursday, May 16, 2013

A sidelight on "A Petabyte for a Century"

In my various posts over the last six years on A Petabyte For A Century I made the case that the amounts of data and the time for which they needed to be kept had reached the scale at which the reliability needed was infeasible. I'm surprised that I don't seem to have referred to the parallel case being made in high-performance computing, most notably in a 2009 paper, Toward Exascale Resilience by Franck Cappello et al:
From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint-restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system.
Here is a fascinating presentation by Horst Simon of the Lawrence Berkeley Lab, who has bet against the existence of an Exaflop computer before 2020. He points out all sorts of difficulties in the way other than reliability, but the key slide is #35 which does include a mention of reliability. This slide makes the same case as Cappello et al on much broader arguments, namely that to get more than an order of magnitude or so beyond our current HPC technology will take a complete re-think of the programming paradigm. Among the features required of the new programming paradigm is a recognition that errors and failures are inevitable and there is no way for the hardware to cover them up. The same is true of storage.

Tuesday, May 14, 2013

The value that publishers add

Here is Paul Krugman pointing out how much better econoblogs are doing at connecting economics and policy than traditional publishing. He brings out several of the points I've been making since the start of this blog six years ago.

First, speed: 
The overall effect is that we’re having a conversation in which issues get hashed over with a cycle time of months or even weeks, not the years characteristic of conventional academic discourse.
Second, the corruption of the reviewing process:
In reality, while many referees do their best, many others have pet peeves and ideological biases that at best greatly delay the publication of important work and at worst make it almost impossible to publish in a refereed journal. ... anything bearing on the business cycle that has even a vaguely Keynesian feel can be counted on to encounter a very hostile reception; this creates some big problems of relevance for proper journal publication under current circumstances.
Third, reproducibility:
Look at one important recent case ... Alesina/Ardagna on expansionary austerity. Now, as it happens the original A/A paper was circulated through relatively “proper” channels: released as an NBER working paper, then published in a conference volume, which means that it was at least lightly refereed. ... And how did we find out that it was all wrong? First through critiques posted at the Roosevelt Institute, then through detailed analysis of cases by the IMF. The wonkosphere was a much better, much more reliable source of knowledge than the proper academic literature.
And here's yet another otherwise good review of the problems of scientific publishing that accepts Elsevier's claims as to the value they add, failing to point out the peer reviewed research into peer review that conclusively refutes these claims. It does, however include a rather nice piece of analysis from Deutsche Bank:
We believe the [Elsevier] adds relatively little value to the publishing process.  We are not attempting to dismiss what 7,000 people at [Elsevier] do for a living.  We are simply observing that if the process really were as complex, costly and value-added as the publishers protest that it is, 40% margins wouldn’t be available.
As I pointed out using 2010 numbers:
The world's research and education budgets pay [Elsevier, Springer & Wiley] about $3.2B/yr for management, editorial and distribution services. Over and above that, the worlds research and education budgets pay the shareholders of these three companies almost $1.5B for the privilege of reading the results of research (and writing and reviewing) that these budgets already paid for.
What this $4.7B/yr pays for is a system which encourages, and is riddled with, error and malfeasance. If these value-subtracted aspects were taken into account, it would be obvious that the self-interested claims of the publishers as to the value that they add were spurious.


Tuesday, May 7, 2013

Storing "all that stuff"

In two CNN interviews former FBI counter-terrorism specialist Tim Clemente attracted a lot of attention when he said:

"We certainly have ways in national security investigations to find out exactly what was said in that conversation. ... No, welcome to America. All of that stuff is being captured as we speak whether we know it or like it or not." and "all digital communications in the past" are recorded and stored
Many people assumed that the storage is in the Utah Data Center which, according to Wikipedia:
is a data storage facility for the United States Intelligence Community that is designed to be a primary storage resource capable of storing data on the scale of yottabytes
Whatever the wisdom of collecting everything, I'm a bit skeptical about the practicality of storing it. Follow me below the fold for a look at the numbers.