Thursday, May 16, 2013

A sidelight on "A Petabyte for a Century"

In my various posts over the last six years on A Petabyte For A Century I made the case that the amounts of data and the time for which they needed to be kept had reached the scale at which the reliability needed was infeasible. I'm surprised that I don't seem to have referred to the parallel case being made in high-performance computing, most notably in a 2009 paper, Toward Exascale Resilience by Franck Cappello et al:
From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint-restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system.
Here is a fascinating presentation by Horst Simon of the Lawrence Berkeley Lab, who has bet against the existence of an Exaflop computer before 2020. He points out all sorts of difficulties in the way other than reliability, but the key slide is #35 which does include a mention of reliability. This slide makes the same case as Cappello et al on much broader arguments, namely that to get more than an order of magnitude or so beyond our current HPC technology will take a complete re-think of the programming paradigm. Among the features required of the new programming paradigm is a recognition that errors and failures are inevitable and there is no way for the hardware to cover them up. The same is true of storage.

Tuesday, May 14, 2013

The value that publishers add

Here is Paul Krugman pointing out how much better econoblogs are doing at connecting economics and policy than traditional publishing. He brings out several of the points I've been making since the start of this blog six years ago.

First, speed: 
The overall effect is that we’re having a conversation in which issues get hashed over with a cycle time of months or even weeks, not the years characteristic of conventional academic discourse.
Second, the corruption of the reviewing process:
In reality, while many referees do their best, many others have pet peeves and ideological biases that at best greatly delay the publication of important work and at worst make it almost impossible to publish in a refereed journal. ... anything bearing on the business cycle that has even a vaguely Keynesian feel can be counted on to encounter a very hostile reception; this creates some big problems of relevance for proper journal publication under current circumstances.
Third, reproducibility:
Look at one important recent case ... Alesina/Ardagna on expansionary austerity. Now, as it happens the original A/A paper was circulated through relatively “proper” channels: released as an NBER working paper, then published in a conference volume, which means that it was at least lightly refereed. ... And how did we find out that it was all wrong? First through critiques posted at the Roosevelt Institute, then through detailed analysis of cases by the IMF. The wonkosphere was a much better, much more reliable source of knowledge than the proper academic literature.
And here's yet another otherwise good review of the problems of scientific publishing that accepts Elsevier's claims as to the value they add, failing to point out the peer reviewed research into peer review that conclusively refutes these claims. It does, however include a rather nice piece of analysis from Deutsche Bank:
We believe the [Elsevier] adds relatively little value to the publishing process.  We are not attempting to dismiss what 7,000 people at [Elsevier] do for a living.  We are simply observing that if the process really were as complex, costly and value-added as the publishers protest that it is, 40% margins wouldn’t be available.
As I pointed out using 2010 numbers:
The world's research and education budgets pay [Elsevier, Springer & Wiley] about $3.2B/yr for management, editorial and distribution services. Over and above that, the worlds research and education budgets pay the shareholders of these three companies almost $1.5B for the privilege of reading the results of research (and writing and reviewing) that these budgets already paid for.
What this $4.7B/yr pays for is a system which encourages, and is riddled with, error and malfeasance. If these value-subtracted aspects were taken into account, it would be obvious that the self-interested claims of the publishers as to the value that they add were spurious.


Tuesday, May 7, 2013

Storing "all that stuff"

In two CNN interviews former FBI counter-terrorism specialist Tim Clemente attracted a lot of attention when he said:

"We certainly have ways in national security investigations to find out exactly what was said in that conversation. ... No, welcome to America. All of that stuff is being captured as we speak whether we know it or like it or not." and "all digital communications in the past" are recorded and stored
Many people assumed that the storage is in the Utah Data Center which, according to Wikipedia:
is a data storage facility for the United States Intelligence Community that is designed to be a primary storage resource capable of storing data on the scale of yottabytes
Whatever the wisdom of collecting everything, I'm a bit skeptical about the practicality of storing it. Follow me below the fold for a look at the numbers.

Monday, April 29, 2013

Talk on LOCKSS Metadata Extraction at IIPC 2013

I gave a brief introduction to the way the LOCKSS daemon extracts metadata from the content it collects at the 2013 IIPC General Assembly. Below the fold is an edited text with links to the sources.

Sunday, April 28, 2013

Talk on Harvesting the Future Web at IIPC2013

I gave a talk to introduce the workshop "Future Web Capture: Replay, Data Mining and Analysis" at the 2013 IIPC General Assembly. It was based on my talk at the Spring CNI meeting. Below the fold is an edited text with links to the sources.

Saturday, April 27, 2013

Software obsolescence doesn't imply format obsolescence

Tim Anderson at The Register celebrates the 20th anniversary of Mosaic:
Using the DOSBox emulator (the Megabuild version which has network connectivity via an emulated NE2000 NIC) I ran up Windows 3.11 with Trumpet Winsock and got Mosaic 1.0 running.
This illustrates two important points:
  • Tim had no trouble resuscitating a 20-year-old software environment using off-the-shelf emulation.
  • The 20-year-old browser struggled to make sense of today's web. But today's browsers have no difficulty at all with vintage web pages.
The fact that the software that originally interpreted the content is obsolete (a) does not meant that there is significant difficulty in running it, and (b) does not mean that you need to use emulation to run it in order to interpret the content, because the obsolescence of the software does not imply the obsolescence of the format. Backwards compatibility is a feature of the Web, for reasons I have been pointing out for many years.

Thursday, April 25, 2013

Moore, Kryder vs. SAW

Ashish Sood et al's paper Predicting the Path of Technological Innovation: SAW vs. Moore, Bass, Gompertz, and Kryder is very interesting. They propose a discontinuous model in which technology evolves in steps, separated by periods of stasis they call waits, leading them to dub the model SAW (Step And Wait). They show that it models the evolution of a wide range of technologies better than continuous models such as Moore's and Kryder's laws. Our work on the economics of long-term storage is based on Kryder's law, a continuous model. Below the fold I ask whether we need to change models.