Sunday, September 16, 2007

Sorry for the gap in posting

It was caused by some urgent work for the CLOCKSS project, vacation and co-authoring a paper which has just been submitted. The paper is based on some interesting data, but I can't talk about it for now. I hope to have more time to blog in a week or two after some upcoming meetings.

In the meantime, I want to draw attention to some interesting discussion about silent corruption in large databases that relates to my "Petabyte for a Century" post. Here (pdf) are slides from a talk by Peter Kelemen of CERN describing an on-going monitoring program at CERN using fsprobe(8). It randomly probes 4000 of CERN's file systems, writing a known pattern then reading it back looking for corruption. They find a steady flow of 1-3 silent corruptions/day, that is the data read back doesn't match what was written and there is no error indication.

Peter sparked a discussion and a post at KernelTrap. The slides, the discussion and the post are well worth reading, especially if you are among the vast majority who believe that data written to storage will come back undamaged when you need it.

Also, in a development related to my "Mass-market Scholarly Communication" post, researchers at UC's Office of Scholarly Communication released a report that apparently contradicts some of the findings of the UC Berkeley study I referred to. I suspect, without having read the new study, that this might have something to do with the fact that they studied only "ladder-rank" faculty, where the Berkeley team studied a more diverse group.