Thursday, March 6, 2008

More bad news on storage reliability

Last year's FAST conference had depressing news for those who think that bits are safe in modern storage systems, with two papers showing that disks in use in large-scale storage facilities are much less reliable than the manufacturers claim, and a keynote (PDF) reporting that errors in file system code are endemic. This year's FAST had more sobering news. I'll return to these papers in more detail, but here are the take-away messages.

Jiang et al from UIUC and NetApp took a detailed look at the various subsystems in modern storage system, showing that 45-75% of the apparent disk unreliability in last year's papers is probably due to the unreliability of other components in the storage system, and that the correlations between errors are even worse than last year's papers suggested.

Gunawi et al from Wisconsin analyzed one of the root causes of the incorrect response of file systems to errors in the underlying storage that was reported in earlier papers from Wisconsin (PDF) and Stanford (PDF), namely the way the file systems propogate reported errors between functions and modules, showing that correct handling of these problems is so hard that implementors often throw up their hands.

Bairavasundaram et al from Wisconsin, NetApp and Toronto presented a massive study of silent data corruption in storage systems, reinforcing the earlier study (PDF) from CERN in showing an alarming incidence of both these errors and of correlations between them.

Krioukov et al from Wisconsin and NetApp analyzed the techniques RAID-based storage systems use to tolerate silent data corruption, showing that in various ways all current systems fall short of an adequate solution to this problem.

Greenan and Wylie (PDF) from HP Labs gave a work-in-progress presentation showing that the Markov models which are pretty much the exclusive technique for analyzing failures in storage systems give results that are systematically optimistic because they depend on assumptions that are known to be untrue.

Considerable kudos is due to NetApp for their many contributions to both data and analysis and to the University of Wisconsin, which is building an impressive track record in this important area.