Thursday, June 4, 2009

Hard Disk Drives: The Good, the Bad and the Ugly

Jon Elerath just published a wonderful paper in the June 2009 Communications of the ACM entitled "Hard Disk Drives: The Good, the Bad and the Ugly". Everyone, especially anyone who believes bit preservation is a solved problem, should read it. He clearly communicates the incredible complexity of the technology inside the familiar 3.5" drive form factor.

Elerath reviews the range of hard disk failure modes, and shows how difficult it will be for disk manufacturers to maintain the drive reliability constant as disks get bigger. And even if they succeed in keeping drive reliability constant while the disk gets bigger, the bit reliability they deliver goes down. He says:
Multi-terabyte capacity drives using perpendicular recording will be available soon, increasing the probability of both correctable and uncorrectable errors by virtue of the narrowed track widths, lower flying heads, and susceptibility to scratching by softer particle contaminants.
Thus, as I have been saying for a while, just as we are trying to preserve larger and larger numbers of bits, the technologies we use to make those bits reliable are not keeping pace. Elerath concludes:
Only when these high-probability [failure] events are included in the optimization of the RAID operation will reliability improve. Failure to address them is a recipe for disaster.
I agree that RAID technology needs to adapt to the decreasing bit reliability and longer time to repair of newer disk drives. But, as I argued in my iPRES2008 paper (pdf), even if we do a good job of adapting RAID to cope with these problems we will still be many orders of magnitude below the reliability levels digital preservation needs.