From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint-restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system.Here is a fascinating presentation by Horst Simon of the Lawrence Berkeley Lab, who has bet against the existence of an Exaflop computer before 2020. He points out all sorts of difficulties in the way other than reliability, but the key slide is #35 which does include a mention of reliability. This slide makes the same case as Cappello et al on much broader arguments, namely that to get more than an order of magnitude or so beyond our current HPC technology will take a complete re-think of the programming paradigm. Among the features required of the new programming paradigm is a recognition that errors and failures are inevitable and there is no way for the hardware to cover them up. The same is true of storage.
Thursday, May 16, 2013
A sidelight on "A Petabyte for a Century"
In my various posts over the last six years on A Petabyte For A Century I made the case that the amounts of data and the time for which they needed to be kept had reached the scale at which the reliability needed was infeasible. I'm surprised that I don't seem to have referred to the parallel case being made in high-performance computing, most notably in a 2009 paper, Toward Exascale Resilience by Franck Cappello et al: