From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint-restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system.Here is a fascinating presentation by Horst Simon of the Lawrence Berkeley Lab, who has bet against the existence of an Exaflop computer before 2020. He points out all sorts of difficulties in the way other than reliability, but the key slide is #35 which does include a mention of reliability. This slide makes the same case as Cappello et al on much broader arguments, namely that to get more than an order of magnitude or so beyond our current HPC technology will take a complete re-think of the programming paradigm. Among the features required of the new programming paradigm is a recognition that errors and failures are inevitable and there is no way for the hardware to cover them up. The same is true of storage.
I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.
Thursday, May 16, 2013
A sidelight on "A Petabyte for a Century"
In my various posts over the last six years on A Petabyte For A Century I made the case that the amounts of data and the time for which they needed to be kept had reached the scale at which the reliability needed was infeasible. I'm surprised that I don't seem to have referred to the parallel case being made in high-performance computing, most notably in a 2009 paper, Toward Exascale Resilience by Franck Cappello et al:
Subscribe to:
Post Comments (Atom)
1 comment:
I was actually at the Exascale kick-off meeting in San Diego a few years back and resilience was one of the topics that kept coming up as a concern, but with no clear path. We used the analogy of it being equivalent to fixing an airplane while its flying if any sort of reasonable throughput is going to be realized for processes that actually can use the full machine.
A bit tangentially (though more directly related to long-term digital preservation) was that they had *no* plan for data exist and curation. This in of itself wasn't too disturbing, but what was, was the complete lack of thought on it. It wasn't a matter of scope, no one had even considered the matter. I brought it up and got stunned silence as a response with a few tentative responses of "Yeah, we don't know..." or "I thought NASA was covering that...", and these were people who were also arguing from the standpoint the data produced from an exascale machine would have long term value and inform even national level policies.
Post a Comment