Architectures: 1. Single node 2. A single HCR, recovery within HCR - same configuration - different configuration 3. Between different HCRs - same configuration - different configuration *HCR=Homogeneous Computing Resource Application Scenarios 1. parametric-style ("pleasantly parallel" - no communication except initial/final 2. time-stepping - global synchronization - many-to-many, lock-step communication 3. loosely coupled application (e.g. adaptive mesh calculations) - no global synchronzation - unidirectional communication - bidirectional communication - unstructured/complex work-flow Failure Scenarios 1. Unrecoverable Application Error - undecidable? Which of these are beyond the scope of our work in this WG? 2. Computing resource failed ** 3. Don't know - operator intervention req'd Running out of resources? - Exceeds resources requested - app error - Computing resource runs out after job allocation Failure attributes/types/themes/modes/??? - Transient failure (not readily reproducible) - Permanent (e.g. machine decommissioned) - Predictable (e.g. strict quota exceeded, notified preemption of job) - Notifiable failure (e.g. explicit request to recover - Application support for checkpoint/recovery - Self-imposed recovery (app notices a problem itself) vs. externally imposed recovery - partial failure Failure notification - who/what needs to be notified? Recovery Scenarios 1. Can I continue on the same node? Who/What initiates checkpointing? - app - host/node/resource - network monitor - service - resource management system - scheduler? - tools - e.g. debuggers Application-scheduled (step) occasional Periodic checkpoints "Told" Signalled/event-driven checkpointing Homework to group - November 15th, 2002: Send to the mailing list: (1) your short list of applications that _you_ care about supporting with checkpoint / recovery (2) What failures do you most care about with respect to these applications? Expectations for our next meeting There should be a research group concerned about broader issues - fault tolerance - event tolerance Draft the scenarios Expections of the application API provide Expectations of the services expected of the underlying Recovery Schema paper from the 80's (ask Augusto Ciuffoletti) Migration as an optimization issue - cost? Fault tolerance as a survivability - cost? Data recoverability - (Paul Borrill) clones and snapshots? vulnerability window of checkpoint window