Working Group Meeting 16.10.2002 meeting: ----------------------------------------- - welcome, assignments of secretaries (Matthias Mueller, Nathan Stone) - Discussion/Ratification of Working Group Charter 1. questions regarding idendification or definition of system level services. Answer: identification 2. meaning of the word "system", proposal to replace it. It is not necessarily kernel or OS level 3. Proposal to include a explicit defintion of the behavior of the app in order to make the checkpoint possible/reliable - Presentation/Discussion of possible Impact - Security - App Development and runtime issues - scalability: will strongly depend on the environment - Goals and Milestones - Presentations (available on the web site) 1. Massimo Sgaravatto: "Checkpointing in Datagrid" API for storing and retrieving pairs 2. Nathan Stone: "PSC Checkpointing" Portable for C/C++/Perl. Currently only recovery on machine of same flavor. Designed for parallel jobs. Numbering scheme for parallel jobs and purging. Redundancy (parity etc.), job state tracking, automatic resubmission at job failure. Requirement for a Grid CPR: - CP file regonition+ - CP file migration - job state tracking 3. Paul Stodghill, Cornell: "Compiler Assisted Checkpinting" - source to source preprocessor to add application level checkpiting for MPI apps - Fault models of a N processes app: predictable N-way, N-way, K-way - Complexity of parallel App: handle early and late messages - system level checkpiiutng is inefficient (e.g. Alegra system of SNL only needs to checkpiunt 5% of core size) - recovery with different number of CPUs requires to save the global view - scalability: can we afford to add a barrier for checkpointing (scalability and deadlock issues) Cornell avoids barriers - paper: A Survey of Rollback-REcovery Protocols in Message passing sstems" Elnozahy et a Sept 2002. 4. Gabrielle Allen: "Checkpoint/Recovery in Cactus" - Cactus is a framework composed of "thorns" - infrastruture thorns have complicated data structures - applicatin thors have typically simple data structures - Thorns take care of I/O. CPR available with HDF5, FlexIO 5. Derek Simmel: CPR in Condor - no source code modification required, but relinking - Condor replaces the IO calls - CPR is done with signal handling - limits: no fork(), system(), no kernel threads, no SIGUSR2, SIGSTP, memory mapped files, file locks are not retained between checkpoints, files must be either read-only or write-only. - condor makes use of a shared file system but it is not a requirement