Grid Checkpoint and Recovery Working Group (GridCPR-WG) Double-Meeting at GGF10, Thursday, 11 March 2004, 4:00-5:30 pm, and 6:00-7:30 pm Room 3088 at Humboldt University, Berlin, Germany Session chaired by Derek Simmel Notes taken by Thilo Kielmann Agenda: ------- Welcome, opening, IPR issues Discussion and revision of GWD-I document: "Use Cases for Grid Checkpoint Recovery" Short presentations about individual use cases - Stephen Pickles, Use of CPR in RealityGrid - Nathan Stone, DejaVu CPR system - Christine Morin, Hybrid Checkpointing for Parallel Applications Discussion, action items, writing assignments Discussion and revision of GWD-I document: "Architecture for Grid Checkpoint Recovery Services and a GridCPR API" Discussion, action items, writing assignments Discussion/planning/outline of GridCPR API Specification document Planning for GGF11 and interim objectives Adjourn 4:00 Derek Simmel opens the session and explains the IPR policy Besides the participants in the room, Paul Stodghill participates by phone Announcement by Thilo Kielmann: Due to his overcommittment he would like to step down as co-chair of the group to be freed from the administrative work. (Thilo still wants to contribute to the technical group progress, though.) Consequently, the group is looking for volunteers to become co-chair. Discussion and revision of GWD-I document: "Use Cases for Grid Checkpoint Recovery" ------------------------------------------- The use case discussion begins with short presentation. 1. Stephen Pickles, Manchester University "CPR in RealityGrid" application-level CPR for - fault tolerance - migration - parameter space exploration with checkpoint trees 'cloning' malleable checkpoints, restart on different architecture computational steering for controlling simulations, pushing them in the right direction steering and CPR leads to checkpoint trees the same checkpoint can be used for multilple purposes, users, systems, administrative domains 2. Nathan Stone, PSC "The DejaVu CPR System" system-level checkpoint via Deja VU possibly use gridCPR to implement actual functionality a very un-anticipated use case, but could be done as well (spawn thinking about more non-obvious use cases) Paul comments (via phone) that this has similar aims as his system. He will add DeJa VU to the document. 3. Christine Morin, IRISA "PARIS project CPR activities" grid-aware Operating System for cluster federations hierarchical checkpointing protocol for code coupling applications transparent checkpointing Paul Stodghill presents the use-case document (via phone, Derek Simmel does the slides) Rosa Badia: add workflow-type applications (Christine's example) Rosa will send some descriptive text about her example via email Point from the discussion: distinguish clearly between checkpoint data and meta data. 5:30 - 6:00pm BREAK =========================================================================== After the break, the group continued reviewing the use-case document. --------------------------------------------------------------------- introduction: OK terminology section: what is a job/task? -> w.r.t. work flow applications Whole task graph or only one single node ? Paul wants to clarify this. What can be a QoS requirement for work flow? Is fault tolerance (re-start) already part of workflow engines? The inclusion of work flow applications requires further research how much support needs to be built into a GridCPR system. Should job scheduling (for restart) be included: the attendees agreed that this is out of scope of GridCPR checkpoint transport needs to be defined -> see architecture document line271: Nathan wants to remove the "when does transport happen" (was his own comment) Discussion and revision of GWD-I document: "Architecture for Grid Checkpoint Recovery Services and a GridCPR API" ------------------------------------------------------------------------- Discussion ---------- Nathan Stone is presenting the current state of the document, leading the discussion We need to agree what is (not) part of the architecture essential components: - applications library interfaces and their implementation - client tools command line - services transporting files... persistent processes or not? availability? (both, implementation) - resources storage and meta data applications library is to be the sole interface for application client tools triggering actions, querying, etc. binaries use the same or different API?? command line tool is a shell binding to an API (or a portal) services state management checkpoint transfer service agent for finding details and invokes physical movement copy location/naming/urling service application might be migrated by another application (-> checkpoint tree) system event handling (service?) ok. resources - persistant storage a.k.a. disk QoS issues metadata DBMS action items, writing assignments --------------------------------- Nathan is taking the input/agreement into the next iteration Discussion/planning/outline of GridCPR API Specification document ----------------------------------------------------------------- This had to be skipped due to the lack of time. But this will follow over the mailing list, resulting from progress with the first two documents. Planning for GGF11 and interim objectives ----------------------------------------- next use case iterations: Paul Stodghil architecture document: Nathan Stone both due by Mar 26 to the mailing list, from then weekly updates with open ends intended for discussion cutoff (for GGF11): end of april Specific planning for GGF11 will be done over the mailing list. The group mile stones still apply. Adjourn ------- 7:30 pm