Meeting of the Grid Checkpoint and Recovery Working Group (GridCPR-WG) at GGF7, Tokyo, Japan, March 6, 2003 Chaired by: Derek Simmel, Pittsburg Supercomputing Center, Minutes: Thilo Kielmann, Vrije Universiteit, CW Hobbs, VERITAS Software Corporation, Meeting agenda: 1. Opening, administrative updates 2. Discussion of first draft of GWD-I document "An Architecture for Grid Checkpoint Recovery Service and a GridCPR API" 3. Review/Discussion of milestones and deliverables 4. Making better progress between meetings 5. Planning of next steps (e.g., GGF8) 6. Closing 1. Opening, administrative updates ---------------------------------- The group is now officially approved by GFSG mailing list: gridcpr-wg@gridforum.org web site: http://gridcpr.psc.edu/GGF (new contents in the making) 2. Discussion of first draft of GWD-I document "An Architecture for Grid Checkpoint Recovery Service and a GridCPR API" --------------------------------------------------------------------------- The document draft had been posted before GGF7: http://gridcpr.psc.edu/GGF/docs/GridCPR001.doc http://gridcpr.psc.edu/GGF/docs/GridCPR001.pdf Derek briefly reviewed the document. This spawned a lively discussion. The following statements are intended for inclusion in further releases in the document: - purpose of checkpoints: for fault-tolerance, and portability - API: write/read from stable storage and parameterize this, for efficiency - incremental checkpoints - what kind of data to checkpoint? system-dependent??? (better not) - what if time differences make a difference after restart? - again review papers/presentations from our GGF6 discussion to get jump started - communication channels may be critical, likely factor them out of the design. - the GridCPR API shall talk to 1 or more (OGSA) services - look at/combine APIs from EDG and PSC approaches - data format might be HDF5 actual data format is application specific, should be parameter to the checkpointing - we need both an API and an infrastructure for storing checkpoint data - management of history of checkpoints of a job, naming checkpoints goes to a tree of checkpoints - using checkpoints for application steering - interaction with AAA - charging up to checkpoint only(?) if you do checkpoint - we need to define which parts should be in the architecture - e.g., data storage, meta data - getting the a handle for the checkpoint data, goes to where?? - a scheduler? a job manager? - if OGSA, then service extensibility can help building more or less specific service interfaces - look at Avaki: secure, global naming scheme (in a WG) - SRB: naming, might be a starting point - we need to check what other GGF WGs have produced for that already, candidates: DATA, GRAAP - collect users' requirements 3. Review/Discussion of milestones and deliverables --------------------------------------------------- The discussion lead to the following agreed-upon list of milestones: March 2003 - GGF7 Tokyo - Discuss and ammend draft GFD-I document detailing scope of GridCPR API and services; Discuss and establish GridCPR Working Group development (virtual) meetings to be held on a regular schedule between GGF meetings. (before next GGF) #1 Ratify Architecture for GridCPR Services & API June 2003 - GGF8 Seattle - Discuss initial draft of GridCPR API Specification; Discuss corresponding Grid resource GridCPR service requirements, including interfaces to underlying scheduling and accounting systems; Delegate specification writing duties to selected authors. Autumn 2003 - GGF9 Chicago - Discuss proof of concept implementation of draft Grid Checkpoint Recovery API/Service Specification. Spring 2004 - GGF10 Frankfurt - Publish Grid Checkpoint Recovery API/Service Specification 1.0 4. Making better progress between meetings ------------------------------------------ We need to make (much) more progress between the GGF meeting. While discussing it, we have agreed on first pursuing discussion on the mailing list, and spawning off phone conferences on a "by need" basis. We also investigated AG meetings (instead of phone conferences). Tom Goodale proposed the use of a tool for using AG from a single laptop (e.g., without expensive installations): http://www.vrvs.org 5. Planning of next steps (e.g., GGF8) -------------------------------------- We identified the following immediate steps: - Complete consensus diagram of GridCPR Services & API Architecture - Identify and describe core components - Interfaces and messaging within and to/from core - Clearly draw scope boundaries of GridCPR-WG - Complete GWD-I Architecture document draft & discuss within GridCPR-WG (via mailing list) - Reconcile architecture with other existing GGF research/working groups' services & spec's. - Publish GWD-I GridCPR Architecture Doc. 6. Closing ---------- The meeting ended and everybody went off for lunch ;-)