GGF9 - Chicago, Illinois Grid Checkpoint Recovery Working Group (GridCPR-WG) Shoeraton Hotel, Michigan Room B Tuesday October 7, 2003, 2:00~5:30pm Session Chaired by Derek Simmel Session Notetakers: Thilo Keilmann & Shantenu Jha Session formalities: Opened the meeting Displayed and reviewed GGF Intellectual Property Policy with attendees Circulated the attendee list Reviewed the proposed Agenda: 2:00 Meeting Bootstrap, Introduction and Agenda Review 2:10 Review of GridCPR-WG after 1 year 3:00 Discussion/Revision of GridCPR Objectives, Architecture, Diagram, Scope, CharterÉ 3:30 Break (30 minutes) 4:00 Structured Brainstorm to Generate Detailed Outline of the GridCPR API & Services Architecture GWD-I document 5:00 Document Action Items and Writing Assignments 5:10 Planning for GGF10 (Frankfurt) and GGF11 (Hawaii), Interim Objectives, and Charter Milestones Updates 5:30 Adjourn Reviewed GridCPR-WG Administrative Status: Website updates at http://gridcpr.psc.edu/GGF/ Mailing and Archive addresses Reviewed GGF GridForge: GridCPR Chairs (Derek Simmel and Thilo Kielmann) are administrators for the GridCPR project. All attendees are encouraged to create an account via http://forge.gridforum.org/ Folks should send e-mail to Derek Simmel to get your GridForge account added to the GridCPR project. Reviewed 1st year of GridCPR-WG activities: GGF5 (Edinburgh) - Held a BoF session to gauge community interest in GridCPR GGF6 (Chicago) - Presentations "Grid Checkpointing in the European DataGrid project" - Massimo Sgaravatto "The PSC CPR System: Scope, Applicability, and Implementation (SAI)" - Nathan Stone "Application-level Checkpointing for Parallel Applications" - Paul Stodghill "Checkpoint Recovery in Cactus 4.0" - Gabrielle Allen "Checkpoint Recovery in Condor" - Derek Simmel - Initial Scoping Discussion GGF7 (Tokyo) - Working Group Charter approved by GFSG (November 2002) - Initial Draft of GridCPR Architecture Document reviewed - More requirements, interactions and scope discussion GGF8 (Seattle) - GridCPR Architecture Discussion Coordinated vs. Uncoordinated Checkpoints Named Checkpoints Checkpoint Data Management Job Run complexity (not just a single binary, multiple systems,É) Resource broker perspectives Security for checkpoint files - Use Case discussions Examined various different views of GridCPR as represented among attendees Mikel and Shantenu from RealityGrid Paul Stodghill from Cornell Tom Goodale and Thilo Kielmann from GridLab Karpjoo Jeony, Konkuk University Heon Yeom, Seoul National Univ "Do we still want to write these (and others we will gather) up in more detail and publish them as an informational document?" (e.g. Paul Stodghill's use-case summary sent to the list) (Rough consensus among attendees is to pursue development of a use-cases document) Reviewed GridCPR "Rough Sketch" and Architecture Goals/Audience from GGF8 GridCPR Goal: Applications developed correctly using the GridCPR API, which write periodic checkpoint data sets, and which are interrupted during execution, will be able to continue operations on a remote system within a Grid, starting at an interim state represented by a retrieved checkpoint data set recorded during the original execution. GridCPR API Specification Audience: - Grid Application Developers GridCPR Service Specification Audience: - Grid Platform Developers and Vendors - Grid Resource Operators Other Stakeholders: - Grid Standards and Specifications Developers "What is the GRID in GridCPR?" - User-level checkpointing - Heterogeneity of source and restart platforms - Ability to migrate a job from n nodes of a system to m nodes (likely on a different system) - Checkpoint API should be available everywhere - Data/Checkpoint files are reusable anywhere - Simple recompile for new platforms - no need to recode for a new platform's checkpoint service (Additional requirements) - the grid must not be visible in the API (opaque) - no dependencies to local names (e.g., volume names) Action: vikas.deolaliker@sun.com to submit use case Discussion ensued regarding existing APIs / uses of CPR Derek Simmel reviewed the European Data Grid CPR API from Massimo Sgaravatto's presentation. Nathan Stone reported on portable CPR API early experience at the Pittsburgh Supercomputing Center - user-level API implemented originally on Lemieux (Terascale Computing System) - file-oriented semantics - can also accommodate memory semantics - checkpoint database (metadata) must be secured against tampering Paul Stodgehill (Cornell) - making checkpointing transparent with compiler (and middleware) - file-oriented APIs compatible - expect GridCPR API and Service to work as a drop-in replacement (Break 3:20-3:45pm) (Grid)CPR Use Cases Document (to become GWD-I document) Action: Paul Stodghill taking lead on pulling this informational document together Will generate rough template with points of comparison Problem Statement - WhatÕs difficult for this particular scenario w.r.t. GridCPR ÒHow we would use GridCPRÓ for {this use case} Within scope defined by Charter statement Will send call for scenarios to mailing list No formatting and other requirements - Make it easy for people to send in content Due dates - Scenarios in to Paul by October 31, 2003 - Paul send out first draft November 30, 2003 - Revise via mailing list discourse for complete draft by January 15, 2004 - Send draft to be presented at GGF10 for final editing to GGF February 1, 2004 - Complete final document review on/shortly after GGF10 -> GGF Doc editor Example use cases could include: - migrate from n to m nodes - migrate from platform a to platform b GridCPR Architecture Document (revamp original GWD-I document) Action: Nathan Stone, Derek Simmel & Raghu Reddy (PSC ) will draft the outline for the Architecture Document Headlines for major sections Unresolved issues Scope questions Due Dates - Draft out to mailing list by October 31, 2003. - Review period - via mailing list - November 30. - Including meat added by gridcpr-wg members as we go - Adding meat to bonesÉ - Jan. 15th - Checkpoint of the Architecture Document Outline - Out for discussion (send in to GGF) Feb 1. GridCPR World View Illustration Action: Thilo will send a diagram of ÒThiloÕs view of GridCPRÓ (with cunningly inserted errors as an exercise to readers ;) Send to mailing list by October 20, 2003. Supplement & organizing view for architecture document It will be a piece of meat for the Architecture Doc Comments back to mailing list between Oct 20~Nov 30, 2003. Documentation of Existing APIs Action: Those with some documentation of existing related APIs should send pointers to them to the mailing list, e.g., - Nathan Stone, PSC CPR API - European Data Grid - Cactus - Paul Stodghill, Cornell - Others? Send your descriptions to the mailing list We will post them to the GridCPR website (later GridForge archive) Action: Derek Simmel will review requirements for publishing a GGF Recommended API regarding number and form of required independent implementations. Action: Stephen Pickles to ensure [Reality Grid] Use case is contributed Discussion of Charter: Original milestones were too ambitious given limited availability of resources Proposed Charter revisions to milestones: Spring 2004 - GGF10 Frankfurt Review Use Cases Document Review Architecture Document Initiate Draft API document Summer 2004 - GGF11 Hawaii Review Final Use Case & Architecture documents Submit finalized GWD-I documents to GGF Editor Review Draft API Fall 2004 - GGF12 Review Existing Implementation Efforts Review Final API document & submit to GGF Editor Determine the nature and scope of required underlying GridCPR services Spring 2005 - GGF13 Review Draft GridCPR Services Doc (if needed) Summer 2005 - GGF14 Review Final GridCPR Services Doc & submit to GGF Editor Meeting Adjourned at 5:30pm