Grid Checkpoint and Recovery Working Group (GridCPR-WG) GGF15 - Boston, MA, US Monday, October 3, 2005, 11:00am - 12:30pm White Hill room (4th Floor Conference Center), Park Plaza Hotel Session chaired by Derek Simmel Session Notetakers: Paul Stodghill & Andrew Stubbings Proposed Agenda: --------------- 11:00am Opening formalities; Updates regarding group status, website and mailing list 11:10am Meeting Objectives and Overview of GridCPR API & Services document plan 11:15am Discussion of Draft GridCPR API Specification document 12:15pm Action Items, Writing Assignments re. GridCPR API Specification document 12:25pm Planning for GGF16 and interim objectives 12:30pm Adjourn Opening formalities: 11:15 Derek Simmel opened the session and explained the Intellectual Property Policy Circulated the attendee list Other than the Chair and co-chair, Paul, there were 11 other participants Presentation of GridCPR-WG to be made to SAGA-RG (Charles River room) at 2:00pm Updates: Two documents now in public comment (see www.ggf.org/ggf_docs_public.htm) Deadline for public comments is end of Oct GWD-I: Use Cases for Grid Checkpoint and Recovery GWD-I: An Architecture for Grid Checkpoint Recovery Services and a GridCPR API Comments can be put in on your behalf if you can't do yourself via GGF GridForge Comments must be made or the documents will not be accepted API: Objectives Minimum necessary set of interfaces Policy governing GridCPR services behaviour defined externally Should not be embedded in a lot of calls Objectives within Grid security framework GridCPR workflow Came out of 2nd public document Compute resource, application with GridCPR library Checkpoint transfer needs to be discussed State management GridCPR lib interfaces in workflow Checkpoint state management CPRquery CPRupdate CPRfreemeta Checkpoint data management CPRread CPRwrite CPRfree CPRtransfer Event handling EHnotify - should application checkpoint now? Example given: TCS CPR library user interface What do we need more/different from this? GridCPR draft API Service initiation and termination int GridCPR_init() int GridCPR_term() Checkpoint state management int GridCPR_query() Brings back a list of checkpoints that are available int GridCPR_update() int GridCPR_free() Checkpoint data management int GridCPR_open() int GridCPR_read() int GridCPR_write() int GridCPR_close() Transfer to occur, who is responsible to coordinate the transfer? Event handling int GridCPR_restart_with() Non-empty means start from scratch int GridCPR_chkpt_when(out int seconds_left) Seconds_left = 0 means "checkpoint now" Rather than polling to find out when to checkpoint Forecasting is not accurate Don't need an API call to return the time the last checkpoint took Can take the time, checkpoint, take time again in the application and calculate difference This call is similar to tcs_drainoperation(), so checkpoint can be taken before the job is kicked off a machine Comment: GridCPR_chkpt_when is not a good name Allows an external event to signal a job that it needs to checkpoint CPR service is external to application so knows about other checkpoints Need to implement some finite time to get checkpoint data Inform service how to respond to the application, i.e. no response in a certain then error exit SAGA API has this but has to be included in the application Do we want to put it in the API? - general consensus: yes But can't guarantee if time to recover will be correct Have location information, i.e. URI *Keep transfer stuff outside of the application. Should be a pre-requisite on the application starting up*** Return from GridCPR_init is a list of checkpoints? Assuming that there is a job scheduler, but should be able to run on your laptop on a single node, so a personal job scheduler is needed State management keeps metadata Job manager needs an ID in order to get state management for a job, not another user credential * Questions When an application has completed successfully, whose responsibility should it be to delete/release checkpoints? application deletes/frees checkpoint explicitly? job manager? GridCPR service configuration/plan? thilo - saga is doing local/remote oblivious api - think about sync or async api ruin - app may not have access to internet andrey - consistency is an issue - saga has task model can't really provide QOS thilo - transfer is prereq to application starting - application is agnostic to the transfer no one objected to these. thilo - why not signals instead of polling consensus - we need signals to do this saga is going to develop this, so we can leverage that api [API discussion] tabled for revision delete/free policy is external to appl, not part of api need to make sure that all of the copies of the docs on all of the sites are the most recent. Charter updates - milestones: Oct/05 GGF15 Boston Nov/05 Revise GWD-I use cases and architecture documents Feb/06 GGF16 Athens Revise draft API Review proof of concept API and services implementation plans Initiate GridCPR services interface specification document Identify writing and review assignments and establish deadlines for drafts and reviews May/06 GGF17 Tokyo Demonstrate initial API and services implementations Finalize API specification and submit to GGF editor Review draft GridCPR services interface specification Fall/06 GGF18 Boston Review proof of concept GridCPR API and services implementation Planning: GGF16 Athens, Greece, February 13-16, 2006 Review draft API Present to SAGA-RG GGF17 Tokyo, Japan, May 9-12, 2006 12:15 Adjourn