Subject: Wednesday CPR group minutes From: Greg Bronevetsky Date: Wed, 16 Oct 2002 18:34:41 -0400 (EDT) To: dsimmel@psc.edu Here are the minutes for today's meeting. Greg Bronevetsky - What level of portability do we really need? Can we force the scheduler to restart on specific platforms via some meta-data? -Definitely the API should be portable. Complete resource portability is not necessary. -Perhaps we can use the original meta-data which specified which kinds of resources are necessary for the job to run to begin with. Have the meta-data travel with the job across the grid. -Are the initial job requirements enough? The progam may use the features of the architecture it wound up running on (such as RAM size). Do we want to keep that info? Will that cause jobs to travel to the biggest processors on the grid over time? - Ok, lets look at the scenarios we want to deal with before we get into the details. - Checkpointing - single machine at any location in the gid - moving parallel jobs within the same homogenous computing resource(HCR), same configurations (same number of processors for example) - same configuration - different configurations - moving across HCR - same configuration - different configurations - Can we spell out specific scenarios vs. types of scenarios or perhaps define some core scenarios? - pleasantly parallel/parametric - no communication (initial and final but that's it) - time stepping (work - barrier - work - barrier ....) - global synchronization, many-to-many communication, lock-step - loosely coupled (different physics interacting, adaptive mesh calculation) - nonuniform message passing, coordinated but not synchronized - unidirectional communication - bidirections communication - unstructured/complex communication. AKA workflow-type jobs which have to be restarted in the middle - Failure Scenarios - Can we just restart on the same hardware or have we determined that the hardware is too faulty. (harware = same node or same cluster) - 2 divisions - Algorithm error: non-recoverable - Computing Resource error: recoverable by placing on new platform PSC example: take the application return code and resource return code, build a mtx and try to decide whether its a recoverable resource error, non-recoverable app error or unknown. - Failure due to running out of resources - resource explicitly requested - system overcommitted resources, will probably want to move - program wants too much, unrecoverable - resource not explicitly requested, will probably want to move - Another taxonomy of failures - transient faults (non-reproducible) - permanent faults (resources are permanently revoked - predictable fault (you're warned in advance that you have to move) - notifiable fault (the application tells our system that it failed and needs to be recovered) - partial failure (eg. some nodes die, others don't) - Failure notification as opposed to failure detection - who do we tell about the failure, does the application really care? - who initiates checkpointing? (list on screen) - periodic (externally dictated) checkpointing vs. checkpointing right before being moved vs. checkpointing at convenient spots (internally dictated) - Homework - Send to everyone a descriptions of your favorite target applications/scenarios - what are your common/expected failure scenarios? - Send the stuff in my the end of the week of November 15th. - Goals - we're not really aiming to cover all the above stuff. A noticeable fraction is good enough. For example, a bunch of the above stuff strays into Byzantine faults, which are too hard. - Maybe we should make this a Research Group + a Working Group, since there's so much stuff to talk about - Expand to general fault tolerance? - Well, this is general task mobility, which can be done for various reasons, failure-oriented or economic or load-balancing, etc. - So lets keep this WG focused and kick off a fault-tolerance, event-tolerance, etc. RG. - What documents should this group produce in the near future? - problem definition (target applications, failure scenarios) - API - expectations of underlying services