The MPI fault-tolerance testsuite

This testsuite evaluates certain aspects of current MPI-2 implementations with respect to fault tolerance issues. The main target is the dynamic process management section of MPI-2, which defines how two independent applications connect to each other or how an application can spawn new processes. This testsuite evaluates, how each of the (independent) applications behave, if the other application (or the child processes) fail.

The testsuite provides furthermore two different versions of a Manager-Worker framework, where the can handle failed worker processes, by re-spawning processes and redistributing the work of the failed processes. A presentation given at SC conference 2004 at the workshop dealing with for fault tolerance and MPI presents the results achieved with the current libraries. The slides can be downloaded here. The source code of the testsuite are available upon request.