A Fault Resilient Approach to Non-collective Communication Creation in MPI
The increasing size of HPC architectures makes the faults' presence an eventuality more and more frequent. This is especially relevant since MPI, the de-facto standard for inter-process communication lacks proper fault management functionalities. The past efforts produced extensions to the MPI standard that enabled fault management, the most important one being ULFM. In this paper, we introduce the support for non-collective communication creation (MPI_Comm_create_group) in ULFM to improve the fault management capabilities. We integrate our solution into the Legio library and measure the overhead introduced in the application. The proposed solution removes the possibility of turning the execution into a deadlock after a fault and can be used as an inspiring effort to improve the ULFM repair capabilities.
READ FULL TEXT