Proactive fault tolerance for HPC with Xen virtualization

Proactive fault tolerance for HPC with Xen virtualization,10.1145/1274971.1274978,Arun Babu Nagarajan,Frank Mueller,Christian Engelmann,Stephen L. Sco

Proactive fault tolerance for HPC with Xen virtualization   (Citations: 87)
BibTex | RIS | RefWorks Download
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from fa ults and generally rely on a checkpoint/restart mechanism. Yet, in t oday's systems, node failures can often be anticipated by detectin g a dete- riorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transpar- ent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with heal th mon- itoring and load-based migration. We exploit Xen's live mig ra- tion mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proac- tive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experi mental results demonstrate that live migration hides migration co sts and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhance- ments make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full check- point/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the con text of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is a ctually triggered by health monitoring.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
Sort by: