Academic
Publications
A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models,10.1109/PDP.2011.72,Nawab Ali,Sriram Krishnamoorthy,Niranjan

A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models   (Citations: 1)
BibTex | RIS | RefWorks Download
Recent trends in high-performance computing point toward increasingly large machines with millions of pro- cessing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean time between failures (MTBF), ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. Results from a computational chemistry application running at scale show that our techniques provide applications with a high degree of fault tolerance and low (2%-4%) overhead for 2048 processors.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
Sort by: