Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
(Citations: 42)
This paper tests the hypothesis that generic recovery techniques, such as process pairs, can survive most appli- cation faults without using application-specific informa- tion. We examine in detail the faults that occur in three, large, open-source applications: the Apache web server, the GNOME desktop environment, and the MySQL data- base. Using information contained in the bug reports and source code, we classify faults based on how they depend on the operating environment. We find that 72-87% of the faults are independent of the operating environment and are hence deterministic (non-transient). Recovering from the failures caused by these faults requires the use of application-specific knowledge. Half of the remaining faults depend on a condition in the operating environment that is likely to persist on retry, and the failures caused by these faults are also likely to require application-specific recovery. Unfortunately, only 5-14% of the faults were triggered by transient conditions, such as timing and syn- chronization, that naturally fixthemselves during recovery. Our results indicate that classical application-generic recovery techniques, such as process pairs, will not be suf- ficient to enable applications to survive most failures caused by application faults.