Academic
Publications
On the road to recovery: restoring data after disasters

On the road to recovery: restoring data after disasters,10.1145/1217935.1217958,Kimberly Keeton,Dirk Beyer,Ernesto Brau,Arif Merchant,Cipriano A. Sant

On the road to recovery: restoring data after disasters   (Citations: 23)
BibTex | RIS | RefWorks Download
Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling decisions. They schedule recovery based on rules of thumb, or on pre-determined orders that might not be best for the failure occurrence. With multiple workloads and recovery techniques, the number of possibilities is large, so the decision process is not trivial. This paper makes several contributions to the area of data recovery scheduling. First, we formalize the description of potential recovery processes by defining recovery graphs. Recovery graphs explicitly capture alternative approaches for recovering workloads, including their recovery tasks, operational states, timing information and precedence relationships. Second, we formulate the data recovery scheduling problem as an optimization problem, where the goal is to find the schedule that minimizes the financial penalties due to downtime, data loss and vulnerability to subsequent failures. Third, we present several methods for finding optimal or near-optimal solutions, including priority-based, randomized and genetic algorithm-guided ad hoc heuristics. We quantitatively evaluate these methods using realistic storage system designs and workloads, and compare the quality of the algorithms' solutions to optimal solutions provided by a math programming formulation and to the solutions from a simple heuristic that emulates the choices made by human administrators. We find that our heuristics' solutions improve on the administrator heuristic's solutions, often approaching or achieving optimality.
Conference: EuroSys Conference - EUROSYS , pp. 235-248, 2006
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...References [1], [2], and [15] consider questions in the broader area of dependable storage system evaluation and design, including online and offline data protection techniques...
    • ...In the area of modeling dependable storage system behavior, Keeton and Merchant presented a framework for evaluating the recovery time and recent data loss for a single application protected by a combination of techniques [2]; more recent work by their group examines how to schedule recovery operations for multiple workloads [15]...

    Shravan Gaonkaret al. Designing Dependable Storage Solutions for Shared Application Environm...

    • ...Keeton et al. [29] details out the actual recovery process after a disaster has struck...

    Tapan Kumar Nayaket al. End-to-end disaster recovery planning: From art to science

    • ...For example, even a correct server may sometimes impose significant extra load because, in an asynchronous system, it may fall behind in processing requests and then need to ask other servers to send a checkpoint of the system’s state and recent requests. [18]...

    Allen Clementet al. Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults

    • ...We have applied Plato to the dynamic reconguration of an overlay network [2] for distributing data to a collection of remote data mirrors [11, 13]...

    Andres J. Ramirezet al. Applying genetic algorithms to decision making in autonomic computing ...

    • ...Financial systems are under huge competitive pressure to support enormous transaction rates, and as the clearing time for transactions continues to diminish towards immediate settlement, the amounts of money at risk from even a small loss of data will continue to rise [20]...
    • ...Since the network-sync option enhances remote mirroring protocols, we assume that a complete remote mirroring protocol will itself handle failover and recovery directly [19, 22, 20]...
    • ...In [20], for example, the authors propose a reactive way to solve the data recovery scheduling problem once the disaster has occurred...

    Hakim Weatherspoonet al. Smoke and Mirrors: Reflecting Files at a Geographically Remote Locatio...

Sort by: