Home » Department » Colloquia » Abstracts

Haifeng Yu
Researcher, Intel Research Pittsburgh
Adjunct Assistant Professor, Carnegie Mellon University
http://www.cs.cmu/edu/~yhf

Tuesday, March 7
3 :10-4:00 p.m.
1065 Kemper - refreshments to follow in 1131 Kemper Hall


Robust Distributed Systems despite Correlated Failures

With the fast advances of systems research, performance is no longer the sole principal design goal in wide-area distributed systems (such as email, distributed file systems, peer-to-peer systems, and etc.). In particular, availability has gained much importance and is the focus of many recent research efforts. In the wide-area, failures can often be correlated, and such failure correlation is a major challenge in building highly available distributed systems.

In the context of a wide-area storage system called IrisStore, this talk thoroughly explores how to better tolerate correlated failures. I will first revisit previous fault-tolerant techniques using trace-driven simulation, model-driven simulation, mathematical analysis, and live deployment. Among the findings, I will show that majority voting, a widely-used technique for high availability, suffers from a strong diminishing return effect under correlated failures. This effect potentially prevents the system from achieving good availability regardless of the amount of resources used.

The second part of the talk will describe the novel design of signed quorum systems (SQS) that help to alleviate the damaging diminishing return effect. Compared to traditional quorum systems such as majority voting, SQS can have a constant quorum size and thus achieve orders of magnitude lower unavailability, at the cost of a small probability (e.g., below 10^{-5}) of returning stale data. The constant quorum size further helps the system to avoid the diminishing return effect in practice.

I will demonstrate how the above findings and techniques can be effectively used in IrisStore. I will also report the experience with IrisStore over an 8-month period over PlanetLab, and demonstrate its ability to tolerate correlated failures.

Biography: Haifeng Yu is a Researcher at Intel Research Pittsburgh and an Adjunct Assistant Professor at Department of Computer Science, Carnegie Mellon University. His research interests cover the general area of distributed systems, as well as related fields such as fault-tolerance, large-scale peer-to-peer systems, distributed computing, operating systems, and database systems. Haifeng receives his Ph.D. (2002) and M.S. (1999) from Duke University, and his B.E. (1997) from Shanghai Jiaotong University, China. More information about his research is available at http://www.cs.cmu.edu/~yhf.