Mai Zheng Research Statement

Mai Zheng
Research Statement
Storage system failures are extremely damaging — if your browser crashes you sigh, but when your family
photos disappear you cry. So we need highly reliable storage systems that can keep our data safe no matter what
happens. Such high standard of reliability is difficult to achieve, which makes the problem even more fascinating.
Thus, the primary focus of my research is on the reliability of storage systems, which is broadly defined as
everything from low-level storage devices to high-level information management systems like databases.
Besides data storage, another challenge in this era of Big Data is how to process the huge volume of data
correctly and efficiently. From systems’ perspective, the question boils down to how to make full use of
parallelism and achieve efficient parallel or distributed computation while maintaining reliability, which is much
more challenging compared to the traditional sequential computation.
Thus, my research has been centered around the two critical aspects of Big Data: reliability and efficiency of
data storage and computation. In terms of categories of systems, I mainly focus on storage systems and parallel
and distributed systems. In particular, my work on reliability of systems includes a record-and-replay framework
that helps test and diagnose the failure-recovery capability of modern databases [1], a fault-injection framework
that uncovers the failure patterns of flash-based solid state drives (SSDs) [2], and two low-overhead bug detectors
for parallel and concurrent programs [3][4]. My work on efficiency of systems includes a fine-grained profiler for
tuning the shared memory usage of parallel programs [5] and a low-latency data layout scheme for distributed
storage systems [6].
My future research will continuously focus on making computer systems more reliable and more efficient,
and expand to closely related areas like security. Specifically, in the short term, I plan to build an enhanced
reliability-analysis framework for storage systems along two dimensions: first, analyzing the failure propagations
among different layers in the whole storage stack (e.g., file systems, logical volume manager, software RAID, etc.)
on a single machine; second, analyzing the failure handling in cloud storage systems with multiple replicas.
Besides, another interesting opportunity opened up by the analysis is enhancing the reliability of data transactions
based on the errors exposed by our analysis framework. Moreover, when a system becomes un-reliable (e.g., a
transaction is partially committed), it may also become more vulnerable to security attacks. Thus, I will explore
the combination effect when reliability meets security. In the longer term, I will explore the reliability, security,
and efficiency of emerging systems with emerging technologies (e.g., mobile systems and phase change memory
or PCM). My ultimate goal is to make computer systems better so that people may benefit from them more. To
this end, I will keep looking for cross-area and cross-disciplinary challenges and opportunities. Through the
collaboration with researchers from different domains, I can amplify my expertise and experience and thus
maximize my contribution to the society as a researcher.
The following sections elaborate on my existing and future research mentioned above.
1. Previous and ongoing research
Reliability analysis of modern databases [1]. People use databases when they want a high level of
reliability. Specifically, they want the sophisticated ACID (atomicity, consistency, isolation, and durability)
protection modern databases provide. However, the ACID properties are far from trivial to provide, particularly
when high performance must be achieved. This leads to complex and error-prone code—even at a low defect rate
of one bug per thousand lines, the millions of lines of code in a commercial OLTP database can harbor thousands
of bugs. In particular, checking for the ACID properties under failure is notoriously hard since a failure scenario
may not be conveniently reproducible.
Mai Zheng – Research Statement
Page 1 of 4
My colleagues and I built a framework to expose and diagnose ACID violations in databases under failures.
By decoupling the databases from the main framework through iSCSI, which is a protocol that allows one
machine to access the block devices on anther machine over the network, we can test databases on multiple
operating systems with high fidelity. By running carefully-designed workloads, we can stress different
functionalities of databases and check each type of ACID properties easily. By using record-and-replay technique,
we can systematically simulate the effect of power outages or system crashes at every possible point during a
workload and re-create the failure scenarios precisely. Moreover, we discover five low-level I/O patterns that can
be used to predict the critical operations of the workloads which are most vulnerable to power faults. In addition,
we develop a multi-layer tracer, which covers everything from function-call level semantics to the lowest level of
block operations and provides an excellent whole picture for diagnosis.
Using our framework, we study 8 widely-used databases, ranging from open-source key-value stores to highend commercial OLTP servers. Surprisingly, all 8 databases exhibit erroneous behavior. We are working with the
developers of the databases to fix the defects exposed by our framework. Meanwhile, we have been contacted by
several interested companies including Microsoft and Facebook.
Reliability analysis of flash-based solid state drives (SSDs) [2]. High-level storage systems (e.g.,
databases) reply on low-level storage devices to provide the basic data integrity and consistency guarantee.
Although the interface remains the same, the underlying technology has evolved a lot. Modern storage devices
like SSDs brings new reliability challenges to the already-complicated storage stack. Among other things, their
behavior under adverse conditions (e.g., power outages) is an important yet mostly ignored issue.
My colleagues and I built a framework to expose reliability issues in block devices such as SSDs under
power faults. The framework includes specially-designed hardware to inject power faults directly to devices,
workloads to stress storage components, and techniques to detect various types of failures. Applying our
framework, we test fifteen commodity SSDs from five different vendors using more than three thousand fault
injection cycles in total. Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit
surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes,
metadata corruption, and total device failure. Our analysis result has been reported by several media. Moreover,
we have been contacted by several major SSD manufactures (e.g., Intel, sTec, LSI, etc.), storage system builders
(e.g., IBM, Savvis, SolidFire, etc.), and consumers (e.g., Airbus, PepsiCo, RBA Consulting, etc.).
Reliability improvements for parallel and concurrent programs [3][4]. Besides storage, another
challenge in this era of Big Data is computation: how to process the huge volume of data efficiently? The
processor industry answers this question by introducing more aggressive parallel architectures. Besides multi-core
CPUs, the many-core GPUs (i.e., graphic processing units) have emerged as an extremely cost-effective means
for general-purpose computation. However, these parallel platforms are non-trivial to use correctly due to the
additional thread-interleaving and synchronization. To improve the reliability of programs running on these
platforms, my colleagues and I combined static and dynamic analysis, made full use of architectural features, and
built a low-overhead data race detector called GRace for GPU programs. Also, for multi-threaded CPU programs,
we developed a state-machine based approach to capture a set of concurrency bugs that involve shared objects.
Efficiency improvements for parallel and distributed systems [5][6]. On parallel GPU systems, efficiency
highly depends on the usage of the memory hierarchy. In particular, how to make an optimal tradeoff between the
large-but-slow device memory and the small-but-fast shared memory is a critical and non-trivial question for
achieving high performance. To help answer this question, my colleagues and I built a fine-grained profiler for
tuning the shared memory usage, through a combination of software analysis and architectural optimizations.
Similarly, for hybrid storage systems containing both hard disks and SSDs, the efficiency highly depends on the
usage of the storage hierarchy. By carefully placing data with different access patterns on different devices, we
improved the efficiency of a distributed storage system by up to 5 times under typical workloads.
Mai Zheng – Research Statement
Page 2 of 4
2. Future research
Based on my existing research, I see many opportunities in computer systems and related areas where my
expertise and experience may help advance the frontier in the future.
In the short term, I will enhance my current reliability-analysis framework. The first direction is analyzing
the failure propagations among different layers in the whole storage stack on a single machine. Databases
and SSDs are just two layers in a typical storage stack, which may contain multiple other layers (e.g., file systems,
logical volume manager, software RAID, etc). Each of these additional layers has its unique characteristics and
thus may require unique workloads and checking logic. Given that we have observed erroneous behavior in many
databases and SSDs, it is likely that the other layers may also contain defects. I will first analyze each individual
layer, which is fundamental for the whole system analysis. Then, I will analyze how the failure of a lower layer
propagates to upper layers and interact with the potential errors inside the upper layers, which is crucial for
improving the reliability of the system as a whole. The second direction is analyzing the failure handling in
cloud storage systems with multiple replicas. Many data are now stored and managed in the cloud. If even the
relatively matured single-machine storage systems can exhibit erroneous behaviors, it becomes important as well
as emergent that we perform similar in-depth analysis on cloud storage systems, which add more layers on top of
the local storage stack and are responsible for protecting much more data. The cloud environment introduces
many new challenges not only to the system under analysis, but also to the analysis framework itself. For example,
synchronization of timestamps, which is necessary for combining the traces, is a new issue for analysis. I will
study state-of-the-art techniques for handling the classic distributed problems, incorporate them with my analysis
methodology, and design a unique distributed analysis framework for the cloud storage systems. Besides
enhancing the analysis framework, another interesting opportunity opened up by the analysis is enhancing the
reliability of data transactions based on the errors exposed by our framework. For example, our databases
analysis shows that there is a big gap of understanding or assumptions between the database developers and the
operating system developers in terms of the behavior of the system call interface. With the collaboration of
researchers from different communities, we can bridge the gaps and make storage transactions truly reliable under
failures while achieving acceptable efficiency. In addition, I will explore security, especially the combination
effect when reliability meets security. Many issues (e.g., buffer overflow) are related to both reliability and
security. When a system becomes un-reliable (e.g., a transaction is partially committed as triggered by our
framework), it may also become more vulnerable to security attacks, which is a more and more critical issue in
this Big Data era. I will collaborate with researchers from security community to address the combined challenges.
In the longer term, I will explore the reliability, security, and efficiency of emerging systems with
emerging technologies. For example, mobile systems such as Android have become a more and more important
computing platform in daily life. These systems relay on flash memory for persistent storage. Meanwhile, they
have special constraints (e.g., energy efficiency). My experience on flash-based SSDs and on storage systems in
general may also help in optimizing mobile systems as well. Also, just like SSDs have revolutionized the storage
market, new technologies like phase change memory (PCM) will probably improve the performance of existing
systems greatly while introducing new challenges. New design tradeoffs must be made to achieve an optimal
balance among reliability, security, and efficiency. Moreover, with the new advancement of systems and
technologies, combining the best of different systems will become possible. For example, parallel platforms like
GPUs are mainly used for high-performance computing. It is interesting to see whether we can make use of the
parallel computing power for efficient storage without sacrificing reliability and security.
My ultimate goal is to make computer systems better in terms of reliability, security, efficiency, and other
desirable properties so that people can benefit from them more. To this end, I will start from my existing domains
and keep looking for cross-area and cross-disciplinary challenges and opportunities. I believe through the
collaboration with researchers from different domains, I can amplify my expertise and experience, and thus
maximize my contribution to the society as a researcher.
Mai Zheng – Research Statement
Page 3 of 4
References
[1] M. Zheng, J. Tucek, D. Huang, F. Qin, M. Lillibridge, E. Yang, B. Zhao, and S. Singh, “Torturing Databases
for Fun and Profit ”. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and
Implementation (OSDI’14), 2014
[2] M. Zheng, J. Tucek, F. Qin, and M. Lillibridge, “Understanding the Robustness of SSDs under Power Fault”.
In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13), 2013
[3] M. Zheng, V. Ravi, F. Qin, G. Agrawal, “GRace: A Low-Overhead Mechanism for Detecting Data Races in
GPU Programs” In Proceedings of the 16th ACM SIGPLAN Annual Symposium on Principles and Practice of
Parallel Programming (PPoPP’11), 2011
[4] Q. Gao, W. Zhang, Z. Chen, M. Zheng, F. Qin, “2ndStrike: Towards Manifesting Hidden Concurrency
Typestate Bugs”. In Proceedings of the 16th ACM International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS’11), 2011
[5] M. Zheng, V. T. Ravi, W. Ma, F. Qin, G. Agrawal, “GMProf: A Low-Overhead, Fine-Grained Profiling
Approach for GPU Programs”. In Proceedings of the 19th IEEE Annual International Conference on High
Performance Computing (HiPC’12), 2012
[6] D. Huang, X. Zhang, W. Shi, M. Zheng, S. Jiang, F. Qin, “LiU: Hiding Disk Access Latency for HPC
Applications with a New SSD-Enabled Data Layout”. In Proceedings of the 21st IEEE International Symposium
on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’13), 2013
Mai Zheng – Research Statement
Page 4 of 4