The Design of a General-Purpose Distributed Execution System
Scaling applications with distributed execution has become the norm. With the rise of big data and machine learning, more and more developers must build applications that involve complex and data-intensive distributed processing.
In this talk, I will discuss the design of a general-purpose distributed execution system that can serve as a common platform for such applications. Such a system offers two key benefits: (1) common system functionality such as distributed resource management can be shared across different application domains, and (2) by building on the same platform, applications across domains can easily interoperate.
First, I will introduce the distributed futures interface, a powerful yet expressive distributed programming abstraction for remote execution and memory. Second, I will introduce ownership, an architecture for distributed futures systems that simultaneously provides horizontal scalability, low latency, and fault tolerance. Finally, I will present Exoshuffle, a large-scale shuffle system that builds on distributed futures and ownership to match the speed and reliability of specialized data processing frameworks while using an order of magnitude less code. These works have reached a broad audience through Ray, an open-source distributed futures system for Python that has more than 23,000 GitHub stars and that has been used to train ChatGPT and to break the world record for CloudSort.
Stephanie Wang is a final-year PhD student at UC Berkeley, advised by Professor Ion Stoica. She is interested in distributed systems, with current focus on problems in cloud computing and fault tolerance. She is a co-creator and committer of the popular open-source project Ray for distributed Python. Stephanie has received the UC Berkeley Chancellor’s Fellowship, a Distinguished Artifact Award at SOSP’19, and was selected for Rising Stars in EECS in 2021.