Skip to main content

A great way to understand data-intensive systems deeply—including their challenges, the space of available solutions, and their trade-offs—is to build them. As part of this course students will build a version of a real distributed system — a small but realistic version of Google. The course follows a new approach to teaching such systems termed distribution-as-a-library: as part of this approach, students use a conventional (non-distributed) programming language to build a library of distributed services.

This is a hands-on implementation-heavy cours — students will incrementally build components of a large-scale distributed system. They will start with a high-performance web server implementing HTTP, then expand it towards an actor-style message-passing substrate, then build a scalable queryable key-value store, then build a MapReduce framework, and then use that to implement scalable web crawling, indexing, and retrieval algorithms. Combined, these components will result in the implementation of Google-style search engine — including distributed, scalable crawling; indexing with ranking; stream processing; and even PageRank on the students’ own MapReduce-style platform!

Project Milestones #

All project milestones except the last one are individual. The pace is about one project milestone per week—due on Tuesday—except for the last milestone, which is due in two weeks after the release of the milestone handout. Milestones build on each other, eventually culminating in a large-scale distributed system deployed on the cloud.

M0 Setup: The goal of this milestone is to set up the development environment, familiarize students with the steps required for each assignment (e.g., GitHub commits, Gradescope), and confirm everyone’s background meets the requirements of the course. It will also serve as a refresher on JavaScript/Typescript and Shell programming.

M1 Serialization: To implement any form of distributed computation, two or more nodes need to communicate — and to do that, they first need to be able to exchange messages. Serialization often comes built into the programming language or runtime system — but here you will be implementing the core serialization and deserialization functionality yourselves. The goal of this milestone is to build the necessary infrastructure for exchanging messages between nodes — including converting any value from an in-memory structure to an on-wire message and correctly back to an in-memory structure.

M2 Actors & Remote Procedure Calls (RPCs): The goal of this framework is to build a simple actor framework with associated runtime support. Actors can be thought of as event-driven processes that, in response to incoming messages, take a set of actions — e.g.,compute something locally, respond to a message, send messages to other actors, or interact with persistent local storage.

M3 Groups & Gossip: This milestone focuses on abstractions and systems support for addressing a set of nodes as a single system. Additional support includes scalable gossip protocols—a first version of a — for checking the health and status of all nodes in a node group, as well as for relaying messaging to node groups.

M4 Distributed Storage: The goal of this milestone is to implement a distributed and scalable storage subsystem over a set of nodes. The core of this milestone is a distributed key-value store centered around two classic techniques — consistent hashing and rendezvous hashing — to store, retrieve, update, and delete objects.

M5 Distributed Execution Engine: This milestone focuses on implementing scalable processing abstractions modeled after the higher-order map and reduce functions. These functions operate over the distributed storage system

M6 Distributed Processing: This milestone uses the map and reduce abstractions developed in M5 to build several distributed data-processing programs — including web crawling, document indexing and ranking, NLP-based ranking, and other functionality critical to the distributed search engine.

M7 Fault Tolerance: This milestone upgrades several components developed earlier with the ability to tolerate node failures of various forms. The core techniques underlying these features are replication and consensus.

M8 Cloud Deployment (Group milestone): The final milestone is about deploying the entire system on real distributed infrastructure such as Amazon Web Services (AWS). Students here will work in groups, combining best-of-breed milestone implementations, considering additional trade-offs, and tuning several subsystems appropriately to offer a final product that is deployed and runs on the cloud.

Project Report #

Project milestones are individual, but the last milestone and the project report is submitted in groups of, ideally, 4 students. The project report will describe the end-to-end system, and will have the form of a short research paper: an introduction section (including problem statements and techniques applied), a little example section, a few sections on the design and implementation of the system—including the core challenges and how they were solved), and a related-work section. It will also include a section evaluating and characterizing each system along several dimensions — including correctness, performance, scalability, and fault tolerance — and a discussion section that compares and contrasts the design and implementation decisions of different members in the team.

As part of the report, students are expected to collaborate effectively — including comparing and contrasting their approaches to different milestones, thinking about the strengths and weaknesses of their respective approaches, and reflecting on their learning.

Posters & Poster Session #

Students in this course will work in groups—the same groups they joined for the last milestone and the project report—to prepare a poster outlining their project. Beyond some basics about the project, students have significant creative freedom in terms of the content and presentation of the poster. Students will present their posters during a poster session during the last week of class. Attendees during the poster session will include an evaluation committee of distributed-systems experts from industry and academia, who will ask each team questions about the implementation of the project, its performance, and other characteristics of interest.