Latest Announcements

11/14: Slight change: switched Mars and Piccolo, see schedule.
11/13: List of papers updated to all but the last class!
11/08: Choose the papers you want to read. Deadline is Thursday, Nov 10.

See all announcements RSS 2.0RSS 2.0 feed


Quick Summary

  • Different from previous editions: narrower focus on Large-scale, data-intensive computing (e.g., MapReduce, Dryad, GraphLab)
  • Systems and application focus
  • Focus on research: read, review and present papers, a few programming assignments, final mini research project
  • Graduate students or advanced undergrads (with consent of instructor)
  • CS and Applied areas (e.g. Engineering, Biology, Astronomy, etc)
  • Not a traditional parallel computing course (not CS178 or CS176, not an MPI/OpenMP course)

Overview

This class is a graduate seminar that focuses on current research topics in networking, distributed, and operating systems. The focus this semester is on large-scale, data-intensive computing.

We are currently generating much more data than we can process, both in industry and in academia. Large web sites have hundreds of millions of users constantly interacting and generating content; companies need to track global transactions and inventory; DNA sequencing throughputs have been improving at about fivefold per year; the LHC is expected to generate 10-15 PB per year. In this course we will look into different ways to process datasets similar to these, focusing on two broad aspects: systems and applications.

Google pioneered and popularized the idea of processing large amounts of data on clusters of commodity machines with its MapReduce model of coarse-grained parallelism. Since then, the model has been applied to many problems other than the initial information indexing and retrieval tasks. Several implementations of MapReduce were independently produced, for architectures ranging from multi-core computers, to GPUs, to large clusters, several extensions to the model were created, addressing performance and algorithmic deficiencies, and entirely new models have been proposed for similar problems.

This course is suitable for graduate students and advanced undergraduates in Computer Science or in other disciplines that have a need for large-scale computation, such as engineering, biology, geology, among others.

The course will consist of a mix of lectures by the instructor and guest, and presentations by the students, followed by discussions. We will focus on the systems and frameworks that enable large-scale data-intensive computation, as well as on algorithms and applications suitable for these frameworks. As an example, we will study MapReduce and Hadoop, as well as a number of algorithms implemented on MapReduce.

There will be a few individual programming assignments to get students up to speed with some sample problems.

Lastly, the other major component of the course is a final research project on a topic related to the course. This project can focus on the systems aspect (e.g., a new framework, an improvement, or a comparison among existing frameworks), or on the applications aspect (e.g., casting of a new algorithm or a domain-specific problem to a framework, with a comprehensive evaluation). There is considerable flexibility in the topic of the project, with a preference for topics related to the student's own research if applicable. The projects will preferably be done in groups of two, and the ideal group will have mixture of an application and a systems person.

Prerequisites: some programming experience. Less if you want to focus on a specific application.

  • Lecture time: Tu/Th 10:30-11:50
  • Location: CIT 506

Instructor

Rodrigo Fonseca

Office: CIT 329, OH TBD