The 42nd IPP Symposium

Combating Stragglers in MapReduce Networks

Srikanth Kandula, Microsoft

The phase structure of map-reduce jobs and resource sharing in cluster networks make task scheduling challenging. In particular, a few laggards can tremendously prolong job completion. Previous work deals with the problem by duplicating tasks that remain after others in their cohort finish. By analyzing logs from a large production cluster, we reveal many causes for stragglers; duplicating prolongs some stragglers rather than mitigating them. The causes for stragglers include dynamic resource contention on machines and along network paths, imbalance in task workload, and wide differences in path bandwidth and disk loss rates.

This talk describes ClusterCull, an add-on to the job scheduler. ClusterCull culls stragglers based on their causes using resource-aware algorithms. From real-time progress reports, ClusterCull detects stragglers early in their lifetime, and takes appropriate action based on their causes. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Both trace-driven simulations and deployment in a 12K server cluster indicate that ClusterCull improves job completion time by an additional 3.1x over the existing state-of-the-art approach.

Srikanth Kandula is a Researcher at Microsoft Research. His research interests span all aspects of networked systems including datacenters, network management, applied statistical inference and security. He has published over 15 papers in top-tier venues such as SIGCOMM, NSDI, and MobiSys. He is a winner of the NSDI best student paper award. He obtained his Ph. D. from the Massachusetts Institute of Technology (2008).