Tech Report CS-93-26

Adaptive Message Logging for Incremental Replay of Message-Passing Programs

Robert H. B. Netzer and Jian Xu

May 1993

Abstract:

Cyclic debugging is a common strategy where a program is executed over and over to track down bugs. However, for message-passing parallel programs, cyclic debugging is impossible without support of special tools. Variations in message latencies and process scheduling can cause different executions on the same input to produce different results. To provide repeatability, an execution must be traced so it can be replayed as many times as needed during the debugging cycle. Since parallel programs are long-running, it is impractical to always restart the replay from the execution's beginning and wait for it to reach the desired point. Providing fast response to debugging queries requires {\it incremental} replay, where reexecution can start from intermediate states. To support incremental replay, processes' states must be saved periodically along with the contents of some messages, but the cost of saving these messages can be prohibitive. This paper presents an {\it adaptive} message logging algorithm that keeps this cost low by logging only a fraction of the messages. Our algorithm tracks dependences among messages and checkpoints to decide at run-time which messages cause a {\it domino effect}, and traces only these messages. The domino effect forces a replay request to start arbitrarily far back in the execution, and domino-free replay ensures that any part of the execution can be quickly reexecuted. Experiments on an iPSC/860 hypercube indicate that our algorithm typically logs only 1\-10% of the messages, a 1 to 2 order of magnitude reduction over past schemes which log every message. In addition, our experiments show that the resulting logs (which allow domino-free replay) provide a small bound on the amount of reexecution needed to satisfy any replay request. Our new logging algorithm thus reduces the overhead of message logging while bounding the response time to replay requests.

(complete text in pdf or gzipped postscript)