Tech Report CS-93-27

Trace Size vs. Parallelism in Trace-and-Replay Debugging of Shared-Memory Programs

Robert H.B. Netzer

June 1993

Abstract:

Execution replay is a debugging strategy in which a program is run over and over on an input that manifests bugs. For explicitly parallel shared-memory programs, execution replay requires support of special tools --- because these programs can be nondeterministic, their executions can differ from run to run on the same input. For such programs, executions must be traced before they can be replayed for debugging. We present improvements over our past work on an adaptive tracing strategy that records only a fraction of the execution's shared-memory references. Our past approach makes run-time-tracing decisions by detecting and tracing exactly the nontransitive dynamic data dependences among the execution's shared data. Tracing the nontransitive dependences provides sufficient information for a replay. In this paper we show that tracing exactly these dependences is not necessary. Instead, we present two algorithms that introduce and trace artificial dependences among some events that are actually independent. These artificial dependences reduce trace size, but introduce additional event orderings that can reduce the amount of parallelism achievable during replay. We present one algorithm that always adds dependences guaranteed not to be on the critical path and thus do not slow replay. Another algorithm adds as many dependences as possible, slowing replay but reducing trace size further. Experiments show that we can improve the already high trace reduction of our past technique by up to two more orders of magnitude, without slowing replay. Our new techniques usually trace only 0.00025 - 0.2% of the shared-memory references, a 3 - 6 order of magnitude reduction over past techniques which trace every access.

(complete text in pdf or gzipped postscript)