We believe the parallel runtime system is in a unique position to extract useful debugging and program analysis information. Because the runtime system manages all communication and directs control flow, it can present this information in a more useful form than a low-level sequential debugger such as gdb.
In this paper, we present a selection of parallel debugging techniques that overcome the shortcomings of existing sequential debugging schemes with a parallel program. Our goal is to provide an integrated debugging environment which allows the programmer to examine and understand the changing state of the parallel program during the course of its execution. As such, we present little brand new work here; but instead present an integrated, orthogonal environment in which these well-known techniques can be put into practice.
The single most well-used debugging method, especially in the primitive runtime environments common to parallel machines, is the insertion of write statements into the code to log specific variables and important events. This method's popularity comes from its simplicity, and the fact that it requires no additional software or training. Nevertheless, the programmer must decide in advance which variables to print and where to insert the output statements, and adding new output statements translates to editing and compiling the program over again. Finding the one piece of critical information hidden in a large output log can be painfully frustrating. Logging in parallel is even more difficult, because network and buffering delays can reorder log statements, resulting in bizarre logs where effects sometimes precede their causes.
Traditional sequential debuggers can deal quite well with a single flow of control using the usual array of step commands, breakpoints, and data structure displays. Sequential debuggers, and sequential debugging tools, are still helpful in debugging the individual processes in a parallel program; but their single-process view of the program ignores concurrent accesses, so debugging message passing or concurrency related bugs is quite difficult.
There are a huge number of research parallel debuggers of varying quality, and a small smattering of commercial debuggers, of which TotalView is a well known example. Hooks for TotalView are available to directly examine the message queues of many MPI implementations[2]; but little additional runtime support is available for this debugger. In addition, the price of the debugger, being nonzero, is beyond the software budget for many small clusters.
Finally, CHARM++ already had a parallel debugger[15], but due to various shortcomings we will describe, the debugger was difficult to use on real applications.
January 23, 2004
Charm Homepage