Subsections

5 Debugging

5.0.1 How can I debug Charm++ programs?

There are many ways to debug programs written in Charm++:

print
By using CkPrintf, values from critical point in the program can be printed.

gdb
This can be used both on a single processor, and in parallel simulations. In the latter, each processor has a terminal window with a gdb connected.

charmdebug
This is the most sofisticated method to debug parallel programs in Charm++. It is tailored to Charm++ and it can display and inspect chare objects as well as messages in the system. Single gdbs can be attached to specific processors on demand.

5.0.2 How do I use charmdebug?

Currently charmdebug is tested to work only under net- versions. With other versions, testing is pending. The executable is present in java/bin/charmdebug of the Charm++ distribution. To start, simply substitute ``charmdebug'' to ``charmrun'':

shell> <path>/charmdebug ./myprogram

5.0.3 Can I use TotalView?

Yes, on mpi- versions of Charm++. In this case, the program is a regular MPI application, and as such any tool available for MPI programs can be used. Notice that some of the internal data structures (like messages in queue) might be difficult to find.

5.0.4 How do I use gdb with Charm++ programs?

It depends on the machine. On the net- versions of Charm++, like net-linux, you can just run the serial debugger:

shell> gdb myprogram

If the problem only shows up in parallel, and you're running on an X terminal, you can use the ++debug or ++debug-no-pause options of charmrun to get a separate window for each process:

shell> export DISPLAY="myterminal:0"
shell> ./charmrun ./myprogram +p2 ++debug

5.0.5 When I try to use the ++debug option I get: remote host not responding... connection closed

First, make sure the program at least starts to run properly without ++debug (i.e. charmrun is working and there are no problems with the program startup phase). You need to make sure that gdb or dbx, and xterm are installed on all the machines you are using (not the one that is running charmrun). If you are working on remote machines from Linux, you need to run ``xhost +'' locally to give the remote machines permission to display an xterm on your desktop. If you are working from a Windows machine, you need an X-win application such as exceed. You need to set this up to give the right permissions for X windows. You need to make sure the DISPLAY environment variable on the remote machine is set correctly to your local machine. I recommend ssh and putty, because it will take care of the DISPLAY environment automatically, and you can set up ssh to use tunnels so that it even works from a private subnet(e.g. 192.168.0.8). Since the xterm is displayed from the node machines, you have to make sure they have the correct DISPLAY set. Again, setting up ssh in the nodelist file to spawn node programs should take care of that. If you are using rsh, you need to set DISPLAY in  /.charmrunrc which will be read at start up time by each node program.

5.0.6 My debugging printouts seem to be out of order. How can I prevent this?

Printouts from different processors do not normally stay ordered. Consider the code:

...somewhere... {
  CkPrintf("cause\n");
  proxy.effect();
}
void effect(void) {
  CkPrintf("effect\n");
}

Though you might expect this code to always print ``cause, effect'', you may get ``effect, cause''. This can only happen when the cause and effect execute on different processors, so cause's output is delayed.

If you pass the extra command-line parameter +syncprint, then CkPrintf actually blocks until the output is queued, so your printouts should at least happen in causal order. Note that this does dramatically slow down output.

5.0.7 Is there a way to flush the print buffers in Charm++ (like fflush())?

Charm++ automatically flushes the print buffers every newline and at program exit. There is no way to manually flush the buffers at another point.

5.0.8 My Charm++ program is causing a seg fault, and the debugger shows that it's crashing inside malloc or printf or fopen!

This isn't a bug in the C library, it's a bug in your program - you're corrupting the heap. Link your program again with -memory paranoid and run it again in the debugger. -memory paranoid will check the heap and detect buffer over- and under-run errors, double-deletes, delete-garbage, and other common mistakes that trash the heap.

5.0.9 Everything works fine on one processor, but when I run on multiple processors it crashes!

It's very convenient to do your testing on one processor (i.e., with +p1); but there are several things that only happen on multiple processors.

A single processor has just one set of global variables, but multiple processors have different global variables. This means on one processor, you can set a global variable and it stays set ``everywhere'' (i.e., right here!), while on two processors the global variable never gets initialized on the other processor. If you must use globals, either set them on every processor or make them into readonly globals.

A single processor has just one address space, so you actually can pass pointers around between chares. When running on multiple processors, the pointers dangle. This can cause incredibly weird behavior - reading from uninitialized data, corrupting the heap, etc. The solution is to never, ever send pointers in messages - you need to send the data the pointer points to, not the pointer.

5.0.10 I get the error: ``Group ID is zero- invalid!''. What does this mean?

The group it is refering to is the chare group. This error is often due to using an uninitialized proxy or handle; but it's possible this indicates severe corruption. Run with ++debug and check it you just sent a message via an uninitialized proxy.

5.0.11 I get the error: Null-Method Called. Program may have Unregistered Module!! What does this mean?

You are trying to use code from a module that has not been properly initialized.

So, in the .ci file for your mainmodule, you should add an ``extern module'' declaration:

mainmodule whatever {
  extern module someModule;
  ...
}

5.0.12 When I run my program, it gives this error:

Charmrun: error on request socket-
Socket closed before recv.

This means that the node program died without informing charmrun about it, which typically means a segmentation fault while in the interrupt handler or other critical communications code. This indicates severe corruption in Charm++'s data structures, which is likely the result of a heap corruption bug in your program. Re-linking with -memory paranoid may clarify the true problem.

5.0.13 When I run my program, sometimes I get a Hangup, and sometimes Bus Error. What do these messages indicate?

Bus Error and Hangup both are indications that your program is terminating abnormally, i.e. with an uncaught signal (SEGV or SIGBUS). You should definitely run the program with gdb, or use ++debug.

February 12, 2012
Charm Homepage