There are many ways to debug programs written in Charm++:
Currently charmdebug is tested to work only under net- versions. With other versions, testing is pending. The executable is present in java/bin/charmdebug of the Charm++ distribution. To start, simply substitute ``charmdebug'' to ``charmrun'':
Yes, on mpi- versions of Charm++. In this case, the program is a regular MPI application, and as such any tool available for MPI programs can be used. Notice that some of the internal data structures (like messages in queue) might be difficult to find.
It depends on the machine. On the net- versions of Charm++, like net-linux, you can just run the serial debugger:
If the problem only shows up in parallel, and you're running on an X terminal, you can use the ++debug or ++debug-no-pause options of charmrun to get a separate window for each process:
First, make sure the program at least starts to run properly without ++debug (i.e. charmrun is working and there are no problems with the program startup phase). You need to make sure that gdb or dbx, and xterm are installed on all the machines you are using (not the one that is running charmrun). If you are working on remote machines from Linux, you need to run ``xhost +'' locally to give the remote machines permission to display an xterm on your desktop. If you are working from a Windows machine, you need an X-win application such as exceed. You need to set this up to give the right permissions for X windows. You need to make sure the DISPLAY environment variable on the remote machine is set correctly to your local machine. I recommend ssh and putty, because it will take care of the DISPLAY environment automatically, and you can set up ssh to use tunnels so that it even works from a private subnet(e.g. 192.168.0.8). Since the xterm is displayed from the node machines, you have to make sure they have the correct DISPLAY set. Again, setting up ssh in the nodelist file to spawn node programs should take care of that. If you are using rsh, you need to set DISPLAY in /.charmrunrc which will be read at start up time by each node program.
Printouts from different processors do not normally stay ordered. Consider the code:
Though you might expect this code to always print ``cause, effect'', you may get ``effect, cause''. This can only happen when the cause and effect execute on different processors, so cause's output is delayed.
If you pass the extra command-line parameter +syncprint, then CkPrintf actually blocks until the output is queued, so your printouts should at least happen in causal order. Note that this does dramatically slow down output.
Charm++ automatically flushes the print buffers every newline and at program exit. There is no way to manually flush the buffers at another point.
This isn't a bug in the C library, it's a bug in your program - you're corrupting the heap. Link your program again with -memory paranoid and run it again in the debugger. -memory paranoid will check the heap and detect buffer over- and under-run errors, double-deletes, delete-garbage, and other common mistakes that trash the heap.
It's very convenient to do your testing on one processor (i.e., with +p1); but there are several things that only happen on multiple processors.
A single processor has just one set of global variables, but multiple processors have different global variables. This means on one processor, you can set a global variable and it stays set ``everywhere'' (i.e., right here!), while on two processors the global variable never gets initialized on the other processor. If you must use globals, either set them on every processor or make them into readonly globals.
A single processor has just one address space, so you actually can pass pointers around between chares. When running on multiple processors, the pointers dangle. This can cause incredibly weird behavior - reading from uninitialized data, corrupting the heap, etc. The solution is to never, ever send pointers in messages - you need to send the data the pointer points to, not the pointer.
The group it is refering to is the chare group. This error is often due to using an uninitialized proxy or handle; but it's possible this indicates severe corruption. Run with ++debug and check it you just sent a message via an uninitialized proxy.
You are trying to use code from a module that has not been properly initialized.
So, in the .ci file for your mainmodule, you should add an ``extern module'' declaration:
This means that the node program died without informing charmrun about it, which typically means a segmentation fault while in the interrupt handler or other critical communications code. This indicates severe corruption in Charm++'s data structures, which is likely the result of a heap corruption bug in your program. Re-linking with -memory paranoid may clarify the true problem.
Bus Error and Hangup both are indications that your program is terminating abnormally, i.e. with an uncaught signal (SEGV or SIGBUS). You should definitely run the program with gdb, or use ++debug.
February 12, 2012
Charm Homepage