Our parallel debugger GUI interface, written using Java, connects to the parallel program across the network using a protocol called Converse Client/Server (CCS), as described in Section 4.2. To extract program state, it calls a debugging CCS handler described in Section 4.3, which traverses and sends runtime and user objects using the PUP framework described next.
The PUP framework is a method to describe the in-memory layout of an object, and was originally designed to support object migration in CHARM++. To copy a complicated object from one processor to another, we must pack the object into a network message, ship the message to another processor, and finally unpack the message into an object on the other side. This is an extremely common operation, used in Java RMI serialization/deserialization, parameter marshalling and unmarshalling for CORBA communication, and even MPI derived datatypes. PUP stands for Pack/UnPack, and is a compact, efficient, and flexible method to perform this packing and unpacking of user objects for C++.
Because of the type safety and introspection capabilities of the Java language and virtual machine, Java can pack and unpack arbitrary objects automatically, without any further effort. CORBA requires the user to describe the format of each communicated object in a CORBA IDL file, which is preprocessed to generate pack and unpack code. MPI requires the user to build a ``derived datatype'' at runtime, using type construction library calls that list each field of each communicated object; because this is complicated, users often write explicit packing and unpacking code to ship complicated objects.
CHARM++ originally required users to write an explicit pack and unpack routine for each object, as well as a size routine to determine the outgoing message size before packing. The motivation for PUP is that the code used to size the message, pack an object into a message, and unpack an object from a message must match up exactly--everything that is packed must be unpacked, and vice versa. Writing three interrelated routines for every object is tedious, error-prone, and contributes to the burden of parallel programming.
In the PUP framework, sizing, packing, and unpacking are all is controlled by a single user-written subroutine called a pup routine. The pup routine simply calls a virtual method on each of the object's fields, which are then sized, packed or unpacked as appropriate.
Consider a very simple C++ class with three fields:
class foo {
int A;
float B;
long C;
public:
...
};
We define an abstract class named ``PUP::er'' with one virtual method named ``bytes'', which takes the address of an object field and a description of the type of data in the field. A pup routine for foo would then just pass each of the foo object's fields into the PUP::er.
void foo::pup(PUP::er &p) {
p.bytes(&A,MPI_INT);
p.bytes(&B,MPI_FLOAT);
p.bytes(&C,MPI_LONG);
}
Because the PUP::er is given the address and data type of each of the objects' fields, it can perform arbitrary manipulations of those fields, including copying data into the fields, copying data out of the fields, or even building an MPI derived datatype using the field offsets.
// Compute total size of object fields
class SIZING_PUP_er : public PUP::er
{
public:
int totalsize; // size of object
SIZING_PUP_er() {totalsize=0;}
virtual void
bytes(void *field,int datatype) {
totalsize+=size(datatype);
}
};
// Copy data out of object fields
... define destbuf as PUP::er field ...
void PACKING_PUP_er::
bytes(void *field,int datatype) {
memcpy(destbuf,field,size(datatype));
destbuf+=size(datatype);
}
// Copy data into object fields
... define srcbuf as PUP::er field ...
void UNPACKING_PUP_er::
bytes(void *field,int datatype) {
memcpy(field,srcbuf,size(datatype));
srcbuf+=size(datatype);
}
// Build an MPI derived datatype
void MPI_DATATYPE::
bytes(void *field,int datatype) {
displacements[n]=field-objbase;
datatypes[n]=datatype
n++;
}
It should be clear the very simple technique of calling a virtual method for each field of an object is quite powerful. CHARM++ actually uses the first three PUP::ers above to size network messages and copy data into and out of objects as they are sent across the network. The overhead for using the very general pup method to do the copy is exactly one virtual function call per field, which on many machines is faster than the memory copy itself. Other PUP::ers, not shown here, can read and write objects to and from disk, or even convert binary data formats between different machine architectures.
void operator|(PUP::er &p,int &x)
{ p.bytes(&x,MPI_INT); }
void operator|(PUP::er &p,float &x)
{ p.bytes(&x,MPI_FLOAT); }
... and so on for other datatypes ...
void foo::pup(PUP::er &p) {
p|A; // calls p.bytes
p|B;
p|C;
}
Users can treat operator
as a builtin operator, analogous
to the
and
C++ iostream operators. This operator overloading
also provides a surprising benefit: we can now use the same syntax
to pup user-defined classes that we use for builtin types like ``int''.
void operator|(PUP::er &p,foo &x)
{ x.pup(p); }
class bar {
int I;
foo F;
...
}
void bar::pup(PUP::er &p) {
p|I; // calls p.bytes
p|F; // calls foo::pup
}
Notice how C++'s operator overloading selects the appropriate
way to pup the two fields I and F, even though the operator
call looks identical.
Operator overloading can also be applied to pup templated classes, with the template type determined by C++ type resolution. For example, we can easily define a pup operator for the standard C++ class std::vector. The elements of the vector are pup'd using their own pup operator, so they can be of any type.
template<class T>
void operator|(PUP::er &p,
std::vector<T> &v)
{
int length=v.size();
p|length;
p.resize(length);
for (int i=0;i<length;i++)
p|v[i];
}
This std::vector pup operator shows some of
the strange beauty of using a single routine for both packing
and unpacking. While packing, the length of x is known,
the ``p
length'' call stores the length, and the ``resize''
call specifies the current size and hence does nothing.
While unpacking, the length is initially zero, ``p
length''
extracts the true length, and the ``resize'' call
actually allocates space in the vector for the new elements.
Because operator overloading follows the type system, we can now
pup an array of ints, std::vector
int
; or a 2D array of
foo objects, std::vector
std::vector
foo
, using the
same ``p
x'' syntax used to pup plain ints. CHARM++ includes builtin
pup operators for std::vector, std::list, std::string, std::map,
and std::multimap, templated over any object with a pup operator.
PUP thus uses C++'s sophisticated type and template overloading system
to approach the true type introspection ability of Java.
#define PUP(field) \
p.fieldName(#field); \
p|field;
void foo::pup(PUP::er &p) {
PUP(A); // calls p.fieldName("A")
PUP(B); // then p.bytes
PUP(C);
}
Most PUP::ers ignore the field names, but CHARM++ has several PUP::ers that use the field names to read and write objects from keyword/value ASCII files. Finally, a debugging PUP::er can send the annotated object data off to the parallel debugger for display. For parallel objects, our debugging support by default calls the same pup routine as is used for migration, but also provides a special ``ckDebugPup'' pup routine that can be used to make debugging-specific data available via pup.
Other PUP::er features allow for dynamically allocated data (which must be allocated during the unpack phase), the ability to easily pup a pointer-to-subclass, and pup routines written in C or Fortran. See the CHARM++ manual[13] for details.
Because the support for PUP is built into the runtime system and, for networking, always built into the application, there is no need to compile the application with `-g' (unless also using a sequential debugger). This means our parallel debugger can be used to examine the internal state of the optimized, production version of an application.
The Converse Client-Server (CCS) network interface [7] enables Converse (and hence CHARM++) programs to act as parallel servers, responding to requests from the network. The server side of this interface is built into every CHARM++ program, and the client side is provided as a library for C and Java.
A CCS client, in this case the parallel debugger, connects to the server via a TCP connection and sends it a request, which consists of a string handler name and a block of binary request data. The CHARM++ runtime uses the handler name to look up and call the appropriate handler function from an extensible table. For example, when the parallel debugger sends the request name ``ccs_set_break_point'', the runtime executes a handler that installs a breakpoint. After the server has processed the request, it responds with a block of binary response data. This simple request/response protocol allows information to be injected into and extracted from a running parallel program.
Because the client opens the TCP connection for a CCS request, CCS can be used by clients behind firewalls or NAT routers. When CCS is running over the unsecured internet, it can be run in a secure authentication mode[14], which uses a SHA-1 hash of the request, a nonce, and a shared secret key for authentication. Authentication prevents arbitrary users from injecting messages, but because of export regulations we do not provide network encryption. If secrecy is also important, users can also add encryption.
The CHARM++ runtime provides a special CCS handler to extract formatted information about the entities in the parallel program. The CCS handler allows lists of objects to be registered, and provides a way to call the objects' pup routines and extract formatted information about the object structure. Various parts of the runtime system register the different classes of objects, including application parallel objects and network messages, with this single CCS handler. This allows the debugger to access these different objects in a uniform manner. CHARM++ applications or libraries can also register more detailed information, which can then be presented by the debugger.
Because this method uses the PUP framework, which CHARM++ applications already support for migration, zero additional code must be written to use an application in the debugger. The is both easier to use as well as more powerful than our previous debugger[15], which required a special ``debugging display routine'' in each object and even then could only display flat ASCII text.
January 23, 2004
Charm Homepage