ICS'02, June 22-26, 2002, New York, New York, USA. 2002 1-58113-483-5/02/0006
2
Orion Sky Lawlor
Dept. of Computer Science
University of Illinois at Urbana-Champaign
Laxmikant V. Kalé
Dept. of Computer Science
University of Illinois at Urbana-Champaign
I.6.mSimulation and ModelingMiscellaneous[Collision Detection]
Algorithms
When using CAD/CAM to design a machine, we must ensure the simulated machine parts never pass though one another. If the simulated parts collide, we must correct the design.
In computer graphics and animation, we often want to ensure that objects behave in a physically plausible way. This means checking if the simulated objects penetrate one another- if they do, the modeling tool or animator will want to know.
In motion planning for robotics, we must check if a robot arm will collide with anything as it executes a proposed command. The command will have to be modified if a collision is possible.
In simulating a car crash or tearing metal, at each timestep we must check if any objects intersect. If they do, we must deform or displace the objects.
|
Collision detection, also known as contact or interference detection, is the problem of determining whether a given set of objects overlap. If the objects overlap, in some situations we need to find exactly which parts of the objects overlap and the depth of the overlap region. The problem can be formulated in 2D, 3D, or higher dimensions; can vary with time; and, as shown above and in Figure 1, is relevant to many domains.
In this work, we present a scalable high-level parallel solution to a large subclass of collision detection problems. Our approach is to divide space into a sparse grid of regular axis-aligned voxels distributed across the parallel machine. Objects are then sent to all the voxels they intersect. Once all the objects have arrived, each voxel becomes a self-contained subproblem, which is then solved using standard serial collision detection approaches. This voxel-based approach efficiently and naturally separates many objects that cannot ever collide, by placing them in separate voxels. Simultaneously, voxels bring together adjacent objects that may intersect.
In theory, the basic voxel algorithm works for any object type or serial collision detection method. In practice, the method works best when there is not too much scale disparity- objects much larger than a voxel will be sent to several voxels, wasting space and time; while structures much smaller than a voxel must be dealt with entirely via the serial approach. Further, the serial collision detection method used must be able to work with only a subset of the problem's objects- the subset that lies within the voxel.
For problems which conform to these limitations, however, the serial and parallel performance of the voxel scheme is excellent.
Finally, although the two problems are distinct,
the common and literature usage of the term ``collision detection''
can refer to work in either
-body or pairwise detection.
The voxel method, whose presentation forms the bulk of this work,
addresses the
-body problem.
The basic problem of pairwise collision detection
is to determine whether two given objects collide.
Farouki et al [6]
show how to do pairwise intersections for quadrics.
Cameron [4] defines ``S-Bounds'' to perform
pairwise intersections for constructive solid geometry.
The famous papers by Lin and Canny [22], and
Gilbert, Johnson, and Keerthi [9], use convex
optimization techniques to intersect a pair of convex polytopes
in time logarithmic in the number of vertices.
This last result is sometimes
misinterpreted to mean the
-body collision detection problem
can be solved in logarithmic time, which is not the case.
In fact, any correct
-body collision detection algorithm has a
worst-case time in
. This is because it is
possible to arrange
objects (see Figure 2)
so that every object collides with every other object, for
separate collisions.
However, because physical objects cannot pass though one another,
nearly any plausible physical situation actually has at
most
collisions, as in Figure 3(a).
Many situations found in practice have as few as
collisions,
as shown in Figure 3(b).
Since most problems have only a few collisions, it is quite often
possible to do collision detection in faster than
time.
There are several ways to do this, each with their own advantages.
Hubbard [12] points out that most collision
detection research ignores the time dependence
of collision problems. Although several methods for fully
four dimensional ``spacetime'' collision detection
are known, and described in Jiménez's survey [13],
a standard approach is to sweep out or ignore time dependence and
treat each timestep or frame of the simulation as
a conceptually separate problem.
For rigid objects with bounded velocities and accelerations,
the collision scheduling method of [21] can be
effective.
Most other
-body approaches use some sort of
bounding volume. Bounding volumes, although shown quite
effective in practice, are subject to rather unlikely worst cases,
such as the nested `L' shapes shown in Figure 4.
Suri, Hubbard, and Hughes [24] show that if the
object aspect ratio and ``scale factor''
are bounded, then bounding boxes introduce no more than a
linear amount of overhead. Zhou and Suri [26] extend this
result to include bounded average aspect ratio and
scale factor. These theoretical results confirm the practical
effectiveness of bounding boxes.
|
There are two major approaches to
-body collision detection. In object subdivision,
objects are divided into parts that cannot intersect.
Hubbard [12] divided objects using bounding spheres;
Gottschalk et al. [10] examine
oriented bounding boxes; Klosowski et al. [17]
use discrete-orientation
polytopes (
-DOPs); and Krishnan et al. [18] use
a higher-order bounding volume. By contrast, with spatial subdivision,
space itself is divided up to separate objects. The voxel
method is the simplest spatial subdivision scheme, although
there are others.
Brown, Attaway, Plimpton, and Hendrickson [2]
use a parallel recursive coordinate bisection (RCB)
spatial subdivision scheme to perform collision detection
in a transient dynamics application. The parallel
implementation of RCB requires at least
synchronized communication steps, however, so their approach
has a rather high synchronization overhead. However, theirs is
scalable, production-level work, and should be
considered a proven alternative to the voxel method.
Several other spatial subdivision schemes have been
described, such as the sweep-line method, binary space partition
[7], octrees [1] and methods derived from ray
tracing. By reduction to sorting, however, any subdivision scheme based
on comparing object locations will require
time;
with parallel implementations that require
multiple synchronization steps.
The voxel method, described next, in some situations achieves
performance with excellent parallel efficiency.
The
-body, spatial subdivision approach we examine in the
remainder of this work is the voxel method.
A regular, axis-aligned grid of voxels is imposed over the
problem domain. Each object is added to all the voxels it touches,
then each voxel's list of objects is collided separately.
Any collision detection subscheme may be used to detect
collisions within a voxel,
from the simple all-pairs test to a recursive application
of the voxel method.
We are deliberately vague about exactly what an ``object to be collided'' means in this context, because the exact type of object is completely irrelevant. The voxel method is exactly the same and works almost equally well whether colliding boxes, triangles, tetrahedra, hexahedral finite elements, polyhedra, spheres, NURBS surface patches, or even complicated composite objects. Of course, any implementation has to pick some representation; two fairly popular, general choices are bounding boxes and triangles.
The voxel method was first described by Turk [25] for molecules, but his approach hashed the grid locations into buckets, ignoring hash collisions, and used the naive all-pairs algorithm on each bucket. Zyda et al. [27] describe the NPSNET system, which uses a parallel 2D dense grid structure quite similar to the voxel method, but Zyda's implementation was limited to simple vehicle/vehicle and vehicle/terrain interactions, and also used the naive all-pairs approach within each grid cell.
When the objects to be collided are of nearly uniform size, the voxel method is ideal and gives excellent results. For objects much smaller than a voxel, the division into voxels does not help much, and the voxel method degenerates to the serial scheme used within a voxel.
Objects much larger than a voxel will be duplicated across many voxels, as shown in Figure 5. In this case, a careful implementation can avoid duplicating collision tests across voxels by using a convention to decide which voxel is responsible for testing these shared objects. For example, only the voxel that contains the midpoint between the object centers need test those objects for collision--see the dots in Figure 5. Even with this optimization, however, objects much larger than a voxel still waste both space and time.
|
Voxels should be implemented as a sparse grid, via a hash table in most cases. A sparse implementation is not required, but can be a good deal more space and time efficient if the problem domain is not naturally bounded or if objects are distributed nonuniformly. A significant advantage of a sparse grid is the fact that empty voxels never get created, and hence cost nothing.
Although normally implemented in 3D, gridding works well in any number of dimensions, although high-dimensional problems may consume large quantities of memory. The common timestep ``sweeping'' approach for spacetime collision detection, illustrated in Figure 6, can be seen as a kind of gridding along the time axis of spacetime.
The voxel approach is easiest to implement via the sloppy strategy of adding each object to all the voxels touched by the object bounding box, which may be many more voxels than are touched by the actual object. This sloppiness can be tolerated because any irrelevant objects in a voxel will be ignored by the collision detection subscheme. For elongated objects, whose bounding box does not fit well, this can be quite wasteful; for long and skinny or flat objects, the 3D digital differential analyzer approach of [1] could improve performance.
However, since the voxel size is normally chosen so that most objects fit in a single voxel, gridding only the bounding box normally works well. The fast gridding method described in Appendix A is ideal for this case.
Unlike many other object- or space-division collision
detection schemes, the voxel method is naturally
parallel.
The description of a parallel implementation of the voxel
algorithm is given below.
A parallel application in CHARM++ consists of a number of relatively small, self-contained parallel objects. These objects communicate via remote method invocation, an RPC protocol similar to CORBA or Java RMI, but asynchronous. Since each processor houses several objects, while one object is waiting for data, other local objects can use the CPU. This data-driven approach thus automatically and transparently overlaps communication and computation, leading to better performance and easier application development. For unstructured, dynamic problems like collision detection, this approach is much more useful than the lower-level primitives provided by MPI.
Parallel objects are implemented as ordinary C++ classes. They communicate with other parallel objects using a local communication proxy object, another regular C++ class. Proxies are automatically generated from an interface file that lists each classes' remote methods.
In this sense, CHARM++ is quite similar to CORBA. However, CHARM++ targets tightly coupled parallel machines and aims for extremely efficient communication between thousands of in-process objects. The overhead imposed by CHARM++ is just a few microseconds, rather than the millisecond-scale cost of CORBA.
The CHARM++ array framework, presented by the author in [20], supports parallel objects called array elements that can be dynamically created and destroyed, participate in reductions and broadcasts, and migrate from one processor to another. Array elements are identified across the parallel machine by an array index, which need not be a contiguous range of integers- it can be a sparse user-defined data structure such as a multidimensional location, bitvector, or string.
Since array elements can migrate, CHARM++ can improve load balance by occasionally migrating some objects from heavily loaded processors to less loaded processors. Migration-based load balancing can automatically compensate for application-induced load imbalance, minimize communication volume, handle variation among machines, or accommodate background load. The load balancing system is described in [3].
CHARM++ is an excellent foundation for parallel program design, research and implementation. Several major, highly scalable parallel applications have been developed using CHARM++, including [15].
Triangles are an extremely popular lowest common denominator surface representation, and can be efficiently represented and processed. Details on the specific input format and some triangle-specific optimizations are presented by the author in another work [19].
Bounding boxes are more general than triangles in the sense that even a complex object can be collided using only its bounding box, at least at the broad phase. Bounding boxes can also be constructed to contain multiple objects, providing an additional degree of freedom for the user. Of course, bounding boxes may intersect when the actual objects do not, so the bounding box implementation is only the first step. A subsequent ``narrow phase'' is still needed to check the few possible colliding pairs of objects for any actual intersections.
The input to either implementation of the voxel algorithm is a set of objects--bounding boxes or triangles--presented on each processor. The output is a list of objects that collide; different pieces of this list are returned on every processor.
module parCollide {
message objList;
array [3D] Voxel {
entry Voxel();
entry [createhere] void add(objList *);
entry void startCollision(void);
};
group Manager : syncReductionMgr {
entry Manager(CkArrayID voxels);
entry void voxelMessageRecvd(void);
};
};
Voxel is a grid of voxels- a sparse three-dimensional CHARM++ parallel array of objects. This allows voxels to be created on any processor, receive messages from any processor, and migrate from processor to processor.
Voxel::add is called to add a list of objects to the voxel. Because add is declared createhere, if an add message is sent to a non-existent grid location, CHARM++ will create a new Voxel object on the sending processor to handle the request. objList, the ``message'' parameter to add, contains a list of objects to be collided by the voxel.
Voxel::startCollision runs the serial collision detection subscheme on all the voxel's accumulated objects. We use a spatial recursive coordinate bisection scheme; although any method can be used. An interesting approach might be to recursively apply the voxel method, especially if many objects lie in a single voxel.
Manager is a CHARM++ group, a special collection of parallel objects with exactly one object on every processor. The Manager coordinates and synchronizes the collision detection computation.
As usual in CHARM++, because this library's clients are almost always parallel objects themselves, there can be several clients per processor, or perhaps none at all on a particular processor.
In step 2, the local Manager actually accumulates the objects to be sent off, then finally sends all the objects for a voxel in a single message. Step 2 is an all-to-all communication step, so this simple ``accumulate and send'' communication optimization is just one of a variety of optimizations that can be applied. Such communication optimizations are the subject of ongoing research.
The message sends in step 2 actually create the voxels, if they haven't already been created. By default, the creation happens on the sending processor- this makes all future communication between this Manager and the voxel efficient. If the objects given on a single processor are clustered well, most will never be sent over the network.
The synchronization in step 3 is needed because voxels may receive objects from several processors. Since any processor can send an object to any voxel, ``Have all objects been delivered?'' is a global property, obtainable only via a global operation. Recall that in CHARM++, however, other local objects will automatically use the CPU during this single synchronization.
After the synchronization, each voxel runs its separate serial collision detection algorithm and reports the results.
Consider, then, step 2 of our parallel algorithm.
A list of objects destined for some Voxel with index
can
be given to CHARM++ on some processor. Somehow, the processor must
determine if that Voxel already exists
somewhere on the parallel machine. If so, the objects
need to be sent to the Voxel. If not, a new Voxel must
be created, ideally on the same processor.
Further, objects destined for the same, nonexistent Voxel may be given to two different processors at the same time. CHARM++ must resolve this race condition--exactly one of the processors must create a new Voxel; and the others' objects must be sent there.
It's easy to imagine using a centralized Voxel registry to map grid locations to processors, and to organize Voxel creation. However, a centralized registry is is not scalable and would quickly become a serial bottleneck. The natural solution, then, is to use a distributed registry.
This is the approach used by CHARM++.
To deliver a message to an unknown Voxel location,
it computes a ``home'' processor
for that location via a simple hash function. This home processor
will either inform the sender of the Voxel's location, or
authorize the sender to create a new local Voxel.
Home processors thus act as a distributed registry to keep
track of which Voxels exist, and where.
Processors cache the location of recently referenced
Voxels, which means the home processor need not be consulted
before every message send.
Home processors also provide an efficient, distributed means to support migrating array elements. The array framework also properly supports broadcasts to all existing Voxels, used in step 4 above; as well as reductions across all Voxels, used in step 5. The details of how to support these operations even with ongoing migrations are presented by the author in another work [20]. Of course, CHARM++'s migratable parallel objects are not a voxel method-specific construct--they are useful, and are used, in many other contexts.
to determine where each object
should be migrated for optimal performance. Performing load balancing
automatically at run-time enables an application to react quickly
to irregularity in the problem structure.
In fact, since collision detection is often only a small part of a much larger parallel program, perfect load balance for the collision library is actually not necessary. As long as the program is structured to allow the execution of different libraries to interleave (as is always the case in CHARM++), different processors may have different amounts of collision work yet all have the same amount of work overall.
Consider the work done by a parallel machine during an unsynchronized
period- that is, between two global synchronizations.
In symbols, let
and
represent the amount of work needed
by two libraries
and
on processor
. Assume further that
and
are independent, so they do not depend on one another--a
situation that CHARM++supports quite well. Then for load balance
we require that each processor have the same amount of work overall:
For the voxel method, the CHARM++ automatic load balancer is free to migrate both the clients, which do work before the synchronization; and the voxels, which do work afterwards. Hence the load balancer can balance the collision algorithm in the context of the larger program.
If, further, the intra-voxel serial scheme runs in linear time, then the voxel method is also linear time overall. If the serial scheme is not linear; but the number of objects in each voxel is bounded, the serial scheme's input size is constant and hence runs in constant time, so the voxel method is again linear time.
In the parallel implementation, after a global
synchronization step each voxel forms an
independent serial subproblem. Thus assuming enough
voxels that good load balance is possible
the time required is
. For large enough
, the first term dominates and the time required
is
.
To reiterate, for a variety of common cases, the
voxel method runs in
serial time, and
parallel time.
These performance results always use only one model, and hence the implementation is always able to skip the serial collision detection step. This nicely factors out the contribution of the serial collision detection scheme from the parallel scheme. The implementation, and the remainder of the computation, from object distribution to the final collection of results, is otherwise unmodified.
The 871,414 triangle
dragon model of Figure 7
presented in [5] took 3.132 seconds to collision detect
on a
single processor of an Origin 2000.
The remaining tests use a much simpler model--a tessellated plane perturbed by small harmonic waves. This simple model is easy to scale to any desired number of triangles, while retaining much of the character of an actual surface.
Figure 8 shows the wall-clock time
per complete collision detection for a range of different
numbers of triangles on a slow Linux PC.
The smallest figure shown is 1,024 triangles, which take
1.4 milliseconds or 1.4 microseconds per triangle.
The largest figure is 1,048,576 triangles,
which take 2.1 seconds or 2.1 microseconds per triangle.
This much larger dataset cannot fit entirely in cache
and is hence slightly slower.
This serial result is competitive with the figure of
2 microseconds per triangle
given for the serial
algorithm of Gottschalk et al. in [10].
Further, the observed performance of the voxel algorithm
is indeed linear in the number of triangles.
The smallest run shown, 65,536 triangles on a single processor, takes 0.44 seconds per step. The largest run shown, 65,536 triangles on each of 1,500 processors or 98.3 million triangles, takes 0.73 seconds per step. This is a scaled speedup of 915, or a parallel efficiency of 60 percent.
The observed parallel performance is indeed excellent. It is also comparable with the result of 40 milliseconds per timestep (approximately 20-fold faster) for 1875 objects per processor (approximately 35-fold fewer) for the parallel RCB scheme described in [2]. However, that work was part of a production code and included extra computation and communication associated with mechanics.
|
The parallel implementation also scales down for smaller models and fast response time, such as for interactive applications. 32 processors of a 195 MHz Origin2000 system can handle 300,000 triangles at the good interactive rate of 30 milliseconds per step.
We have demonstrated a parallel collision detection algorithm based on regular space subdivision. We have implemented the algorithm in CHARM++ and extensively analyzed its performance. The results are theoretically and practically quite competitive with published work.
A much simpler goal is automatic determination of the ideal voxel size. Voxels that are too small have too much overhead; voxels that are too large do not create enough parallelism and lump together too many objects. The tradeoff should be better analyzed and, if possible, automatic.
Voxels need not form a regular grid at all. There are several other regular and irregular ways to tile 3D space. A less angular grid cell, with a higher volume to surface area than a cube, would be less likely to bring together objects that will never collide. The implementation could also make much better use of temporal and spatial coherence, as suggested in [22] and [21].
The implementation currently requires one barrier operation per collision, which could become a bottleneck for small problems on large machines. Some sort of global operation is needed to ensure all objects have been delivered to the voxels; but a systolic approach, continually summing the current send and receive counts, might be more proactive.
We currently only determine which objects intersect. This is sufficient for some applications, such as motion planning, clearance checking, and many graphics applications. However, a more complete solution would go on to determine the entire set of penetrating points, the penetration distance, and the penetration direction for each point. Kawachi and Suzuki [16] use a ``discrete closest feature list'' to efficiently determine these quantities via an algorithm similar to the voxel method.
Finally, we are implementing a real mechanics application that uses this collision detection implementation. We hope to quantitatively demonstrate the benefits of our dynamic load balancing capabilities using this real application.
In certain situations, a novel but simple technique [11] can be used to dramatically speed up floating-point to integer conversions. In particular, given a double-precision source and integer destination array, we may use either the first or the second two C statements below:
/*Normal way:*/ dest[i] = (int)floor(src[i]); /*Bizarre but fast way:*/ float f = (float)(src[i] + convert); dest[i] = *(int32 *)&f;
The reason is that an IEEE single precision floating-point number's mantissa field overlaps the low bits of an integer, as shown in Figure 11. If the grid size is a power of two, we can convert a coordinate into a grid location by shifting and extracting out the appropriate bits of the coordinate.
Floating-point numbers are always stored in the ``normalized''
form
, so we can shift
the mantissa of a floating point number by simply
adding a constant. For example, given a
floating-point coordinate with binary representation
, after adding
, we get
By adding a constant, we have shifted
the grid location
into the low bits, and by rounding
to single precision we have removed the excess bits
that represent the sub-gridcell location.
This floating-point number can be compared,
incremented, and decremented as if an integer.
The IEEE exponent field corrupts the high bits of this
integer, but these can easily
be subtracted off. Since the origin of the grid coordinates
is arbitrary anyway, we leave the exponent.
The mantissa is only 23 bits, but this is still enough to
represent 8 million grid cells along each axis, or
cells total. Converting a double-precision
floating point number to a 64-bit integer in the same way
would yield a 52-bit grid index.
By adding a ``borrow protection'' bit to the constant, this method can accept both positive and negative input coordinates. By subtracting the coordinate origin, we can accomplish a shifting as well as scaling. We can also scale the constant to extract out a different set of bits, simulating any power-of-two grid size; this is less flexible but much more efficient than scaling the input values to change the grid size. Finally, we must compensate for IEEE rounding, which rounds to the nearest bitpattern rather than truncating. The ideal constant for this method thus has the form:
Thus given a power-of-two grid size and origin, the optimized C code to compute the conversion constant and then convert a floating point value src to an integer grid cell index dest is as follows.
/* gridsize must be a power of two*/
double convert = (1.5*(1<<23)-0.5)
*gridsize-origin;
float f = (float)(src + convert);
int dest = *(int32 *)&f;
Of course, the same method can be applied in any language that allows us to quickly interpret the bits of one data type as bits of another data type.
Table 1 shows the time per floating point to integer conversion for both the conventional and optimized approaches. The performance difference between the two is dramatic- a factor of 2 to 40, depending on the architecture. Because mapping one bounding box onto a grid requires six such conversions, the total savings by using the optimized method is a microsecond or two per object- a significant speedup.
This document was generated using the LaTeX2HTML translator Version 99.2beta6 (1.42)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -white -antialias -local_icons paper.tex
The translation was initiated by Orion Lawlor on 2002-05-09