Threads that block waiting for messages transfer control to other runnable threads on the same processor (yielding the processor) via the Converse scheduler, causing a thread context-switch. Converse threads are user-level, which already have very low context-switch overhead compared to alternatives such as kernel-level threads. However, the percentage overhead introduced by the use of threads depends on the grainsize of a component. If the grainsize of a component (in this context, the average time spent by the component between successive coordination steps or blocking communication calls) is large, one can indeed tolerate the overhead introduced by threads. On today's processors, the typical context switching time for a user-level thread is less than half a microsecond. For a scientific application with near neighbor communication, each partition would typically have six message-receives, out of which, three will block. Thus, in each timestep, there will be on an average three context-switches for each partition. If the computation time for each partition is large, say a few milliseconds, the context switching overhead would be less than 0.1%, which can be tolerated especially since it gives us vital capabilities of dynamic load balancing. In addition, ``overdecomposing'' a message-passing component (i.e. having many more ``chunks'' of a parallel component than the number of available physical processors) often allows the component to compensate for the thread overhead since the threaded message-passing component is more latency tolerant. When a chunk waits for data, the runtime system can schedule other ready chunks for execution, thus effectively overlapping communication with computation. Multi-partitioning can also exhibit better cache utilization in application components that are not optimized for caches.
To evaluate the overhead introduced by threaded components with ``overdecomposition'', we carried out an experiment using a Finite Element Method application that does structural simulation on an FEM mesh with 300K elements. We ran this application on 8 processor Origin2000 (250 MHz MIPS R10000) with different number of partitions of the same mesh mapped to each processor. Results are presented in figure 5.1. It shows that increasing number of chunks is beneficial up to 16 chunks per physical processor. This increase in performance is caused by better cache behavior of smaller partitions, and overlap of computation and communication (latency tolerance). Though these numbers may vary depending on the application, we often see similar behavior for many applications that deal with large data sets and have near-neighbor communication.
As another test for evaluating the efficiency of the overdecomposition, we
studied a Conjugate Gradient Solver5.1. This component is a partial differential equation solver which
uses a sparse, matrix-free form of the conjugate gradient method to solve the
Poisson problem on a regular 2D grid. The mesh size for this problem is
1000
1000; equivalent to a million-row matrix. Each thread is
responsible for computing the solution on a rectangular region of the mesh.
Since the solution residual for a grid point depends on the solutions for its
nearest four neighbors, each processor maintains a one-element-thick ghost
region. In each step, messages are exchanged to fill these ghost regions, and
there are two short global reductions. Like many scientific codes, this
application is normally memory bandwidth bound. Figure 5.2
shows the time per step of the solver on a single physical processor, while
varying the number of virtual processors between 1 and 4096. Because AMPI's
virtual processors are implemented as user-level threads, there is very little
overhead in managing the threads. On our Pentium IV system, with a relatively
small cache but very fast RDRAM memory, simulating 100 virtual processors led
to only a slight (10%) slowdown. However, for the Athlon and Pentium III
Xenon, with their large caches and slower memory systems, simulating 100
virtual processors was actually slightly faster than using the single
physical processor normally. Thus the single-processor virtualization
efficiency is very high.