We presented a study of four flow-of-control mechanisms that are widely used in parallel programming. These mechanisms are processes, kernel threads, user-level threads, and event-driven objects.
Through experiments, we illustrated the practical limitations of using these techniques on a variety of platforms. We further analyzed the performance of these techniques while varying parameters such as the number of threads and the amount of memory used by each thread. We demonstrated a wide variation in the performance of these flow-of-control mechanisms in context switching overhead. However, in general, user-level threads provide both flexible implementations and scalable performance. This makes user-level threads an attractive approach for programming parallel applications with a large number of flows of control.
We also examined approaches to support thread migration, which is particularly useful for load balancing. We described several methods for implementing migratable threads that can be used for load balancing large scale parallel applications. We have implemented these techniques -- stack copying, isomalloc, and memory aliasing stacks -- in the Charm++/Adaptive MPI runtime system [16]. We have also shown that these techniques can be used on a wide variety of platforms by a variety of real parallel applications.