next up previous contents
Next: Conversion to AMPI Up: Migration Path: Adaptive MPI Previous: Overhead of Threaded Components   Contents

Obstacles to Migration

While it is clear that threads simplify coordination of message-driven parallel components, they present obstacles to important across-the-module services offered by our common language runtime system, Converse. One particularly important service is migration-based dynamic load balancing. Converse load balancing framework keeps track of computational loads and communication patterns of each component, and periodically migrates components from overloaded processors to underloaded ones. Threads impede migration because of references to thread stacks. Local variables in subroutines are stored on the thread stack, and other variables could contain references to the stack variables. Memory for thread stacks is allocated dynamically, and may occupy different virtual addresses when migrated to another processor. Therefore, references to thread stacks may become invalid when a thread migrates from one address space to another. This situation is illustrated in figure 5.3.

Figure 5.3: Threads impede migration, since references to stack variables may become invalid upon migration if stacks are dynamically allocated.
\includegraphics[width=4in]{figures/stackref}

{CodeOne}
void foo(void)
{
  int iarray[3];
  // initialize iarray
  bar(iarray);
}

void bar(int *iptr)
{
  ... // use *iptr
  migrate();
  ... // use *iptr
}

Figure 5.4: An example of references to stack variables
\begin{figure}\centering \fbox{\BUseVerbatim{CodeOne}} \end{figure}

The code running inside a thread (see figure 5.4) calls a function and passes a local integer array (iarray) as a parameter to that function. When the thread migrates while still inside a called function, the reference to the local array becomes invalid upon migration. In the absence of compiler support, which may detect such stack references, and change them appropriately when a thread migrates, migrating threads is difficult. For thread migration to work, one has to make sure that the address spaces occupied by the thread stack remains unchanged on any processor where a thread can migrate.

Our preliminary implementation of migratable threads was based on a stack-copy mechanism, where contents of the thread-stack were copied into and out of the process stack at every context-switch between two threads. Thus, all threads execute with the same stack and refer to valid addresses even after migration (assuming that the main process stack on all processors begins at the same virtual address which is true for most parallel machines when using the same compiler and operating environment.) This has two drawbacks. First, it is inefficient because of the copy overhead on every context-switch (figure 5.5), and second, one thread cannot access another thread's local variables, thus making it mandatory to copy it to some shared location. The context-switching time of stack-copying threads varies with the amount of stack space in use at the time of context switch. At context-switching time, the system determines the size of the buffer required to set aside the filled stack, allocates memory, copies the stack out to the buffer, and copies the buffered stack of another thread in. In order to ensure efficiency with this mechanism, it is recommended to keep the stack size as low as possible at the time of a context-switch.

Figure 5.5: Comparison of context-switching times of stack-copying and isomalloc-based migrating threads with non-migrating threads. This experiment was performed on NCSA Origin2000, with 250 MHz MIPS R10000 processor.
\includegraphics[width=4in]{figures/context}

Our current implementation of migratable threads uses a new scalable variant of the isomalloc functionality of $\textrm{PM}^2$ [3]. In this implementation, each thread's stack is allocated such that it spans the same reserved virtual addresses across all the processors. This is achieved by splitting the unused virtual address space among physical processors. Figure 5.6 shows a typical layout of the address space of a Unix process. The heap and the stack expand dynamically. However, by limiting both to an upper bound, one gets a large amount of unused virtual address space. The boundaries of the unused virtual address space are marked by making the boundary pages inaccessible to the process. When a thread is created, its stack is allocated from a portion of the virtual address space assigned to the creating processor. This ensures that no thread encroaches upon addresses spanned by others' stacks on any processor. Allocation and deallocation within the assigned portion of virtual address space is done using the mmap and munmap functionality of Unix. Thus, the size of the process page table is still determined by the amount of data actually in use. Since we use isomalloc for fixed size thread stacks only, we can eliminate several overheads associated with $PM^2$ implementation of isomalloc. This results in low thread-creation overheads. Context-switching overheads for isomalloc-based threads are as low as non-migrating threads, irrespective of the stack-size. However, it is still more efficient to keep the stack size down at the time of migration to reduce the thread migration overhead.

Figure 5.6: Address space layout of a typical Unix process.
\includegraphics[width=4in]{figures/addrspace}


next up previous contents
Next: Conversion to AMPI Up: Migration Path: Adaptive MPI Previous: Overhead of Threaded Components   Contents
Milind Bhandarkar 2002-06-12