While it is clear that threads simplify coordination of message-driven parallel components, they present obstacles to important across-the-module services offered by our common language runtime system, Converse. One particularly important service is migration-based dynamic load balancing. Converse load balancing framework keeps track of computational loads and communication patterns of each component, and periodically migrates components from overloaded processors to underloaded ones. Threads impede migration because of references to thread stacks. Local variables in subroutines are stored on the thread stack, and other variables could contain references to the stack variables. Memory for thread stacks is allocated dynamically, and may occupy different virtual addresses when migrated to another processor. Therefore, references to thread stacks may become invalid when a thread migrates from one address space to another. This situation is illustrated in figure 5.3.
|
{CodeOne}
void foo(void)
{
int iarray[3];
// initialize iarray
bar(iarray);
}
void bar(int *iptr)
{
... // use *iptr
migrate();
... // use *iptr
}
The code running inside a thread (see figure 5.4) calls a function and passes a local integer array (iarray) as a parameter to that function. When the thread migrates while still inside a called function, the reference to the local array becomes invalid upon migration. In the absence of compiler support, which may detect such stack references, and change them appropriately when a thread migrates, migrating threads is difficult. For thread migration to work, one has to make sure that the address spaces occupied by the thread stack remains unchanged on any processor where a thread can migrate.
Our preliminary implementation of migratable threads was based on a stack-copy mechanism, where contents of the thread-stack were copied into and out of the process stack at every context-switch between two threads. Thus, all threads execute with the same stack and refer to valid addresses even after migration (assuming that the main process stack on all processors begins at the same virtual address which is true for most parallel machines when using the same compiler and operating environment.) This has two drawbacks. First, it is inefficient because of the copy overhead on every context-switch (figure 5.5), and second, one thread cannot access another thread's local variables, thus making it mandatory to copy it to some shared location. The context-switching time of stack-copying threads varies with the amount of stack space in use at the time of context switch. At context-switching time, the system determines the size of the buffer required to set aside the filled stack, allocates memory, copies the stack out to the buffer, and copies the buffered stack of another thread in. In order to ensure efficiency with this mechanism, it is recommended to keep the stack size as low as possible at the time of a context-switch.
|
Our current implementation of migratable threads uses a new scalable variant of
the isomalloc functionality of
[3]. In this
implementation, each thread's stack is allocated such that it spans the same
reserved virtual addresses across all the processors. This is achieved by
splitting the unused virtual address space among physical processors.
Figure 5.6 shows a typical layout of the address space of a
Unix process. The heap and the stack expand dynamically. However, by limiting
both to an upper bound, one gets a large amount of unused virtual address
space. The boundaries of the unused virtual address space are marked by making
the boundary pages inaccessible to the process. When a thread is created, its
stack is allocated from a portion of the virtual address space assigned to the
creating processor. This ensures that no thread encroaches upon addresses
spanned by others' stacks on any processor. Allocation and deallocation within
the assigned portion of virtual address space is done using the mmap and
munmap functionality of Unix. Thus, the size of the process page table
is still determined by the amount of data actually in use. Since we use
isomalloc for fixed size thread stacks only, we can eliminate several overheads
associated with
implementation of isomalloc. This results in low
thread-creation overheads. Context-switching overheads for isomalloc-based
threads are as low as non-migrating threads, irrespective of the stack-size.
However, it is still more efficient to keep the stack size down at the time of
migration to reduce the thread migration overhead.