The PM2 implementaton of ``isomalloc'' [4] overcomes these disadvantages by allocating a globally unique address for each thread stack. To avoid conflicts between threads, stack addresses must be unique across the entire parallel machine, which makes allocating stacks somewhat complicated. As illustrated in Figure 2, the PM2 approach is to divide up the entire unused virtual memory address space, or ``isomalloc region'', into per-processor slots which can subsequently be allocated in parallel. This isomalloc region must be agreed upon by all processors at startup -- normally the largest space available lies between the process stack and the heap. A processor can then grant any local thread a new globally reserved range of virtual addresses from within that processor's region of the shared address space. The threads can then be confident that they can migrate to any other processor, and their addresses will be free for use on that processor.
This approach can be seen as a sort of distributed shared memory system, in that each thread is using globally unique virtual memory addresses. However, because threads never directly share data, we can eliminate the possibility of DSM page faults by sending all a thread's data along with the thread at migration time.
Clearly, with
threads per processor,
bytes per thread, and
processors, the isomalloc approach uses at least
bytes of address
space.
For 10 threads per processor, 1MB of data per thread, and 1000 processors,
this amounts to 10 gigabytes of address space! But luckily we can use the
virtual memory hardware to avoid using such a large amount of physical memory.
The system call mmap allows individual pages of program virtual address
space to be assigned to physical memory, so on each processor we assign
physical memory only to the addresses in use by local threads. Addresses used
by all remote threads are claimed only in principle, but never actually
allocated physical memory unless that remote thread migrates in.
The original PM2 required applications to be modified to call the special memory allocation routines isomalloc and isofree, which was a burden on developers and was not feasible when linking with a third-party library. In our runtime system, we extended this approach by overriding the system malloc/free routines to use the new isomalloc/free when it is called within a thread. Of course, malloc/free called from outside the threading context (e.g. by the communication layer of the runtime system) is still directed to the normal system version of malloc/free. This approach thus allows unmodified applications to use migratable thread memory for their heap data.
Since one thread's data is always allocated at the same addresses inside the isomalloc region, a thread can be migrated simply by copying all its data to the new processor -- pointers within and between the thread's stack and heap need not be modified. Because we only allocate physical memory to local threads, the physical memory usage on each processor is modest. Unlike stack-copy threads, no data needs to be moved when switching threads, and multiple threads can run simultaneously, which allows the straightforward exploitation of SMP machines.
However, isomalloc stacks have the disadvantage that they consume virtual address space on each processor proportional to the total number of threads on all processors. 32-bit machines only have 4 GB of virtual address space, of which a substantial fraction is already occupied by the operating system, shared libraries, and other machinery. Even if the entir 32-bit address space were available for thread stacks, if each thread uses 1 megabytes, there would only be room for 4,096 threads. 64-bit machines, by contrast, normally have terabytes of virtual memory space available, and so never suffer from this problem.
Unfortunately, machines today continue to be constructed with high processor counts using the simplest, cheapest parts available, which today are 32-bit microprocessors. Large 32-bit x86 Linux clusters are commonplace; and the Blue Gene/L machine built by IBM for LLNL consists of 128K processors, each of which is a 32-bit PowerPC derivative. On this kind of machine, virtual address space usage quickly becomes a significant impediment to the use of isomalloc-based migratable threads. This motivated us to design a new approach to reduce the usage of virtual address space which is described next.