next up previous
Next: Application Parallel Up: Performance Previous: Stack Size

Minimal Context Switching

We have determined a lower bound on the number of instructions required to explicitly context switch two user-level threads, as initiated by a subroutine call to a library's switch routine. Because the switch routine is a subroutine, this means only the saved registers defined in the architecture's subroutine calling convention need actually be saved and restored -- scratch registers will automatically be saved and restored by the compiler, just like at any subroutine call.

This observation makes it possible to write extremely efficient user-level thread switch routines, particularly for today's popular x86 and x86-64 CPUs.4Figure 10 shows the minimum correct thread swap routines for 32 and 64-bit x86 CPUs. Note that on x86, the floating point registers must be empty before making a subroutine call, so the compiler will already save and restore floating point registers when needed.

Figure 10: Minimal user-level thread context switching routines for 32-bit (a) and 64-bit (b) x86 CPUs. AT&T/GNU assembly syntax shown.
\includegraphics[width=2.5in]{fig/codefig}

The subroutines in Figure 10 can swap user-level threads in 16ns (32-bit mode) and 18ns (64-bit mode) on a 2.2GHz Athlon64. Of course, a real thread library also requires a scheduling component to decide which thread to swap in when another thread suspends, but for many applications thread scheduling can be very simple -- for example, a circular linked list of runnable threads.

Most user-level thread packages provide far worse performance than this for two reasons. First, real systems often include multiple layers of scheduling and prioritization which costs function-call overhead. Secondly, many user-level threads implementations save and restore far more state than is necessary, either through fear or ignorance. In particular, popular implementations of swapcontext and setjmp/longjmp (often used to implement user-level threading) save and restore all registers, including scratch registers. Worse, they often include system calls to save and restore the signal mask, even though very few scientific applications manipulate signals at runtime. If a user-level thread context switch involves even one system call, most of the speed advantage of user-level threads is lost. This is because a system call involves saving application registers when entering kernel space and restoring application registers when leaving kernel space, so the kernel could just as quickly perform a process switch by simply saving the registers of one process and restoring the registers of a different process!


next up previous
Next: Application Parallel Up: Performance Previous: Stack Size
Gengbin Zheng 2006-03-18