Next: Stack Size
Up: Performance
Previous: Performance
We measured the context-switching performance of four different implementations of flows of control.
- Processes, created using fork() and yielding using sched_yield().
This is an imperfect benchmark, because some operating systems
seem to ignore sched_yield() when called repeatedly, resulting in
an artificially low measurement of the context switching time.
- Pthreads, created using pthread_create() and yielding using sched_yield().
- Cth (Converse Threads) [23], our implementations of
user-level threads created using CthCreate() and scheduled using CthYield().
We used the non-migratable version of these threads.
- AMPI (Adaptive MPI) [16,15] 3 user-level threads created
by the AMPI runtime and scheduled using the AMPI routine MPI_Yield().
These are migratable threads, implemented using the isomalloc stack allocation
approach based on the Cth threads, although no migrations actually occur.
We ran our experiments on a variety of machines. We report context switch
times as the time per flow of control per context switch.
- Linux, on a typical x86 laptop, with a 1.6 GHz Pentium M running Linux 2.4.25/glibc 2.3.3 (Red Hat 9). Context switch time is shown in Figure 4.
- Mac OS X, on Turing cluster at University of Illinois, each node has 2GHz
G5 processors and 4 GB of RAM. Context switch time is shown in Figure 5.
- Sun Solaris, with a 700MHz SunBlade 1000 workstation running Solaris 9. Context switch time is shown in Figure 6.
- IBM SP, on the production machine cu.ncsa.uiuc.edu, with one 1.3GHz Power4 "Regatta" node running A/IX 5.1. Context switch time is shown in Figure 7.
- HP/Compaq Alpha, on the production machine lemieux.psc.edu, with one 1 GHz ES45 AlphaServer node running Tru64 Unix. Context switch time is shown in Figure 8.
Our experiments have shown there is a wide variation in the limitations and
performance of these methods on different machines.
In general, the user-level threads (Cth) on most of these machines have the
fastest context switch time except on IBM SP and Alpha machines.
On these machines except IBM SP, the context switch time of the user-level
threads tends to increase slowly as the number of flows increases.
Table 2:
Approximate practical limitations (on stock systems)
for various methods to implement flow of control.
| Flow of control |
Limiting Factor |
Linux |
Sun |
IBM SP |
Alpha |
Mac OS |
IA-64 |
| Process |
ulimit/kernel |
8000 |
25000 |
100 |
1000 |
500 |
50000+ |
| Kernel Threads |
kernel |
250 |
3000 |
2000 |
90000+ |
7000 |
30000+ |
| User-level Threads |
memory |
90000+ |
90000+ |
15000 |
90000+ |
90000+ |
50000+ |
|
Figure 4:
Context switching time vs. number of flows on a x86 Linux machine.
|
|
Figure 5:
Context switching time vs. number of flows on a Mac Apple G5 machine.
|
|
Figure 6:
Context switching time vs. number of flows on a Sun Solaris machine.
|
|
Figure 7:
Context switching time vs. number of flows on an IBM SP machine. This is a 16-way SMP node. We believe the low times for processes and threads are due to the OS ignoring our repeated sched_yield() calls.
|
|
Figure 8:
Context switching time vs. number of flows on Alpha machine. This is a 4-way SMP node. Again, process and thread switching numbers may be unrealistically low.
|
|
Table 2 illustrates approximate practical limitations on.
stock systems. It shows the approximate maximum number of processes a user
can create on a processor and the maximum number of threads a user can
create in a process.
As we can see, an unmodified Linux Red Hat 9 machine can spawn less than 256
pthreads in one process; while the per-user process limit on our IBM SP was
only 100 processes. Both of these limitations
can be extended with only a small amount of system administrator effort, but
this effort is likely beyond the reach of a typical parallel user.
In general, processes and kernel threads were limited to a few thousand,
with only the Alpha allowing more than 5000 threads at a time
and IA-64 without such limitation.
By contrast, we could create tens of thousands of user-level threads easily
on all platforms.
Next: Stack Size
Up: Performance
Previous: Performance
Gengbin Zheng
2006-03-18