PPL Logo
Subnormal Floating-Point Degradation in Parallel Applications

This page discusses the interference caused by the quiet generation of subnormal or denormalized floating point values in parallel applications. This site was created as a supplement to the OSIHPA 2005 paper entitled Performance Degradation in the Presence of Subnormal Floating-Point Values. A pdf is currently available from the OSIHPA webpage. It is generally believed that subnormal, also called denormalized, floating point values are rare, and thus handling them quickly in hardware is not essential. We argue that their occurrence is significant, and furthermore the degradation of performance in a parallel application is amplified, making the phenomenon much more significant that commonly believed. The existing cost models for these floating point operations needs to be addressed for parallel programs since the cost will depend upon the number of processors used. In the paper we provide a short discussion on these topics. Below we list some performance tests on a wide range of modern processors.

Performance analysis on different processors:

We provide versions of our Gauss-Seidell Subnormal Micro-benchmark in C and Java. Both test the slowdown that occurs when subnormal floating point values are present in a computation. There is also a bash script which will download the source code, compile, and run the test micro-benchmark with a variety of compilers and options. The results from running this script on a wide range of systems are available below. For more information click on the processor link to view the entire output. To run it yourself, just download the bash script into some new empty directory, and run it(its filename extension is .txt so your browser and this server don't get confused).

     mkdir newdir
     cd newdir
     wget http://charm.cs.uiuc.edu/subnormal/files/runtests.txt
     bash runtests.txt

Various compilers were used including xlc, gcc, icc. Although the results vary between processors, the slowdowns are not caused by a particular bad compiler. Please review the result files to determine how a particular compiler affected the results on any of the architectures. I do not have access to a large number of different compilers, so I just used the ones available on some of the machines I commonly use. The important number in the results is the slow/fast ratio. If you have access ot other types of architectures or compilers, please run the sample program and send us the results.

ProcessorOSWorst Case
Slowdown
More
Runs
Notes:
AMDK6Linux 1.4 Best
IBMPowerPC G4(in an iBook)Darwin / OSX 1.4
IBMPowerPC G4(PowerMac)Linux 1.6
IBMPowerPC G4(PowerMac)Darwin / OSX 1.7
IBMPower4AIX 5.1 2.1
IBMPowerPC 970 (Apple G5)Darwin / OSX2.2 1
AMDAthlonLinux 6.0
AMDAthlonXPWindows XP7.1Courtesy of
Alexei Samoylov
IntelPentium 3 XeonLinux14.5Courtesy of
Greg Koenig
IntelPentium 3Linux 15.8
AlphaEV67Linux20.5Courtesy of
Andrew Lenharth
AMDAthlon64Linux 21.4
AMD Opteron64Linux 23.8
IntelQuad-Core Xeon E5345 / 2.33 GHz Linux 26.0
IntelQuad-Core Xeon (8-core MacPro) / 2.8 GHz Darwin / OSX 28.8
Alpha PCA56Linux31.9Courtesy of
Andrew Lenharth
IntelCore Duo (MacBook Pro)Darwin / OSX 44.2
IntelCore Duo (MacBook)Linux 55.4 Courtesy of
Eric Bengtson
IntelPentium 4Linux 92.2
AlphaAlpha EV6.8CBTru64 Unix 95.1
IntelP4 XeonLinux 98.2
IntelPentium 4Linux 129.8
IntelItanium 2 (IA-64)Linux183.2 1
SunUltraSPARC IV 64-bitSolaris 520.0 Worst

These tests were performed on machines at: TeraGrid, NCSA, TACC, PSC, UIUC, PPL, as well as on personal computers. If you have the opportunity to run this test program on other types of machines, please send us the results to be added here.

Other information about subnormals:

IEEE 754R Meeting Minutes describes the trends in processor designs and assumptions held by chip designers. It also describes why platforms with fused-multiply-add functionality can handle underflow much more quickly with minimal additional logic. This may point to why the Power processors have significantly better performance of the sample application above.

Intel Performance Primitives describes methods for zeroing small values and denormals. The table on page 5 lists how many cycles various operation takes.

This page maintained by Isaac Dooley.