|Subnormal Floating-Point Degradation in Parallel Applications|
This page discusses the interference caused by the quiet generation of subnormal or denormalized floating point values in parallel applications. This site was created as a supplement to the OSIHPA 2005 paper entitled Performance Degradation in the Presence of Subnormal Floating-Point Values. A pdf is currently available from the OSIHPA webpage. It is generally believed that subnormal, also called denormalized, floating point values are rare, and thus handling them quickly in hardware is not essential. We argue that their occurrence is significant, and furthermore the degradation of performance in a parallel application is amplified, making the phenomenon much more significant that commonly believed. The existing cost models for these floating point operations needs to be addressed for parallel programs since the cost will depend upon the number of processors used. In the paper we provide a short discussion on these topics. Below we list some performance tests on a wide range of modern processors.
Performance analysis on different processors:
We provide versions of our Gauss-Seidell Subnormal Micro-benchmark in C and Java. Both test the slowdown that occurs when subnormal floating point values are present in a computation. There is also a bash script which will download the source code, compile, and run the test micro-benchmark with a variety of compilers and options. The results from running this script on a wide range of systems are available below. For more information click on the processor link to view the entire output. To run it yourself, just download the bash script into some new empty directory, and run it(its filename extension is .txt so your browser and this server don't get confused).
mkdir newdir cd newdir wget http://charm.cs.uiuc.edu/subnormal/files/runtests.txt bash runtests.txt
Various compilers were used including xlc, gcc, icc. Although the results vary between processors, the slowdowns are not caused by a particular bad compiler. Please review the result files to determine how a particular compiler affected the results on any of the architectures. I do not have access to a large number of different compilers, so I just used the ones available on some of the machines I commonly use. The important number in the results is the slow/fast ratio. If you have access ot other types of architectures or compilers, please run the sample program and send us the results.
These tests were performed on machines at: TeraGrid, NCSA, TACC, PSC, UIUC, PPL, as well as on personal computers. If you have the opportunity to run this test program on other types of machines, please send us the results to be added here.
Other information about subnormals:
IEEE 754R Meeting Minutes describes the trends in processor designs and assumptions held by chip designers. It also describes why platforms with fused-multiply-add functionality can handle underflow much more quickly with minimal additional logic. This may point to why the Power processors have significantly better performance of the sample application above.
Intel Performance Primitives describes methods for zeroing small values and denormals. The table on page 5 lists how many cycles various operation takes.