Subnormal Floating-Point Degradation in Parallel Applications

This page discusses the interference caused by the quiet generation of subnormal or denormalized floating point values in parallel applications. This site was created as a supplement to the OSIHPA 2005 paper entitled Performance Degradation in the Presence of Subnormal Floating-Point Values. A pdf is currently available from the OSIHPA webpage. It is generally believed that subnormal, also called denormalized, floating point values are rare, and thus handling them quickly in hardware is not essential. We argue that their occurrence is significant, and furthermore the degradation of performance in a parallel application is amplified, making the phenomenon much more significant that commonly believed. The existing cost models for these floating point operations needs to be addressed for parallel programs since the cost will depend upon the number of processors used. In the paper we provide a short discussion on these topics. Below we list some performance tests on a wide range of modern processors.

Performance analysis on different processors:

We provide versions of our Gauss-Seidell Subnormal Micro-benchmark in C and Java. Both test the slowdown that occurs when subnormal floating point values are present in a computation. There is also a bash script which will download the source code, compile, and run the test micro-benchmark with a variety of compilers and options. The results from running this script on a wide range of systems are available below. For more information click on the processor link to view the entire output. To run it yourself, just download the bash script into some new empty directory, and run it(its filename extension is .txt so your browser and this server don't get confused).

     mkdir newdir
     cd newdir
     wget http://charm.cs.uiuc.edu/subnormal/files/runtests.txt
     bash runtests.txt

Various compilers were used including xlc, gcc, icc. Although the results vary between processors, the slowdowns are not caused by a particular bad compiler. Please review the result files to determine how a particular compiler affected the results on any of the architectures. I do not have access to a large number of different compilers, so I just used the ones available on some of the machines I commonly use. The important number in the results is the slow/fast ratio. If you have access ot other types of architectures or compilers, please run the sample program and send us the results.

Processor		OS	Worst Case Slowdown	More Runs	Notes:
AMD	K6	Linux	1.4 Best
IBM	PowerPC G4(in an iBook)	Darwin / OSX	1.4
IBM	PowerPC G4(PowerMac)	Linux	1.6
IBM	PowerPC G4(PowerMac)	Darwin / OSX	1.7
IBM	Power4	AIX 5.1	2.1
IBM	PowerPC 970 (Apple G5)	Darwin / OSX	2.2	1
AMD	Athlon	Linux	6.0
AMD	AthlonXP	Windows XP	7.1		Courtesy of Alexei Samoylov
Intel	Pentium 3 Xeon	Linux	14.5		Courtesy of Greg Koenig
Intel	Pentium 3	Linux	15.8
Alpha	EV67	Linux	20.5		Courtesy of Andrew Lenharth
AMD	Athlon64	Linux	21.4
AMD	Opteron64	Linux	23.8
Intel	Quad-Core Xeon E5345 / 2.33 GHz	Linux	26.0
Intel	Quad-Core Xeon (8-core MacPro) / 2.8 GHz	Darwin / OSX	28.8
Alpha	PCA56	Linux	31.9		Courtesy of Andrew Lenharth
Intel	Core Duo (MacBook Pro)	Darwin / OSX	44.2
Intel	Core Duo (MacBook)	Linux	55.4		Courtesy of Eric Bengtson
Intel	Pentium 4	Linux	92.2
Alpha	Alpha EV6.8CB	Tru64 Unix	95.1
Intel	P4 Xeon	Linux	98.2
Intel	Pentium 4	Linux	129.8
Intel	Itanium 2 (IA-64)	Linux	183.2	1
Sun	UltraSPARC IV 64-bit	Solaris	520.0 Worst

These tests were performed on machines at: TeraGrid, NCSA, TACC, PSC, UIUC, PPL, as well as on personal computers. If you have the opportunity to run this test program on other types of machines, please send us the results to be added here.

Other information about subnormals:

IEEE 754R Meeting Minutes describes the trends in processor designs and assumptions held by chip designers. It also describes why platforms with fused-multiply-add functionality can handle underflow much more quickly with minimal additional logic. This may point to why the Power processors have significantly better performance of the sample application above.

Intel Performance Primitives describes methods for zeroing small values and denormals. The table on page 5 lists how many cycles various operation takes.

This page maintained by Isaac Dooley.