Speed Testing for LiveViz 2D: 2003/7/18 CVS Version

AMD AthlonXP, 1250MHz, 1GB Ram, RedHat 9.0

Charm built with 
	> cd ~/charm
	> make OPTS="-O -pg -DCMK_OPTIMIZE=1"

	> cd ~/charm/pgms/charm++/ccs/liveViz/speedtest
	> make test
...
./charmrun ./app +p3 ++server ++server-port 1234
Charm++: standalone mode (not using charmrun)
1 deposited 1000 x 1000 image in 0.084340 s
2 deposited 1000 x 1000 image in 0.077340 s
0 deposited 1000 x 1000 image in 0.082073 s
Client: total request time for 1000 1000 image (1000000 bytes): 0.773196 s

This is with three processes all on the same machine, so 
700 milliseconds to assemble a 1-MB local image is way too big.
I suspect an excessive amount of copying happening inside liveViz:

Tracing out from liveVizDeposit,
user's data
-> XSortedImageList's "list" image (a)
-> packedData
-> CkReductionMsg

Similarly, reduction functions work like:
Reduction messages
-> XSortedImageList images (unpacked)
-> XSortedImageList (combined)
-> packedData
-> CkReductionMsg

Finally, vizReductionHandler does:
Reduction message
-> XSortedImageList images (unpacked)
-> Image (combined)
-> CkAllocImage
-> CCS

Every arrow is an allocate and a full copy, which is ridiculous.

1.) Most cursory possible optimization, to minimize allocation and copying.
2.) Early exit for combine operation.

--------------------------------------------------------------------------
This is still slower than I'd like, and gprof doesn't appear useful.
After running with +p3 and 500 1000x1000 clients:

30653 pts/1    S      0:05 ./charmrun ./app +p3 ++server ++server-port 1234
30671 ?        S      0:09 app
30678 ?        S      0:09 app
30682 ?        S      0:46 app (PE 0?)

A typical client run looks like:
Charm++: standalone mode (not using charmrun)
0 deposited 1000 x 1000 image in 0.006973 s
1 deposited 1000 x 1000 image in 0.007306 s
2 deposited 1000 x 1000 image in 0.007255 s
LiveViz: sending ccs back 0.128951 s
Client: request sent: 0.141697 s
LiveViz: request took 0.141342 s
Client: total request time for 1000 1000 image (1000000 bytes): 0.152821 s
Program finished.

The breakdown on CPU time per request must then be:
charmrun: 10ms
pe 1, pe 2: 20ms (about right, considering the deposit time is 7ms)
pe 0: 92ms (?)
kernel/TCP delays: 30ms

The biggest factor is something on PE 0, likely the 
actual image combine step.  Extra instrumentation shows
the image combine takes like 60 milliseconds (!).

Simplifying the index computations seems to have little 
benefit-- only down to 55 ms.

Adding a cliping array slows the thing down to 68ms.

Incrementalizing per-pixel computations drops it down to 19ms.
Adding a greyscale special case drops it again to 15ms for
three images (5ms per 1mpixel image -> 200MB/sec).
The total time is still like 100ms, I'm guessing mostly
because of various TCP delays.

This still isn't quite as fast as it might be (MMX?), but
it's faster than most networks, so it's probably fast enough.

