----------------------- StarSlice perf. tests -------------------
old and new meshes are named ".noboite" (tetmesh format)
in
	~olawlor/csar/meshing/starslice/transfer/
so you've got to do
	ln -s ~olawlor/csar/meshing/starslice/transfer/*.noboite .

Transfer from old to new, with fabricated solution data
Read input file old.noboite: 18232 nodes, 86906 tets
Read input file new.noboite: 23274 nodes, 117787 tets

-- Serial tests on thrift4 (AthlonXP2000, 1GB ram) --
  (-memory os is *much* better than our builtin allocator--
   the builtin version doesn't seem to release metis' huge
   buffers when they get freed (sbrk issue?))

Everything is built with -O -g

 +p1 +vp1
Finding bounding boxes took 0.154904 s (+0.000193 s imbalance)
Colliding bounding boxes took 13.764658 s (+0.000318 s imbalance)
Extracting collision list took 0.434714 s (+0.000228 s imbalance)
Rank 0: 5597657 collisions, 
Finding communication size took 0.120343 s (+0.000155 s imbalance)
Creating outgoing messages took 0.116760 s (+0.000185 s imbalance)
Isend took 0.000211 s (+0.000060 s imbalance)
Recv took 0.000079 s (+0.000053 s imbalance)
Wait took 0.000077 s (+0.000051 s imbalance)
Transferring solution took 48.980991 s (+0.000217 s imbalance)
  (lots of warnings about little boundary elements that 
   are getting clipped off)
Normalizing transfer took 0.057111 s (+0.000184 s imbalance)
[64.517 s] Transferred.

 +p1 +vp2 (peak memory is close to 500MB during collide (!))
Similar overall, collide is slightly slower (16s now).
Send/recv lists are pretty dang big-- 5.5m collisions for
only  100K elements....
[78.622 s] Transferred, counting 11s in metis.

Varying numbers of processes on single-PE thrift4:
Input/partition:
log_thrift4_1:[0.826 s] Input files read
log_thrift4_2:[11.004 s] Input files read
log_thrift4_4:[15.652 s] Input files read
log_thrift4_8:[11.917 s] Input files read
log_thrift4_16:[12.692 s] Input files read
  -> This is surpisingly slow.  Metis isn't such a problem,
     but node adjacency spits out too many links.
     Changing the adjacency to face-based both speeds up
     the solution and cuts the memory dramatically (to like 3s!).

Collide:
log_thrift4_1:Colliding bounding boxes took 13.621231 s (+0.000344 s imbalance)
log_thrift4_2:Colliding bounding boxes took 21.385291 s (+0.064712 s imbalance)
log_thrift4_4:Colliding bounding boxes took 22.907772 s (+1.533167 s imbalance)
log_thrift4_8:Colliding bounding boxes took 20.893643 s (+3.414591 s imba
log_thrift4_16:Colliding bounding boxes took 17.842731 s (+12.608363 s imbalance)

Total time:
log_thrift4_1:[63.437 s] Transferred.
log_thrift4_2:[83.291 s] Transferred.
log_thrift4_4:[91.333 s] Transferred.
log_thrift4_8:[87.812 s] Transferred.
log_thrift4_16:[96.161 s] Transferred.

 -> This is excellent overall scaling. Be interesting to 
    see how the real parallel performance is.

------------------ Magic Intersection -------------------
To avoid numerical difficulties in my intersection routines,
switching to Magic Software's intersection routines,
which are volume-based and hence less likely to get
confused by roundoff.

They're initially almost exactly the same speed under -O:
Transferring solution took 47.655317 s (+0.000120 s imbalance)

A -pg run shows (with a *4x* slowdown from -pg...):
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 15.91     14.54    14.54 42604238     0.00     0.00  SplitAndDecompose(Mgc::Tetrahedron, Mgc::Plane const &, vector<Mgc:
:Tetrahedron, allocator<Mgc::Tetrahedron> > &)
 11.62     25.16    10.62 793922392     0.00     0.00  Mgc::Vector3::Vector3(Mgc::Vector3 const &)
  6.65     31.24     6.08 40391633     0.00     0.00  vector<Mgc::Tetrahedron, allocator<Mgc::Tetrahedron> >::_M_insert_a
ux(Mgc::Tetrahedron *, Mgc::Tetrahedron const &)
  6.22     36.93     5.69 80783266     0.00     0.00  Mgc::Tetrahedron * __uninitialized_copy_aux<Mgc::Tetrahedron *, Mgc
::Tetrahedron *>(Mgc::Tetrahedron *, Mgc::Tetrahedron *, Mgc::Tetrahedron *, __false_type)
  6.15     42.55     5.62  5597657     0.00     0.01  Mgc::FindIntersection(Mgc::Tetrahedron const &, Mgc::Tetrahedron co
nst &, vector<Mgc::Tetrahedron, allocator<Mgc::Tetrahedron> > &)
  5.67     47.73     5.18 376371632     0.00     0.00  Mgc::Vector3::operator=(Mgc::Vector3 const &)
  5.02     52.32     4.59  5597657     0.00     0.01  getSharedVolumeMgc(TetMesh const &, TetMesh const &, int const *, i
nt const *)
  4.57     56.50     4.18 198405237     0.00     0.00  Mgc::Vector3::Dot(Mgc::Vector3 const &) const
  4.47     60.59     4.09 258061106     0.00     0.00  Mgc::Vector3::Vector3(double, double, double)
  3.85     64.11     3.52                             CollideOctant::divide(int)
  3.22     67.05     2.94                             simpleFindCollisions(CollideOctant const &, CollisionList &)
  3.11     69.89     2.84 108600622     0.00     0.00  Mgc::operator*(double, Mgc::Vector3 const &)
  2.78     72.43     2.54  5833145     0.00     0.00  Mgc::Tetrahedron * __uninitialized_copy_aux<Mgc::Tetrahedron const 
*, Mgc::Tetrahedron *>(Mgc::Tetrahedron const *, Mgc::Tetrahedron const *, Mgc::Tetrahedron *, __false_type)
  2.63     74.83     2.40 22390628     0.00     0.00  vector<Mgc::Tetrahedron, allocator<Mgc::Tetrahedron> >::operator=(v


Original:
Transferring solution took 47.655317 s (+0.000120 s imbalance)

inline Vector3 constructor:
Transferring solution took 35.354978 s (+0.000114 s imbalance)

early-exit main intersection loop:
Transferring solution took 34.663323 s (+0.000121 s imbalance)

parameter-passing early-exit optimization on tet parameter:
Transferring solution took 34.083170 s (+0.000121 s imbalance)

vector-copy optimization:
Transferring solution took 27.907812 s (+0.000120 s imbalance)

fewer copies in caller code:
Transferring solution took 26.393417 s (+0.000120 s imbalance)

Vector3::Dot inlining: (I'm sure glad CkVector is almost all inline!)
Transferring solution took 22.342098 s (+0.000116 s imbalance)

Early-exit test for nonintersecting tets (*halves* the SplitAndDecompose calls!)
Transferring solution took 19.053882 s (+0.000121 s imbalance)

Final -pg run:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 22.82      7.60     7.60 25265028     0.00     0.00  SplitAndDecompose(Mgc::Tetrahedron const &, Mgc::Plane const &, vec
tor<Mgc::Tetrahedron, allocator<Mgc::Tetrahedron> > &)
 11.08     11.29     3.69                             CollideOctant::divide(int)
  8.80     14.22     2.93  5597657     0.00     0.00  getSharedVolumeMgc(TetMesh const &, TetMesh const &, int const *, i
nt const *)
  7.90     16.85     2.63                             simpleFindCollisions(CollideOctant const &, CollisionList &)
  5.68     18.74     1.89 11195314     0.00     0.00  Mgc::Tetrahedron::GetPlanes(Mgc::Plane *) const
  4.83     20.35     1.61 14480365     0.00     0.00  vector<Mgc::Tetrahedron, allocator<Mgc::Tetrahedron> >::_M_insert_a
ux(Mgc::Tetrahedron *, Mgc::Tetrahedron const &)
  4.77     21.94     1.59 28960730     0.00     0.00  Mgc::Tetrahedron * __uninitialized_copy_aux<Mgc::Tetrahedron *, Mgc
::Tetrahedron *>(Mgc::Tetrahedron *, Mgc::Tetrahedron *, Mgc::Tetrahedron *, __false_type)
  4.41     23.41     1.47  5597657     0.00     0.00  Mgc::FindIntersection(Mgc::Tetrahedron const &, Mgc::Tetrahedron co
nst &, vector<Mgc::Tetrahedron, allocator<Mgc::Tetrahedron> > &)
  3.81     24.68     1.27  8419254     0.00     0.00  allOutside(Mgc::Plane const *, int, Mgc::Tetrahedron const &)