OpenAtom
Version1.5a
|
Ortho is decomposed by orthoGrainSize. More...
#include "debug_flags.h"
#include "orthoConfig.h"
#include "ortho.decl.h"
#include "pcSectionManager.h"
#include "CLA_Matrix.h"
#include "ckcallback-ccs.h"
Go to the source code of this file.
Classes | |
class | initCookieMsg |
class | orthoMtrigger |
class | Ortho |
For definition of CkDataMsg. More... | |
Macros | |
#define | INVSQR_TOLERANCE 1.0e-15 |
< More... | |
#define | INVSQR_MAX_ITER 10 |
#define | myabs std::abs |
Variables | |
bool | fakeTorus |
int | numPes |
Ortho is decomposed by orthoGrainSize.
We restrict orthograin to be a factor of sGrainsize then we have no section overlap issues. Thereby leaving us with ortho sections that need a simple tiling split of the sgrain sections. Mirrored by a stitching of the submatrix inputs for the backward path.
This can be accomplished manually within the current codebase with some waste in data replication and computation replication to handle the splitting/stiching operations.
A more efficient implementation would adopt the multicast manager group model of building a tree of participants for these operations. The reduction side from the PC would be broken up into multiple reductions, one for each orthograin within the sgrain. With a separate contribution for each orthograin. The multicast requires us to stitch together the input matrices into one per sgrain section. This might be accomplished in two stages, one in which the stitching is done, and a second in which the stitched sgrainsize matrices are multicast. The alternative is to just multicast the orthograin submatrices where needed and have each scalc do its strided copying stitching. As stitching is not computationally intensive, this may be the simplest and fastest solution. The second approach allows you to simply use the reductions and multicasts as mirror uses of the tree. Where each little ortho can run once it gets its input, while the scalcs would have to assemble their inputs from multiple multicasts.
Implementation details for this require that each ortho object participate in a section which has a section multicast client directed to the sGrainSize PC section. The converse PC sGrainSize elements will have an array of section cookies, one for each of the subsections for all orthograin elements within the sGrain. The forward path of the PC will contribute its orthograin tile (via a strided contribute) which will end up at the correct ortho object.
Note: these PC sections must include all 4th dim blocks.
OrthoHelper can be used to perform the 2nd of the multiplies in the 3 step S->T process in parallel with the 3rd multiply. If used, the results of multiply 1 are sent from ortho[x,y] to orthoHelper[x,y]. The results are then returned to ortho[x,y]. The last of step2 or step3 will then trigger step4. Due to the copy and communication overhead this is only worth doing if the number of processors is greater than 2 * the number of ortho chares.
Allowing sgrainsize choices which are nstates % sgrainsize != 0 forces us to handle remainder logic. To avoid overlap/straddle issues between ortho and PC, we still enforce sgrainssize % orthograinsize ==0. Complexity cost here comes in two forms.
The total multiply itself will still of course be nstates X nstates.
Definition in file ortho.h.
#define INVSQR_TOLERANCE 1.0e-15 |