Standard lattice layout:
 4 dimensions
 Node remapping: TRIVIAL (no effort made to reorder)

 Sites on node: 32 x 16 x 16 x 16
 Processor layout: 1 x 2 x 2 x 2
Matrix * Matrix: 3.84766ms 
Vector * Matrix: 1.33789 ms 
Vector square sum: 0.307617 ms 
Dirac 4 dirs: 14.375ms 
Dirac: 14.0625ms 
CG: 22.3438ms / iteration
 COMMS from node 0: 516 done, 3312(86.5204%) optimized away
