Standard lattice layout:
 4 dimensions
 Node remapping: TRIVIAL (no effort made to reorder)

 Sites on node: 16 x 16 x 16 x 8
 Processor layout: 2 x 2 x 2 x 4
Matrix * Matrix: 1.67969ms 
Vector * Matrix: 0.595703 ms 
Vector square sum: 0.0610352 ms 
Dirac 4 dirs: 4.6875ms 
Dirac: 6.875ms 
CG: 8.04688ms / iteration
 COMMS from node 0: 1028 done, 6640(86.5936%) optimized away
