
Initial Investigation (Novemeber 2016)

kevin@login:/work/knl-users/kevin/sc16> module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.5                 14) lustre-utils/2.3.4-6.74
  2) alps/6.1.6-20.1                  15) Base-opts/2.1.3-2.16
  3) nodestat/2.2-2.40                16) intel/17.0.0.098
  4) sdb/2.2.1-3.119                  17) craype-mic-knl
  5) udreg/2.3.2-4.6                  18) craype-network-aries
  6) ugni/6.0.12-2.1                  19) craype/2.5.7
  7) gni-headers/5.0.7-3.1            20) cray-mpich/7.4.4
  8) dmapp/7.1.0-12.37                21) pbs/default
  9) xpmem/0.1-4.5                    22) cray-libsci/16.09.1
 10) llm/20.2.4-3.18                  23) pmi/5.0.10-1.0000.11050.0.0.ari
 11) nodehealth/5.2.0-5.46            24) atp/2.0.3
 12) system-config/2.2.18-3.38        25) PrgEnv-intel/6.0.3
 13) sysadm/2.2.2-3.39

All KNL quad_0 mode using 64 or 128 threads with 1 MPI task
numactl -m 1
(preliminary investigations show little or no improvement at 256 threads)

* -DNDEBUG -O2 -qopenmp -no-vec  => c. --- and 149 sec
* -DNDEBUG -O2 -qopenmp          => c. --- and 152 sec
* -DNDEBUG -O2 -qopenmp -DVVL=2  => c. 182 and 148 sec
* -DNDEBUG -O2 -qopenmp -DVVL=4  => c. 129 and 101 sec
* -DNDEBUG -O2 -qopenmp -DVVL=8  => c. 114 and  69 sec
* -DNDEBUG -O2 -qopenmp -DVVL=16 => c. 119 and  72 sec

* -DNDEBUG -O3 -qopenmp -DVVL=8    => c. 115 and  69 sec
* -DNDEBUG -fast -qopenmp -DVVL=8  => c. 108 and  58 sec


1, 2, 4, 8, 16, 32, 64, and 128 threads
* -DNDEBUG -fast -qopenmp -DVVL=8  =>
                  c. 3260, 1656, 837, 424, 220, 129, 107, 58  seconds
                  c. 3253, 1657, 835, 424, 219, 129, 108, 105 sec
                  c. 3257, 1658, 837, 424, 220, 129, 108, 110 sec

Using quad_100

1, 2, 4, 8, 16, 32, 64, and 128 threads
* -DNDEBUG -fast -qopenmp -DVVL=8  =>
                  c. 3356, 1708, 857, 434, 221, 116, 66.6, 56.4 
                  c. 3355, 1708, 857, 434, 223, 116, 68.1, 55.9
                  c. 3352, 1707, 857, 433, 222, 116, 66.7, 57.2


Rerun the above comparison between quad_0 and quad_100 with svn 3018
(with updated propagation). The odd quad_0 may be related to numactrl
(or lack of it). All the following have no numactrl in the command
line.

quad_0
1, 2, 4, 8, 16, 32, 64, 128, and 256 threads
* -DNDEBUG -fast -qopenmp -DVVL=8  =>
                  c. 2878, 1465, 738, 375, 195, 123, 110, 108, 120
                  c. 2878, 1467, 738, 375, 196, 123, 112, 109, 120
                  c. 2879, 1463, 738, 375, 195, 123, 110, 112, 119

quad_1000
1, 2, 4, 8, 16, 32, 64, 128, and 256 threads
                  c. 2962, 1505, 756, 382, 195, 101, 59.2, 51.8, 53.9
                  c. 2962, 1505, 756, 382, 195, 102, 59.6, 51.4, 54.5
                  c. 2962, 1508, 757, 382, 194, 101, 58.4, 53.4, 54.3 


quad_100
Mixed mode

-n 2 MPI tasks -d 32 or 64 threads
                  c. 58.3, 51.4
                  c. 58.4, 49.0
                  c. 58.4, 52.2

-n 4 MPI tasks -d 16 or 32 threads
                  c. 59.9, 51.6
                  c. 61.1, 52.7
                  c. 60.0, 51.5

-n 8 MPI tasks and 8 or 16 threads
                  c. 61.3, 50.8
                  c. 61.4, 51.8
                  c. 61.1, 53.4

-n 16 MPI tasks and 4 or 8 threads
                  c. 62.0, 51.1
                  c. 61.8, 50.5
                  c. 61.9, 50.7

-n 32 MPI tasks and 2 or 4 threads
                  c. 65.2, 55.3
                  c. 65.1, 55.6
                  c. 65.7, 55.8

-n 64 MPI tasks and 1 or 2 threads
                  c. 68.7, 54.3
                  c. 69.0, 54.0
                  c. 68.9, 54.3

TWO x quad_100

-n 2 -N 1 and 128 threads
                  c. 31.7 c. 30.9 c. 30.0

-n 4 -N 2 and 64 threads
                  c. 30.4 c. 28.9 c. 28.9

-n 8 -N 4 and 32 threads
                  c. 29.6 c. 29.4 c. 29.5

FOUR x quad_100

-n 16 -N 4 and 32 threads
                  c. 17.6 c. 17.6 c. 17.5

-n 8 -N 2 and 64 threads
                  c. 17.6 c. 17.9 c. 17.3

-n 4 -N 1 and 128 threads
                  c. 18.4 c. 17.6 c 17.9