This is the benchmark used in the submission of "Performance Portability"
manuscript.

Currently Loaded Modulefiles:
  1) modules/3.2.10.2
  2) eswrap/1.3.1-1.020200.1272.0
  3) switch/1.0-1.0502.57058.1.58.ari
  4) craype-network-aries
  5) craype/2.4.2
  6) pbs/12.2.401.141761
  7) craype-ivybridge
  8) cray-mpich/7.2.6
  9) packages-archer
 10) bolt/0.6
 11) nano/2.2.6
 12) leave_time/1.0.0
 13) quickstart/1.0
 14) ack/2.14
 15) epcc-tools/6.0
 16) intel/15.0.2.164
 17) cray-libsci/13.2.0
 18) udreg/2.3.2-1.0502.9889.2.20.ari
 19) ugni/6.0-1.0502.10245.9.9.ari
 20) pmi/5.0.7-1.0000.10678.155.25.ari
 21) dmapp/7.0.1-1.0502.10246.8.47.ari
 22) gni-headers/4.0-1.0502.10317.9.2.ari
 23) xpmem/0.1-2.0502.57015.1.15.ari
 24) dvs/2.5_0.9.0-1.0502.1958.2.55.ari
 25) alps/5.2.3-2.0502.9295.14.14.ari
 26) rca/1.0.0-2.0502.57212.2.56.ari
 27) atp/1.8.3
 28) PrgEnv-intel/5.2.56


Trunk version following Alan's optimsation at SVN 2911.

Using

MPICC=cc -openmp
CFLAGS=-O2 -DNDEBUG -DVVL=4 -DKEEPFIELDONTARGET -DKEEPHYDROONTARGET

which is standard data layout (unlike what it says in the MS!). Running
with

export KMP_AFFINITY=disabled
export OMP_NUM_THREADS=12

aprun -n 1 hostname
aprun -n 1 -d 12 -cc numa_node ./Ludwig.exe

we get indicative times of


      Time step loop:      0.267      0.338    269.581   0.269581 (1000 calls)
         Propagation:      0.035      0.040     35.612   0.035612 (1000 calls)
    Propagtn (krnl) :      0.035      0.040     35.611   0.035611 (1000 calls)
           Collision:      0.026      0.034     26.524   0.026524 (1000 calls)
   Collision (krnl) :      0.026      0.034     26.133   0.026133 (1000 calls)



Refactored version kevin-1397 at same SVN gives, with

MPICC=cc -openmp
CFLAGS=-O2 -DNDEBUG -DVVL=4

and same run-time options indicative times of

      Time step loop:      0.258      0.335    261.362   0.261362 (1000 calls)
         Propagation:      0.037      0.043     37.626   0.037626 (1000 calls)
    Propagtn (krnl) :      0.037      0.043     37.622   0.037622 (1000 calls)
           Collision:      0.027      0.034     26.943   0.026943 (1000 calls)
   Collision (krnl) :      0.027      0.034     26.937   0.026937 (1000 calls)

This includes a new propagation kernel via the kernel context functions,
and removal of memory copies in favour of aliasing. These changes result
in slightly slower propagation, but overall faster timestep loop.



Examples of standard output should appear in this directory representing
what is basically Alan's trunk version ("best"), and a version "ecse"
representing the same result of the feature branch at Sept 2016 being
the end of the current ECSE project.

Note at SVN 2946 there are a couple of manual alterations required to
patch in vectorised routines in phi_force_stress and beris edwards.
These should become the default in future.
