



Environment variables

export MIC_PREFIX=MIC
export MIC_ENV_PREFIX=MIC
export MIC_KMP_AFFINITY=balanced
export MIC_OMP_STACKSIZE=2M      



offload

The sequence of events when a statement marked for offload is encountered:
(this omits details for asynchronous cases)

     1. If there is no IF clause, go to step 3.
     2. On the host CPU, evaluate the IF expression. 
        If it evaluates to true, go to step 3. 
        Otherwise, execute the region on the host CPU and go to step 14.
     3. Attempt to acquire the target. If successful, go to step 4. 
        Otherwise, execute the region on the host CPU and go to step 14.
     4. On the host CPU, compute all ALLOC_IF, FREE_IF, 
        and element-count-expr expressions used in IN and OUT clauses.
     5. On the host CPU, gather all variable values that are inputs to the offload.
     6. Send the input values from the host CPU to the target.
     7. On the target, allocate memory for variable-length OUT variables.
     8. On the target, copy input values into corresponding target variables.
     9. On the target, execute the offloaded region.
    10. On the target, compute all element-count-expr expressions used in OUT clauses.
    11. On the target, gather all variable values that are outputs of the offload.
    12. Send output values back from the target to the host CPU.
    13. On the host CPU, copy values received into corresponding host CPU variables.
    14. Continue processing the program on the host CPU.




!dir	offload  <specifier> [[,] <specifier>]

   <specifier> :=
      target( MIC [:dev#] )            << use dev# if have more than one MIC
      signal( tag )                    << tag is integer variable unique for this offload
      wait( tag [, tag, ...] )         << wait until all tags are set to finished   
      if( expr )                       << only offload if expr evaluates to .true.
      mandatory                        << execution on the target is mandatory
      optional                         << execution on the target is optional
      <data_specifier>    
      
   if no signal specifier, HOST waits until MIC finished.
   else HOST keeps going while MIC computes; use "wait" to resynchronize.
   
   
   <data_specifier> :=
      (in | out | inout | nocopy) ( identifier [[,]identifier...] [: modifier [[,] modifier...] )
      
   
      identifier is name of variable, array, bitwise copyable structure (i.e., no pointers)
      modifier := 
         alloc_if( expr ) 
         free_if( expr ) 
         align( expr )        << value of expression should be a power of two
         into( into-identifier )   
         
         

code can check if running on HOST or MIC

#ifdef __MIC__
         foo2 = 1 ! Code is running on MIC
#else
         foo2 = 0 ! Code is running on host
#endif
   result = foo()


ATTRIBUTES OFFLOAD

only procedures that include the ATTRIBUTES OFFLOAD:MIC directive are available to be called by offloaded code (i.e., code immediately following an OFFLOAD directive), and only these procedures can be called on the coprocessor.

need attributes offload directive at the place of definition
and also at the places of use in code block following an OFFLOAD directive.



!dir$ attributes offload : mic :: <function/subroutine name or variable name>


!dir$ options /offload_attribute_target=mic
! The target(mic) attribute is set for all following functions or variables
....
!dir$ end options

or as compiler option, set attribute on all functions and global variables in file
   -offload-attribute-target=mic
to save space on MIC, don't use this on files that are only for host

 

   module support
      implicit none
      !dir$ attributes offload: mic :: global
      integer :: global = 0    
      contains
      !dir$ attributes offload: mic :: foo
      integer function foo
         !dir$ if defined  (__MIC__)
            global = global + 1  ! Code is running on coprocessor
         !dir$ else
            global = global - 1  ! Code is running on host
         !dir$ endif
         foo = global
      end function foo   
   end module support

   program main 
      use support 
      implicit none
      integer :: i 
      !dir$ attributes offload:mic :: foo
      !          << is this really necessary? or get attributes from use?
      !dir$ offload target(mic) inout(global) 
      i = foo() 
      print *, "global:i=",global,i,"(both=1)"
   end program main



About Creating Offload Libraries with xiar

   xiar -qoffload-build rcs libsample.a obj1.o obj2.o

   -qoffload-build tells xiar to also build lib for MIC.
   libsample.a contains the CPU object files obj1.o and obj2.o. 
   libsampleMIC.a contains the coprocessor object files obj1MIC.o and obj2MIC.o

   When linking a static archive that contains offload code, 
   use the linker options -Lpath and -llibname. 
   The compiler driver automatically incorporates the corresponding 
   coprocessor library, libMIC.a, into the linking phase.

   >>> TEST THIS WITH PREVIOUS EXAMPLE -- make lib from module support.
   
   


remote execution of OpenMP
!DIR$ OMP OFFLOAD TARGET(MIC)
!$omp parallel
...
!$omp end parallel



SPLIT THE COMPUTATION ACROSS THE CPU AND MIC

   !dir$ attributes offload:mic :: work 
   subroutine work(knt, ns,ne, a) 
      integer :: a(*)
      do i=ns,ne 
         a(i) = a(i) + 1
      end do 
   end subroutine

   program main 
   !dir$ attributes offload : mic :: work
   integer, parameter :: N=100 
   integer :: i, knt=1, a(N),NS,NE, sig1=1      ! ??? check if really need to initialize sig1
   do i = 1,N; a(i) = i; end do
   do while (knt .lt. 10) 
      NS=1; NE=N/2 
      !dir$ offload target(mic) signal(sig1) inout(a(1:N/2)) 
      call work(knt,NS,NE,a)
      NS=N/2+1; NE=N 
      call work(knt,NS,NE,a)
      !dir$ offload_wait target(mic) wait(sig1)
      knt=knt+1
   end do
   do i = 1,N 
      print*, i, a(i)
   end do 
   end program



POINTERS AND PERSISTENT DATA


real, dimension(:), pointer :: p
!DIR$ ATTRIBUTES OFFLOAD:mic :: p

on CPU, associate p (e.g., p => target_array or allocate(p(n))

>>>  For optimal data transfer performance, by default, the target memory address for a transfer through a pointer is made to match the offset within 64 bytes of the CPU data. That is, if the CPU source address is 16 bytes past a 64 byte boundary, the target data address will also be 16 bytes past a 64 byte boundary.  The align modifier overrides this default and aligns the target memory at the requested alignment. To get the benefits of fast data transfer and the necessary alignment on the target, ensure that the CPU data is aligned on the same boundary as the alignment needed on the target.

to get 64 byte alignment for p on the host, use this:
!DEC$ ATTRIBUTES ALIGN : 64 :: p

>>> offload uses the CPU address of pointer p to identify the allocated target data.

allocate target storage, no initial values, retain target storage for future use.
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) NOCOPY (p : alloc_if(.true.))
      
if p is not aligned on host, but want to make it aligned on MIC, then 
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) &
      NOCOPY (p : alloc_if(.true.) align(64))
this makes the copy slower, but may make execution faster (e.g., vectorization)

copy values p(1:n) and q(1:n) from host to target.
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) &
      IN (p, q : length(1:n) alloc_if(.false.) free_if(.false.))

copy values p(1:n) back to host from target.
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) &
      OUT (p : length(1:n) alloc_if(.false.) free_if(.false.))

asynchronous copy values p(1:n) from host to target.
   !DIR$ OFFLOAD_TRANSFER(mic) SIGNAL(sig) &
      IN (p : length(1:n) alloc_if(.false.) free_if(.false.))
   ...host computation while copy from host to target...
   ...when host computation done, wait for copy to finish...
   !DIR$ OFFLOAD_WAIT TARGET(mic) WAIT(sig)

asynchronous copy values p(1:n) to host from target.
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) SIGNAL(sig) &
      OUT (p : length(1:n) alloc_if(.false.) free_if(.false.))
   ...host computation while copy from target to host...
   ...when host computation done, wait for copy to finish...
   !DIR$ OFFLOAD_WAIT TARGET(mic) WAIT(sig)

reuse target storage, copy new values (1:n).
   !DIR$ OFFLOAD TARGET(mic) &
      IN (p : length(1:n) alloc_if(.false.) free_if(.false.))
   call f(p) ! on target, call fcn f with target version of pointer p

reuse target storage, use previous values.
   !DIR$ OFFLOAD TARGET(mic) NOCOPY (p)
   call f(p) ! on target, call fcn f with target version of pointer p

asynchronous copy and computation, reuse target storage, use previous values.
   !DIR$ OFFLOAD TARGET(mic) SIGNAL(sig) NOCOPY (p)
   call f(p) ! on target, call fcn f with target version of pointer p
   ...host computation while target is working...
   ...when host computation done, wait for target to finish...
   !DIR$ OFFLOAD_WAIT TARGET(mic) WAIT(sig)

free target storage.
   !DIR$ OFFLOAD_TRANSFER target(mic:0) NOCOPY (W : free_if(.true.))


Performing File I/O on the Coprocessor

if on host,
export MIC_PROXY_FS_ROOT=/home/bpaxton/
then proxied host directory is /home/bpaxton/proxyfs/
from the offloaded code you can read or write the special directory ./proxyfs/
offloaded code can read or write to what appears to it to be a local file, 
and that file I/O is automatically proxied over to the host

MIC output to stdout may not be immediately visible on the CPU. 
To ensure output is visible you must execute a FLUSH 6 after each WRITE to unit 6

   program main
   integer :: f, r, d

   !dir$ offload begin target(mic) nocopy(f,r)
   open(f, FILE='./proxyfs/myfile.txt', IOSTAT=r)
   if (r .ne. 0) then
      print *, 'Failed to open myfile.txt for write'
      stop
   end if
   write(f,*) 55
   close(f)
   !dir$ end offload

   !dir$ offload begin target(mic) nocopy(f,r) out(d)
   open(f, FILE='./proxyfs/myfile.txt', IOSTAT=r, STATUS='OLD')
   if (r .ne. 0) then
      print *, 'Failed to open myfile.txt for read'
      stop
   end if
   read(f, '(I)') d
   close(f)
   !dir$ end offload
   
   if (d .ne. 55) then
      print *, 'File incorrectly read back on coproc'
      stop
   end if

   print *, d

   end program main
        
        
        
        
        
for diagnostics
export OFFLOAD_REPORT=1


in mesa, surround mic related things with ifdef use_mic
including all !dir$  

#ifdef use_mic
#endif



---------------------------------------------------------------------------------



PROBLEM -- seems that when call to PHI returns with free_if(.true.), it destroys ALL allocated data on target, not just the array that was created to hold the argument data.

WORKAROUND -- NEVER use free_if(.true.)

all data transfers use preallocated buffers that are retained for entire run.


allocate target storage, no initial values, retain target storage for future use.
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) NOCOPY(buffer : alloc_if(.true.))


to send data from host to target

1st, on host, move the data to the buffer

2nd, on host, copy data from host buffer(1:n) to target buffer(1:n)
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) IN(buffer(1:n) : alloc_if(.false.) free_if(.false.))

3rd, on host, call routine on coprocessor to copy data from buffer to final destination
   do NOT pass vector data to this routine -- unless free_if(.false.)
   
   
to get data from target back to host

1st, on coprocessor, move the data to the buffer

2nd, on host, copy data from target buffer(1:n) to host buffer(1:n)
   !DIR$ OFFLOAD_TRANSFER TARGET(mic) OUT(buffer(1:n) : alloc_if(.false.) free_if(.false.))

3rd, on host, copy data from buffer to final destination






