Published March 21, 2019 | Version v1
Working paper Open

Architectural Scalability of Neural Network Inference Using Task-based Programming

  • 1. High Performance Computing section, IT Dept, NTNU
  • 2. Edinburgh Parallel Computing Centre (EPCC)

Description

The internal structure of interactions in a hidden network can be inferred using a maximum likelihood estimate based on
a record of its external behavior, within the framework of the kinetic Ising model. Beyond its origins in statistical physics,
solutions to this problem can model the internal structure of a hidden neural network based on activity recordings from a
laboratory setting, or the training process of an arti cial neural network in the context of machine learning. The primary
obstacle to its practical application is that the amount of computational work required grows rapidly with the dimensions
of the represented network, but the vast majority of the operations can be independently evaluated in parallel. In this
paper, we investigate the performance characteristics of a proxy application that models this growth, with the purpose
of examining its suitability as a candidate application for future exascale platforms. While the application implies an
abundant amount of parallelizable computation, the practical scalability of a particular implementation depends on the
distribution of its underlying data structure in memory, and the resulting interactions with the memory system of the
target architecture. We investigate three di erent programming strategies that cover di erent trade-o s in terms of
memory access, from a process-based implementation that partitions the global workload into parallel parts that are
strictly sequenced internally, through a combination of thread parallelism and statically scheduled iterations, to a task-
based implementation that exposes all the work in terms of potentially parallel work units and schedules their sequencing
at run-time. We nd that this trade-o leads to implementations that can utilize computing platforms of growing size
comparably well, displaying near-linear speedup on our test system, which makes the application a promising candidate
for extreme scale computations. For the present test systems, however, scheduling the computation at run-time comes
with an overhead that is not amortized by the gains from additional scheduling exibility, suggesting that the process-
based implementation provides the most favorable scalability on present architectures.

Files

WP281_final_version.pdf

Files (294.0 kB)

Name Size Download all
md5:4dd3998e7e22088361f1162c425eadf6
294.0 kB Preview Download

Additional details

Funding

European Commission
PRACE-5IP - PRACE 5th Implementation Phase Project 730913