Architectural Scalability of Neural Network Inference Using Task-based Programming
Authors/Creators
- 1. High Performance Computing section, IT Dept, NTNU
- 2. Edinburgh Parallel Computing Centre (EPCC)
Description
The internal structure of interactions in a hidden network can be inferred using a maximum likelihood estimate based on
a record of its external behavior, within the framework of the kinetic Ising model. Beyond its origins in statistical physics,
solutions to this problem can model the internal structure of a hidden neural network based on activity recordings from a
laboratory setting, or the training process of an arti cial neural network in the context of machine learning. The primary
obstacle to its practical application is that the amount of computational work required grows rapidly with the dimensions
of the represented network, but the vast majority of the operations can be independently evaluated in parallel. In this
paper, we investigate the performance characteristics of a proxy application that models this growth, with the purpose
of examining its suitability as a candidate application for future exascale platforms. While the application implies an
abundant amount of parallelizable computation, the practical scalability of a particular implementation depends on the
distribution of its underlying data structure in memory, and the resulting interactions with the memory system of the
target architecture. We investigate three di erent programming strategies that cover di erent trade-o s in terms of
memory access, from a process-based implementation that partitions the global workload into parallel parts that are
strictly sequenced internally, through a combination of thread parallelism and statically scheduled iterations, to a task-
based implementation that exposes all the work in terms of potentially parallel work units and schedules their sequencing
at run-time. We nd that this trade-o leads to implementations that can utilize computing platforms of growing size
comparably well, displaying near-linear speedup on our test system, which makes the application a promising candidate
for extreme scale computations. For the present test systems, however, scheduling the computation at run-time comes
with an overhead that is not amortized by the gains from additional scheduling exibility, suggesting that the process-
based implementation provides the most favorable scalability on present architectures.
Files
WP281_final_version.pdf
Files
(294.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:4dd3998e7e22088361f1162c425eadf6
|
294.0 kB | Preview Download |