Published October 5, 2021 | Version v1
Report Open

Task-Based Performance Portability in HPC

  • 1. Inria
  • 2. BSC
  • 3. University of Vienna

Description

As HPC hardware continues to evolve and diversify and workloads become more dynamic and complex, applications need to be expressed in a way that facilitates high performance across a range of hardware and situations. The main application code should be platform-independent, malleable and asynchronous with an open, clean, stable and dependable interface between the higher levels of the application, library or programming model and the kernels and software layers tuned for the machine. The platform-independent part should avoid direct references to specific resources and their availability, and instead provide the information needed to optimise behaviour.

This paper summarises how task abstraction, which first appeared in the 1990s and is already mainstream in HPC, should be the basis for a composable and dynamic performance-portable interface. It outlines the innovations that are required in the programming model and runtime layers, and highlights the need for a greater degree of trust among application developers in the ability of the underlying software layers to extract full performance. These steps will help realise the vision for performance portability across current and future architectures and problems.

Notes

This work was supported by the Spanish Government (contract PID2019-107255GB), Generalitat de Catalunya (contract 2014-SGR-1051), and the European Union's Horizon 2020 research and innovation programme under grant agreements No 955606 (DEEP-SEA) and No 754337 (EuroEXA). Paul Carpenter holds the Ramon y Cajal fellowship under contracts RYC2018-025628-I of the Ministry of Economy and Competitiveness of Spain. This work was supported by the French Government (contract ANR-19-CE46-0009), Région Nouvelle Aquitaine (contract 018-1R50119) and the European Union's Horizon 2020 research and innovation programme under grant agreements No 671602 (INTERTWinE) and No 801015 (EXA2PRO). This work was supported by the Austrian Science Fund grant P29783.

Files

ETP4HPC_WP_Task-based-PP_FINAL.pdf

Files (4.8 MB)

Name Size Download all
md5:21ef9ac844d9d6479d047cf095cef572
4.8 MB Preview Download

Additional details

Funding

INTERTWINE – Programming Model INTERoperability ToWards Exascale (INTERTWinE) 671602
European Commission
EXA2PRO – Enhancing Programmability and boosting Performance Portability for Exascale Computing Systems 801015
European Commission
EuroEXA – Co-designed Innovation and System for Resilient Exascale Computing in Europe: From Applications to Silicon 754337
European Commission
DEEP-SEA – DEEP – SOFTWARE FOR EXASCALE ARCHITECTURES 955606
European Commission

References

  • Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model. IEEE TPDS, 2017. DOI: http://dx.doi.org/10.1109/TPDS.2017.2766064.
  • Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. Legion: expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12), 2012. PDF: https://legion.stanford.edu/pdfs/sc2012.pdf.
  • Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (Aug. 1995), 207–216. DOI: https://doi.org/10.1145/209937.209958.
  • Chameleon Dense Linear Algebra library: https://gitlab.inria.fr/solverstack/chameleon.
  • J. Dokulil, M. Sandrieser and S. Benkner, "Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems," 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2016, pp. 364-368, doi: 10.1109/PDP.2016.81.
  • A. Duran, E. Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, X. Martorell and Judit Planas. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures, Parallel Process. Lett., 2011. Volume 21, pp.173–193.
  • Code FLUSEPA: https://hal.inria.fr/hal-01507613.
  • Garcia-Gasulla, M., Mantovani, F., Josep-Fabrego, M., Eguzkitza, B. and Houzeaux, G., 2020. Runtime mechanisms to survive new HPC architectures: a use case in human respiratory simulations. The International Journal of High Performance Computing Applications, 34(1), pp.42-56.
  • H2020 INTERTWinE: http://www.intertwine-project.eu/.
  • Reazul Hoque, Thomas Herault, George Bosilca, and Jack Dongarra. Dynamic task discovery in PaRSEC: a data-flow task-based runtime. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17), 2017. DOI: https://doi.org/10.1145/3148226.3148233.
  • Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. HPX:A task based programming model in a global address space. In 8th international conference on partitioned global address space programming models PGAS'14, 2014. DOI: https://doi.org/10.1145/2676870.2676883.
  • OCR specification: https://www.univie.ac.at/ocr-vx/doc/ocr-v1.1.0.pdf
  • OCR-Vx website: https://www.univie.ac.at/ocr-vx/
  • Perez, Josep M., et al. "Improving the integration of task nesting and dependencies in OpenMP." 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2017.
  • OpenMP Application Program Interface. Version 3.0 May 2008. Available: https://openmp.org/wp-content/uploads/spec30.pdf.
  • PasTiX Sparse Linear Algebra library: https://solverstack.gitlabpages.inria.fr/pastix/
  • StarPU website: https://starpu.gitlabpages.inria.fr/
  • Afshin Zafari, Elisabeth Larsson, Martin Tillenius. DuctTeip: An efficient programming model for distributed task-based parallel computing. Parallel Computing, Vol. 90, 2019. DOI: https://doi.org/10.1016/j.parco.2019.102582.