Report Open Access
Shaffer, Tim; Blomer, Jakob; Ganis, Gerardo
High-performance computing (HPC) contributes a significant and growing share of resource to high-energy physics (HEP). Individual supercomputers such as Edison or Titan in the U.S. or SuperMUC in Europe deliver a raw performance of the same order of magnitude than the Worldwide LHC Computing Grid. As we have seen with codes from ALICE and ATLAS, it is notoriously difficult to deploy high-energy physics applications on supercomputers, even though they often run a standard Linux on Intel x86_64 CPUs.
The three main problems are:
1. Limited or no Internet access;
2. The lack of privileged local system rights;
3. The concept of cluster submission or whole-node submission of jobs in contrast to single CPU slot submission in HEP.
Generally, the delivery of applications to hardware resources in high-energy physics is done by CernVM-FS . CernVM-FS is optimized for high-throughput resources. Nevertheless, some successful results on HPC resources where achieved using the Parrot system that allows to use CernVM-FS without special privileges. Building on these results, the project aims to prototype a toolkit for application delivery that seamlessly integrates with HEP experiments job submission systems, for instance with ALICE AliEn or ATLAS PanDA. The task includes a performance study of the parrot-induced overhead which will be used to guide performance tuning for both CernVM-FS and Parrot on typical supercomputers. The project should further deliver a lightweight scheduling shim that translates HEP’s job slot allocation to a whole node or cluster-based allocation. Finally, in order to increase the turn-around of the evaluation of new supercomputers, a set of "canary jobs" should be collected that validate HEP codes on new resources.
On high performance computing (HPC) resources, users have less control over worker nodes than in the grid. Using HPC resources for high energy physics applications becomes more complicated because individual nodes often don't have Internet connectivity or a filesystem configured to use as a local cache. The current solution in CVMFS preloads the cache from a gateway node onto the shared cluster file system. This approach works but does not scale well into large production environments. In this project, we develop an in memory cache for CVMFS, and assess approaches to running jobs without special privilege on the worker nodes. We propose using Parrot and CVMFS with RAM cache as a viable approach to HEP application delivery on HPC resources.