Intelligent HTC for Committor Analysis

Committor analysis is a powerful, but computationally expensive, tool to study reaction mechanisms in complex systems. The committor can also be used to generate initial trajectories for transition path sampling, a less-expensive technique to study reaction mechanisms. The main goal of the project was to facilitate an implementation of committor analysis in the software application OpenPathSampling ( http://openpathsampling.org/ ) that is performance portable across a range of HPC hardware and hosting sites. We do this by the use of hardware-enabled MD engines in OpenPathSampling coupled with a custom library extension to the data analytics framework Dask ( https://dask.org/ ) that allows for the execution of MPI-enabled tasks in a steerable High Throughput Computing workﬂow. The software developed here is being used to generate initial trajectories to study a conformational change in the main protease of the SARS-CoV-2 virus, which causes COVID-19. This conformational change may regulate the accessibility of the active site of the main protease, and a better understanding of its mechanism could aid drug design.


Introduction
The main goal of this project is to facilitate an implementation of committor analysis in the software application OpenPathSampling (OPS) [1,2] that is performance portable across a range of HPC hardware and hosting sites.
Committor analysis is essentially an ensemble calculation that maps straightforwardly to a High Throughput Computing (HTC) workflow, where typical individual tasks have moderate scalability and indefinite duration.Since this workflow requires dynamic and resilient scalability within the HTC framework, OPS is coupled to a custom HTC library [3] that leverages the Dask [4,5] data analytics framework and implements support for the management of MPI-aware tasks.
The committor can also be used to generate initial trajectories for transition path sampling, a less-expensive technique to study reaction mechanisms.The software developed here is being used to generate initial trajectories to study a conformational change in the main protease of the SARS-CoV-2 virus, which causes COVID-19.This conformational change may regulate the accessibility of the active site of the main protease, and a better understanding of its mechanism could aid drug design.
The project targets porting the custom HTC library to a wider variety of HPC platforms (specifically additional resource managers and MPI runtimes) and stress testing the framework for very large task counts.Both OPS and jobqueue features are Python-based and we have developed more sophisticated support for Python-based MPI-enabled tasks, including direct access to the memory space of executed tasks (which allows us to avoid the use of the filesystem for information transfer between tasks).
We have also investigated the use of the UCX protocol for communication to reduce the overall overhead of the HTC framework.

Preparing OpenPathSampling for the HTC framework
Leveraging the Dask framework involved significant software development within OPS.Two main problems had to be addressed: OPS had no mechanism to gather results reported from parallel workers into a single result set, and tasks based on OPS could not be serialised by Dask.
Both of these problems were solved by building atop an ongoing overhaul of the OPS storage and serialisation subsystem.These have been combined with a new implementation of the committor simulation to make a usable parallel committor simulation for OPS.
The overall approach to interfacing the HTC framework with OPS was to allow the HTC framework to effectively act as a drop-in replacement for objects from the Dask scheduler.This required significant effort to prepare OPS to support Dask in general, but then little additional effort to support the HTC framework.In particular, OPS could not trivially support Dask because: • Some objects in OPS could not be serialised by Cloudpickle (the default serialisation in Dask).
• OPS caches results of some calculations (functions referred to as "collective variables") in memory and also stores them to disk.Maintaining this behaviour in a parallel scheme required new objects to replace the existing collective variables.
A new storage subsystem was already in development for OPS for reasons of performance, extensibility, and maintainability.We completed the core functionality of that storage subsystem and extended it to address issues with Dask compatibility.The incompatibility with Cloudpickle was overcome by using the new storage subsystem's serialization capabilities to serialise the data into a memory-based database, which could then be again serialized by Cloudpickle.Deserialising the memory-based database then becomes part of the task.
Developing new collective variables was a planned part of the new storage subsystem, and we added functionality to support gathering results into a canonical copy of the collective variable as part of the process of storing results from a task to disk.
In addition, we added support for the OPS GROMACS engine which we use as the scalable, hardwareportable MD engine that drives the OPS tasks.

Development and optimization of the jobqueue features HTC library
Building upon [3], we expanded the capabilities of jobqueue features [6] to include the following: • Support for MPI-enabled tasks (i.e., tasks which can leverage MPI4PY for parallelisation).The framework has the ability to access the memory space of the root MPI process of each task.• Support for a Portable Batch System (PBS) resource scheduler.
• Continuous integration support for SLURM and PBS.This means that the library features are fully tested (automatically) on both of these schedulers through the use of a set of custom Docker containers.• Support for OpenMPI, Intel MPI and MPICH MPI runtimes (including runtime process distribution).
• A Docker-based tutorial infrastructure which includes a SLURM scheduler.The infrastructure includes two computation nodes and one head node.On the head node a JupyterLab instance is installed, which allows the user to interact with the tutorial infrastructure (and explore our package) directly through their browser.We intend to create some tutorial Jupyter notebooks to facilitate use of this infrastructure.• Support for Dask and Dask.distributed 2.x (tested up to version 2.21, released in July 2020).
• Configuration of the HTC library for the PRACE resources JUWELS and Irene.Irene uses both a custom scheduler (based on SLURM) and a custom MPI runtime, however we were able to override the more generic classes of jobqueue features to support these (see Appendix B for details).MPI-enabled task support is now tested via in a continuous integration setup, and this feature was the main test case used for our scalability studies.For each task, the framework reuses the MPI environment created when the worker was initialised on the node.We have tested the framework with PyLAMMPS to show that each task does indeed have access to, and uses, all available resources (this information is something that LAMMPS reports in its output).A minimal example of what the implementation of this looks like in code is provided in Listing 1. return " Potential energy : % s " % L .eval ( " pe " ) @on_cluster ( cluster = lamm ps_clus ter ) def my_lammps_job ( input_file = " in .melt " , run_steps =100) : t1 = lammps_task ( input_file , run_steps = run_steps ) print ( t1 .result () ) # blocking call ( t1 is a lazy future object ) Listing 1: Example of decorator usage to parallelize computation The main actions for optimisation/improvement were: • Porting of the HTC framework to Dask 2.x.
• Porting of the framework to Irene and JUWELS.
• Implementing support for the PBS resource manager and additional MPI runtimes.
• Continuous integration development to cover all library features.
• Investigation of UCX instead of IPoIB as a communication protocol.
There was no particular bottleneck to these developments; a lot of the time was spent resolving diverse issues related to the rapid development of the underlying Dask infrastructure.HPC network infrastructure is one key potential weakness since we currently use IPoIB and require a network connection between the scheduler and the workers.Since the interface used by the scheduler and those of the workers are now independently configurable, we have not (as yet) encountered a situation where issues related to this were not surmountable.
Another issue is that the Dask scheduler is intended to be a (relatively) long running task (with low/moderate resource usage).This is because when the Dask scheduler submits jobs to the resource manager, it cannot know when the jobs will execute nor how long the entire workflow will take.The scheduler would therefore normally run on a login node.When using JupyterLab on the Juelich Supercomputing Centre (JSC) systems (here JUWELS in particular) a Dask scheduler has the potential to run for up to 3 days.On Irene, we also did not encounter limitations to running the scheduler on the login nodes, but we are aware this is likely to be the case on some sites.The scheduler can always be run within a resource manager job context.

Scalability of jobqueue features
In an HTC use case, strong and weak scalability are open to definition.To give an initial definition, one could say that strong scaling would correspond to the ability of the individual tasks to scale to larger resource usage.Such a measure as this is entirely dependent on the scalability of the applications used within the tasks themselves rather than related to the HTC framework.In the case of OPS, it uses an underlying community application engine to achieve scalability (GROMACS for the use case described here, but this is configurable).We have tested the framework with both GROMACS (as used within OPS) and LAMMPS (in particular PyLAMMPS).Neither of these applications are directly under the remit of this project and both applications are known to scale well on HPC resources.
Another argument that could be made is that strong scaling could be defined as the number of simultaneous workers that the framework can utilise.Since each worker is an individual job, the limitation here does not come from the framework, it is the number of simultaneous jobs that can be allocated by the resource manager for a single user (which is usually of O(100) but varies from site to site).Since job submissions are ultimately handled by the library dependency dask jobqueue [7], we must wait for support of job arrays [8] to be able to reliably perform any such analysis (and this was not in the remit of this project).
Our interest in this project lies in the scalability of the HTC framework itself with respect to task counts.We would argue that is somewhat equivalent to a weak scaling analysis when considering an HTC workload: higher task count is equivalent to larger workload and triggers the dynamic provisioning of additional workers by the HTC framework.jobqueue features had not really been stress-tested for the number of tasks it was capable of supporting.Previous efforts had looked at up to 2000 tasks, in this case we scaled out to 1M tasks on all available architectures, with each individual task using 2 nodes worth of resources.Each set of 2 nodes forms a worker and the workers are reused by queued tasks.The number of workers that the framework can use is entirely dependent on the maximum number of simultaneous jobs allowed by a specific site.The package can simultaneously use workers of different resource types (CPU, KNL, GPU,...).As we can see from Figure 1 (and Table 1), we were able to easily scale our task workload out to 1M tasks.The overhead of the framework is negligible at 1ms per task for SkyLake (SKL) architecture and 10ms per task for Knights Landing (KNL) architecture (and is completely independent of the resource usage of the workers themselves).Such overheads are negligible even for short task durations.To put this in context, the node configuration times on JUWELS is about 10s; on Irene it is about 6s for the SKL, and 29s for the KNL (see Table 2).CPU time savings for short task durations are, therefore, potentially substantial when using the framework.

Tasks
We note that maintaining a functional software environment is non-trivial for the use cases we address here, there are many complications possible due to the software requirements of the framework and the tasks themselves.This is in addition to the complexity due to the potential to simulate simultaneously on a variety of hardware (such as we have done for the CPU, GPU and KNL partitions of JURECA).

UCX as a communication protocol
To test the efficiency of the UCX protocol in comparison to TCP (using IPoIB in our case), we used several test cases with different kinds of computation, from basic hand written functions to those based on functions provided by Dask.The most significant differences between Dask related operations and hand written tests can be seen on two specific tests: increment mapped on a Dask bag, and regular tree reduction.This can be seen in Tables [7,8,9,10].As we can see, UCX working only on Dask bag gives slightly better results then TCP.However, when it comes to also working on multiple tasks, TCP lost less time on communication.In Tables [11, 12, 13, 14], we can see that for the case of a fully hand written tree reduction, we have better performance for TCP for both single task, and multiple tasks.

Size of data in task TCP
On most of these tests, UCX performance is not very different to TCP performance (where we are utilising IPoIB).In general, UCX does give slightly better results for tasks based on Dask functions and TCP performs better for hand written tasks (most likely due to the fact that Dask can perform optimisations when it has control over the data objects).Critically, UCX has been seen to lead to some errors related to the fact that it is not yet a fully supported technology within Dask.In particular, during tests there were problems in the case of restarting the workers while UCX was in use, which means that we potentially lose resiliency in this scenario.Given the very limited potential for performance improvement and the significant impact of sacrificing resiliency, we would not recommend the use of UCX with jobqueue features (or even Dask) at this time.

Conclusions
Our HTC library jobqueue features is resilient and scales extremely well.OPS was expanded to be able to use the Dask framework and integration with the jobqueue features library now requires just 3 lines of code (see Appendix A for details).We have shown that support on other PRACE machines is a relatively straightforward process.This means that OPS can now almost trivially transition from use on a personal laptop to some of the largest HPC sites in Europe.
UCX support was found to be somewhat unstable (currently) and beneficial only in specific scenarios that make heavy use of Dask objects.Several times on closing a cluster we received errors about UCX losing the connection, but results were correctly returned.Problems appear when a cluster attempts to restart, then the connection can not be reestablished.For this reason we say that resiliency is currently compromised when using UCX, as such we would not recommend its use until this issue is resolved.
The parallelised committor simulation capability that the integration between OPS and the HTC library provides allows for the possibility to more easily address committor analysis in a scalable way.Ongoing work will use the tools developed here for a committor simulation of the SARS-CoV-2 main protease.Initial analysis of the stable states is based on a long trajectory provided by D.E.Shaw Research [9].That trajectory indicates that a loop region of the protein may act as a gate to the active site.Initial frames will be taken from that trajectory to be used as configurations in the committor simulation.
For a given molecular system, the committor simulation can be scaled up in two ways: by taking more initial configurations, which ensures a broader exploration of configuration space, and by running more trajectories per configuration, which leads to a more accurate calculation of the committor value.The results of this simulation will provide insight into the dynamics of the loop region, and the mechanism of its gate-like activity.Furthermore, trajectories generated by the committor simulation can be used an initial conditions for further studies using methods such as transition path sampling.

B Integration with the PRACE Irene Joliot-Curie system
The PRACE Irene Joliot-Curie system uses a the fully customised workload manager as a wrapper for several workload managers, such as SLURM or Moab.The jobqueue features library is configured for standard SLURM and PBS workload managers, and so some integration steps were required.The syntax of the custom workload manager wrapper was far from common but, for the most part, corresponds with an underlying SLURM syntax.This general correspondence allows for a relatively straightforward customization of the jobqueue features library to allow for interaction with the Irene system.
The required steps to achieve such integration were the identification of the specialized commands and syntax of Irene, and mapping those to the classes already available within the jobqueue features.The customizations of the jobqueue features classes are shown in Listing 3.

Figure 1 : 6 Table 2 :
Figure 1: Total overhead of the framework per task in seconds for various numbers of tasks and on various architectures.

Table 1 :
Overhead per task (seconds, excluding node configuration) on different clusters.A maximum of 5 workers were available for each task count.

Table 3 :
Data set creation time per task for different protocols in relation to data size

Table 4 :
Data set creation computation time per set of tasks for different protocols in relation to data sizeCrucial to each test, we need to create data structures for them to use.That is why we test the impact of that creation for both TCP and UCX protocol, shown in Tables[3, 4, 5, 6].As we can see times of single operations

Table 5 :
Data set creation time per task for different protocols and numbers of tasks

Table 6 :
Data set creation computation time per set of tasks for different protocols and numbers of tasks do not show much differences, which was expected.For whole set of tasks, where most of communication take place, TCP shows better results.

Table 7 :
Bag increment computation time per task for different protocols relative data size

Table 8 :
Bag increment computation time per set of tasks for different protocols relative data size

Table 10 :
Bag increment computation time per set of tasks for different protocols and number of tasks

Table 14 :
Tree reduction computation time per set of tasks for different protocols and numbers of tasks