Molecular modeling and simulation: model development, thermodynamic properties, scaling behavior and data management

We are outlining our most recent ﬁndings, covering: 1) A comparison of a micro- and macroscopic solution of a two-phase Riemann problem obtained from molecular dynamics simulations and ﬁnite volume schemes; 2) A novel equation of state for the bulk viscosity of liquid noble gases based on a multi-mode relaxation ansatz; 3) A detailed analysis of the evaporation process of simple ﬂuids; 4) Diffusion coefﬁcients of quaternary liquid mixtures obtained by the Green-Kubo formalism; 5) An analysis of the solid/ﬂuid phase transition for the face centered cubic (fcc) lattice; 6) The relative permittivity of mixtures of water and acetone; 7) An assessment of the reliability and reproducibility of molecular simulation results; 8) Techniques for the data management in simulation workﬂows, including annotations of simulation outcomes with appropriate metadata standardized by an ontology.

Large scale molecular dynamics (MD) simulations of a two-phase shock tube scenario were conducted. As in Ref. [30], these simulations were intended to serve as a benchmark for macroscopic solutions obtained from computational fluid dynamics simulations employing finite volume (FV) schemes. Two macroscopic approaches were considered: the homogeneous equilibrium method (HEM) and the sharp interface method. Both are implemented in the discontinuous Galerkin spectral element method (DGSEM) framework FLEXI 1 .
In contrast to the scenario in Ref. [30] that consisted of two supercritical states, known as the classical Riemann problem, the present scenario consisted of a liquid and a vapor phase, connected through a planar interface. The thermodynamic states of the liquid and vapor phases were specified such that they were out of equilibrium and hence phase transition occurred. This scenario is known as two-phase Riemann problem. While the classical Riemann problem is fully understood, the two-phase Riemann problem has implications for the system of equations that cannot be solved in a straightforward manner.
As in Ref. [30], the Lennard-Jones Truncated and Shifted (LJTS) fluid was used for two major reasons: it is computationally cheap in MD simulations and an accurate equation of state (EOS) [27] is available for the macroscopic solution.
The initial configurations of the MD simulations were prepared in two steps, cf. Fig. 1. First, a vapor-liquid equilibrium (VLE) was maintained at a temperature of T = 0.9. The liquid phase was extracted and brought into contact with a vapor phase at a lower temperature T = 0.8 and a lower density in a symmetric setup. Three cases were considered, varying the density of the vapor phase ρ v , i.e. specifying 50%, 70% and 90% of the saturated density at a temperature of T = 0.8. The system dimensions, specified in particle diameters σ , were defined as follows: The extent of the vapor phase L v = 1500 was chosen to be wide enough so that the shock wave exerted from the liquid phase, which is a consequence of the global non-equilibrium, could be observed for a sufficiently long time period, before it reached the periodic boundary. The specified width of the liquid phase L l = 200 was a compromise between being sufficiently wide such that the opposite vapor-liquid interfaces do not interfere with each other, because of rapid state changes due to evaporation, and being small enough to keep the computational cost on an acceptable level. To check whether the specified width of L l is appropriate, an exemplary simulation was repeated with a doubled width L l = 400. From that, almost identical results were obtained.
The cross-sectional area had an extent of 10 6 σ 2 so that the sampling of the rapidly changing profiles, i.e. temperature, density and hydrodynamic velocity, employing a classical binning scheme with a high spatial and temporal resolution, yielded an excellent statistical quality. Because of the large number of up to N ≈ 3 · 10 8 particles, all simulations were carried out with the massively parallel code ls1 mardyn [52]. The code is being continuously improved and recently optimized with respect to its node-level performance and parallel efficiency [69].

Bulk viscosity of liquid noble gases
Stokes' hypothesis postulates that any form-invariant changes of a fluid's local volume, i.e. compression or dilatation, are not associated with the dissipation of linear momentum, which is synonymous with a vanishing bulk viscosity µ b = 0. Despite theoretical and experimental evidence to the contrary, the hypothesis is still widely applied throughout all branches of fluid mechanics. The bulk viscosity of liquid noble gases was studied here on the basis of a multi-mode relaxation ansatz and an equation of state is proposed [11]. The relaxation ansatz is based on a large data set that was generated by dedicated atomistic simulations, resulting in an EOS exposing the bulk viscosity as a two-parametric power function of density, with the parameters being functions of temperature. The noble gases' atomistic description rests on the Lennard-Jones potential.
The Green-Kubo formalism relates the bulk viscosity to time-autocorrelation functions (ACF) that were sampled in the microcanonical (NVE) ensemble, utilizing the fully open source program ms2 [55]. To reduce finite size effects and to gain better statistics, ensembles containing N = 4096 particles were placed in cubic volumes with periodic boundary conditions. To resolve the small-scale pressure fluctuations, a small integrator time step was used and each autocorrelation function was sampled over a substantial time period. Relatively large ensemble sizes in combination with long simulation times are computationally demanding and thus require modern HPC architectures. The bulk viscosity can be determined microscopically by ACF of local small-scale, transient pressure fluctuations that are intrinsic in any fluid under equilibrium. These pressure fluctuations have been observed to relax in different modes. Each mode decays exponentially over time, following a Kohlrausch-Williams-Watt function. For liquid noble gases, three superimposing relaxation modes were found to be present, leading to the relaxation model The first term describes the fast, and the subsequent terms the intermediate and slow modes, respectively. The weighting factors are constraint, i.e. C f +C m +C s = 1, and the Kohlrausch parameters δ i , β i are a measure of relaxation time scale and distortion from the exponential function. The model's eight independent parameters C f ,C m , δ f , δ m , δ s , β f , β m , β s were determined by fitting the relaxation model B R to the data sampled by MD. Each mode's average relaxation time τ i is defined as integral mean value of its respective relaxation model contribution B R,i . As originally proposed by Maxwell, the bulk viscosity µ b is proportional to the cumulative averaged relaxation time with the proportionality constant K r being the fluid's relaxation modulus. The sampled ACF partitions into three segments, with each segment being dominated by a different mode, cf. Fig 3. In contrast to the sampled ACF that is plagued by noise, the employed relaxation model's time integral properly converges to a definite value, thus allowing to determine µ b unambiguously, cf. Fig. 4. Applying the relaxation ansatz to all sampled state points generates a large dataset from which the EOS emerges as a two-parametric power function with both parameters showing a conspicuous saturation behavior over temperature. After passing a temperature threshold, the bulk viscosity is found to vary significantly over density, a behavior that resembles the frequency response of a one pole low-pass filter.

Diffusion in quaternary liquid mixtures
Most mass transfer processes occurring in nature and in technical applications involve liquid solutions with more than three components. Rate-based methods employed for modeling, design and control of separation unit operations in chemical engineering, such as distillation, rely on mass and energy transfer models which require reliable information on diffusion coefficient data for the regarded mixtures [39]. Therefore, there is a significant interest in the improvement of experimental methodologies and the development of reliable methods for the prediction of mutual diffusion coefficients of liquid multicomponent mixtures.
Fick's law for diffusion in a quaternary mixture requires nine different diffusion coefficients that depend on temperature, pressure, composition and the regarded frame of reference. The presence of six cross diffusion coefficients makes interpretation and data processing in experimental work a challenging task which often leads to large experimental uncertainties [62]. Despite the continuous improvement and development of experimental techniques during the last decades, the availability of diffusion coefficients of mixtures containing four components is still very poor. Thus, the growing need of accurate diffusion data for basic research and engineering applications cannot be satisfied by experimental measurements alone.
Most predictive equations for multicomponent diffusion of liquids rely on extensions of the Darken relation [3,45] and are therefore only truly valid for ideal mixtures. The underlying physical phenomena in non-ideal mixtures are still not well understood and the lack of data impedes the development and verification of new predictive equations. In this context, MD offers an alternative path not only to assess multicomponent diffusion coefficients, but also to gain insight into the underlying microscopic behavior.
Recently, the ability of MD to predict the Fick diffusion coefficient matrix of a quaternary mixture has been demonstrated for the liquid mixture water + methanol + ethanol + 2-propanol [25]. However, because of the lack of experimental data, only consistency test could be performed for the predicted diffusion data. Here, the Fick diffusion coefficient matrix for the mixture cyclohexane + toluene + acetone + methanol, which was recently studied with Raman spectroscopy [54], was successfully predicted solely with molecular simulation techniques.
In the framework of the generalized form of Fick's law, the molar flux of component i in a mixture of four components is written as a linear combination of concentration gradients ∇c j [12] where D ii are the main diffusion coefficients that relate the molar flux of component i to its own concentration gradient and D i j are the cross diffusion coefficients that relate the molar flux of component i to the concentration gradient of component j [9]. The Fick approach involves three independent diffusion fluxes and a 3 × 3 diffusion coefficient matrix, which is generally not symmetric, i.e. D i j = D ji . Further, the numerical values of D i j depend both on the reference frame for velocity (molar-, mass-or volume-averaged) and on the order of the components. The main shortcoming of Fick's law is the fact that concentration gradients are not the true thermodynamic driving forces for diffusion, which are rather given by chemical potential gradients. Maxwell-Stefan theory follows this path, assuming that chemical potential gradients ∇µ i are balanced by friction forces between the components that are proportional to their mutual velocity [68]. The Maxwell-Stefan diffusion coefficient Ð i j plays the role of an inverse friction coefficient between components i and j [68] and its matrix is symmetric so that it has only six independent elements. Maxwell-Stefan diffusion coefficients are associated with chemical potential gradients and thus cannot directly be measured in the laboratory. However, they are accessible with equilibrium MD techniques, i.e. the Green-Kubo formalism [23,47] or the Einstein approach.
This work employs the Green-Kubo formalism based on the net velocity autocorrelation function to obtain n × n phenomenological coefficients [45] in a mixture of n components. Here, N is the total number of molecules, N i is the number of molecules of component i and v i,k (t) denotes the center of mass velocity vector of the k-th molecule of component i at time t.
Starting from the phenomenological coefficients L i j , the elements of a (n − 1) × (n − 1) matrix ∆ ∆ ∆ can be defined as [45] where x i is the molar fraction of component i.
On the other hand, experimental methods yield the Fick diffusion coefficients. Thus, to compare the predictions by molecular simulation with experimental values, a relation between Fick and Maxwell-Stefan diffusion coefficients is required [68] in which all three symbols represent 3 × 3 matrices and the elements of ∆ ∆ ∆ are given in Eq. (5). Γ Γ Γ is the thermodynamic factor matrix Therein, δ i j is the Kronecker delta function and γ i the activity coefficient of component i. Here, the thermodynamic factor matrix was estimated from information on the microscopic structure given by radial distribution functions g i j (r) based on Kirkwood-Buff theory. In the grand canonical (µV T ) ensemble Kirkwood-Buff integrals G i j are defined by [41] Because the canonical (NV T ) ensemble was employed, possible convergence issues [50] were corrected with the method by Krüger et al. [46]. Moreover, corrections of the radial distribution functions are required. Therefore, Kirkwood-Buff integrals were calculated based on the methodology proposed by Ganguly and van der Vegt [20], which was found to be the most adequate in previous work [16]. Extrapolation to the thermodynamic limit was not necessary because of the rather large ensemble size N = 8000. Predictive equilibrium MD simulations of diffusion coefficients and the thermodynamic factor of the quaternary mixture cyclohexane (1) + toluene (2) + acetone (3) + methanol (4) were carried out at 298.15 K and 0.1 MPa for one composition that was studied experimentally [54], i.e. x 1 = x 2 = x 3 = 0.05 mol mol −1 . A cubic simulation volume of containing 8000 molecules with a cut-off radius of 24.5 Å was employed for this purpose. The resulting phenomenological coefficients L i j were averaged from more than 10 5 correlation functions with a length of 20 ps. These coefficients were employed together with the thermodynamic factor matrix to calculate the Fick diffusion coefficient matrix in the molar frame of reference. In order to compare simulation results with available experimental data, the Fick diffusion coefficient matrix was transformed into the volume-averaged frame [68].
A comparison between present simulation results and experimental data is given in Fig. 6. The simulation results for all elements of the diffusion matrix agree with the experimental data within the reported uncertainties.

Solid/fluid phase transition and strong scaling of ms2
In a recent study [53], the solid/fluid (S/F) phase transition was analyzed for the face centered cubic (fcc) lattice utilizing ms2 [13,21,55]. The LJ potential was applied in MD simulations such that the solid was heated at constant volume up to its phase transition.
The Z method [53,5,6] was applied to determine the limit of superheating (LS) and the melting point (MP). For this purpose, total energy u (potential + kinetic energy) and temperature T were evaluated, cf. Fig. 7. First, the fcc lattice remains for a specific total energy range in the metastable solid state, which is limited by u LS when the solid melts. Beyond this range (u reaches values slightly above u LS ), the temperature drops to its melting temperature T m since kinetic energy supplies the internal energy of fusion. At the S/F transition, the total energy u solid (v, T LS ) = u f luid (v, T m ) with the maximum T LS and minimum temperature T m when constant volume is maintained [53]. The melting process is strongly dependent on the system's structure and dynamics, particularly when a perfect fcc lattice without defects is considered. Thus, finite size effects were hypothesized and supported by a recent study [53]. Fig. 7 clearly shows a substantial system size dependence of T LS and T m . Moreover, this behavior reinforces how well molecular simulations are able to tackle physical phenomena from a theoretical point of view, where experiments are challenging or even impossible because of extreme conditions.
In this context, the strong scaling efficiency of ms2 was analyzed for its hybrid MPI + OpenMP parallelization. Combining MPI and OpenMP, memory demand of ms2 was optimized such that simulations with a larger particle number N can be achieved. In Fig. 8, the vertical axis shows the computing power (nodes) times computing time per computing intensity (problem size), thus, horizontal lines would show a strong scaling efficiency of 100 %. From Fig. 8, it becomes clear that ms2 is close to optimal strong scaling. However, the computing intensity of traversing the particle matrix is proportional to N 2 , but intermolecular interactions are calculated for particles that are in the cutoff sphere only. As a result, the scaling of ms2 should be in between N to N 2 . Fig. 8 indicates that almost doubling the number of particles (120000/64000 = 1.875) in ms2 leads to an increase of computational cost of a factor around 1.91 if the number of nodes was chosen appropriately so that the overhead is small (comparison of red and blue symbols for 800 nodes).

Relative permittivity of mixtures
The relative permittivity ε of a fluid, also known as the dielectric constant, indicates how that fluid weakens an external electric field compared to vacuum. While experimental data on the relative permittivity are available for many pure fluids (at least under ambient conditions), measurements of the relative permittivity of mixtures have rarely been reported. However, such information is important for chemical engineering, e.g. for electrolyte solutions with mixed solvents [61] or solutions of weak electrolytes [43]. On the molecular scale, the relative permittivity is directly related to the mutual orientation of the molecular dipoles via Kirkwood's theory [40]. Thus, the relative permittivity can be sampled straightforwardly with molecular simulations in the canonical (NV T ) ensemble via where k B is Boltzmann's constant, T the temperature, V the volume, and M M M the total dipole moment of the simulation volume that is obtained by summing up all molecular dipole moment vectors Hence, molecular simulations are an ideal tool to study the relative permittivity of mixtures. It has recently been shown that with existing molecular models for mixtures of molecular fluids [42] and electrolyte solutions [56], at least qualitative agreement with experimental data can be obtained. To further demonstrate the power of this predictive approach also in a quantitative manner, MD simulations of the relative permittivity of the mixture acetone + water were carried out with the molecular simulation tool ms2 [55], using the TIP4P/ε water model [19] and the acetone model by Windmann et al. [71]. These models are known to yield the pure component permittivities excellently. The Lorentz-Berthelot combining rules were applied so that the simulation results for the mixture are strictly predictive. First, the mixture density was obtained with isothermal-isobaric (N pT ) runs and then the relative permittivity was sampled with NV T simulations. The results in Fig. 9 demonstrate that a good prediction of the mixture permittivity was obtained. x m Acetone / g g −1 ε Fig. 9 Relative permittivity of mixtures of water and acetone as a function of the acetone mass fraction at two different temperatures and 1 bar. Crosses show the experimental data by Åkerlöf [1], open circles denote present simulation results. Simulation uncertainties are within symbol size, dotted lines are guides to the eye.

Reliability and reproducibility of simulation data
Molecular simulations have become a well established alternative to laboratory experiments for predicting thermophysical properties of fluids [66,64,63]. Evidently, the reliability and reproducibility of such predictions is of fundamental importance.
To sample thermophysical properties of a given molecular model, computer experiments can be carried out. In general, the simulation result of a given observable x sim will not agree with the true model value x mod [26]. Like in laboratory experiments, errors can also occur in computer experiments [57,51] that can cause deviations between the true value x mod and the value observed in simulation x sim [26,57]. Both stochastic and systematic errors may in general occur in computer experiments. While techniques to assess statistical errors are well established for computer simulations [2,18,17], it is more difficult to deal with systematic errors, which have a significant influence on the reliability of the results. Systematic errors may be a consequence of erroneous algorithms, user errors, differences due to different simulation methods, finite size effects, erroneous evaluation of long-range interactions, insufficient equilibration or production periods, compilers, parallelization, hardware architecture etc. [57]. As in laboratory experiments, round robin studies can be made for quantifying systematic errors, in which the same simulation task is carried out by different groups with different programs.
The detection and assessment of outliers in large datasets is a standard task in the field of data science [32], but has to the best of our knowledge not yet been applied to thermophysical property data obtained by molecular simulation. The assessment of experimental thermophysical property data is a well-established field in chemical engineering [14], especially for phase equilibrium data [37,38].
The accuracy with which properties of a simple (Lennard-Jones) model fluid can currently be determined by molecular simulation was assessed. The Lennard-Jones potential is often used as a starting point for the development of many force fields for complex molecules [65,48]. It is often taken as a benchmark for the validation of simulation codes and the test of new simulation techniques. Accordingly, a large number of computer experiment data are available for this fluid. Molecular simulations were performed both for homogeneous state points and for the vaporliquid equilibrium to complement the data in regions that were only sparsely investigated in the literature. This database (cf. Fig. 10) allows for a systematic data evaluation and determination of outliers. In total, about 35,000 data points were evaluated [67]. The VLE properties: vapor pressure, saturated densities, enthalpy of vaporization and surface tension were investigated; for homogeneous state points, the investigated properties were: pressure, thermal expansion coefficient, isothermal compressibility, thermal pressure coefficient, internal energy, isochoric heat capacity, isobaric heat capacity, Grüneisen parameter, Joule-Thomson coefficient, speed of sound, Helmholtz energy and chemical potential.
Different consistency tests were applied to assess the accuracy and precision and thereby the reliability of the data. The data on homogeneous states were evaluated point-wise using data from their respective vicinity and EOS. Approximately 10% of all homogeneous bulk data were identified as gross outliers. VLE data were as- sessed by tests based on the compressibility factor, the Clausius-Clapeyron equation and by an outlier test. First, consistency tests were used to identify unreliable datasets. In a subsequent step, the mutual agreement of the remaining datasets was evaluated. Seven particularly reliable VLE data sets were identified. The mutual agreement of these data sets is approximately ±1% for vapor pressure, ±0.2% for saturated liquid density, ±1% for saturated vapor density and ±0.75% for enthalpy of vaporization -excluding the extended critical region. In most cases, the results from different datasets were found to differ by more than the combined statistical uncertainty of the individual data. Hence, the magnitude of systematic errors often exceeds that from stochastic errors.

Data management
While it is generally always advisable to follow good practices of data management when dealing with research data, this becomes even more expedient in cases where the data have been obtained by accessing dedicated facilities, as it is the case in scientific high-performance computing: Simulation results without annotation become dark data, making their meaning and purpose unintelligible to others, in particular, to automated processing on repositories and computing environments [58]. HPC and other facilities are not employed adequately if they are used to generate dark data. Obversely, annotating the simulation outcome with appropriate metadata enhances its value to the community and ensures that data become and remain FAIR, i.e., findable, accessible, interoperable, and reusable, permitting their preservation far beyond the immediate circumstances that motivated their creation originally [7,8,59].
Data infrastructures, such as repositories, digital marketplaces, or modelling and simulation environments, often follow a multi-tier design with an explicit logical or semantic layer, as illustrated by Fig. 11; in these cases, the underlying semantic technology, which may include non-relational databases, mechanisms for checking constraints, or handling digital objects, requires mechanisms for knowledge representation. This technical requirement is the main underlying cause of the increasing pressure on scientific communities to develop standardized schema metadata definitions or ontologies [33]. Such metadata standards are known as semantic assets; an agreement on semantic assets establishes semantic interoperability. ! " # Fig. 11 Role of semantic technology within interoperable data infrastructures, illustrated for the case of a typical multi-tier architecture [33].
In materials modelling, understood here as roughly comprising the fields of computational molecular engineering (CME) for soft matter [36] and integrated computational materials engineering (ICME) for solid materials [60], a major communitygoverned effort towards metadata standardization is conducted by the European Materials Modelling Council (EMMC), specifically the EMMC focus areas on digitalization and interoperability, supported by a series of projects funded from the Horizon 2020 research and innovation programme, including VIMMP [33,35]. This approach, which is implemented by the present data management concept, is based on a system of ontologies that permits the characterization of CME/ICME data provenance at multiple levels of abstraction: To facilitate technical-level reproducibility, metadata documenting all boundary conditions, technical parameters of the employed software, and circumstances related to workflow execution need to be provided, including details on the hardware architecture and the mode of parallelization. Semantic assets that can be used for this purpose include the VIMMP ontologies MACRO, OSMO, VISO, and VOV [35] in combination with the PaaSPort ontology [4]. Documenting a workflow manually in this way, at full detail, is not recommended except in the case of very straightforward scenarios; the complete viability of such an approach would require the automated annotation by an integration of ontologies with workflow management systems [36].
In a logical representation of a simulation workflow, details of the technical implementation are left out of consideration; instead, the workflow is described in terms of the involved use cases, models, solvers, and processors, which are defined by their function rather than by their practical realization. This approach was introduced by the EMMC through the development of the MODA (Model Data) workflow description standard [10]; this is an adequate level of annotation for most purposes, as it permits documenting the provenance of simulation results as well as the intended use cases in a similar way as it is usually done in a scientific journal article. However, metadata provided in this way are machine-processable and standardized by an ontology -in the present case, by the Ontology for Simulation, Modelling, and Optimization (OSMO), i.e., the ontology version of MODA [36]. Logical data transfer (LDT) notation is a graph-based visualization of such workflow descriptions, which was also developed on the basis of a similar notation from MODA. The LDT graph corresponding to the scenario from Section 1 is shown in Fig. 12; this graph corresponds to a workflow description in terms of OSMO and to a collection of digital objects that can be ingested into (and extracted from) an interoperable data infrastructure, e.g., in JSON or JSON-LD format, cf. Fig. 11. addressing the Riemann problem use case from Section 1. The LDT notation is presented in detail in previous work [36].
For a high-level representation of CME/ICME scenarios, a conceptualization of modelling and simulation workflows as semioses is developed on the basis of the European Materials and Modelling Ontology (EMMO) [15,22]. As a top-level ontology, the main purpose of the EMMO consists in establishing the foundations for a coherent architecture of semantic assets at the highest possible degree of abstraction. Due to the nature of this work, technical and philosophical requirements need to be reconciled; the present stage of these developments is discussed in a recent report [34].