Markov Chain Monte Carlo Sampling (insoluble monolayer example)

Markov Chain Monte Carlo (MCMC) sampling is a well-established method of using Bayes' theorem to determine estimates for the uncertainty in model input parameters. MultilayerPy employs the emcee python package, which is widely used for this purpose (https://doi.org/10.1086/670067). The full up-to-date documentation for emcee is found here: https://emcee.readthedocs.io/en/stable/index.html.

There is an initial example which will demonstrate how to set up the system and it will run into a common problem. An improvement to the system is then presented along with a summary of what we can learn from an MCMC analysis.

Summary of MCMC

MCMC sampling essentially involves initiating an ensemble of "walkers" (parameter sets) in a pre-defined parameter space. These walkers are also referred to as chains. At each iteration of the algorithm a new position is proposed for each walker. The probability that this next position is accepted is dependent on the probability of the previous step (i.e. the goodness of fit). Each iteration builds up a chain of parameter values which eventually tends towards the region of maximum likelihood. As a result, after a period of "burn-in" time where the chains find and converge around the region of maximum likelihood, the chains find an equilibrium position and walk around this region. A histogram can then be plotted for each model parameter, revealing each parameter's probability distribution (for that model-experimental data system) - this is not always gaussian.

The user is encouraged to find out more in the detailed documentation and paper referred to above.

The model system we will use here is identical to the system studied in the "insoluble monolayers" tutorial notebook. In practice MCMC works best when initialising the ensemble of "walkers" around what is most likely the global minimum (hence the initial global optimisation). The focus will therefore be on the MCMC sampling procedure carried out after the initial global optimisation.

The dataset used is from Woden et al. (2021) (https://doi.org/10.5194/acp-21-1325-2021).

This dataset is from a monolayer of deuterated oleic acid on an aqueous subphase. We will assume that none of the products evaporate from the surface and keep them lumped together as a "products" component for simplicity.

The reaction scheme is consistent with other MultilayerPy tutorials: oleic acid + ozone --> products

In [1]:
import os
# this stops numpy paralellising some processes
# when it comes to parallel MCMC sampling we don't want to parallelise sub-processes (slows down otherwise)
os.environ["OMP_NUM_THREADS"] = "1"

# importing the necessaries
import numpy as np
import multilayerpy.build as build 
import multilayerpy.simulate as simulate
import multilayerpy.optimize as optimize
In [2]:
# import the ModelType class
from multilayerpy.build import ModelType

# import the ReactionScheme class
from multilayerpy.build import ReactionScheme

# define the model type (KM-SUB in this case) and geometry (film)
mod_type = ModelType('km-sub','film')

# build the reaction tuple list, in this case only 1 tuple in the list (for 1 reaction)
# component 1 (oleic acid) reacts with component 2 (ozone)
reaction_tuple_list = [(1,2)]

# build the product tuple list, only component 3 (products) is a product
# a tuple with a single value inside is defined (value,)
product_tuple_list = [(3,)]

# now construct the reaction scheme
# we can give it a name and define the nuber of components as below
reaction_scheme = ReactionScheme(mod_type,name='Oleic acid ozonolysis',
                                                   reactants=reaction_tuple_list,
                                                products=product_tuple_list)

# let's print out a representation of the reaction scheme
reaction_scheme.display()
#########################################################
Reaction scheme: Oleic acid ozonolysis
Model type: km-sub
** No stoichiometry shown **
R1: 1 + 2 -----> 3 
#########################################################
In [3]:
# import ModelComponent class
from multilayerpy.build import ModelComponent

# making model components

# oleic acid
OA = ModelComponent(1,reaction_scheme,name='Oleic acid')

# ozone, declare that it is a gas
O3 = ModelComponent(2,reaction_scheme,gas=True,name='Ozone') 

# products
prod = ModelComponent(3,reaction_scheme, name='Reaction products')

# collect into a dictionary
model_components_dict = {'1':OA,
                        '2':O3,
                        '3':prod}
In [4]:
# import DiffusionRegime class
from multilayerpy.build import DiffusionRegime

# making the diffusion dictionary
diff_dict = {'1' : None,
             '2': None,
             '3':None}  

# make diffusion regime
diff_regime = DiffusionRegime(mod_type,model_components_dict,diff_dict=diff_dict)

# call it to build diffusion code ready for the builder
diff_regime()
In [5]:
# import ModelBuilder class
from multilayerpy.build import ModelBuilder

# create the model object, ignore [1,2,3] etc at the end
model = ModelBuilder(reaction_scheme,model_components_dict,diff_regime)

# build the model. Will save a file, don't include the date in the model filename
model.build(date=False)

# print out the parameters required for the model to run
print(model.req_params)
{'alpha_s_0_2', 'T', 'delta_3', 'w_2', 'Xgs_2', 'k_1_2_surf', 'Db_1', 'Db_2', 'Td_2', 'H_2', 'delta_1', 'delta_2', 'Db_3', 'k_1_2'}

Making an insoluble monolayer

An insoluble monolayer can be made by setting the bulk diffusion coefficient of that component = 0.0 cm2 s-1. This means that that component does not diffuse at all in the bulk. We will nominally create 5 model bulk layers and a bulk film thickness of 2 µm to satisfy the model building process. However, there is essentially no exchange of material between the bulk and surface layers. Henry's law coefficient is also set to 0.0.

The surface reaction rate constant determined by Woden et al is 2.2e-10 cm2 s-1. Ozone concentration was 323 ppb (323 x 2.46e10 cm-3). We will vary alpha_s_0_2, Td_2 and k_1_2_surf in the model.

In [6]:
# import the Simulate class
from multilayerpy.simulate import Simulate

# import the Parameter class
from multilayerpy.build import Parameter

# make the parameter dictionary
# SETTING BULK DIFFUSION AND HENRY'S LAW PARAMETERS TO 0.0
param_dict = {'delta_3':Parameter(1e-7),
              'alpha_s_0_2':Parameter(2e-3,vary=True,bounds=(1e-4,1e-1)),
              'delta_2':Parameter(0.4e-7),
              'Db_2':Parameter(0.0),
              'delta_1':Parameter(0.8e-7),
              'Db_1':Parameter(0.0),
              'Db_3':Parameter(0.0),
              'k_1_2':Parameter(0.0),
              'H_2':Parameter(0.0),
              'Xgs_2': Parameter(323.0 * 2.46e10),
              'Td_2': Parameter(1e-7,vary=True,bounds=(1e-9,1e-6)),
              'w_2':Parameter(3.6e4),
              'T':Parameter(294.0),
              'k_1_2_surf':Parameter(2.2e-10)}
# 'Td_2': Parameter(1e-7,vary=True,bounds=(1e-8,1e-6))

# import the data
from multilayerpy.simulate import Data

raw_data = np.genfromtxt('woden_etal_acp_data_uncert.csv',delimiter=',')

data = Data(raw_data)

# make the simulate object with the model and parameter dictionary
sim = Simulate(model,param_dict,data=data)

# define required parameters
n_layers = 5
rp = 2e-4 # radius in cm
time_span = [0,800] # in s
n_time = 999 # number of timepoints to save to output

#spherical V and A
# use simulate.make_layers function
V, A, layer_thick = simulate.make_layers(mod_type,n_layers,rp)

# initial conc. of everything
bulk_conc_dict = {'1':0.0,'2':0.0,'3':0.0} # key=model component number, value=bulk conc

# initial surf conc of oleic acid calculated from neutron reflectometry model fit
surf_conc_dict = {'1':262571428571428.56,'2':0.0,'3':0.0} # key=model component number, value=surf conc

y0 = simulate.initial_concentrations(mod_type,bulk_conc_dict,surf_conc_dict,n_layers) 
    
# now run the model
output = sim.run(n_layers,rp,time_span,n_time,V,A,layer_thick,Y0=y0)

%matplotlib inline
# plot the model
sim.plot(norm=True)

# There may be some runtime warnings because we are forcing some divisions by 0
In [7]:
# now optimise the model

# import the optimize module and Optimizer object
import multilayerpy.optimize
from multilayerpy.optimize import Optimizer

fitter = Optimizer(sim)

res = fitter.fit(method='least_squares',weighted=False);


sim.plot(norm=True)
Optimising using least_squares Nelder-Mead algorithm...

Optimization terminated successfully.
         Current function value: 0.005293
         Iterations: 2
         Function evaluations: 5
optimised params:

{'alpha_s_0_2': 0.002025, 'Td_2': 1.0125e-07}

Success=: True ,termination message: Optimization terminated successfully.
number of iters: 2
final cost function value = 
0.005293324465658205

MCMC sampling

Now we can set up MCMC sampling. We need to define the number of walkers (chains) we want to use. The more the merrier. Normally a multiple of the number of CPUs available in your system. Use os.cpu_count() to find this out if you don't know already. My machine has 12 cores available.

Here, I have selected 120 walkers and 1000 samples. We will decide how many of the initial steps to discard (or burn) after the sampling procedure.

In order to paralellise the algorithm, the multiprocessing package is used and the Pool object is imported. Using the with Pool() as pool: syntax will ensure parallelisation happens without unnecessary problems.

The sampler defined below is an emcee.EnsembleSampler object and can be manipulated as described in the emcee documentation (https://emcee.readthedocs.io/en/stable/user/sampler/).

NOTE if you are carrying this out in an IDE such as Spyder on a Windows system, you may need to invoke the if __name__ == "__main__": syntax before typing the code outlined below.

In [8]:
# MCMC sampling
walkers = 120
samples = 1000

# set the numpy random seed so that the analysis is reproducible
np.random.seed(1)

from multiprocessing import Pool

with Pool() as pool:

    sampler = fitter.sample(n_walkers=walkers,samples=samples,pool=pool)
Running MCMC production run...
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [09:10<00:00,  1.82it/s]

Now we want to look at the chains that have been generated and plot what they look like.

In [9]:
# get the chains (see emcee documentation for details)
chains = sampler.get_chain()

# printing the shape of the chains
print('chains = (number of samples, number of walkers , number of dimensions)',chains.shape)

# we can plot the chains with the Optimizer object (fitter in this case)
fitter.plot_chains()
chains = (number of samples, number of walkers , number of dimensions) (1000, 120, 2)
In [10]:
# let's decide how many initial steps to burn
# the chains seem to have expanded to their equilibrium distributions after ~ 800 steps

burn_no = 800
thin = 1

# redefine chains now discarding the burn-in number of samples
chains = sampler.get_chain(discard=burn_no,thin=thin)
print(chains.shape)

# plot the chains again 
fitter.plot_chains(discard=burn_no)
(200, 120, 2)

Running the sampler for more steps after the burn-in stage will allow us to get a more accurate estimation of the parameter probability distribution.

The idea of autocorrelation and thinning will not be discussed here but MCMC sampling studies should ideally be accompanied by an analysis of autocorrelation (see references at the start of this tutorial). There should be at least the details regarding the number of samples, walkers, burn-in steps and parameter bounds for someone to reproduce the analysis. Better yet, a Jupyter notebook going through the analysis!

Below is one method of how to present the distributions as a corner plot using the corner module. If you do not already have this module $ pip install corner will install the package into your python environment.

In [11]:
# get the flattened chains (chains from all walkers joined together)
flat_chains = sampler.get_chain(flat=True,discard=burn_no,thin=thin)

import corner

# order of labels is the same as the order of varying parameters printed during MCMC sampling
labels = ['alpha_s_0_2', 'Td_2']

# corner.corner is the best method for this 
# we can plot 25th and 75th quantiles
fig = corner.corner(
    flat_chains, quantiles=[0.75,0.25],
    bins=20,
    labels=labels
    
);

What could this mean?

Ideally there would be a nice gaussian blob in the middle of this corner plot. But there isn't. These two parameters are highly correlated (with an interesting banana shape). We could still quote the mean of these parameter probability distributions but it would be wise to present this plot along with any analysis. Most uncertainties are assumed to be the standard deviation of some gaussian distribution. The shape of these distributions is not gaussian. This is an advantage of this kind of analysis.

Take-home message from this:

  1. Either alpha_s_0_2 or Td_2 should be held constant and/or experimentally constrained in some way. They are both associated with the adsorption and desorption of ozone to the surface.
  2. Could alpha_s_0_2 and/or Td_2be dependent on the composition of the monolayer? See the "composition-dependent surface adsorption" tutorial for how to construct such a model.

NOTE as the number of model components and layers increases, so does the time it takes for a single model run. This would significantly increase the time taken to optimise and sample the model-data system. Paralellising the global optimisation and sampling procedure is recommended for this (possible with MutilayerPy). At some point, however, a lot of CPUs would be required for bulky models. A high-performance computing resource (such as a computer cluster) would be recommended for this.

Plotting the outcome

We can take a random n_samples number of samples from the MCMC run to plot. Firstly, because the optimiser didn't save each model output as the MCMC sampling procedure progressed (see note below), we need to run the model with this sub-sample of model parameters.

MultilayerPy has the utility to do this. Let's grab 50 random samples from the MCMC sampling run, run and save those model outputs and plot them in 3 lines.

Note the optimiser did not save each model output after each iteration due to memory concerns. For example, running 120 walkers for 1000 steps produces 120000 samples. The amount of data to store quickly ramps up. Therefore it is appropriate to store only the runs we want for presentation.

In [12]:
n_samples_plot = 200
fitter.get_chain_outputs(n_samples=n_samples_plot,n_burn=burn_no,override_stop_run=True)
fitter.plot_chain_outputs()
Getting 200 sampled model outputs from MCMC chain with 800 discarded steps and 1 thinning steps...
Running sampled model simulations in series...

This is not a good outcome. The MCMC chains have not converged fully and more sampling would be required. You can see that some of the sampled runs are nowhere near the data. An analysis of the autocorrelation time would also help. This is not the focus of this notebook but interested readers can consult the references at the start of this notebook. Holding one of these highly correlated parameters constant may be a good way to proceed...

An improvement

Let's hold Td_2 at 1e-7 s. Which is a reasonable guess for ozone at room temperature. We'll only allow alpha_s_0_2 to vary. We can do the same optimisation + MCMC sampling procedure as before (to save time, we'll take fewer samples).

In [13]:
# make the parameter dictionary
# SETTING BULK DIFFUSION AND HENRY'S LAW PARAMETERS TO 0.0
param_dict = {'delta_3':Parameter(1e-7),
              'alpha_s_0_2':Parameter(1e-3,vary=True,bounds=(1e-4,1e-2)),
              'delta_2':Parameter(0.4e-7),
              'Db_2':Parameter(0.0),
              'delta_1':Parameter(0.8e-7),
              'Db_1':Parameter(0.0),
              'Db_3':Parameter(0.0),
              'k_1_2':Parameter(0.0),
              'H_2':Parameter(0.0),
              'Xgs_2': Parameter(323.0 * 2.46e10),
              'Td_2': Parameter(1e-7),
              'w_2':Parameter(3.6e4),
              'T':Parameter(294.0),
              'k_1_2_surf':Parameter(2.2e-10)}


# import the data
from multilayerpy.simulate import Data
raw_data = np.genfromtxt('woden_etal_acp.csv',delimiter=',')
raw_errors = np.genfromtxt('woden_etal_errors.csv',delimiter=',')

actual_errors = (raw_errors[:,1] - raw_data[:,1]) * 2

collected_data = np.column_stack((raw_data,actual_errors))

data = Data(collected_data[:6,:])

# make the simulate object with the model and parameter dictionary
sim = Simulate(model,param_dict,data=data)

# define required parameters
n_layers = 5
rp = 2e-4 # radius in cm
time_span = [0,800] # in s
n_time = 999 # number of timepoints to save to output

#spherical V and A
# use simulate.make_layers function
V, A, layer_thick = simulate.make_layers(mod_type,n_layers,rp)

# initial conc. of everything
bulk_conc_dict = {'1':0.0,'2':0.0,'3':0.0} # key=model component number, value=bulk conc

# initial surf conc of oleic acid calculated from neutron reflectometry model fit
surf_conc_dict = {'1':262571428571428.56,'2':0.0,'3':0.0} # key=model component number, value=surf conc

y0 = simulate.initial_concentrations(mod_type,bulk_conc_dict,surf_conc_dict,n_layers) 
    
# now run the model
output = sim.run(n_layers,rp,time_span,n_time,V,A,layer_thick,Y0=y0)

%matplotlib inline
# plot the model
sim.plot(norm=True)
In [14]:
# now optimise the model

# import the optimize module and Optimizer object
import multilayerpy.optimize
from multilayerpy.optimize import Optimizer

fitter = Optimizer(sim)

res = fitter.fit(method='differential_evolution',popsize=20);


sim.plot(norm=True)
Optimising using differential_evolution algorithm...

differential_evolution step 1: f(x)= 1.46756
differential_evolution step 2: f(x)= 1.46756
differential_evolution step 3: f(x)= 1.41011
differential_evolution step 4: f(x)= 1.41011
differential_evolution step 5: f(x)= 1.41011
differential_evolution step 6: f(x)= 1.40907
optimised params:

{'alpha_s_0_2': 0.002335663907749916}

Success=: True ,termination message: Optimization terminated successfully.
number of iters: 6
final cost function value = 
1.4088794597760537
In [15]:
# MCMC sampling
walkers = 120
samples = 200

# set the numpy random seed so that the analysis is reproducible
np.random.seed(2)

from multiprocessing import Pool

with Pool() as pool:

    sampler = fitter.sample(n_walkers=walkers,samples=samples,pool=pool)
Running MCMC production run...
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [02:12<00:00,  1.51it/s]
In [16]:
burn_no = 100
thin = 1

# redefine chains now discarding the burn-in number of samples
chains = sampler.get_chain(discard=burn_no,thin=thin)
print(chains.shape)

# plot the chains again 
fitter.plot_chains(discard=burn_no)
(100, 120, 1)
In [17]:
# get the flattened chains (chains from all walkers joined together)
flat_chains = sampler.get_chain(flat=True,discard=burn_no,thin=thin)

import corner

# order of labels is the same as the order of varying parameters printed during MCMC sampling
labels = ['alpha_s_0_2']

# corner.corner is the best method for this 
# using quantiles allows you to define a line for a confidence interval for your data
fig = corner.corner(
    flat_chains, quantiles=[0.975,0.025],
    bins=20,
    labels=labels)
    
In [18]:
n_samples_plot = 50
chain_outputs = fitter.get_chain_outputs(n_samples=n_samples_plot,n_burn=burn_no)
fitter.plot_chain_outputs()

# lets print the 2.5 and 97.5 % quantiles for this data (an uncertainty range to quote)
upper, lower = np.quantile(flat_chains,[0.975,0.025])
mean = np.mean(flat_chains)
std = np.std(flat_chains)
print(f'mean alpha_s_0: {mean}')
print(f'std alpha_s_0: {std}')
print(f'2.5% quantile: {lower}')
print(f'97.5% quantile: {upper}')

# let's save the optimised model output to a .csv file. A summary of what we have done here
sim.save_params_csv()
Getting 50 sampled model outputs from MCMC chain with 100 discarded steps and 1 thinning steps...
Running sampled model simulations in series...
mean alpha_s_0: 0.002352413419271761
std alpha_s_0: 0.000142032461476851
2.5% quantile: 0.0020911677980411125
97.5% quantile: 0.002643678022742569

Ah, much better! The histogram for alpha_s_0_2 is looking nice and gaussian and the sample of model outputs seem to agree well with the data.

What can we learn from this?

  • MCMC sampling can be a powerful tool in inferring parameters from your model-data system.
  • Care must be taken when selecting the parameters to vary. Some can be highly correlated (as we saw here). Can you constrain one of them?
  • The number of walkers, samples, burn-in steps and thinning steps matter. There is a large body of information available about the best way to carry out an MCMC sampling study. MultilayerPy uses the well-established emcee package to facilitate this but caution is needed when drawing conclusions from such an analysis. Quote what you did!
In [19]:
# inter-quartile range
print(np.quantile(flat_chains,[0.25,0.75]))
[0.0022536  0.00244388]