###############################################################################

Title:  

Machine learning-accelerated small-angle X-ray scattering analysis of disordered two- and three-phase materials

Authors:
Magnus Röding, Piotr Tomaszewski, Shun Yu, Markus Borg, Jerk Rönnols

Description:

Dataset and code used in M Röding, et al, "Machine learning-accelerated small-angle X-ray scattering analysis of disordered two- and three-phase materials", published in Frontiers in Materials. In this work, we develop a machine learning-based framework for prediction of material parameters from small-angle X-ray scattering (SAXS) data. The method is trained using data from a Gaussian random field-based model for the electron density of the material and a very fast Fourier transform-based numerical method for simulating realistic SAXS measurements. The prediction is performed using regression with XGBoost. Herein, the codes in Matlab and Python/XGBoost necessary to investigate the prediction models and reproduce the results of the paper are supplied. Also, the dataset and the trained XGBoost models are supplied.

###############################################################################

Begin by unzipping all compressed files.

-------------------------------------------------------------------------------
Requirements
-------------------------------------------------------------------------------

The Matlab code is tested in Matlab R2021b and utilizes the GPU functionality through the Parallel Computing Toolbox.

The Python/XGBoost code is tested in Python 3.10.0 (Anaconda distribution) and XGBoost 1.5.2. NumPy is also required.

-------------------------------------------------------------------------------
Data
-------------------------------------------------------------------------------

The data is generated using Matlab.

The code for generating virtual materials structures (electron density distributions) and simulating SAXS data are found in the 'data' folder. By running 'run_gpu.m', a batch of data is generated for a random dataset (structure type). The results are stored in MAT format in e.g. 'two_phase_low_porosity/data' for the two-phase, low porosity dataset, and analogously for the three other datasets. The outcome of this step is random.

Once a sufficient amount of data has been generated, the data can be consolidated to a single dataset for each structure type by running 'generate_datasets.m'. The data is shuffled, simulated measurement noise is added, and the data is rescaled and split into training, validation, and test datasets, stored in MAT format in e.g. 'two_phase_low_porosity'. As opposed to the prior step, the outcome of this step is deterministic by use of a different, fixed random seed for each of the four datasets. The MAT format data is converted to BIN (raw binary, single precision) format data by running 'mat2bin.m', and the result is stored in e.g. 'two_phase_low_porosity'.

In this repository, the datasets used in the article are supplied both in MAT and BIN formats.

-------------------------------------------------------------------------------
Prediction
-------------------------------------------------------------------------------

Training and prediction is performed in Python/XGBoost using some helper scripts in Matlab.

The code for training prediction models is in e.g. 'prediction/two_phase_low_porosity' and analogously for the three other datasets. For each dataset, training of an XGBoost model is performed in 'train.py'. Therein, BIN format data is imported, preprocessed and rescaled. Training is then performed for a scalar output parameter, randomly selected from the 2 or 3 outputs. Once training is finalized, the trained model (.bin) and its parameters/metadata (.dat) are stored in 'training_results'. The supplied code performs training for the fixed set of hyperparameters used for final training in the paper.

The best trained models are identified by running 'consolidate_training.m'. The best model and its parameters/metadata are copied to 'trained_models'.

Predictions are performed on the training, validation, and test datasets and for all output parameters by running 'predict.py'. The results are stored in 'predictions'.

The performance of the predictions is assessed by running 'extract_results.m', which computes MSE (mean squared error) and MAPE (mean absolute percentage error) losses for all individual parameters. The prediction results are stored in 'results.mat'.

In this repository, the prediction models, the predictions, and the assessment of the results used in the article are supplied.