A Multimodal Machine Learning Approach to Omics-based Risk Stratification in Coronary Artery Disease

. This study aims at developing a personalized model for coronary artery disease (CAD) risk stratification based on machine learning modelling of non imaging data, i


Introduction
Coronary artery disease (CAD) is a multi-factorial disease characterized by the accumulation of lipids into the arterial wall and the subsequent inflammatory response [1,2]. The phenotype of disease progression is affected by several factors, including clinical risk factors (e.g. gender, smoking, hyperlipidaemia, hypertension, diabetes) as well as molecular, biohumoral and biomechanical factors (e.g. low endothelial shear stress). CAD diagnosis is validated through invasive coronary angiography (CA); however, different invasive [e.g. intravascular ultrasound (IVUS), optical coherence tomography (OCT)] and non-invasive imaging modalities [e.g. computed tomography angiography [CTA], magnetic resonance imaging (MRI)] are nowadays available to visualize the vessel wall, quantify the plaque burden and characterize the type of the atherosclerotic plaque.
Predicting the risk of CAD constitutes a widely-studied problem from the perspective of statistical modelling. The majority of existing risk models, such as the Framingham risk score (FRS) [3], the Systematic COronary Risk Evaluation (SCORE) [4] and the QRISK [5], postulate a Cox proportional hazard regression or logistic regression model of relatively few traditional predictors of the disease, focusing on CAD or cardiovascular disease (CVD). In spite of the reported good discrimination ability of parametric linear regression models, a recent systematic review demonstrated the paucity of external validation and head-to-head comparisons, the poor reporting of their technical characteristics as well as the variability in outcome variables, predictors and prediction horizons, which limits their applicability in evidence-based decision making in healthcare [6]. Precision medicine suggests individualized dynamic predictive modelling approaches not being hypotheses-driven [7][8][9]. Moreover, the increasing availability of electronic health records (EHRs), personal health records (PHRs) and omics big data give rise to multiscale multi-parametric predictive big data analytics in personalized medicine in cardiovascular research and clinical practice [10][11][12].
The purpose of this study is to design and develop a machine learning-based model effectively integrating multiple categories of biological data towards precise risk stratification in coronary artery disease. Herein, we outline the formulation of the problem, present the main components of the model architecture, and investigate the predictive power of the currently available feature set.

Problem Formulation
CAD risk stratification is formulated as a multiclass classification problem, representing the severity of the disease as a nonlinear parametric function of a confined set of The utilized feature set is provided in Table 1. Three dominant classes , 1, , k i Ci  have been defined, namely "No CAD", "Non Obstructive CAD", and "Obstructive CAD", with a ≥50% diameter stenosis in at least one main coronary artery vessel, as assessed by computed tomography coronary angiography (CTCA), characterizing patients with obstructive CAD.

Multimodal Machine-Learning CAD Stratification Model
A multimodal architecture was specified relying on two processing layers which are defined according to late or intermediate data integration strategies [13]. First, the following feature classes (or views) were defined: (View 1) demographics, (View 2) clinical data, risk factors, symptoms, (View 3) molecular variables (i.e. biohumoral, inflammatory markers and lipids profile), (View 4) gene expression data, (View 5) exposome, and (View 6) monocytes. As it is shown in Fig.1, late data integration consists in the construction of: (i) an ensemble of decision tree-based prediction models (i.e. random forests, boosted decision trees) for each data view, whose individual decisions are effectively merged using simple mechanisms (e.g. weighted voting), or (ii) a multimodal deep neural network comprising of appropriate deep learning subnetworks for each separate data view and, unifying their output into higher network layers.  Intermediate data integration is based on multiple kernel learning (Fig. 2). Kernel matrices are computed for each data view, and then they are combined, through a parametric linear function, in order to generate the final kernel matrix. Kernel-based classification (i.e. support vector machine, relevance vector machine) is subsequently applied to predict CAD risk stratification. The skeleton and individual modules of the integrative model (i.e. merging mechanisms, machine learning algorithms, metric learning, regularization, and feature extraction) are implemented in R.

Results
Currently, the dataset is confined to demographics, risk factors, biohumoral markers and symptoms, which led us to concatenate all features into a single vector. In particular, three machine learning  The gradual improvement of accuracy with the enhancement of the input space is apparent, with proper customization of the input via feature ranking ( 20 d  ) better balancing the sensitivity to specificity ratio. SVM outperforms FFNN and RF resulting in an overall accuracy 85.1% and a nearly perfect sensitivity (98.7%), whereas specificity remains low (44.0%), presumably due to the class imbalance in the dataset. The confusion matrices corresponding to SVM output in Case 3 and Case 4 are reported in Table  3 and Table 4, respectively.

Discussion & Conclusions
CAD diagnosis is currently performed according to well-known screening strategies (i.e. CA, IVUS, OCT, CTA, MRI), whereas CVD risk can be assessed by linear regression models of clinical, laboratory and anthropometric features, assuming linearity as well as time-invariance of the underlying input-output relationships. Non-linearity is addressed by black-box parameterizations (neural networks and kernel-based models) or more transparent architectures (decision trees, dynamic |Bayesian networks) or ensembles of classification models (random forests), which feature space, however, resembles that of linear approaches (i.e. established risk factors). The generalization capability of the existing machine learning models for the diagnosis of CAD or the estimation of eventful or asymptomatic CAD progression is promising; however, new knowledge coming from big data sources (e.g. molecular, cellular, inflammatory and omics data) requires more integrative machine learning solutions.
In this study, a new machine-learning approach to CAD risk stratification has been proposed relying on multimodal data integration. Its deployment and evaluation are ongoing by: (i) integrating new features concerning the lipid profile, the exome and mRNA sequencing, the exposome, and inflammatory and monocyte markers, and (ii) selecting the most effective multimodal predictive modelling scheme. Moreover, the multiclass classification problem is going to be refined by considering established risk scores of coronary atherosclerosis combining markers of stenosis severity, plaque location and composition, as assessed by computed tomography angiography.