Generation of virtual patient data for in-silico cardiomyopathies drug development using tree ensembles: a comparative study

In-silico clinical platforms have been recently used as a new revolutionary path for virtual patients (VP) generation and further analysis, such as, drug development. Advanced individualized models have been developed to enhance flexibility and reliability of the virtual patient cohorts. This study focuses on the implementation and comparison of three different methodologies for generating virtual data for in-silico clinical trials. Towards this direction, three computational methods, namely: (i) the multivariate log-normal distribution (log- MVND), (ii) the supervised tree ensembles, and (iii) the unsupervised tree ensembles are deployed and evaluated against their performance towards the generation of high-quality virtual data using the goodness of fit (gof) and the dataset correlation matrix as performance evaluation measures. Our results reveal the dominance of the tree ensembles towards the generation of virtual data with similar distributions (gof values less than 0.2) and correlation patterns (average difference less than 0.03).


I. INTRODUCTION
Virtual population (VP) development is a very popular and emerging aspect of healthcare technology, where the recent computational advances have shed light into the impact of this rapidly evolving biomedical field. According to the literature, many studies refer to the use of VP models towards the 3D visualization of human patients which could be potentially used to train clinicians. On the other hand, other applications focus on the design of geometrical models using VP models, which can be reconstructed from the virtual data to extract valuable clinical information in order to represent big varied cohorts depending on the disease under investigation.
In this study, we refer to virtual population generation as a method towards the generation of virtual patient data for clinical trials where mathematical models are used to describe and predict a patient's progression based on an existing therapy. Such virtual patient generation models are used to enhance the population size of clinical trials for numerous applications, including effective drug development and robust decision making in clinical trials in terms of clinical trial *This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 777204. This paper reflects only the author's view and the Commission is not responsible for any use that may be made of the information it contains. simulations (CTS). Knowledge mining from these models leads to the reduction, refinement and partial substitution of the animal and human experimentation for drug development. The definition of a CTS though assumes the reproduction of the clinical trial design, the drug efficacy and the disease progression. A difficult issue of this field is the quality of the provided data. Towards this direction, data curation methods can be used to resolve any errors that are present within the data (e.g., outliers, data inconsistencies) and thus yield high quality data for more accurate virtual population generation.
The major baseline approaches for VP generation can be grouped in two categories, namely the parametric which are used to resample the real data towards the generation of synthetic data and the non-parametric where virtual patients are produced by randomly selecting patients from a real clinical dataset [1]. The multivariate normal distribution (MVND) method is a parametric approach which has been proposed by Tanenbaum [1] towards the generation of a target group of virtual patients according to realistic covariates. Teutonico et al. [2] evaluated and compared the MVND approach using resampling. Furthermore, some other studies that used different techniques are worth to be mentioned due to their increased performance. Regnier et al. [3] compared three different approaches to generate VPs that would match the real distributions from clinical cohorts. Moreover, Allen et al. [4], introduced a novel technique to generate VPs and demonstrated an efficient approach for VP generation without the need for weighting. Apart from the statistical methods, more straightforward methods, such as, the tree ensembles have been recently proposed by Robnik-Šikonja [5] towards the generation of robust semiartificial data, however, without any reported application for in-silico clinical trials.
In this work, we deploy three computational methods to generate virtual patient data from real clinical data. We extend a previous study [6], where the MVND method was used to generate 300 virtual patients by comparing the multivariate log-normal distribution with the tree ensembles to generate 1000 virtual patients for in-silico clinical trials targeting the drug development for familiar cardiomyopathies (FCM).  These methods have been integrated into the in-silico clinical trial SILICOFCM cloud-based platform [7]. The latter incorporates straightforward simulation tools for FCM drug development. Our results confirm the dominance of the tree ensembles as a prominent method for virtual population generation yielding an increased level of agreement between the real and the virtual data with goodness of fit (gof) values less than 0.2 (empirical threshold) and correlation patterns highly similar to the real patient data (2.71% difference in the average correlation values).

B.
Data sharing Anonymized data were obtained from 1227 patients at two timepoints (2454 records in total) under the SILICOFCM project [7]. The dataset included 29 features (17 discrete, 12 continuous) related to demographic (e.g., age, gender), laboratory measures (e.g., diastolic pressure), and generelated information (e.g., ACTC1, CSRP3). All the data came from the same centre, the Cardiomyopathies Unit at Careggi Hospital, Florence, and were collected by a very limited number of clinicians over more than 20 years.

B. Data quality evaluation
A data quality control pipeline presented in a previous study [8,9] was applied on the clinical data to deal with outliers, incompatible fields, duplicated features and missing values, where the Interquartile range (IQR) method was used to detect outliers and the Jaro distance [10] for lexical similarities. Features with more than 50% missing values, outliers, inconsistent fields were ignored from the analysis.

C. Methods for generating virtual data 1) Multivariate log-normal distribution
The multivariate normal distribution (MVND) [1,2] was adopted as a standard approach towards the generation of virtual patient data. Assuming a set of −features, say = { , , … , }, the MVND is defined as: where is the dimension, is the mean vector, is the covariance matrix, and −1 is the pseudoinverse of the covariance matrix which is estimated through singular value decomposition (SVD). The goal of the MVND is to construct a multi-dimensional normal distribution given the mean vector, and the covariance matrix, . In an attempt to strengthen the assumption of the normal distribution among the real data, we generated samples that follow a log-normal distribution, where the exponential of (1) was used to generate virtual data so that: The generated values from ( ) were finally log-normally transformed and compared with the original ones.

2) Supervised tree ensembles
Another straightforward approach for virtual population generation is to train tree ensembles [11] given a set of training features along with a target feature. The virtual population generation problem can be treated as a supervised learning problem, where the tree ensembles are trained on 50% of the real data and tested on the remaining subset. The classifier can then be turned into a virtual data generator to generate semi-artificial data that follow the same distribution as the original data [11]. During the training process, the Gini impurity index [12] is used to measure the probability of a variable being classified in the wrong class as in: where is the probability of a sample falling in class ∈ {1,2, … , }, where is the total number of classes ( = 2 for a binary target feature). The Semiartificial package in R [11] was used to create a tree ensemble.

3) "Unsupervised" tree ensembles
A similar and rather straightforward approach for virtual population generation is to see the virtual population as a regression problem using tree ensembles [11]. The regressor can then be turned into a virtual data generator to generate semi-artificial data that follow the same distribution as the original data [11]. During the training process, the Mean Square Error (MSE) [11] is used to measure the difference between the real and the virtual distributions. For two variables (features), assume and , in the real and virtual distributions, respectively, the MSE is given as: The Semiartificial package in R [11] was used to construct a tree ensemble which was used as virtual generator.

D. Evaluation measures 1) Goodness of fit (gof)
The Kolmogorov-Smirnoff goodness of fit (gof) statistical test [13] is widely used in VP studies to measure the level of agreement between the real and the virtual distributions. The gof test measures whether the virtual data instances and the real data instances come from the same distribution. More specifically, the gof test statistics, say , is given as: where ( ) and ( ) are the empirical distribution functions of the original and virtual data, respectively. Large gof values denote distributions with large vertical distance between them, whereas small gof values denote distributions with small vertical distances and thus similar.

2) Correlation
To further evaluate the similarity of the real and virtual data we computed the Pearson's correlation coefficient between each pair of features in the virtual data and the real data. For two variables (features), assume and , in the real and virtual distributions, respectively, the Pearson's correlation coefficient [14], assume , is given as: where , are the mean values and , are the standard deviations of and , respectively, and [. ] is the expected value. An value 1 denotes a perfect positive correlation between and , whereas an value of 0 denotes no correlation.

A. Data quality assessment
The total number of missing values within the data was 10.12% (13 features had less than 50% missing values and 16 features had no missing values at all). Neither outliers nor duplicated features or inconsistent fields were detected in the anonymized data.
The performance evaluation results for each type of virtual population generation method are presented in Table 1, where the gof values where the smallest for the features "Max_LVT" and "Age" in the log-MVND, for the features "sep_Eprime", "LA", "PW", and "Age" in the supervised tree ensembles and for the remaining features in the unsupervised tree ensembles. The distribution of the gof values for each type of virtual population generation method is depicted in Figure 1 for the set of features which is presented in Table 1. The correlation matrix of the real data is depicted in Figure  2 whereas the correlation matrix for the virtual data that were derived by the unsupervised tree ensembles (as the method that achieved the highest number of "optimal" gof values) is depicted in Figure 3, showing similar association patterns within the data. High correlation values are depicted in deep blue color whereas low correlation values are depicted in yellow. Since this is a qualitative way to view the correlation between the features in the real and the virtual populations, we have also computed the absolute value of the difference between the average correlation values from the upper (or lower) triangular part of the matrices, for quantitative purposes. The difference in the average correlation values was 4.45% for the log-MVND, 8.28% for the supervised tree ensembles and 2.71% for the unsupervised tree ensembles.

IV. CONCLUSIONS
In this work, we compared three computational methods towards the generation of 1000 high-quality, virtual patient data for in silico clinical trials in cardiomyopathies drug development using the goodness of fit (gof) and the correlation matrix for performance evaluation purposes. Our results suggest the dominance of the tree ensembles for virtual population generation yielding virtual patient data with an increased level of agreement (distributions with less than 0.2 gof) and at the same time maintaining the correlation patterns (associations) among the features in the real clinical data.
More specifically, the "unsupervised" tree ensembles achieved the lowest goodness-of-fit values for five out of ten features according to Table 1 and Figure 1 (i.e., for the clinical features "LVOTO_Rest", "Evel", "lat_Eprime", "LVEF", and "NYHA"), the supervised tree ensembles for four out of ten features (i.e., the "sep_Eprime", "LA", "PW", and "Age") using "NYHA" as the target feature and finally the log-MVND for only two out of ten features (i.e., the "Max_LVT", and "Age"). The correlation matrix that was generated by the "unsupervised" tree ensembles was close to the original one, a fact that enhances the level of agreement between the virtual and the real data. For example, the strong association between the lateral e` wave ("latEprime") and the septal e` wave ("sepEprime") ( Figure 2) which is high (more than 75%) is clearly preserved in the virtual population ( Figure 3).
The proposed methods could potentially provide significant insight in the field of virtual population generation to re-adjust the perspective of Clinical Trials (CTs) in other domains. As a future work, we also plan to deploy artificial neural networks (ANNs) that make use of radial basis functions (RBFs) as activation functions towards the generation of even more robust clinical data for in-silico clinical trials.