A 2-step Deep Learning method with Domain Adaptation for Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Magnetic Resonance Segmentation

. Segmentation of anatomical structures from Cardiac Magnetic Resonance (CMR) is central to the non-invasive quantitative assessment of cardiac function and structure. Anatomical variability, imaging heterogeneity and cardiac dynamics challenge the automation of this task. Deep learning (DL) approaches have taken over the field of automatic segmentation in recent years, however they are limited by data availability and the additional variability introduced by differences in scanners and protocols. In this work, we propose a 2-step fully automated pipeline to segment CMR images, based on DL encoder-decoder frameworks, and we explore two domain adaptation techniques, domain adversarial training and iterative domain unlearning, to overcome the imaging heterogeneity limitations. We evaluate our methods on the MICCAI 2020 Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge training and validation datasets. The results show the improvement in performance produced by domain adaptation models, especially among the seen vendors. Finally, we build an ensemble of baseline and domain adapted networks, that reported state-of-art mean Dice scores of 0.912, 0.857 and 0.861 for left ventricle (LV) cavity, LV myocardium and right ventricle cavity, respectively, on the externally validated Challenge dataset, including several unseen vendors


Introduction
Cardiac magnetic resonance (CMR) imaging allows for an accurate non-invasive quantification of cardiac function and structure [1][2][3]. The valuable information that it provides in cardiovascular disease management has been repeatedly shown [4][5][6][7][8][9]. Nevertheless, the anatomical variability and the intrinsic complexity of cardiac dynamics and geometry represent a challenge, and CMR analysis remains manual in clinical practice [3,[10][11][12]. Deep learning (DL) has revolutionized medical image analysis in recent years, progressing towards automatic segmentation. However, these approaches are challenged by (1) the limited data availability due to technical, ethical and financial constraints along with confidentiality issues, especially for specific pathologies and (2) the imaging heterogeneity introduced by anatomical variability and the use of different scanners and protocols [13][14][15][16]. Various techniques have been proposed for improved and robust performance of DL models under limited training data. The most common technique is data augmentation, including affine and non-affine transformations to populate the space of shape variability [17][18][19][20]. Other techniques incorporate modifications in the network architecture such as reducing the number of network parameters to avoid overfitting or using a 2step segmentation pipeline that improves class balance by zooming into the anatomical region of interest (ROI) [21,22]. Scanner-induced variation in datasets echoes 'domain shift' which leads to the performance of models trained on data from one scanner (the source domain) degrading when applied to another (the target domain). Techniques such as transfer learning [23] have proven successful. However, fine-tuning for every unseen domain is still required. Solutions based on domain adaptation (DA) techniques, which aim to create a single feature representation for all domains which is invariant to domain but discriminative for the task of interest [24], have been proposed to overcome this limitation. One of the successful DA approaches is domain adversarial training of neural networks (DANN) [25]. DANN assumes that predictions must be based on domain-invariant features, and jointly optimizes the underlying features to simultaneously minimize the loss of the label predictor and maximize the loss of a domain predictor. An alternative approach was proposed using an iterative framework [26] of domain unlearning (DU) for adversarial adaptation, creating a classifier which is more uniformly uninformative across domains [27]. This framework has been successfully applied to harmonization of MRI for brain tissue segmentation [28].
The MICCAI 2020 Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms 2020), including a dataset with a wide variety in centers and scanner vendors and granting external validation, provides a benchmark to assess the generalizability of the segmentation algorithms [29]. We propose a fully automated pipeline to segment CMR short-axis (SAx) stacks, based on the 2-step DL framework that was awarded 1 st prize in the LVQuan19 Challenge [22,30]. We evaluate the segmentation performance of the proposed method, and explore two DA techniques, DANN and DU, for multi-vendor and multi-center applications on the M&Ms 2020 Challenge. The results show that the DA techniques contribute to performance improvement and suggest more experimentation for improvement on unseen vendors applications. A final model ensemble was built, achieving state-of-art performance for both seen and unseen vendors.

Materials and Methods
In essence, the proposed pipeline first locates where the heart is (1 st neural network -NN) and then focuses on that ROI to produce a fine segmentation (2 nd NN). Three alternative implementations are proposed for this latter step. The pipeline is coupled with pre-processing and post-processing stages, as illustrated on fig. 1.

Data
We deployed and evaluated our methods on the publicly available M&Ms 2020 Challenge dataset [29], which involves 4 acquisition centers and consists of 150 annotated SAx images from two different MRI vendors, A and B (75 each), and 25 unannotated images from a third vendor, C. The annotated images are segmented only at end-diastole (ED) and end-systole (ES), including left ventricular (LV) cavity, LV endocardium and right ventricle (RV) cavity masks. The image resolution varies from 192x240 to 384x384 pixels, and the pixel spacing, from 0.977 to 1.625 mm. An additional testing set of 200 cases from 6 acquisition centers, including vendors A, B, C and a fourth unseen vendor, D, in equal proportions, is held by the organizers for external validation (20% of the set -40 cases) and final challenge results (80% of the set). Details on the acquisition and annotation protocols can be checked in [29].

1 st NN: Pre-processing, heart detection and transformation.
The 3 most central ES and ED slices of each patient are normalized in intensity and resolution (pre-processing - fig.1, step 1-2) and fed to the 1 st NN that segments the LV epicardium and RV endocardium (heart detection - fig.1, step 2-3), so that a ROI can be defined (transformation - fig.1, step 3-4). The pre-processing consists of a linear interpolation to a 2D pre-defined template (256x256 pixels with symmetrical 1.12mm pixel spacing) centered in the image, followed by an intensity clipping (10 th and 96 th percentiles, selected empirically to avoid clipping intensities in important structures) and a minmax normalization to 0-255 (uint8). The 2D mass center coordinates of the LV and RV are provided by the 6 prediction outputs of the 1 st NN, and a transformation is calculated from these to align the images to a smaller pre-defined template (144x144 pixels, 1.32mm pixel spacing -balance between small size and high resolution while image large enough to cover the heart) centered in the heart. This same transformation is applied to each slice of the ES and ED SAx stacks. Details of the architecture and implementation of the 1 st NN, based on a standard U-Net, can be checked in [22].

2 nd NN: Fine segmentation and postprocessing
The transformed images, linear interpolated to the smaller template and normalized in intensities (5 th and 93 th percentiles clipping -cutoffs empirically adjusted to the intensities of the structures in the ROI space), are fed to the 2 nd NN for a fine segmentation of the LV endo-and epicardium and the RV endocardium ( fig.1, step 4-5). The LV myocardium is computed as the region between LV endo and epicardium, and the predictions are back interpolated to the original resolution and rearranged into the original 3D setting in a post-processing step ( fig.1, step 5-6). This final step also accounts for segmentation quality enhancement (binarizing predictions, filling holes and removing stray clusters). Three architectures are proposed for the 2 nd NN as described below:

U-Net. Baseline method.
A U-Net, fine-tuned for cardiac segmentations is applied for baseline comparisons. Details are available in [19].

Domain adversarial training of neural networks (DANN)
The DANN model, proposed by [25], consists of a feature extractor network with a domain predictor and label predictor. The gradient-reversal layer, placed between the feature extractor and the domain predictor, reverses the gradient direction during backpropagation and maximizes the domain prediction loss, thus minimizing the shift between the domains, while simultaneously making the model discriminative towards the main task of label prediction.

Domain unlearning (DU)
The DU model * [28] is based on the iterative unlearning framework [26] for adversarial adaptation, which rather than using a gradient reversal layer, optimizes two opposing loss functions in a sequential manner: one to maximize the performance of a label predictor given the fixed feature representation, and another to update the feature representation in order to maximally confuse the domain classifier.

Implementation details for DANN and DU
For the label predictor of the DANN and DU models, we used the Adam optimizer, with the same parameters as used in the baseline U-Net method. For the domain predictor in DANN and DU, we trained with the Momentum optimizer (with a momentum value of 0.9) and the Adam optimizer respectively. We used a batch size of 16, with a learning rate of 10 -3 and 10 -5 for DANN and DU respectively. In addition, in DU we used a beta value of 1000 (a factor used for weighting the domain confusion a). These training hyper parameters were chosen empirically, and a criterion based on a patience (number of epochs to wait for progress on validation set) value of 25 epochs was used to determine model convergence (early stopping). Both models were run on an NVIDIA Tesla V100 GPU. The DANN and DU models took around 50 secs and 45 secs per epoch respectively for training.

Data augmentation
Standard data augmentation based on random rotations (0 to 360°), translation (1 st NN: +/-20mm. 2 nd NN: +/-8mm) and flipping was applied during the training of any of the architectures. For each epoch, all the original training images are randomly transformed as described prior to be fed to the NN. Thus, the originals are never used directly, and each instance fed to the NN is only seen once.

Performance evaluation
Volumetric Dice scores between predicted and original segmentations in the native image space are calculated to assess the fine segmentation performance of the three proposed approaches.

Experiments and Results
The 1 st NN was trained on the 150 labeled images (A and B vendors) following a 5-fold cross-validation strategy with a training-validation-testing split ratio of 107-13-30 images. The 5 resulting models were used in combination (majority voting) for the prediction of unseen test subjects.
To assess the quantitative performance of the 3 proposed methods on the 2 nd NN fine segmentation, for seen and unseen vendors, we carried out 2 experiments:  Training on the 150 labeled images from vendors A and B (A+B) and cross-validation evaluation on the same set, A+B (seen vendors assessment).  Training on vendors A and C (A+C), and evaluation on vendor B (unseen vendor assessment). The baseline method was only trained on A, since unlabeled data (vendor C) cannot be incorporated into training, as done in the other 2 methods.  In both cases, and for all the 3 methods, we followed a 5-fold cross-validation approach with the split ratio described above. The results are shown in table 1. While both DA techniques significantly improved the segmentation performance for seen vendors, the improvement in performance decreased for unseen vendors and only DANN reports a significantly better performance on average (t-test) than the baseline in segmenting any structure. Indeed, baseline is superior in RV median Dice under unseen vendors.
Finally, the 3 models were trained on A+B+C (only A+B for baseline), following the 5-fold training and split ratio described, and submitted to the M&Ms 2020 Challenge for external validation on unseen samples from vendors A, B, C and D. In addition, an ensemble with the combination in majority voting of the 15-resulting baseline, DANN and DU models (5folds x 3methods) was submitted. The external results, shown in table 2, are consistent with our experiments and confirm the improvement in performance created by DA, especially for unseen vendors. The differences were not significant due to the size of the sets (only 10 patients per vendor). The best performance on average was obtained with the ensemble model in any of the structures.   Table 3. Volumetric Dice results from the final testing set, stratified by vendor (median and inter-quartile range) and aggregated (average). Significant differences (p <0.05) of mean values with respect to the ensemble external validation results are marked with an asterisk (*).

Discussion and conclusions
Our contributions in this work are as follows: (1)  The results show that the implemented DA methods significantly improve segmentation performance on seen labelled vendors. However, the 2 proposed DA methods only achieved comparable or slightly superior performance to baseline on the unseen vendors, and so the potential improvement that these 2 methods offer remains inconclusive for multi-vendor applications.
The DA model improvements for seen vendors, especially with DU, illustrated by the A+B experiment in Table 1 can be explained by the fact that both DA models learn a generic feature representation (by unlearning domain-specific information) enabling information from both domains to be incorporated. This essentially increases the size of the training dataset and its variability, improving the label prediction. We believe that the DU model provided better results compared to DANN since the minmax optimization (with gradient reversal) could occasionally become unstable and get trapped in local maxima. On the other hand, feature representations are updated to unlearn domain-specific information with each iteration in DU models. Thus, the label predictor is consistently improved using these feature representations, making the DU model comparatively more stable.
However, despite training DANN and DU models with data from vendor C, there was a lack of improvement on that vendor when compared to baseline results, according to the external validation experiments. A plausible explanation is the additional variability introduced by unseen pathological cases and centers within the same vendor domain. Further investigation is required, after the test data is released, to understand this behavior and to propose changes accordingly via domain shift analysis and hyperparameter fine-tuning.
Comparing the A+B and A+C experiment results, the relatively small drop in performance in the baseline models on the prediction of vendor B suggests the robustness of the baseline method for unseen domain prediction. This is further confirmed by the external validation experiments (see predictions on vendor C and D, table 2). Credit for this robustness should be given to the 2-step implementation, which normalizes for orientation, resolution and appearance and crops the original images to the ROI, levering the label imbalance (background vs structure) towards segmentation enhancement. In an ensemble model, different models are trained with different initializations and each of them learn different aspects of the data. For instance, while the baseline model learns the salient features for label prediction, DANN and DU learn domain invariant features, still retaining task-specific features. Hence, we achieved better overall results in the external validation dataset by combining the predictions using an ensemble model compared to any of the individual models. The dice scores obtained by this model ensemble in the validation dataset (see table 2) are comparable to those reported in [21], trained on 4875 theoretically healthy subjects, while we only used 150 labelled subjects for training and evaluated on data from multiple centers and vendors, including pathological cases.
While according to the M&Ms 2020 Challenge description the external validation (40 cases) and the final testing set (160 cases) come from the same test set, including samples from different vendors, centers and pathologies in the same proportion, significant differences in performance are obtained for the myocardium structure and for vendor B. A plausible explanation is that the size of the validation set might not be large enough to cover the variability of the test set. We believe that the discrepancies between the testing and validation set across participants is worth exploring to check this hypothesis and to improve the representativeness of the sets in future editions of the Challenge.
To sum up, we have evaluated our method and explored various DA techniques on heterogeneous data from multiple vendors and centers, achieving the best performance with the ensemble of DA and baseline models. We have made our methods publicly available as Singularity containers that may serve as an independent testing tool for the community * . Future directions include further exploration of DA models for improving model generalizability.