[Datasets] Evaluación de la capacidad predictiva de modelos de aprendizaje supervisado para la clasificación de pacientes con cáncer colorrectal

Pablo Roman-Naranjo

doi:10.5281/zenodo.8061669

Published June 20, 2023 | Version v1

Dataset Open

[Datasets] Evaluación de la capacidad predictiva de modelos de aprendizaje supervisado para la clasificación de pacientes con cáncer colorrectal

Pablo Roman-Naranjo¹

1. Division of Otolaryngology, Department of Surgery, Instituto de Investigación Biosanitaria, ibs.GRANADA, Universidad de Granada, 18071 Granada, Spain

DATASETS INFO

Dataset on colorectal cancer and hydroxymethylation levels ready to be used in machine learning algorithms. This dataset was generated using data from Walker NJ, Rashid M, Yu S, et al. [Dataset] Hydroxymethylation profile of cell free DNA is a biomarker for early colorectal cancer. Accessed May 17, 2023. https://zenodo.org/record/5170265#.ZGSpgHZBxD-.

ABSTRACT
Colorectal cancer (CRC) is the second most common cause of cancer death, accounting for 9.5% of all cancer deaths. In addition to patient age, other potential risk factors should be considered to correctly identify the target population for CRC screening programmes. The identification of these risk factors would allow a personalised and accurate approach for each patient that would help improve the survival rate. Thus, the main objective of this study was to identify useful risk biomarkers for the early detection of CRC using machine learning algorithms.

For this purpose, we compared the predictive ability of different supervised machine learning models, such as gradient boosting, support vector machines (SVM) or random forest, using a public dataset on hydroxymethylation levels in the enhancer regions in CRC patients, AAR and controls. In addition, we evaluated the suitability of K-means for the identification of CRC patient subgroups using this dataset.

The results of this work suggested that the best supervised model to differentiate CRC patients from controls, using hydroxymethylation data, was a SVM model with linear kernel, whose sensitivity was 58% after setting the specificity to 95%, improving the model presented in the article from which the dataset was extracted. In addition, enhancers that regulate the expression of genes such as MYSM1 or SP1, or those that regulate genes encoding proteins involved in pathways such as TGF-β and integrin pathways, were identified as the most relevant enhancers when classifying samples into CRC or control. On the other hand, the use of K-means identified 6 clusters among the samples in the hydroxymethylation dataset. Two of these clusters were mainly composed of samples with CCR, however, these clusters were not associated with a specific stage of development, and the differentiation between clusters was not clear, obtaining very close clusters.

Thus, we can conclude that hydroxymethylation data were useful for the identification of CRC biomarkers, obtaining promising results by supervised machine learning approaches. However, these results should be interpreted as preliminary, requiring validation in an external cohort and molecular analysis of the biomarkers identified.

Files

dataset_enhancer_crc_aa_c_ml.csv

Files (361.4 MB)

Name	Size	Download all
dataset_enhancer_crc_aa_c_ml.csv md5:e3e9831141434a4cd83d8b7524e65335	361.4 MB	Preview Download

	All versions	This version
Views	78	77
Downloads	49	48
Data volume	32.5 GB	32.2 GB

[Datasets] Evaluación de la capacidad predictiva de modelos de aprendizaje supervisado para la clasificación de pacientes con cáncer colorrectal

Authors/Creators

Description

Files

dataset_enhancer_crc_aa_c_ml.csv

Files (361.4 MB)