BRAINTEASER ALS and MS Datasets
Creators
- Faggioli, Guglielmo1
- Marchesin, Stefano1
- Menotti, Laura1
- Aidos, Helena2
- Bergamaschi, Roberto3
- Birolo, Giovanni4
- Bosoni, Pietro5
- Cavalla, Paola6
- Chiò, Adriano4
- Dagliati, Arianna3
- de Carvalho, Mamede2
- Di Nunzio, Giorgio Maria1
- Fariselli, Piero4
- García Dominguez, Jose Manuel7
- Gromicho, Marta2
- Guazzo, Alessandro1
- Longato, Enrico1
- Madeira, Sara C.2
- Manera, Umberto4
- Silvello, Gianmaria1
- Tavazzi, Eleonora5
- Tavazzi, Erica1
- Trescato, Isotta1
- Vettoretti, Martina1
- Di Camillo, Barbara1
-
Ferro, Nicola1
Description
BRAINTEASER (Bringing Artificial Intelligence home for a better care of amyotrophic lateral sclerosis and multiple sclerosis) is a data science project that seeks to exploit the value of big data, including those related to health, lifestyle habits, and environment, to support patients with Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) and their clinicians. Taking advantage of cost-efficient sensors and apps, BRAINTEASER will integrate large, clinical datasets that host both patient-generated and environmental data.
As part of its activities, BRAINTEASER organized three open evaluation challenges on Intelligent Disease Progression Prediction (iDPP), iDPP@CLEF 2022, iDPP@CLEF 2023, and iDPP@CLEF 2024 co-located with the Conference and Labs of the Evaluation Forum (CLEF).
The goal of iDPP@CLEF is to design and develop an evaluation infrastructure for AI algorithms able to:
- better describe disease mechanisms;
- stratify patients according to their phenotype assessed all over the disease evolution;
- predict disease progression in a probabilistic, time-dependent fashion.
The iDPP@CLEF challenges relied on retrospective and prospective ALS and MS patient data made available by the clinical partners of the BRAINTEASER consortium.
Retrospective Dataset
We release three retrospective datasets, one for ALS and two for MS. The two retrospective MS datasets, one consisting of clinical data only and one with clinical data and environmental/pollution data.
The retrospective datasets contain data about 2,204 ALS patients (static variables, ALSFRS-R questionnaires, spirometry tests, environmental/pollution data) and 1,792 MS patients (static variables, EDSS scores, evoked potentials, relapses, MRIs). A subset of 280 MS patients contains environmental and pollution data.
More in detail, the BRAINTEASER project retrospective datasets were derived from the merging of already existing datasets obtained by the clinical centers involved in the BRAINTEASER Project.
- The ALS dataset was obtained by the merge and homogenisation of the Piemonte and Valle d’Aosta Registry for Amyotrophic Lateral Sclerosis (PARALS, Chiò et al., 2017) and the Lisbon ALS clinic (CENTRO ACADÉMICO DE MEDICINA DE LISBOA, Centro Hospitalar Universitário de Lisboa-Norte, Hospital de Santa Maria, Lisbon, Portugal,) dataset. Both datasets were initiated in 1995 and are currently maintained by researchers of the ALS Regional Expert Centre (CRESLA), University of Turin, and of the CENTRO ACADÉMICO DE MEDICINA DE LISBOA-Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa. They include demographic and clinical data, comprehending both static and dynamic variables.
- The MS dataset was obtained from the Pavia MS clinical dataset, which was started in 1990 and contains demographic and clinical information that is continuously updated by the researchers of the Institute and the Turin MS clinic dataset (Department of Neurosciences and Mental Health, Neurology Unit 1, Città della Salute e della Scienza di Torino.
- Retrospective environmental data are accessible at various scales at the individual subject level. Thus, environmental data have been retrieved at different scales:
- To gather macroscale air pollution data we’ve leveraged data coming from public monitoring stations that cover the whole extension of the involved countries, namely the European Air Quality Portal;
- data from a network of air quality sensors (PurpleAir - Outdoor Air Quality Monitor / PurpleAir PA-II) installed in different points of the city of Pavia (Italy) were extracted as well. In both cases, environmental data were previously publicly available. In order to merge environmental data with individual subject locations we leverage postcodes (postcodes of the station for the pollutant detection and postcodes of subject address). Data were merged following an anonymization procedure based on hash keys. Environmental exposure trajectories have been pre-processed and aggregated in order to avoid fine temporal and spatial granularities. Thus, individual exposure information could not disclose personal addresses.
The retrospective datasets are shared in two formats:
- RDF (serialized in Turtle) modeled according to the BRAINTEASER Ontology (BTO);
- CSV, as shared during the iDPP@CLEF 2022 and 2023 challenges, split into training and test.
Each format corresponds to a specific folder in the datasets, where a dedicated README file provides further details on the datasets. Note that the ALS dataset is split into multiple ZIP files due to the size of the environmental data.
Prospective Dataset
For the iDPP@CLEF 2024 challenge, the datasets contain prospective data about 86 ALS patients (static variables, ALSFRS-R questionnaires compiled by clinicians or patients using the BRAINTEASER mobile application, sensors data).
The prospective datasets are shared in two formats:
- RDF (serialized in Turtle) modeled according to the BRAINTEASER Ontology (BTO);
- CSV, as shared during the iDPP@CLEF 2024 challenge, split into training and test.
Each format corresponds to a specific folder in the datasets, where a dedicated README file provides further details on the datasets. Note that the MS dataset is split into multiple ZIP files due to the size of the environmental data.
The BRAINTEASER Data Sharing Policy section below reports the details for requesting access to the datasets.
Files
Additional details
Funding
References
- Chiò A, Mora G, Moglia C, Manera U, Canosa A, Cammarosano S, Ilardi A, Bertuzzo D, Bersano E, Cugnasco P, Grassano M, Pisano F, Mazzini L, Calvo A (2017). Piemonte and Valle d'Aosta Register for ALS (PARALS). Secular Trends of Amyotrophic Lateral Sclerosis: The Piemonte and Valle d'Aosta Register. JAMA Neurol., 74(9):1097-1104. doi: 10.1001/jamaneurol.2017.1387
- Bergamaschi R, Monti MC, Trivelli L, Mallucci G, Gerosa L, Pisoni E, Montomoli C. (2021). PM2.5 exposure as a risk factor for multiple sclerosis. An ecological study with a Bayesian mapping approach. Environ Sci Pollut Res Int., 28(3):2804-2809, doi: 10.1007/s11356-020-10595-5
- Bergamaschi R, Monti MC, Trivelli L, Introcaso VP, Mallucci G, Borrelli P, Gerosa L, Montomoli C. (2020). Increased prevalence of multiple sclerosis and clusters of different disease risk in Northern Italy. Neurol Sci., 41(5):1089-1095, doi: 10.1007/s10072-019-04205-7
- Guazzo, A., Trescato, I., Longato, E., Hazizaj, E., Dosso, D., Faggioli, G., Di Nunzio, G. M., Silvello, G., Vettoretti, M., Tavazzi, E., Roversi, C., Fariselli, P., Madeira, S. C., de Carvalho, M., Gromicho, M., Chiò, A., Manera, U., Dagliati, A., Birolo, G., Aidos, H., Di Camillo, B., and Ferro, N. (2022). Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2022. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), pages 395–422. Lecture Notes in Computer Science (LNCS) 13390, Springer, Heidelberg, Germany. doi: 10.1007/978-3-031-13643-6_25
- Faggioli, G., Guazzo, A., Marchesin, S., Menotti, L., Trescato, I., Aidos, H., Bergamaschi, R., Birolo, G., Cavalla, P., Chiò, A., Dagliati, A., de Carvalho, M., Di Nunzio, G. M., Fariselli, P., Garc ́ıa Dominguez, J. M., Gromicho, M., Longato, E., Madeira, S. C., Manera, U., Silvello, G., Tavazzi, E., Tavazzi, E., Vettoretti, M., Di Camillo, B., and Ferro, N. (2023). Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2023. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), pages 343-369. Lecture Notes in Computer Science (LNCS) 14163, Springer, Heidelberg, Germany. doi: 10.1007/978-3-031-42448-9_24
- Birolo, G., Bosoni, P., Faggioli, G., Aidos, H., Bergamaschi, R., Cavalla, P., Chiò, A., Dagliati, A., de Carvalho, M., Di Nunzio, G. M., Fariselli, P., Garcia Dominguez, J. M., Gromicho, M., Guazzo, A., Longato, E., Madeira, S., Manera, U., Marchesin, S., Menotti, L., Silvello, G., Tavazzi, E., Tavazzi, E., Trescato, I., Vettoretti, M., Di Camillo, B., and Ferro, N. (2024). Intelligent Disease Progression Prediction: Overview of iDPP@CLEF 2024. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024) – Part II, pages 118-139. Lecture Notes in Computer Science (LNCS) 14959, Springer, Heidelberg, Germany. doi: 10.1007/978-3-031-71908-0_6
- Faggioli, G., Menotti, L., Marchesin, S., Chiò, A., Dagliati, A., de Carvalho, M., Gromicho, M., Manera, U., Tavazzi, E., Di Nunzio, G. M., Silvello, G., and Ferro, N. (2024). An extensible and unifying approach to retrospective clinical data modeling: the BrainTeaser Ontology. Journal of Biomedical Semantics, 15:16:1-16:28. doi: 10.1186/s13326-024-00317-y
- Alves, I., Gromicho, M., Oliveira Santos, M., Pinto, S., Pronto-Laborinho, A,, Swash, M., and de Carvalho, M. (2023) Demographic changes in a large motor neuron disease cohort in Portugal: a 27 year experience. Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, 24(7–8), 614–624. doi: 10.1080/21678421.2023.2220747.