Dataset Open Access

# Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests

Cabitza, Federico; Campagner, Andrea; Ferrari, Davide; Di Resta, Chiara; Ceriotti, Daniele; Sabetta, Eleonora; Colombini, Alessandra; De Vecchi, Elena; Banfi, Giuseppe; Locatelli, Massimo; Carobene, Anna

### Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:creator>Cabitza, Federico</dc:creator>
<dc:creator>Campagner, Andrea</dc:creator>
<dc:creator>Ferrari, Davide</dc:creator>
<dc:creator>Di Resta, Chiara</dc:creator>
<dc:creator>Ceriotti, Daniele</dc:creator>
<dc:creator>Sabetta, Eleonora</dc:creator>
<dc:creator>Colombini, Alessandra</dc:creator>
<dc:creator>De Vecchi, Elena</dc:creator>
<dc:creator>Banfi, Giuseppe</dc:creator>
<dc:creator>Locatelli, Massimo</dc:creator>
<dc:creator>Carobene, Anna</dc:creator>
<dc:date>2020-10-12</dc:date>
<dc:description>The .xlsx dataset includes all patients used for training, internal-external and external validation: these can be distinguished by looking at the ID (first column) in the dataset: those in format Axxxx-&lt;Date&gt; are the data used for the training, those in the format 20xx are the data used for the internal-external validation, while the remaining data were used for external validation.

As regards the features: for the Target feature the value 1 stands for "Positive to COVID-19" while the value 0 stands for "Negative to COVID-19"; while for the Sex feature the value 1 stands for "Male" while the value 0 stands for "Female".

The full article is available at: https://www.degruyter.com/view/journals/cclm/ahead-of-print/article-10.1515-cclm-2020-1294/article-10.1515-cclm-2020-1294.xml.

A pre-print version of the article is also available on MedrXiv: https://www.medrxiv.org/content/10.1101/2020.10.02.20205070v1

ABSTRACT

Background The rRT-PCR test, the current gold standard for the detection of coronavirus disease (COVID-19), presents with known shortcomings, such as long turnaround time, potential shortage of reagents, false-negative rates around 15–20%, and expensive equipment. The hematochemical values of routine blood exams could represent a faster and less expensive alternative.

Methods Three different training data set of hematochemical values from 1,624 patients (52% COVID-19 positive), admitted at San Raphael Hospital (OSR) from February to May 2020, were used for developing machine learning (ML) models: the complete OSR dataset (72 features: complete blood count (CBC), biochemical, coagulation, hemogasanalysis and CO-Oxymetry values, age, sex and specific symptoms at triage) and two sub datasets (COVID-specific and CBC dataset, 32 and 21 features respectively). 58 cases (50% COVID-19 positive) from another hospital, and 54 negative patients collected in 2018 at OSR, were used for internal-external and external validation.

Results We developed five ML models: for the complete OSR dataset, the area under the receiver operating characteristic curve (AUC) for the algorithms ranged from 0.83 to 0.90; for the COVID-specific dataset from 0.83 15 to 0.87; and for the CBC dataset from 0.74 to 0.86. The validations also achieved good results: respectively, AUC 16 from 0.75 to 0.78; and specificity from 0.92 to 0.96.

Conclusions ML can be applied to blood tests as both an adjunct and alternative method to rRT-PCR for the fast and cost-effective identification of COVID-19-positive patients. This is especially useful in developing countries, or in countries facing an increase in contagions.</dc:description>
<dc:identifier>https://zenodo.org/record/4081318</dc:identifier>
<dc:identifier>10.1515/cclm-2020-1294</dc:identifier>
<dc:identifier>oai:zenodo.org:4081318</dc:identifier>
<dc:relation>url:https://zenodo.org/communities/covid-19</dc:relation>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
<dc:title>Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests</dc:title>
<dc:type>info:eu-repo/semantics/other</dc:type>
<dc:type>dataset</dc:type>
</oai_dc:dc>

1,066
474
views
downloads
 Views 1,066 Downloads 474 Data volume 144.0 MB Unique views 962 Unique downloads 394