Dataset Open Access

Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests

Cabitza, Federico; Campagner, Andrea; Ferrari, Davide; Di Resta, Chiara; Ceriotti, Daniele; Sabetta, Eleonora; Colombini, Alessandra; De Vecchi, Elena; Banfi, Giuseppe; Locatelli, Massimo; Carobene, Anna


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/c9e39fc7-b83c-4f8c-b5eb-2173b1f9a2d9/all_training.xlsx"
      }, 
      "checksum": "md5:a0a7a04a687c019a17330c6b9f97fc66", 
      "bucket": "c9e39fc7-b83c-4f8c-b5eb-2173b1f9a2d9", 
      "key": "all_training.xlsx", 
      "type": "xlsx", 
      "size": 303721
    }
  ], 
  "owners": [
    106636
  ], 
  "doi": "10.1515/cclm-2020-1294", 
  "stats": {
    "version_unique_downloads": 394.0, 
    "unique_views": 962.0, 
    "views": 1066.0, 
    "version_views": 1066.0, 
    "unique_downloads": 394.0, 
    "version_unique_views": 962.0, 
    "volume": 143963754.0, 
    "version_downloads": 474.0, 
    "downloads": 474.0, 
    "version_volume": 143963754.0
  }, 
  "links": {
    "doi": "https://doi.org/10.1515/cclm-2020-1294", 
    "latest_html": "https://zenodo.org/record/4081318", 
    "bucket": "https://zenodo.org/api/files/c9e39fc7-b83c-4f8c-b5eb-2173b1f9a2d9", 
    "badge": "https://zenodo.org/badge/doi/10.1515/cclm-2020-1294.svg", 
    "html": "https://zenodo.org/record/4081318", 
    "latest": "https://zenodo.org/api/records/4081318"
  }, 
  "created": "2020-10-12T13:14:13.172108+00:00", 
  "updated": "2021-04-14T10:16:11.722777+00:00", 
  "conceptrecid": "4081317", 
  "revision": 7, 
  "id": 4081318, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.1515/cclm-2020-1294", 
    "description": "<p>The .xlsx dataset includes all patients used for training, internal-external and external validation: these can be distinguished by looking at the ID (first column) in the dataset: those in format Axxxx-&lt;Date&gt; are the data used for the training, those in the format 20xx are the data used for the internal-external validation, while the remaining data were used for external validation.</p>\n\n<p>As regards the features: for the Target feature the value 1 stands for &quot;Positive to COVID-19&quot; while the value 0 stands for &quot;Negative to COVID-19&quot;; while for the Sex feature the value 1 stands for &quot;Male&quot; while the value 0 stands for &quot;Female&quot;.</p>\n\n<p>The full article is available at: https://www.degruyter.com/view/journals/cclm/ahead-of-print/article-10.1515-cclm-2020-1294/article-10.1515-cclm-2020-1294.xml.</p>\n\n<p>A pre-print version of the article is also available on MedrXiv:&nbsp;https://www.medrxiv.org/content/10.1101/2020.10.02.20205070v1</p>\n\n<p><strong>ABSTRACT</strong></p>\n\n<p><strong>Background</strong> The rRT-PCR test, the current gold standard for the detection of coronavirus disease (COVID-19),&nbsp;presents with known shortcomings, such as long turnaround time, potential shortage of reagents, false-negative&nbsp;rates around 15&ndash;20%, and expensive equipment. The hematochemical values of routine blood exams could&nbsp;represent a faster and less expensive alternative.&nbsp;</p>\n\n<p><strong>Methods</strong> Three different training data set of hematochemical values from 1,624 patients (52% COVID-19&nbsp;positive), admitted at San Raphael Hospital (OSR) from February to May 2020, were used for developing machine&nbsp;learning (ML) models: the complete OSR dataset (72 features: complete blood count (CBC), biochemical,&nbsp;coagulation, hemogasanalysis and CO-Oxymetry values, age, sex and specific symptoms at triage) and two sub&nbsp;datasets (COVID-specific and CBC dataset, 32 and 21 features respectively). 58 cases (50% COVID-19 positive)&nbsp;from another hospital, and 54 negative patients collected in 2018 at OSR, were used for internal-external and external validation.</p>\n\n<p><strong>Results</strong> We developed five ML models: for the complete OSR dataset, the area under the receiver operating&nbsp;characteristic curve (AUC) for the algorithms ranged from 0.83 to 0.90; for the COVID-specific dataset from 0.83 15 to 0.87; and for the CBC dataset from 0.74 to 0.86. The validations also achieved good results: respectively, AUC 16 from 0.75 to 0.78; and specificity from 0.92 to 0.96.&nbsp;</p>\n\n<p><strong>Conclusions</strong> ML can be applied to blood tests as both an adjunct and alternative method to rRT-PCR for the fast&nbsp;and cost-effective identification of COVID-19-positive patients. This is especially useful in developing countries,&nbsp;or in countries facing an increase in contagions.</p>", 
    "license": {
      "id": "CC-BY-4.0"
    }, 
    "title": "Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests", 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "4081317"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "4081318"
          }
        }
      ]
    }, 
    "communities": [
      {
        "id": "covid-19"
      }
    ], 
    "publication_date": "2020-10-12", 
    "creators": [
      {
        "affiliation": "DISCo, Universit\u00e0 degli Studi di Milano-Bicocca, Viale Sarca 336, Milano, 20126, Italy", 
        "name": "Cabitza, Federico"
      }, 
      {
        "affiliation": "IRCCS Istituto Ortopedico Galeazzi, Orthopaedic Biotechnology Lab, Via Riccardo Galeazzi, 4, 20161, Milano, Italy", 
        "name": "Campagner, Andrea"
      }, 
      {
        "affiliation": "SCVSA Department, University of Parma, Parco Area delle Science 11/a, 43124, Parma, Italy", 
        "name": "Ferrari, Davide"
      }, 
      {
        "affiliation": "Vita-Salute San Raffaele University; Unit of Genomics for Human Disease Diagnosis, Division of Genetics and Cell Biology., Via Olgettina 58, 20132, Milan, Italy", 
        "name": "Di Resta, Chiara"
      }, 
      {
        "affiliation": "Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Via Olgettina 60, 20132, Milan, Italy", 
        "name": "Ceriotti, Daniele"
      }, 
      {
        "affiliation": "Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Via Olgettina 60, 20132, Milan, Italy", 
        "name": "Sabetta, Eleonora"
      }, 
      {
        "affiliation": "IRCCS Istituto Ortopedico Galeazzi, Orthopaedic Biotechnology Lab, Via Riccardo Galeazzi, 4, 20161, Milano, Italy", 
        "name": "Colombini, Alessandra"
      }, 
      {
        "affiliation": "IRCCS Istituto Ortopedico Galeazzi, Orthopaedic Biotechnology Lab, Via Riccardo Galeazzi, 4, 20161, Milano, Italy", 
        "name": "De Vecchi, Elena"
      }, 
      {
        "affiliation": "IRCCS Istituto Ortopedico Galeazzi, Orthopaedic Biotechnology Lab, Via Riccardo Galeazzi, 4, 20161, Milano, Italy", 
        "name": "Banfi, Giuseppe"
      }, 
      {
        "affiliation": "Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Via Olgettina 60, 20132, Milan, Italy", 
        "name": "Locatelli, Massimo"
      }, 
      {
        "affiliation": "Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Via Olgettina 60, 20132, Milan, Italy", 
        "name": "Carobene, Anna"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }
  }
}
1,066
474
views
downloads
Views 1,066
Downloads 474
Data volume 144.0 MB
Unique views 962
Unique downloads 394

Share

Cite as