Published June 9, 2021 | Version 1.0.0
Dataset Open

CODE-15%: a large scale annotated dataset of 12-lead ECGs

Description

A dataset of 12-lead ECGs with annotations. The dataset contains 345 779 exams from 233 770 patients. It was obtained through stratified sampling from the CODE dataset ( 15% of the patients). The data was collected by the Telehealth Network of Minas Gerais in the period between 2010 and 2016.

This repository contains the files `exams.csv` and the files `exams_part{i}.zip` for i = 0, 1, 2, ... 17. 

  • "exams.csv": is a comma-separated values (csv) file containing the columns
    • "exam_id": id used for identifying the exam;
    • "age": patient age in years at the moment of the exam;
    • "is_male": true if the patient is male;
    • "nn_predicted_age": age predicted by a neural network to the patient. As described in the paper "Deep neural network estimated electrocardiographic-age as a mortality predictor" bellow.
    • "1dAVb": Whether or not the patient has 1st degree AV block;
    • "RBBB": Whether or not the patient has right bundle branch block;
    • "LBBB": Whether or not the patient has left bundle branch block;
    • "SB": Whether or not the patient has sinus bradycardia;
    • "AF": Whether or not the patient has atrial fibrillation;
    • "ST": Whether or not the patient has sinus tachycardia;
    • "patient_id": id used for identifying the patient;
    • "normal_ecg": True if the patient has a normal ECG;
    • "death": true if the patient dies in the follow-up time. This data is available only in the first exam of the patient. Other exams will have this as an empty field;
    • "timey": if the patient dies it is the time to the death of the patient. If not, it is the follow-up time. This data is available only in the first exam of the patient. Other exams will have this as an empty field;
    • "trace_file": identify in which hdf5 file the file corresponding to this patient is located.
  • "exams_part{i}.hdf5": The HDF5 file containing two datasets named `tracings` and other named `exam_id`. The `exam_id` is a tensor of dimension `(N,)` containing the exam id (the same as in the csv file) and the dataset `tracings` is a `(N, 4096, 12)` tensor containing the ECG tracings in the same order. The first dimension corresponds to the different exams; the second dimension corresponds to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples), we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are then saved in the hdf5 dataset. 

    In python, one can read this file using h5py.
    ```python
    import h5py

    f = h5py.File(path_to_file, 'r')
    # Get ids
    traces_ids = np.array(self.f['id_exam'])
    x = f['signal']
    ```
    The `signal` dataset is too large to fit in memory, so don't convert it to a numpy array all at once.
    It is possible to access a chunk of it using: ``x[start:end, :, :]``.

The CODE dataset was collected by the Telehealth Network of Minas Gerais (TNMG) in the period between 2010 and 2016. TNMG is a public telehealth system assisting 811 out of the 853 municipalities in the state of Minas Gerais, Brazil. The dataset is described

Ribeiro, Antônio H., Manoel Horta Ribeiro, Gabriela M. M. Paixão, Derick M. Oliveira, Paulo R. Gomes, Jéssica A. Canazart, Milton P. S. Ferreira, et al. “Automatic Diagnosis of the 12-Lead ECG Using a Deep Neural Network.” Nature Communications 11, no. 1 (2020): 1760. https://doi.org/10.1038/s41467-020-15432-4

The CODE 15% dataset is obtained from stratified sampling from the CODE dataset. This subset of the code dataset is described in and used for assessing model performance:
"Deep neural network estimated electrocardiographic-age as a mortality predictor"
Emilly M Lima, Antônio H Ribeiro, Gabriela MM Paixão, Manoel Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Ester C Sabino, Bruce B Duncan, Luana Giatti, Sandhi M Barreto, Wagner Meira Jr, Thomas B Schön, Antonio Luiz P Ribeiro. MedRXiv (2021) https://www.doi.org/10.1101/2021.02.19.21251232

The companion code for reproducing the experiments in the two papers described above can be found, respectively, in:
- https://github.com/antonior92/automatic-ecg-diagnosis; and in,
https://github.com/antonior92/ecg-age-prediction.

Note about authorship: Antônio H. Ribeiro, Emilly M. Lima and Gabriela M.M. Paixão contributed equally to this work.

Files

exams.csv

Files (46.3 GB)

Name Size Download all
md5:0107516d3f63864498fb77d15799cc95
35.5 MB Preview Download
md5:2bed0dc753d16beef8c2f7627e2b6ea4
2.7 GB Preview Download
md5:b32446cdb93247d07550509a204a061d
2.7 GB Preview Download
md5:26bf9e387289dabbb140c0453853872b
2.7 GB Preview Download
md5:dd99137b6c199c9558bc2c2b6ae0e4dc
2.7 GB Preview Download
md5:c25b38260fb46edff089fa56eb442ddb
2.7 GB Preview Download
md5:5472ba3186e39bd03e888e486143bf7a
2.7 GB Preview Download
md5:0a0458779f5e795df20cb07db6e50682
2.7 GB Preview Download
md5:cf99b9c54cf7c15b9b683511cd5e6d5a
2.7 GB Preview Download
md5:64c210e19cf1c0abac3e643f88708ebd
2.7 GB Preview Download
md5:6e9ac2e36197c4d301df91d7dc6877c0
782.6 MB Preview Download
md5:e2862a75eeb6245b148c6c520245c0e0
2.7 GB Preview Download
md5:623e073c1e1323cb69a219e5f9bedaf3
2.7 GB Preview Download
md5:e5f958b7c31bd82a6bc76fcf6aed5713
2.7 GB Preview Download
md5:2730cb080d03d61681ded6609c18c9d8
2.7 GB Preview Download
md5:4cd7537330f0a62ffcc6cc40c63774eb
2.7 GB Preview Download
md5:aec8d97e8dde9ac15f5c671c168fdfbe
2.7 GB Preview Download
md5:96adb16ee032e4e41528c04fd7b582ee
2.7 GB Preview Download
md5:8d74d7032396298304b51e447b73d40f
2.7 GB Preview Download