CodiEsp-abstracs: Abstracts from Lilacs and Ibecs with ICD10 codes
- 1. Barcelona Supercomputing Center
Description
JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish).
Please, cite us:
Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. In: CLEF (Working Notes) (2020)
@inproceedings{miranda2020overview,
title={Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020},
author={Miranda-Escalada, Antonio and Gonzalez-Agirre, Aitor and Armengol-Estap{\'e}, Jordi and Krallinger, Martin},
booktitle={Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings},
year={2020} }
Lilacs and Ibecs databases have MeSH terms describing some of their documents. Then, using UMLS Metathesaurus, those MeSH terms have been translated into ICD10 codes (ICD10-CM and ICD10-PCS). Every abstract have at least one ICD10 code.
In addition, MeSH codes given by the databases (Lilacs and Ibecs) have a "word" describing them. These "words" have been used to add further ICD10 codes. We have done strict string matching to find whether those "words" were a descriptor of any ICD10 code (in the Spanish version, CIE10).
The format of the JSON file is the following:
{'articles': [{'title': 'title', 'pmid': 'pmid', 'abstractText': 'abtract (in Spanish)', 'Mesh': [{'Code': 'MeSHCode', 'Word': 'reference', 'CIE': [CIE10_1, CIE10_2, ...]}, ...] }, ...] }
Additionally, the compressed file includes a folder with all the abstracts extracted in individual UTF-8 encoded text files and a tab-separated file with 4 fields:
pmid label cie10-code word
Summary statistics:
- number of abstracts: 355 840
- number abstracts with at least one ICD10 code: 176 294
- Percentage of MeSH codes mapped to ICD10: 10.6% (there were 2 526 772 MeSH codes and 266 949 mapped to ICD10)
- average number of MeSH codes per article: 7.1
- average number of ICD10 codes per article: 2.5
- number of ICD10 codes that have an associated MeSH code in UMLS: 3293
- number of ICD10 codes that have an associated MeSH code in UMLS and appear in this dataset: 3082
Notes
Files
abstractsWithCIE10_v2.zip
Files
(185.3 MB)
Name | Size | Download all |
---|---|---|
md5:2c03ec12e609389842aad60cc915308d
|
185.3 MB | Preview Download |