10.5281/zenodo.3606626
https://zenodo.org/records/3606626
oai:zenodo.org:3606626
Antonio Miranda
Antonio Miranda
Barcelona Supercomputing Center
Ankush Rana
Ankush Rana
Barcelona Supercomputing Center
Martin Krallinger
Martin Krallinger
Barcelona Supercomputing Center
Abstracts from Lilacs and Ibecs with ICD10 codes
Zenodo
2020
2020-01-13
10.5281/zenodo.3606625
https://zenodo.org/communities/medicalnlp
1.0
Creative Commons Attribution 4.0 International
JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish).
These databases have MeSH terms describing some of their documents. Then, using UMLS Metathesaurus, those MeSH terms have been translated into ICD10 codes (ICD10-CM and ICD10-PCS). Every abstract have at least one ICD10 code.
In addition, MeSH codes given by the databases (Lilacs and Ibecs) have a "word" describing them. These "words" have been used to add further ICD10 codes. We have done strict string matching to find whether those "words" were a descriptor of any ICD10 code (in the Spanish version, CIE10).
The format of the JSON file is the following:
{'articles':
[{'title': 'title',
'pmid': 'pmid',
'abstractText': 'abtract (in Spanish)',
'Mesh':
[{'Code': 'MeSHCode',
'Word': 'reference',
'CIE': [CIE10_1, CIE10_2, ...]},
...]
},
...]
}
Corpus statistics:
There are 176 294 abstracts.
On average, every abstract has 2,5 associated ICD10 codes.
There are 3103 unique ICD10 codes (ICD10-CM and ICD10-PCS).
Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).