Published January 13, 2020 | Version 2.0
Dataset Open

CodiEsp-abstracs: Abstracts from Lilacs and Ibecs with ICD10 codes

  • 1. Barcelona Supercomputing Center

Description

JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish).

 

Please, cite us:

Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. In: CLEF (Working Notes) (2020)

@inproceedings{miranda2020overview, 
title={Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020}, 
author={Miranda-Escalada, Antonio and Gonzalez-Agirre, Aitor and Armengol-Estap{\'e}, Jordi and Krallinger, Martin}, 
booktitle={Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings}, 
year={2020} }

 

Lilacs and Ibecs databases have MeSH terms describing some of their documents. Then, using UMLS Metathesaurus, those MeSH terms have been translated into ICD10 codes (ICD10-CM and ICD10-PCS). Every abstract have at least one ICD10 code. 

In addition, MeSH codes given by the databases (Lilacs and Ibecs) have a "word" describing them. These "words" have been used to add further ICD10 codes. We have done strict string matching to find whether those "words" were a descriptor of any ICD10 code (in the Spanish version, CIE10).

The format of the JSON file is the following:

{'articles':
	[{'title': 'title',
	'pmid': 'pmid',
	'abstractText': 'abtract (in Spanish)',
	'Mesh':
		[{'Code': 'MeSHCode',
		'Word': 'reference',
		'CIE': [CIE10_1, CIE10_2, ...]},
		...]
	},
	...]
}

 

Additionally, the compressed file includes a folder with all the abstracts extracted in individual UTF-8 encoded text files and a tab-separated file with 4 fields:

pmid    label    cie10-code    word

Summary statistics:

  • number of abstracts: 355 840
  • number abstracts with at least one ICD10 code: 176 294
  • Percentage of MeSH codes mapped to ICD10: 10.6% (there were 2 526 772 MeSH codes and 266 949 mapped to ICD10)
  • average number of MeSH codes per article: 7.1
  • average number of ICD10 codes per article: 2.5
  • number of ICD10 codes that have an associated MeSH code in UMLS: 3293
  • number of ICD10 codes that have an associated MeSH code in UMLS and appear in this dataset: 3082

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

abstractsWithCIE10_v2.zip

Files (185.3 MB)

Name Size Download all
md5:2c03ec12e609389842aad60cc915308d
185.3 MB Preview Download