There is a newer version of the record available.

Published January 13, 2020 | Version 1.0
Dataset Open

Abstracts from Lilacs and Ibecs with ICD10 codes

  • 1. Barcelona Supercomputing Center


JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish).

These databases have MeSH terms describing some of their documents. Then, using UMLS Metathesaurus, those MeSH terms have been translated into ICD10 codes (ICD10-CM and ICD10-PCS). Every abstract have at least one ICD10 code. 

In addition, MeSH codes given by the databases (Lilacs and Ibecs) have a "word" describing them. These "words" have been used to add further ICD10 codes. We have done strict string matching to find whether those "words" were a descriptor of any ICD10 code (in the Spanish version, CIE10).

The format of the JSON file is the following:

	[{'title': 'title',
	'pmid': 'pmid',
	'abstractText': 'abtract (in Spanish)',
		[{'Code': 'MeSHCode',
		'Word': 'reference',
		'CIE': [CIE10_1, CIE10_2, ...]},


Corpus statistics:

  • There are 176 294 abstracts.
  • On average, every abstract has 2,5 associated ICD10 codes.
  • There are 3103 unique ICD10 codes (ICD10-CM and ICD10-PCS).


Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).


Files (83.9 MB)

Name Size Download all
83.9 MB Preview Download