There is a newer version of this record available.

Dataset Open Access

Abstracts from Lilacs and Ibecs with ICD10 codes

Antonio Miranda; Ankush Rana; Martin Krallinger

JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish).

These databases have MeSH terms describing some of their documents. Then, using UMLS Metathesaurus, those MeSH terms have been translated into ICD10 codes (ICD10-CM and ICD10-PCS). Every abstract have at least one ICD10 code. 

In addition, MeSH codes given by the databases (Lilacs and Ibecs) have a "word" describing them. These "words" have been used to add further ICD10 codes. We have done strict string matching to find whether those "words" were a descriptor of any ICD10 code (in the Spanish version, CIE10).

The format of the JSON file is the following:

	[{'title': 'title',
	'pmid': 'pmid',
	'abstractText': 'abtract (in Spanish)',
		[{'Code': 'MeSHCode',
		'Word': 'reference',
		'CIE': [CIE10_1, CIE10_2, ...]},


Corpus statistics:

  • There are 176 294 abstracts.
  • On average, every abstract has 2,5 associated ICD10 codes.
  • There are 3103 unique ICD10 codes (ICD10-CM and ICD10-PCS).

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
Files (83.9 MB)
Name Size
83.9 MB Download
All versions This version
Views 859296
Downloads 13455
Data volume 19.3 GB4.6 GB
Unique views 682269
Unique downloads 11054


Cite as