There is a newer version of this record available.

Dataset Open Access

Abstracts from Lilacs and Ibecs with ICD10 codes

Antonio Miranda; Ankush Rana; Martin Krallinger

JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish).

These databases have MeSH terms describing some of their documents. Then, using UMLS Metathesaurus, those MeSH terms have been translated into ICD10 codes (ICD10-CM and ICD10-PCS). Every abstract have at least one ICD10 code. 

In addition, MeSH codes given by the databases (Lilacs and Ibecs) have a "word" describing them. These "words" have been used to add further ICD10 codes. We have done strict string matching to find whether those "words" were a descriptor of any ICD10 code (in the Spanish version, CIE10).

The format of the JSON file is the following:

{'articles':
	[{'title': 'title',
	'pmid': 'pmid',
	'abstractText': 'abtract (in Spanish)',
	'Mesh':
		[{'Code': 'MeSHCode',
		'Word': 'reference',
		'CIE': [CIE10_1, CIE10_2, ...]},
		...]
	},
	...]
}

 

Corpus statistics:

  • There are 176 294 abstracts.
  • On average, every abstract has 2,5 associated ICD10 codes.
  • There are 3103 unique ICD10 codes (ICD10-CM and ICD10-PCS).

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
Files (83.9 MB)
Name Size
abstractsWithCIE10.zip
md5:e96ecc06b88b582d7201acc092e9e9a9
83.9 MB Download
859
134
views
downloads
All versions This version
Views 859296
Downloads 13455
Data volume 19.3 GB4.6 GB
Unique views 682269
Unique downloads 11054

Share

Cite as