Published November 29, 2019 | Version 1.1
Dataset Open

MeSCCon - Medical Spanish Chemical compound, drug and medication Name Lexicon (unfiltered version)

Description

The MeSCCon (Medical Spanish Chemical compound, drug and medication Name Lexicon) consists of a list or gazetteer of candidate names of chemicals, drugs, and medications mentioned in Spanish clinical texts. Thus MeSCCon serves as a lexical resource or dictionary for automatic detection of chemical/drug mentions, as well as indexing or classification of medical texts with such concept types.

This collection was generated in a five step procedure:

  1. Automatic detection of mentions of chemicals and drugs in biomedical texts in English (including mapping/normalization to MeSH terms or ChEBI identifiers).
  2. Generation of a unique name list from the detected concept mentions.
  3. Basic filtering of non-chemical names or highly ambiguous mentions-abbreviations using basic characteristics like name morphology and length criteria.
  4. Automatic translation of name lists from English to Spanish using a medical machine translation system (see Soares, F. and Krallinger, M. BSC Participation in the WMT Translation of Biomedical Abstracts. In Proceedings of the Fourth Conference on Machine Translation, Volume 3: Shared Task Papers, pp. 175-178 2019; https://zenodo.org/record/3346802)
  5. Automatic mention lookup of translated names in a collection of 20 million Spanish clinical notes (primary care and pedriatrics).

Every term in MeSCCon is identified by a text span (in Spanish), a target terminology namespace to which it was automatically mapped (MeSH or ChEBI) and the corresponding concept identifier in that terminology.

Moreover, we provide for every text span the absolute term frequency, i.e. the number of matches in the corpus of 20 million clinical notes and the number of documents or notes in which it was found.

Important note: no manual filtering of the MeSCCon was carried out, implying that some entries might comprise errors, either due to the initial name recognition and concept mapping in English or due to wrong automatic translations into Spanish.

The MeSCCon resource is provided in two formats:

  • TSV. Data is separated by tabs (\t). Every row of the file has the following fields:
terminology	identifier	translatedTerm	termCount	documentCount
  • JSON. Records are stored as a list of JSON objects. They have the following fields:
{
	"terminology":"MESH",
	"identifier":"D009020",
	"translatedTerm":"clorhidrato de morfina",
	"termFrequency":1,
	"documentFrequency":1
}

 

Copyright (c) 2019 Secretaría de Estado para el Avance Digital

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

MeSCCon.zip

Files (12.7 MB)

Name Size Download all
md5:30fdc87531a70aed87be0f7a04cf692a
12.7 MB Preview Download

Additional details

References

  • Soares, F. and Krallinger, M. BSC Participation in the WMT Translation of Biomedical Abstracts. In Proceedings of the Fourth Conference on Machine Translation, Volume 3: Shared Task Papers, pp. 175-178 2019
  • Santamaría J, Krallinger M. Construcción de recursos terminológicos médicos para el español: el sistema de extracción de términos CUTEXT y los repositorios de términos biomédicos. Procesamiento del Lenguaje Natural. 2018 Sep 1;61.