MeSpEn_Parallel-Corpora

Marta Villegas; Ander Intxaurrondo; Aitor Gonzalez-Agirre; Martin Krallinger

doi:10.5281/zenodo.3562536

Published December 4, 2019 | Version 2019-12-04

Dataset Open

MeSpEn_Parallel-Corpora

MeSpEn consists of a resource of heterogeneous health related documents in Spanish and English useful to build parallel corpora for training and evaluating Spanish <-> English medical machine translation systems, to generate multilingual automatic term extraction tools, and develop other Spanish medical NLP components. MeSpEn provides the combination and harmonization of various bibliographic datasets of biomedical and clinical literature from Spain and Latin America or web-content with trusted information sources about diseases, conditions, and wellness issues for patients.

MeSpEn was used to generate automatically bilingual health related-glossaries through automatic term detection and named entity recognition in English and target candidate term extraction in Spanish through sentence alignment approaches, implying potentially the generation of Silver Standard annotated health texts in Spanish.

MeSpEn was used to generate automatically bilingual health related-glossaries through automatic term detection and named entity recognition in English and target candidate term extraction in Spanish through sentence alignment approaches, implying potentially the generation of Silver Standard annotated health texts in Spanish (see Villegas, et al. "The MeSpEN resource for English-Spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations." Proc. LREC 2018 Workshop MultilingualBIO: Multilingual Biomedical Text Processing).

The MeSpEn resource aggregates several datasets, mainly from 4 principal sources: IBECS, SciELO, Pubmed and MedlinePlus:

IBECS (Spanish Bibliographical Index in Health Sciences) is a bibliographical database that collects scientific journals covering multiple fields in health sciences. It is maintained by the Spanish National Health Sciences Library (BNCS), at the Carlos III Health Institute.

This corpus contains titles and abstracts from 168,198 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by IBECS is provided as well.

For more information about IBECS parallel corpora, see IBECS_README file.
SciELO (Scientific Electronic Library Online) gathers electronic publications of complete full text articles from scientific journals of Latin America, South Africa and Spain. Currently is present in 15 countries and supported by the Sao Paulo Research Foundation (FAPESP) and the Brazilian National Council for Scientific and Technological Development (BIREME).

This corpus contains titles and abstracts from 161,710 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format.

For more information about SciELO parallel corpora, see Scielo_README file.
Pubmed is a free search engine used to access the MedlineNLM).

This corpus contains titles and abstracts from 127,619 records. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by PubMed is provided as well.

For more information about Pubmed parallel corpora, see Pubmed_README file.

Users can access to all Spanish articles in Pubmed by clicking here. Follow these steps to download all articles' metadata in XML format:
- Click on Send to.
- Select File on Choose destination.
- Select XML on Format.
- And finally click on Create File.
MedlinePlus is an online information service provided by the U.S. National Library of Medicine (NLM), and gives free information about health in both English and Spanish. MedlinePlus provides the following information: Health topics, Drugs and supplements, Laboratory test information, Medical encyclopedia.

There are 2 corpora available for download:
- Health topics metadata in Dublin Core format: the source code of the site stores metadata information about each topic, we created the DC files based on these metadata. This collection contains a total of 1,063 articles in English and Spanish. For more information about it, see MedlinePlus-health-topics_README.
- Complete MedlinePlus in TEI format: clean raw text and XML files of each article, structured by sections and paragraphs. This collection contains a total of 7,033 articles in English and Spanish. For more information about it, see MedlinePlus-articles_README.

These corpora are also available at http://temu.bsc.es/mespen/

In addition, forty-six bilingual medical glossaries for various language pairs are available at https://zenodo.org/record/2205690#.XefkzdEo9hF

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files