Dataset Open Access
Marta Villegas; Ander Intxaurrondo; Aitor Gonzalez-Agirre; Martin Krallinger
MeSpEn consists of a resource of heterogeneous health related documents in Spanish and English useful to build parallel corpora for training and evaluating Spanish <-> English medical machine translation systems, to generate multilingual automatic term extraction tools, and develop other Spanish medical NLP components. MeSpEn provides the combination and harmonization of various bibliographic datasets of biomedical and clinical literature from Spain and Latin America or web-content with trusted information sources about diseases, conditions, and wellness issues for patients.
MeSpEn was used to generate automatically bilingual health related-glossaries through automatic term detection and named entity recognition in English and target candidate term extraction in Spanish through sentence alignment approaches, implying potentially the generation of Silver Standard annotated health texts in Spanish.
MeSpEn was used to generate automatically bilingual health related-glossaries through automatic term detection and named entity recognition in English and target candidate term extraction in Spanish through sentence alignment approaches, implying potentially the generation of Silver Standard annotated health texts in Spanish (see Villegas, et al. "The MeSpEN resource for English-Spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations." Proc. LREC 2018 Workshop MultilingualBIO: Multilingual Biomedical Text Processing).
The MeSpEn resource aggregates several datasets, mainly from 4 principal sources: IBECS, SciELO, Pubmed and MedlinePlus:
IBECS (Spanish Bibliographical Index in Health Sciences) is a bibliographical database that collects scientific journals covering multiple fields in health sciences. It is maintained by the Spanish National Health Sciences Library (BNCS), at the Carlos III Health Institute.
This corpus contains titles and abstracts from 168,198 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by IBECS is provided as well.
For more information about IBECS parallel corpora, see IBECS_README file.
SciELO (Scientific Electronic Library Online) gathers electronic publications of complete full text articles from scientific journals of Latin America, South Africa and Spain. Currently is present in 15 countries and supported by the Sao Paulo Research Foundation (FAPESP) and the Brazilian National Council for Scientific and Technological Development (BIREME).
This corpus contains titles and abstracts from 161,710 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format.
For more information about SciELO parallel corpora, see Scielo_README file.
Pubmed is a free search engine used to access the MedlineNLM).
This corpus contains titles and abstracts from 127,619 records. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by PubMed is provided as well.
For more information about Pubmed parallel corpora, see Pubmed_README file.
Users can access to all Spanish articles in Pubmed by clicking here. Follow these steps to download all articles' metadata in XML format:
MedlinePlus is an online information service provided by the U.S. National Library of Medicine (NLM), and gives free information about health in both English and Spanish. MedlinePlus provides the following information: Health topics, Drugs and supplements, Laboratory test information, Medical encyclopedia.
There are 2 corpora available for download:
These corpora are also available at http://temu.bsc.es/mespen/
In addition, forty-six bilingual medical glossaries for various language pairs are available at https://zenodo.org/record/2205690#.XefkzdEo9hF
Copyright (c) 2019 Secretaría de Estado para el Avance Digital
Name | Size | |
---|---|---|
MedlinePlus-articles_README
md5:3d1af50170b2f8743e0f2b04f80a33fb |
1.1 kB | Download |
MedlinePlus-health-topics_README
md5:d72afaa5a030b3cb31075ccce70b04a6 |
776 Bytes | Download |
MedlinePlus-health_topics-dublin_core-Sp-En.tar.bz2
md5:c122aea703fd9168aa91e8e9770ad389 |
726.8 kB | Download |
MedlinePlus-TEI-Sp-En.tar.bz2
md5:cc5123bdc80b253aa4d8b921f4192578 |
41.6 MB | Download |
MeSpEn_Parallel-Corpora.zip
md5:2662ec36379c8c7740ba9518c5f04477 |
616.1 MB | Download |
Pubmed-dublin_core-Sp-En.tar.bz2
md5:e6b92973b638d8fecd98918ea5db1d10 |
84.1 MB | Download |
Scielo_README
md5:27271e41b6fdd9397c1c071d955bd4da |
1.5 kB | Download |
Villegas M, Intxaurrondo A, Gonzalez-Agirre A, Marimon M, Krallinger M. The MeSpEN resource for English-Spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations. InProceedings of the LREC 2018 Workshop "MultilingualBIO: Multilingual Biomedical Text Processing", Paris, France. European Language Resources Association (ELRA) 2018 May 8.
All versions | This version | |
---|---|---|
Views | 1,675 | 1,675 |
Downloads | 829 | 829 |
Data volume | 221.7 GB | 221.7 GB |
Unique views | 1,402 | 1,402 |
Unique downloads | 385 | 385 |