Published August 1, 2024 | Version v2
Dataset Open

Datasets for "Exploring the potential of neural machine translation for cross-language clinical NLP resource generation through annotation projection"

Description

This repository contains the data and additional resources used for the paper:

"Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection. Rodriguez Miret et al. Information (2024)".

There are four different datasets included, namely:

  • The (1)  DisTEMIST, (2)  DrugTEMIST and (3) MEDDOPROF Spanish corpora and corresponding versions in 10 different languages, created through Machine Translation and annotation projection techniques. The Catalan annotations, used in the paper's experiments, were validated by bilingual expert annotators, who also provided alternative translations for the annotated terms in case they were wrongly translated. Thus, for Catalan we provide two different versions of the data: (i) the output of the annotation projection process as is, without any further validation, and (ii) the validated version of the data. For the rest of the languages (with the exception of the DrugTEMIST English and Italian data, used for the MultiCardioNER shared task), only an unvalidated version is provided.
  • The (4) Catalan Clinical Case Corpus (CataCCC), a collection of 200 clinical case reports in originally written in Catalan covering a variety of clinical specialties. This corpus includes manually validated annotations for diseases, medications and professions created by the experts who annotated the corpora mentioned above, using the same guidelines and annottaion criteria. It can this be considered the first clinical Gold Standard corpus for diseases, medications and processions in Catalan.

It is noteworthy that the MEDDOPROF-related data includes annotations for two labels, PROFESION and SITUACION_LABORAL, but only the former was used for training and evaluation in the paper.

These are the 10 languages included in the repository (along with their language codes):

  • Spanish (`es-gs`, with `gs` standing for Gold Standard)
  • Catalan (`cat`)
  • English (`en`)
  • French (`fr`)
  • Italian (`it`)
  • Dutch (`nl`)
  • Portuguese (`pt`)
  • Romanian (`ro`)
  • Swedish (`sv`)
  • Czech (`cz`)

Related Links

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:

- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

Files

cataccc_distemist_drugtemist_meddoprof_zenodo_v2.zip

Files (68.6 MB)