MESINESP: Post-workshop datasets. Silver Standard and annotator records

Martin Krallinger; Carlos Rodríguez-Penagos; Aitor Gonzalez-Agirre; Alejandro Asensio

doi:10.5281/zenodo.3946558

Published July 15, 2020 | Version Version 1.0

Dataset Open

MESINESP: Post-workshop datasets. Silver Standard and annotator records

1. Barcelona Supercomputing Center

Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials).

The MESINESP (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) Challenge was held in May-June 2020, and as a result of a strong participation and the manual annotation of an evaluation dataset, two additional datasets are released now:

1) "all_annotations_withIDsv3.tsv" contains a tab-separated file with all manual annotations (both validated and non-validated) of the evaluation dataset prepared for the competition. It contains the following fields:

annotatorName: Human annotator id
documentId: Document ID in the source database
decsCode: A DeCS code added to it or validated
timestamp: When it was added
validated: if it was validated at that point by another annotator, or not yet
SpanishTerm: The Spanish descriptor corresponding to the DeCS code
mesinespId: The internal document id in the distributed evaluation file
dataset: if part of the evaluation or the test sets
source: which database it was taken from

Example:

annotatorName   documentId   decsCode   timestamp   validated   SpanishTerm   mesinespId   dataset   source
A7   biblio-1001069   6893   2020-01-17T11:27:07.000Z   false   caballos   mesinesp-dev-671   dev   LILACS
A7   biblio-1001069   4345   2020-01-17T11:27:12.000Z   false   perros   mesinesp-dev-671   dev   LILACS

2) A "Silver Standard" created from the 24 system runs submitted by 6 participating teams. It contains each of the submitted DeCS code for each document in the test set, as well as other information that can help ascertain reliability and source for anyone that wants to use this dataset to enrich their training data. It contains more that 5.8 million datapoints, and is structured as follows

SubmissionName: Alias of the team that submitted the run
REALdocumentId: The real id of the document
mesinespId: The mesinesp assigned id in the evaluation dataset
docSource: The source database
decsCode: the DeCS code assigned to it by the team's system
SpanishTerm: The Spanish descriptor of the DeCS code
MiF: The Micro-f1 scored by that system's run
MiR: The Micro-Recall scored by that system's run
MiP: The Micro-Precision scored by that system's run
Acc: The Accuracy scored by that system's run
consensus: The number of runs where that DeCS code was assigned to this document by the participating teams (max. is 24)

Example:

SubmissionName   REALdocumentId   mesinespId   docSource   decsCode   SpanishTerm   MiF   MiR   MiP   Acc   consensus
AN ibc-177565   mesinesp-evaluation-00001   IBECS   28567   riesgo   0.2054   0.1930   0.2196   0.1198   4
AN   ibc-177565   mesinesp-evaluation-00001   IBECS   15335   trabajo   0.2054   0.1930   0.2196   0.1198   4
AN   ibc-177565   mesinesp-evaluation-00001   IBECS   33182   conocimiento   0.2054   0.1930   0.2196   0.1198   7

For citation and a detailed description of the Challenge, please cite:
Anastasios, Nentidis and Anastasia, Krithara and Konstantinos, Bougiatiotis and Martin, Krallinger and Carlos, Rodriguez-Penagos and Marta, Villegas and Georgios, Paliouras. Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering (2020). Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). Thessaloniki, Greece, September 22--25

Citation

@inproceedings{durusan2019overview,
title={Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering},
author={Anastasios, Nentidis and Anastasia, Krithara and Konstantinos, Bougiatiotis and Martin, Krallinger and Carlos, Rodriguez-Penagos and Marta, Villegas and Georgios, Paliouras},
booktitle={Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020), Thessaloniki, Greece, September 22--25, 2020, Proceedings},
volume={12260},
year={2020},
organization={Springer}
}

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

mesinesp_silver_standard.zip

Files (70.9 MB)

Name	Size	Download all
all_annotations_withIDsv3.tsv md5:415b0dd7a193160da73ea938e67c4fee	6.8 MB	Download
mesinesp_silver_standard.zip md5:891a0483d212199f339751dc51300a37	64.1 MB	Preview Download

	All versions	This version
Views	495	494
Downloads	139	139
Data volume	4.9 GB	4.9 GB

MESINESP: Post-workshop datasets. Silver Standard and annotator records

Creators

Description

Notes

Files

mesinesp_silver_standard.zip

Files (70.9 MB)