MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish
Description
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/
Introduction:
These corpora contain the data for each of the subtracks of MESINESP2 shared-task:
- [Subtrack 1] MESINESP-L – Scientific Literature :
- Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
- Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
- Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
- Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:
- 213 articles were annotated by more than one annotator. We have selected de union between annotations.
- 852 articles were annotated by only one of the three selected annotators with better performance.
- Test set: We provide a test set containing 10179 abstract without DeCS codes (not annotated) from LILACS and IBECS. Participants will have to predict the DecS codes for each of the abstracts in the entire dataset. However, the evaluation of the systems will only be made on the set of 500 expert-annotated abstracts that will be published as Gold Standard after finishing the evaluation period.
- Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
- [Subtrack 2] MESINESP-T- Clinical Trials:
- Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.
- Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.
- Test set: The test dataset contains a collection of 8919 items. Out of this subset, there are 461 clinical trials coming from REEC and 8458 clinical trials artificially constructed from drug datasheets that have a similar structure to REEC documents. The evaluation of the systems will be performed on a set of 250 items annotated by DeCS experts following the same protocol as in subtrack 1. Similarly, these items will be published as Gold Standard after completion of the task.
- [Subtrack 3] MESINESP-P – Patents:
- Development set: We provide a Development set manually indexed by expert annotators. This dataset includes 115 patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”. We have selected these patents based on semantic similarity to the MESINESP-L training set to facilitate model generation and to try to improve model performance.
- Test set: We provide a test set containing 68404 records that correspond to the total number of patents published in Spanish with the IPC codes “A61P” and “A61K31”. From this set, 150 will be selected and indexed by DeCS experts under the protocol defined in subtask 1, which will be used to evaluate the quality of the developed systems. Similarly to the development set, we selected these 150 records based on semantic similarity to the MESINESP-L training set.
- Additional data:
- We provide this information to the participants as additional data in the “Additional Data” folder. For each training, development, and test set there is an additional JSON file with the structure shown here. Each file contains entities related to medications, diseases, symptoms, and medical procedures extrated with the BSC NERs.
Files structure:
Subtrack1-Scientific_Literature.zip contains the corpora generated for subtrack 1. Content:
- Subtrack1:
- Train
- training_set_track1_all.json: Full training set for subtrack 1.
- training_set_track1_only_articles.json: Articles training set for subtrack 1.
- Development
- development_set_subtrack1.json: Manually annotated development set for subtrack 1.
- Test
- test_set_subtrack1.json: Test set for subtrack 1.
- Train
Subtrack2-Clinical_Trials.zip contains the corpora generated for subtrack 2. Content:
- Subtrack2:
- Train
- training_set_subtrack2.json: Training set for subtrack 2.
- Development
- development_set_subtrack2.json: Manually annotated development set for subtrack 2.
- Test
- test_set_subtrack2.json: Test set for subtrack 2.
- Train
Subtrack3-Patents.zip contains the corpora generated for subtrack 3. Content:
- Subtrack3:
- Development
- development_set_subtrack3.json: Manually annotated development set for subtrack 3.
- Test
- test_set_subtrack3.json: Test set for subtrack 3.
- Development
Additional data.zip contains the corpora with additional data for each subtrack of MESINESP2.
DeCS2020.tsv contains a DeCS table with the following structure:
- DeCS code
- Preferred descriptor (the preferred label in the Latin Spanish DeCS 2020 set)
- List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020 set, separated by pipes.
DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.
*Note: The obo and tsv files with DeCS2020 descriptors contain some additional COVID19 descriptors that will be included in future versions of DeCS. These items were provided by the Pan American Health Organization (PAHO), which has kindly shared this content to improve the results of the task by taking these descriptors into account.
For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at lgasco@bsc.es
Notes
Files
Additional data.zip
Files
(354.9 MB)
Name | Size | Download all |
---|---|---|
md5:d78b0bfe07dcce33e3a9452fee417032
|
33.8 MB | Preview Download |
md5:7d2e94715515a27564322d5fb3f09b74
|
21.4 MB | Download |
md5:8c25bd99a5323bbc5678337eb551e4b8
|
8.1 MB | Download |
md5:5f837e4bf5abfd4034089f741a115a6d
|
243.2 MB | Preview Download |
md5:adc3a9b3ad637cab3d0b776540d10fa1
|
29.3 MB | Preview Download |
md5:60da5c64d7ced7e49b14383aa9b289a8
|
19.3 MB | Preview Download |