LivingNER corpus: recognition and normalization of species

Miranda-Escalada, Antonio; Farré-Maduell, Eulàlia; González Gacio, Gloria; Krallinger, Martin

doi:10.5281/zenodo.6376663

Published March 22, 2022 | Version 1.0

Journal article Open

LivingNER corpus: recognition and normalization of species

1. Barcelona Supercomputing Center
2. Bitac

LivingNER corpus - trainining set

The LivingNER corpus is a collection of 2000 clinical cases from over 10 different medical areas annotated with SPECIES mentions, that are mapped to NCBI Taxonomy. It is used for the LivingNER Shared Task on occupations and employment status detection and normalization in Spanish medical documents, which will be celebrated as part of IberLEF 2022.

The training set is composed of 1000 clinical cases extracted from the training set from four different specialties: COVID, oncology, infectious diseases and tropical medicine. The files are distributed as follows:

- For the subtask 1 (LivingNER - NER), annotations are distributed in a tab-separated file (TSV) file with the following columns:

filename: document name
mark: identifier mention mark
label: mention type (SPECIES or HUMAN)
off0: starting position of the mention in the document
off1: ending position of the mention in the document
span: textual span

- For the subtask 2 (LivingNER - NORM), annotations are distributed in a TSV file with the same columns as the previous one, plus:

isH: whether the span is narrower than the NCBITax assigned code
isN: whether the mention corresponds to a nosocomial infection
iscomplex: whether the span has assigned a combination of NCBITax codes
NCBITax: mention code in the NCBI Taxonomy

- For the subtask 3 (LivingNER - Application), annotations are distributed in a (TSV). In this version of the sample set, the data for this subtask is pending.

All text files are distributed as plain UTF-8 text files.

For further information, please visit https://temu.bsc.es/livingner/ or email us at encargo-pln-life@bsc.es

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

training.zip

Files (2.5 MB)

Name	Size	Download all
training.zip md5:3cfb430233ed6ffef1210c1a13585778	2.5 MB	Preview Download

	All versions	This version
Views	4,463	130
Downloads	582	26
Data volume	28.1 GB	66.2 MB

LivingNER corpus: recognition and normalization of species

Authors/Creators

Description

Notes

Files

training.zip

Files (2.5 MB)