There is a newer version of the record available.

Published March 15, 2022 | Version 1.3.2
Dataset Open

SocialDisNER corpus: gold standard annotations for detection of disease mentions in Spanish tweets

  • 1. Barcelona Supercomputing Center

Description

Gold Standard annotations for SocialDisNER (SMM4H 2022 – Task 10) shared task.

Introduction:
The SocialDisNER corpus of the SMM4H 2022 – Task 10 task will focus on the recognition of disease mentions in tweets written in Spanish after selecting primarily first-hand experience of diseases and other health-relevant content (from patient associations, professional healthcare institutions, and through followers of patient association accounts of a diversity of pathologies including rare diseases, mental health, cancer, etc..).

The corpus was manually annotated by medical experts following the SMM4H-SocialDisNER guidelines. These guidelines were adapted from previous efforts used to annotate patient clinical records and medical literature. It covers rules for annotating mentions of diseases in health-related tweets in Spanish,

The training set consists of 5000 tweets written in Spanish and the validation set consists of 2500 tweets written in Spanish. Both sets have been manually annotated by healthcare professionals. The unannotated test set will be published shortly.

 Additional data contains disease mentions automatically extracted from a set of 85000 tweets. In addition, a co-occurrence matrix of the extracted disease mentions is also shared.

File structure:

The structure of the zip file is: 

  • socialdisner.zip:
    • training-validation-data folder
      • train-valid-txt-files:  folder with training and validation text files. One text file per tweet, the file name corresponds to the tweet id. One sub-directory per corpus split (train and valid). The files named ids_dev_set.txt and ids_train_set.txt contain the list of file identifiers for each of the data splits (validation and train).
      • mentions.tsv: This file contains the manually annotated disease mentions. The file has the following fields:
        • tweets_id: This is the id of the tweet, using Twitter API you can query the content of the tweet.
        • Begin: This is the position in the tweet where the annotation was found.
        • End: This is the position of the last character of the annotation in the tweet.
        • Type: This is the type of entity found, in our case "ENFERMEDAD".
        • Extraction: This is the literal extraction, in other words, the fragment of text which refers to the annotation. 
    • additional-large_scale_data:
      • socialdisner_diseases.zip:
        • tweets_txt: Folder with large-scale tweet database. One text file per tweet, the file name corresponds to the tweet id.
        • socialdiser_disease_mentions.tsv: This file contains the automatically annotated disease mentions from the large-scale SocialDisNER corpus (Silver Standard). The structure is the same than the Golden Standard annotations.
        • socialdiser_disease_net.tsv: This tsv file contains the array of socialdisner-disease large-scale corpus co-mentions separated by ";". This file can be loaded into NetworkX to perform disease co-morbidity analysis on the socialdisner-disease large-scale data.

Note: In previous versions of the dataset the order of the columns in the mentions.tsv file was not in the correct order. From this version onwards the order is correct and adequate to send the predictions of the task.

 

For further information, please visit https://temu.bsc.es/socialdisner/

Summary statistics:

Manually annotated data
  Training set Development set
# tweets 5000 2500
# characters 1253431 516768
# tokens 211555 84478
Avg. char / tweet 250.69 206.71
Avg. tok. / tweet 42.31 33.79
# mentions 15173 4252
# unique mentions 4407 1413

 

Large-scale annotated data (Silver Standard)
  Socialdisner-diseases
# tweets     85077
# characters     19920670
# tokens     3236411
Avg. char / tweet     234.15
Avg. tok. / tweet     38.04
# mentions     116260
# unique mentions     16034

 

 

Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy.

 

 

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

socialdisner.zip

Files (37.9 MB)

Name Size Download all
md5:65ddbbefe5a1521f127785973278c8f5
37.9 MB Preview Download