Published October 18, 2024 | Version V2.1
Publication Open

LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations

  • 1. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
  • 2. TurkuNLP Group, Department of Computing, University of Turku, Finland

Description

In this study, we introduce LSD600, the first corpus specifically focused on lifestyle factor-disease relations. LSD600 consists of 600 abstracts annotated with LSF-disease relations, encompassing 1897 relations covering eight different relation types.

The annotated LSF entities in these relations are pre-annotate using a comprehensive LSF classification (Nourani et al., 2024) and cover  a wide spectrum of lifestyle factors belonging to nine categories.

We have used LSD600 to train a transformer-based model on the multi-label LSF-disease RE task.

Included Files:

  1. LSD600 corpus in BRAT format
  2. LSD600 corpus annotation guidelines
  3. LSD600 Metadata 
  4. BRAT config files: Configuration files for loading and visualizing the LSD600 corpus in the BRAT annotation tool.
  5. Trained best model for LSF-Disease relation extraction


LSD600 Metadata

         1. Columns in the Consolidated Relations Dataset :

    • Disease_Entity: The disease-related term annotated in the text.
    • LSF_Entity: The lifestyle factor (LSF) entity annotated in the text
    • Relationship_Type: The type of relationship between the Disease and LSF entities.
    • Publication_ID: The PubMed ID 
    • Data_Set: (Train/Dev/Test)
    • Disease_BRAT_ID: The BRAT tool’s unique identifier for the Disease entity in the dataset.
    • LSF_BRAT_ID: The BRAT tool’s unique identifier for the LSF entity in the dataset.
    • Disease_Entity_Span: The character span (start and end positions) of the Disease entity in the text.
    • LSF_Entity_Span: The character span (start and end positions) of the LSF entity in the text.
    • Abstract_Text: The full abstract text from which the relationship was extracted.
  1. Columns in the LSD600 metadata:

    • Publication_ID: The PubMed ID.
    • Article_Title
    • Journal_Name
    • Publication_Year
    • Data_Set: (Train/Dev/Test)
    • Lifestyle_Factor_Count: The number of lifestyle factors (LSFs) identified in the abstract.
    • Disease_Count: The number of disease entities identified in the abstract.
    • Relations_Count: The total number of relationships between lifestyle factors and diseases annotated in the abstract.
    • Abstract_Text
    • Unique_Lifestyle_Factors: The list of unique lifestyle factors annotated in the abstract.
    • Unique_Diseases: The list of unique diseases annotated in the abstract.

Files

RE_Annotation_Guidelines.pdf

Files (1.3 GB)

Name Size Download all
md5:ed1dbd383b3a63c87074a4aa0da34b08
1.3 GB Download
md5:6ac9dff6056fae5499d17b31536d37e9
5.2 kB Download
md5:39697f2a5cedb9b18a4f84df5f90a8a7
4.0 MB Download
md5:3d4560b656d9d276a340fcd9b6c7c7ad
610.9 kB Download
md5:9f81498f2e82430518a6cb98616064f1
1.2 MB Download
md5:7177bbbfb4ef2c937c3297e37a97fcba
208.7 kB Preview Download

Additional details

Dates

Created
2024

References

  • LSD600