Published October 18, 2024
| Version V2.1
Publication
Open
LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations
Creators
- 1. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
- 2. TurkuNLP Group, Department of Computing, University of Turku, Finland
Description
In this study, we introduce LSD600, the first corpus specifically focused on lifestyle factor-disease relations. LSD600 consists of 600 abstracts annotated with LSF-disease relations, encompassing 1897 relations covering eight different relation types.
The annotated LSF entities in these relations are pre-annotate using a comprehensive LSF classification (Nourani et al., 2024) and cover a wide spectrum of lifestyle factors belonging to nine categories.
We have used LSD600 to train a transformer-based model on the multi-label LSF-disease RE task.
Included Files:
- LSD600 corpus in BRAT format
- LSD600 corpus annotation guidelines
- LSD600 Metadata
- BRAT config files: Configuration files for loading and visualizing the LSD600 corpus in the BRAT annotation tool.
- Trained best model for LSF-Disease relation extraction:
LSD600 Metadata:
1. Columns in the Consolidated Relations Dataset :
-
- Disease_Entity: The disease-related term annotated in the text.
- LSF_Entity: The lifestyle factor (LSF) entity annotated in the text
- Relationship_Type: The type of relationship between the Disease and LSF entities.
- Publication_ID: The PubMed ID
- Data_Set: (Train/Dev/Test)
- Disease_BRAT_ID: The BRAT tool’s unique identifier for the Disease entity in the dataset.
- LSF_BRAT_ID: The BRAT tool’s unique identifier for the LSF entity in the dataset.
- Disease_Entity_Span: The character span (start and end positions) of the Disease entity in the text.
- LSF_Entity_Span: The character span (start and end positions) of the LSF entity in the text.
- Abstract_Text: The full abstract text from which the relationship was extracted.
-
Columns in the LSD600 metadata:
- Publication_ID: The PubMed ID.
- Article_Title
- Journal_Name
- Publication_Year
- Data_Set: (Train/Dev/Test)
- Lifestyle_Factor_Count: The number of lifestyle factors (LSFs) identified in the abstract.
- Disease_Count: The number of disease entities identified in the abstract.
- Relations_Count: The total number of relationships between lifestyle factors and diseases annotated in the abstract.
- Abstract_Text
- Unique_Lifestyle_Factors: The list of unique lifestyle factors annotated in the abstract.
- Unique_Diseases: The list of unique diseases annotated in the abstract.
Files
RE_Annotation_Guidelines.pdf
Files
(1.3 GB)
Name | Size | Download all |
---|---|---|
md5:ed1dbd383b3a63c87074a4aa0da34b08
|
1.3 GB | Download |
md5:6ac9dff6056fae5499d17b31536d37e9
|
5.2 kB | Download |
md5:39697f2a5cedb9b18a4f84df5f90a8a7
|
4.0 MB | Download |
md5:3d4560b656d9d276a340fcd9b6c7c7ad
|
610.9 kB | Download |
md5:9f81498f2e82430518a6cb98616064f1
|
1.2 MB | Download |
md5:7177bbbfb4ef2c937c3297e37a97fcba
|
208.7 kB | Preview Download |
Additional details
Dates
- Created
-
2024
Software
- Repository URL
- https://github.com/EsmaeilNourani/LSF_Disease_RE/
References
- LSD600