Variantscape datasets
Description
Variantscape dataset
LLM-based extraction of genetic variants and biomedical entities from titles and abstracts of biomedical publications. These datasets support the analysis of literature-derived co-associations between genetic variants, cancer types, and treatments, enabling downstream network analysis, hypothesis generation, and discovery in precision oncology.
1. Dataset: Cleaned literature dataset for biomedical entity extraction (2014–2024)
"cleaned_OpenAlex.csv "
A pre-processed, cleaned, and structured dataset of cancer-related biomedical publications (2014–2024) retrieved from OpenAlex, containing titles, abstracts, and metadata curated for downstream NLP and LLM-based biomedical entity extraction.
2. Dataset: Binary entity matrix for co-association and network analysis
"dataset_for_analysis.csv"
Final binary matrix dataset derived from NLP- and LLM-based entity extraction on cancer-related literature. Entities include genetic variants, cancer types, and treatments, enabling co-occurrence and network analysis, and the investigation of literature-derived co-associations.
3. Dataset: LLM-based classification of variant-treatment co-associations
"variant_treatment_relationship_consensus.csv"
Dataset capturing LLM-based classification and consensus on co-associations between genetic variants and treatments.
4. Dataset: Metadata mapping for entity extraction and analysis
"metadata_mapping_transposed.csv "
Transposed, row-indexed metadata mapping file used for identification of each column as a variant, cancer type, treatment, study design element, or publication-derived metadata.
Files
cleaned_OpenAlex.csv
Additional details
Dates
- Created
-
2025-04-23
Software
- Repository URL
- https://github.com/hastingslab-org/Variantscape
- Programming language
- Python
- Development Status
- Active