Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports
Authors/Creators
Description
General Information
- Title: Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports
- Version: 1.0.0
- Date of export: 2026-02-27
- Created by: Silene Systems (AI-powered clinical data extraction pipeline)
- Disease: Fabry Disease (OMIM #301500)
- Gene: GLA (Xq22.1)
- Inheritance: X-linked
Description
This dataset contains structured clinical data for 154 patients with Fabry disease, automatically extracted from 93 peer-reviewed case reports and case series indexed in PubMed Central. Data was extracted using a multi-agent AI pipeline with automated validation, clinical plausibility checks, and human review.
Each patient record includes demographics, genetic variants, clinical symptoms with temporal information, laboratory values, and links to source publications. Symptoms are provided in two formats: original AI-extracted terms and HPO (Human Phenotype Ontology)-normalized mappings with confidence scores.
The dataset is intended for rare disease research, phenotype-genotype correlation studies, natural history analysis, and machine learning applications in clinical genetics.
Dataset Summary
| Metric | Value |
|---|---|
| Patients | 154 |
| Publications | 93 |
| Symptom observations (original) | 2,039 |
| Symptom observations (HPO-normalized) | 2,017 |
| Unique symptom names | 803 |
| Lab value measurements | 1,264 |
| Patients with genetic variant | 103 |
| Unique genetic variants | 60 |
Demographics
| Count | |
|---|---|
| Male | 73 |
| Female | 73 |
| Unknown sex | 8 |
| Age metric | Available | Min | Max | Median |
|---|---|---|---|---|
| Age at symptom onset | 54 | 0 | 74 | 36 |
| Age at diagnosis | 124 | 4 | 75 | 48 |
Phenotype Distribution
| Phenotype | Count |
|---|---|
| Classic | 66 |
| Later-onset | 67 |
| Asymptomatic carrier | 1 |
File Descriptions
patients.csv (154 rows)
One row per patient. Contains demographics, genetics, and clinical summary.
| Column | Type | Description |
|---|---|---|
case_id |
string | Unique patient identifier (e.g., FAB-001) |
sex |
string | Patient sex (male/female) |
age_at_symptom_onset |
numeric | Age in years at first symptom onset |
age_at_diagnosis |
numeric | Age in years at Fabry disease diagnosis |
diagnostic_delay_years |
numeric | Difference between diagnosis and onset age |
genetic_variant |
string | GLA gene variant in HGVS notation (e.g., c.644A>G) |
protein_change |
string | Protein-level change (e.g., p.Asn215Ser) |
zygosity |
string | Hemizygous, heterozygous, or homozygous |
phenotype |
string | Clinical phenotype: classic, later-onset, or asymptomatic carrier |
disease_stage |
string | Disease stage if reported |
diagnosis_summary |
string | AI-generated clinical summary |
extraction_confidence |
numeric | AI extraction confidence score (0-1) |
data_completeness |
numeric | Proportion of fields populated (0-1) |
publication_pmcid |
string | PubMed Central ID of source publication |
symptoms.csv (2,039 rows)
One row per symptom observation, sourced from the original AI extraction. Preserves the exact clinical terminology used in source publications.
| Column | Type | Description |
|---|---|---|
case_id |
string | Patient identifier |
symptom_name |
string | Original symptom name as extracted from the publication |
present |
boolean | Whether the symptom is present (True) or explicitly absent (False) |
details |
string | Additional clinical details or context |
timepoint_raw |
string | Raw temporal reference from the publication |
timepoint_category |
string | Normalized category: onset, diagnosis, follow_up, etc. |
age_at_event |
numeric | Patient age at the time of this observation |
offset_months |
numeric | Months relative to a reference event |
symptoms_hpo.csv (2,017 rows)
One row per symptom mapped to the Human Phenotype Ontology (HPO). Enables standardized phenotype analysis and cross-dataset comparisons. HPO mappings have been validated and corrected using LLM-based quality review.
| Column | Type | Description |
|---|---|---|
case_id |
string | Patient identifier |
original_name |
string | Original symptom name from the publication |
canonical_name |
string | HPO-normalized term name |
hpo_id |
string | HPO identifier (e.g., HP:0001639) |
present |
boolean | Whether the symptom is present (True) or absent (False) |
details |
string | Additional clinical details |
timepoint |
string | Temporal reference |
context |
string | Clinical context |
confidence |
numeric | Mapping confidence score (0-1) |
lab_values.csv (1,264 rows)
One row per laboratory measurement or diagnostic test result.
| Column | Type | Description |
|---|---|---|
case_id |
string | Patient identifier |
lab_name |
string | Name of the laboratory test or biomarker |
value |
string | Measured value (numeric or descriptive) |
unit |
string | Unit of measurement |
context |
string | Clinical context (e.g., baseline, post-treatment) |
timepoint_raw |
string | Raw temporal reference |
timepoint_category |
string | Normalized temporal category |
age_at_event |
numeric | Patient age at measurement |
offset_months |
numeric | Months relative to a reference event |
publications.csv (93 rows; 95 with header)
One row per source publication.
| Column | Type | Description |
|---|---|---|
pmcid |
string | PubMed Central ID |
pmid |
string | PubMed ID |
doi |
string | Digital Object Identifier |
title |
string | Publication title |
authors |
string | Author list (semicolon-separated) |
journal |
string | Journal name |
publication_date |
date | Publication date (YYYY-MM-DD) |
license |
string | Publication license |
dataset_summary.json
Machine-readable summary with aggregate statistics including patient counts, demographic distributions, genetic variant counts, and phenotype breakdown.
Methodology
Data Extraction Pipeline
- Publication Discovery — Automated PubMed search for Fabry disease case reports (publications from 2015 onwards)
- Relevance Screening — AI-based abstract screening to filter relevant case reports
- Case Extraction — Structured patient data extraction from full-text articles using Claude Opus 4.6 (Claude, Anthropic)
- Clinical Validation — Automated checks for temporal logic, lab value plausibility, and genetic consistency against disease-specific profiles
- HPO Normalization — Mapping of extracted symptom terms to Human Phenotype Ontology, with LLM-based validation and correction of mappings
- Timepoint Normalization — Structuring of free-text temporal references into standardized categories with chronological ordering
Quality Assurance
- Extraction confidence and data completeness scores per patient
- Automated clinical plausibility validation
- HPO mapping quality validated by LLM review
- Negated findings (symptoms explicitly absent) are captured with
present=False - Manual audit of random samples against source publications confirmed >90% symptom completeness and near-complete laboratory value extraction
Limitations
- Data quality depends on the detail and clarity of source publications
- Some clinical findings may be missed if not explicitly stated in the text
- HPO mappings for ~246 symptoms lack HPO IDs (canonical name provided but ID not in reference ontology)
- Temporal information is not always available or precisely reported in source publications
- Dataset is limited to publications indexed in PubMed Central with available full text
License
The structured data in this dataset is derived from published scientific literature. Individual source publications retain their original licenses (see publications.csv). This compiled dataset is made available under CC BY 4.0.
Files
dataset_summary.json
Files
(731.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:78095cf31a7fb08b57f5723f6e52dc3f
|
785 Bytes | Preview Download |
|
md5:ce7e7f49950b954a8dd392a331c4ca24
|
97.3 kB | Preview Download |
|
md5:b65e35b7d24367d28ae2f9434594a7c1
|
69.1 kB | Preview Download |
|
md5:114ce5e5e4dc191202d26c3eb7780442
|
51.3 kB | Preview Download |
|
md5:98c699776759df43e82277039ad319e8
|
8.7 kB | Preview Download |
|
md5:dddc0ef07f58757ddfb3b0e1e56a1ae6
|
231.3 kB | Preview Download |
|
md5:2102b0f7023fc91719a16f24b8da8676
|
273.2 kB | Preview Download |