Published February 27, 2026 | Version v1
Dataset Open

Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports

Authors/Creators

Description

General Information

  • Title: Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports
  • Version: 1.0.0
  • Date of export: 2026-02-27
  • Created by: Silene Systems (AI-powered clinical data extraction pipeline)
  • Disease: Fabry Disease (OMIM #301500)
  • Gene: GLA (Xq22.1)
  • Inheritance: X-linked

Description

This dataset contains structured clinical data for 154 patients with Fabry disease, automatically extracted from 93 peer-reviewed case reports and case series indexed in PubMed Central. Data was extracted using a multi-agent AI pipeline with automated validation, clinical plausibility checks, and human review.

Each patient record includes demographics, genetic variants, clinical symptoms with temporal information, laboratory values, and links to source publications. Symptoms are provided in two formats: original AI-extracted terms and HPO (Human Phenotype Ontology)-normalized mappings with confidence scores.

The dataset is intended for rare disease research, phenotype-genotype correlation studies, natural history analysis, and machine learning applications in clinical genetics.

Dataset Summary

Metric Value
Patients 154
Publications 93
Symptom observations (original) 2,039
Symptom observations (HPO-normalized) 2,017
Unique symptom names 803
Lab value measurements 1,264
Patients with genetic variant 103
Unique genetic variants 60

Demographics

  Count
Male 73
Female 73
Unknown sex 8

 

Age metric Available Min Max Median
Age at symptom onset 54 0 74 36
Age at diagnosis 124 4 75 48

Phenotype Distribution

Phenotype Count
Classic 66
Later-onset 67
Asymptomatic carrier 1

File Descriptions

patients.csv (154 rows)

One row per patient. Contains demographics, genetics, and clinical summary.

Column Type Description
case_id string Unique patient identifier (e.g., FAB-001)
sex string Patient sex (male/female)
age_at_symptom_onset numeric Age in years at first symptom onset
age_at_diagnosis numeric Age in years at Fabry disease diagnosis
diagnostic_delay_years numeric Difference between diagnosis and onset age
genetic_variant string GLA gene variant in HGVS notation (e.g., c.644A>G)
protein_change string Protein-level change (e.g., p.Asn215Ser)
zygosity string Hemizygous, heterozygous, or homozygous
phenotype string Clinical phenotype: classic, later-onset, or asymptomatic carrier
disease_stage string Disease stage if reported
diagnosis_summary string AI-generated clinical summary
extraction_confidence numeric AI extraction confidence score (0-1)
data_completeness numeric Proportion of fields populated (0-1)
publication_pmcid string PubMed Central ID of source publication

symptoms.csv (2,039 rows)

One row per symptom observation, sourced from the original AI extraction. Preserves the exact clinical terminology used in source publications.

Column Type Description
case_id string Patient identifier
symptom_name string Original symptom name as extracted from the publication
present boolean Whether the symptom is present (True) or explicitly absent (False)
details string Additional clinical details or context
timepoint_raw string Raw temporal reference from the publication
timepoint_category string Normalized category: onset, diagnosis, follow_up, etc.
age_at_event numeric Patient age at the time of this observation
offset_months numeric Months relative to a reference event

symptoms_hpo.csv (2,017 rows)

One row per symptom mapped to the Human Phenotype Ontology (HPO). Enables standardized phenotype analysis and cross-dataset comparisons. HPO mappings have been validated and corrected using LLM-based quality review.

Column Type Description
case_id string Patient identifier
original_name string Original symptom name from the publication
canonical_name string HPO-normalized term name
hpo_id string HPO identifier (e.g., HP:0001639)
present boolean Whether the symptom is present (True) or absent (False)
details string Additional clinical details
timepoint string Temporal reference
context string Clinical context
confidence numeric Mapping confidence score (0-1)

lab_values.csv (1,264 rows)

One row per laboratory measurement or diagnostic test result.

Column Type Description
case_id string Patient identifier
lab_name string Name of the laboratory test or biomarker
value string Measured value (numeric or descriptive)
unit string Unit of measurement
context string Clinical context (e.g., baseline, post-treatment)
timepoint_raw string Raw temporal reference
timepoint_category string Normalized temporal category
age_at_event numeric Patient age at measurement
offset_months numeric Months relative to a reference event

publications.csv (93 rows; 95 with header)

One row per source publication.

Column Type Description
pmcid string PubMed Central ID
pmid string PubMed ID
doi string Digital Object Identifier
title string Publication title
authors string Author list (semicolon-separated)
journal string Journal name
publication_date date Publication date (YYYY-MM-DD)
license string Publication license

dataset_summary.json

Machine-readable summary with aggregate statistics including patient counts, demographic distributions, genetic variant counts, and phenotype breakdown.

Methodology

Data Extraction Pipeline

  1. Publication Discovery — Automated PubMed search for Fabry disease case reports (publications from 2015 onwards)
  2. Relevance Screening — AI-based abstract screening to filter relevant case reports
  3. Case Extraction — Structured patient data extraction from full-text articles using Claude Opus 4.6 (Claude, Anthropic)
  4. Clinical Validation — Automated checks for temporal logic, lab value plausibility, and genetic consistency against disease-specific profiles
  5. HPO Normalization — Mapping of extracted symptom terms to Human Phenotype Ontology, with LLM-based validation and correction of mappings
  6. Timepoint Normalization — Structuring of free-text temporal references into standardized categories with chronological ordering

Quality Assurance

  • Extraction confidence and data completeness scores per patient
  • Automated clinical plausibility validation
  • HPO mapping quality validated by LLM review
  • Negated findings (symptoms explicitly absent) are captured with present=False
  • Manual audit of random samples against source publications confirmed >90% symptom completeness and near-complete laboratory value extraction

Limitations

  • Data quality depends on the detail and clarity of source publications
  • Some clinical findings may be missed if not explicitly stated in the text
  • HPO mappings for ~246 symptoms lack HPO IDs (canonical name provided but ID not in reference ontology)
  • Temporal information is not always available or precisely reported in source publications
  • Dataset is limited to publications indexed in PubMed Central with available full text

License

The structured data in this dataset is derived from published scientific literature. Individual source publications retain their original licenses (see publications.csv). This compiled dataset is made available under CC BY 4.0.

Files

dataset_summary.json

Files (731.7 kB)

Name Size Download all
md5:78095cf31a7fb08b57f5723f6e52dc3f
785 Bytes Preview Download
md5:ce7e7f49950b954a8dd392a331c4ca24
97.3 kB Preview Download
md5:b65e35b7d24367d28ae2f9434594a7c1
69.1 kB Preview Download
md5:114ce5e5e4dc191202d26c3eb7780442
51.3 kB Preview Download
md5:98c699776759df43e82277039ad319e8
8.7 kB Preview Download
md5:dddc0ef07f58757ddfb3b0e1e56a1ae6
231.3 kB Preview Download
md5:2102b0f7023fc91719a16f24b8da8676
273.2 kB Preview Download