Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports

Michalski, Adrian

doi:10.5281/zenodo.18799012

Published February 27, 2026 | Version v1

Dataset Open

Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports

Michalski, Adrian

General Information

Title: Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports
Version: 1.0.0
Date of export: 2026-02-27
Created by: Silene Systems (AI-powered clinical data extraction pipeline)
Disease: Fabry Disease (OMIM #301500)
Gene: GLA (Xq22.1)
Inheritance: X-linked

Description

This dataset contains structured clinical data for 154 patients with Fabry disease, automatically extracted from 93 peer-reviewed case reports and case series indexed in PubMed Central. Data was extracted using a multi-agent AI pipeline with automated validation, clinical plausibility checks, and human review.

Each patient record includes demographics, genetic variants, clinical symptoms with temporal information, laboratory values, and links to source publications. Symptoms are provided in two formats: original AI-extracted terms and HPO (Human Phenotype Ontology)-normalized mappings with confidence scores.

The dataset is intended for rare disease research, phenotype-genotype correlation studies, natural history analysis, and machine learning applications in clinical genetics.

Dataset Summary

Metric	Value
Patients	154
Publications	93
Symptom observations (original)	2,039
Symptom observations (HPO-normalized)	2,017
Unique symptom names	803
Lab value measurements	1,264
Patients with genetic variant	103
Unique genetic variants	60

Demographics

	Count
Male	73
Female	73
Unknown sex	8

Age metric	Available	Min	Max	Median
Age at symptom onset	54	0	74	36
Age at diagnosis	124	4	75	48

Phenotype Distribution

Phenotype	Count
Classic	66
Later-onset	67
Asymptomatic carrier	1

File Descriptions

patients.csv (154 rows)

One row per patient. Contains demographics, genetics, and clinical summary.

Column	Type	Description
`case_id`	string	Unique patient identifier (e.g., FAB-001)
`sex`	string	Patient sex (male/female)
`age_at_symptom_onset`	numeric	Age in years at first symptom onset
`age_at_diagnosis`	numeric	Age in years at Fabry disease diagnosis
`diagnostic_delay_years`	numeric	Difference between diagnosis and onset age
`genetic_variant`	string	GLA gene variant in HGVS notation (e.g., c.644A>G)
`protein_change`	string	Protein-level change (e.g., p.Asn215Ser)
`zygosity`	string	Hemizygous, heterozygous, or homozygous
`phenotype`	string	Clinical phenotype: classic, later-onset, or asymptomatic carrier
`disease_stage`	string	Disease stage if reported
`diagnosis_summary`	string	AI-generated clinical summary
`extraction_confidence`	numeric	AI extraction confidence score (0-1)
`data_completeness`	numeric	Proportion of fields populated (0-1)
`publication_pmcid`	string	PubMed Central ID of source publication

symptoms.csv (2,039 rows)

One row per symptom observation, sourced from the original AI extraction. Preserves the exact clinical terminology used in source publications.

Column	Type	Description
`case_id`	string	Patient identifier
`symptom_name`	string	Original symptom name as extracted from the publication
`present`	boolean	Whether the symptom is present (True) or explicitly absent (False)
`details`	string	Additional clinical details or context
`timepoint_raw`	string	Raw temporal reference from the publication
`timepoint_category`	string	Normalized category: onset, diagnosis, follow_up, etc.
`age_at_event`	numeric	Patient age at the time of this observation
`offset_months`	numeric	Months relative to a reference event

symptoms_hpo.csv (2,017 rows)

One row per symptom mapped to the Human Phenotype Ontology (HPO). Enables standardized phenotype analysis and cross-dataset comparisons. HPO mappings have been validated and corrected using LLM-based quality review.

Column	Type	Description
`case_id`	string	Patient identifier
`original_name`	string	Original symptom name from the publication
`canonical_name`	string	HPO-normalized term name
`hpo_id`	string	HPO identifier (e.g., HP:0001639)
`present`	boolean	Whether the symptom is present (True) or absent (False)
`details`	string	Additional clinical details
`timepoint`	string	Temporal reference
`context`	string	Clinical context
`confidence`	numeric	Mapping confidence score (0-1)

lab_values.csv (1,264 rows)

One row per laboratory measurement or diagnostic test result.

Column	Type	Description
`case_id`	string	Patient identifier
`lab_name`	string	Name of the laboratory test or biomarker
`value`	string	Measured value (numeric or descriptive)
`unit`	string	Unit of measurement
`context`	string	Clinical context (e.g., baseline, post-treatment)
`timepoint_raw`	string	Raw temporal reference
`timepoint_category`	string	Normalized temporal category
`age_at_event`	numeric	Patient age at measurement
`offset_months`	numeric	Months relative to a reference event

publications.csv (93 rows; 95 with header)

One row per source publication.

Column	Type	Description
`pmcid`	string	PubMed Central ID
`pmid`	string	PubMed ID
`doi`	string	Digital Object Identifier
`title`	string	Publication title
`authors`	string	Author list (semicolon-separated)
`journal`	string	Journal name
`publication_date`	date	Publication date (YYYY-MM-DD)
`license`	string	Publication license

dataset_summary.json

Machine-readable summary with aggregate statistics including patient counts, demographic distributions, genetic variant counts, and phenotype breakdown.

Methodology

Data Extraction Pipeline

Publication Discovery — Automated PubMed search for Fabry disease case reports (publications from 2015 onwards)
Relevance Screening — AI-based abstract screening to filter relevant case reports
Case Extraction — Structured patient data extraction from full-text articles using Claude Opus 4.6 (Claude, Anthropic)
Clinical Validation — Automated checks for temporal logic, lab value plausibility, and genetic consistency against disease-specific profiles
HPO Normalization — Mapping of extracted symptom terms to Human Phenotype Ontology, with LLM-based validation and correction of mappings
Timepoint Normalization — Structuring of free-text temporal references into standardized categories with chronological ordering

Quality Assurance

Extraction confidence and data completeness scores per patient
Automated clinical plausibility validation
HPO mapping quality validated by LLM review
Negated findings (symptoms explicitly absent) are captured with present=False
Manual audit of random samples against source publications confirmed >90% symptom completeness and near-complete laboratory value extraction

Limitations

Data quality depends on the detail and clarity of source publications
Some clinical findings may be missed if not explicitly stated in the text
HPO mappings for ~246 symptoms lack HPO IDs (canonical name provided but ID not in reference ontology)
Temporal information is not always available or precisely reported in source publications
Dataset is limited to publications indexed in PubMed Central with available full text

License

The structured data in this dataset is derived from published scientific literature. Individual source publications retain their original licenses (see publications.csv). This compiled dataset is made available under CC BY 4.0.

Files

dataset_summary.json

Files (731.7 kB)

Name	Size	Download all
dataset_summary.json md5:78095cf31a7fb08b57f5723f6e52dc3f	785 Bytes	Preview Download
lab_values.csv md5:ce7e7f49950b954a8dd392a331c4ca24	97.3 kB	Preview Download
patients.csv md5:b65e35b7d24367d28ae2f9434594a7c1	69.1 kB	Preview Download
publications.csv md5:114ce5e5e4dc191202d26c3eb7780442	51.3 kB	Preview Download
README.md md5:98c699776759df43e82277039ad319e8	8.7 kB	Preview Download
symptoms.csv md5:dddc0ef07f58757ddfb3b0e1e56a1ae6	231.3 kB	Preview Download
symptoms_hpo.csv md5:2102b0f7023fc91719a16f24b8da8676	273.2 kB	Preview Download

	All versions	This version
Views	53	53
Downloads	16	16
Data volume	2.2 MB	2.2 MB

Fabry Disease: AI-Extracted Clinical Dataset from Published Case Reports

Authors/Creators

Description

General Information

Description

Dataset Summary

Demographics

Phenotype Distribution

File Descriptions

patients.csv (154 rows)

symptoms.csv (2,039 rows)

symptoms_hpo.csv (2,017 rows)

lab_values.csv (1,264 rows)

publications.csv (93 rows; 95 with header)

dataset_summary.json

Methodology

Data Extraction Pipeline

Quality Assurance

Limitations

License

Files

dataset_summary.json

Files (731.7 kB)