Synthetic DGP Atlas - Benchmark Dataset for Time-Series Causal Discovery

Ruiz Rueda, Marco A.; Arana-Catania, Miguel; Ardila, David; Ventura, Rodrigo

doi:10.48550/arXiv.2604.02488

Published April 8, 2026 | Version 2.0

Dataset Open

Synthetic DGP Atlas - Benchmark Dataset for Time-Series Causal Discovery

1. IST Lisbon, PhD Student in Aerospace Engineering
2. University of Oxford
3. California Institute of Technology
4. Johns Hopkins University
5. The Aerospace Corporation
6. Jet Propulsion Laboratory
7. Instituto Superior Técnico

Synthetic DGP Atlas: Benchmark Dataset for Time-Series Causal Discovery

A collection of 500 synthetic multivariate time-series datasets with controlled assumption violations and known ground-truth causal graphs, designed for benchmarking causal discovery methods and pre-analysis tools.

All datasets share a first-order vector autoregressive (VAR(1)) base process, X_t = A X_{t-1} + ε_t, with family-specific modifications that introduce one or more assumption violations at controlled severity levels. Each dataset is accompanied by a metadata file containing the true causal graph, ground-truth risk labels, and the exact generation parameters needed for reproduction.

Motivation

Time-series causal discovery methods (PCMCI+, Granger causality, transfer entropy, etc.) rely on assumptions such as stationarity, regular sampling, and bounded temporal dependence. Existing benchmarks (e.g., TimeGraph, CausalTime) provide evaluation targets but do not systematically vary assumption violations across controlled severity gradations. This atlas fills that gap: it provides 10 families of 50 datasets each, where each family targets a specific violation type with continuous severity variation, enabling calibration and evaluation of assumption-checking tools and method-selection pipelines.

Dataset structure

synthetic_dgp_atlas_v02/
├── F1_clean_var/                    # Family 1: clean baseline
│   ├── F1_clean_var_000_data.csv    # Time-series data (DatetimeIndex + N columns)
│   ├── F1_clean_var_000_metadata.json  # Ground truth + generation params
│   ├── F1_clean_var_001_data.csv
│   ├── F1_clean_var_001_metadata.json
│   └── ...                          # 50 DGPs per family (000–049)
├── F2_structural_breaks/
├── F3_irregular_sampling/
├── F4_high_persistence/
├── F5_latent_confounders/
├── F6_seasonality/
├── F7_nonlinear/
├── F8_non_gaussian/
├── F9_mixed_violations/
└── F10_extreme_cases/

Each DGP consists of two files:

*_data.csv: a CSV with a datetime index (daily, starting 2020-01-01) and N numeric columns (X0, X1, ..., X_{N-1}). Missing values are encoded as empty cells (NaN).
*_metadata.json: a JSON file with the fields described below.

Metadata schema

{
  "dgp_id": "F1_clean_var_000",
  "family": "F1_clean_var",
  "description": "Clean VAR, no violations (TRUE NEGATIVES)",
  "n_variables": 7,
  "n_samples": 500,
  "true_graph": { "X0": ["X1", "X3"], "X1": ["X0", "X2"], ... },
  "nonstationarity_risk": 0.05,
  "irregularity_risk": 0.05,
  "persistence_risk": 0.25,
  "confounding_risk": 0.05,
  "generation_params": { "n_vars": 7, "n_samples": 500, "max_eigenvalue": 0.7, "noise_std": 0.1 },
  "random_seed": 121958,
  "generation_timestamp": "2025-12-01T15:50:50.559026"
}

The true_graph field encodes the causal adjacency list: true_graph[Xj] lists the parents of Xj (i.e., Xi → Xj for each Xi in the list). Risk labels are ground-truth values assigned during generation based on the known process parameters, not estimated from the data.

Family catalog

Family	Name	n	N	T	Violation mechanism	Severity
F1	Clean baseline	50	5–8	500–1000	None	ρ(A) ≤ 0.7
F2	Structural breaks	50	5–8	500–1000	1–3 regime changes in VAR coefficients	Continuous
F3	Irregular sampling	50	5–8	500–1000	MCAR / MAR / seasonal gaps	15–35% missing
F4	High persistence	50	5–8	500–1000	Near-unit-root spectral radius	ρ(A) ∈ [0.92, 0.98]
F5	Latent confounders	50	5–8	500–1000	L ∈ {1, 2} hidden common causes	σ_conf ∈ {0.3, 0.6, 0.9}
F6	Seasonality	50	5–8	500–1000	Additive harmonic components	P ∈ {12, 24, 52}
F7	Nonlinear	50	5–8	500–1000	tanh / sin / ReLU transforms	Moderate
F8	Non-Gaussian	50	5–8	500–1000	Student-t or Laplace noise	ν ∈ {3, 5, 10}
F9	Mixed violations	50	5–8	500–1000	2–3 families combined	Multiple high
F10	Boundary cases	50	3–12	200–2000	Short series / high-dim / sparse / near-unit-root	Stress test

n = number of datasets, N = number of observed variables, T = number of time steps.

Families F1–F9 use N ∈ {5, 6, 7, 8} and T ∈ {500, 750, 1000}, approximately balanced across datasets. Family F10 uses wider ranges (N ∈ {3–12}, T ∈ {200–2000}) to probe boundary conditions.

Intended uses

Calibrating risk models for assumption-violation detection in causal discovery pipelines.
Benchmarking method-selection tools that choose among causal discovery algorithms based on data characteristics.
Evaluating robustness of causal discovery methods (PCMCI+, Granger, transfer entropy, LPCMCI, etc.) to specific violation types.
Training and validating classifiers that predict whether a dataset satisfies the assumptions of a given method.
Ablation studies comparing diagnostic statistics across violation families.

Generation

All datasets were generated with the script synthetic_atlas_extended_v02.py (included in the causal-audit repository) using global random seed 42. Each DGP receives a unique per-DGP seed derived from the global seed, recorded in the metadata. The generation is fully deterministic and reproducible.

python synthetic_atlas_extended_v02.py --output_dir synthetic_dgp_atlas_v02 --seed 42 --n_per_family 50

Technical details

Total datasets: 500 (10 families × 50)
Total files: 1000 (500 CSV + 500 JSON)
Total size: ~140 MB
Generation date: 2025-12-01
Global random seed: 42
Base process: VAR(1), X_t = A X_{t-1} + ε_t
Coefficient matrix: A is generated with controlled spectral radius ρ(A) and stabilized to ensure stationarity (except where violations are intended)
Noise: Gaussian N(0, σ²) unless otherwise specified (F8: Student-t or Laplace)
Index: daily DatetimeIndex starting 2020-01-01

Files

synthetic_dgp_atlas.zip

Files (23.2 MB)

Name	Size	Download all
synthetic_atlas_extended_v02.py md5:75f3285160cfc759ae2284bd2ba3d8bb	33.0 kB	Download
synthetic_dgp_atlas.zip md5:15ca94e9bae3bb87e344d717c4ee5606	23.2 MB	Preview Download

Additional details

Alternative title: Causal-Audit: A Framework for Risk Assessment of Assumption Violations in Time-Series Causal Discovery

Is documented by: Preprint: arXiv:2604.02488 (arXiv)

Fundação para a Ciência e Tecnologia
LARSyS - Laboratory of Robotics and Engineering Systems LA/P/0083/2020

Repository URL: https://github.com/marcoruizrueda/causal-audit
Programming language: Python
Development Status: Active

	All versions	This version
Views	3	3
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Synthetic DGP Atlas: Benchmark Dataset for Time-Series Causal Discovery

Motivation

Dataset structure

Metadata schema

Family catalog

Intended uses

Generation

Technical details

synthetic_dgp_atlas.zip

Files (23.2 MB)

Additional titles

Related works

Funding

Software

Synthetic DGP Atlas - Benchmark Dataset for Time-Series Causal Discovery

Authors/Creators

Description

Synthetic DGP Atlas: Benchmark Dataset for Time-Series Causal Discovery

Motivation

Dataset structure

Metadata schema

Family catalog

Intended uses

Generation

Technical details

Files

synthetic_dgp_atlas.zip

Files (23.2 MB)

Additional details

Additional titles

Related works

Funding

Software