Published April 8, 2026 | Version 2.0
Dataset Open

Synthetic DGP Atlas - Benchmark Dataset for Time-Series Causal Discovery

  • 1. IST Lisbon, PhD Student in Aerospace Engineering
  • 2. ROR icon University of Oxford
  • 3. ROR icon California Institute of Technology
  • 4. ROR icon Johns Hopkins University
  • 5. ROR icon The Aerospace Corporation
  • 6. ROR icon Jet Propulsion Laboratory
  • 7. ROR icon Instituto Superior Técnico

Description

Synthetic DGP Atlas: Benchmark Dataset for Time-Series Causal Discovery

A collection of 500 synthetic multivariate time-series datasets with controlled assumption violations and known ground-truth causal graphs, designed for benchmarking causal discovery methods and pre-analysis tools.

All datasets share a first-order vector autoregressive (VAR(1)) base process, X_t = A X_{t-1} + ε_t, with family-specific modifications that introduce one or more assumption violations at controlled severity levels. Each dataset is accompanied by a metadata file containing the true causal graph, ground-truth risk labels, and the exact generation parameters needed for reproduction.

Motivation

Time-series causal discovery methods (PCMCI+, Granger causality, transfer entropy, etc.) rely on assumptions such as stationarity, regular sampling, and bounded temporal dependence. Existing benchmarks (e.g., TimeGraph, CausalTime) provide evaluation targets but do not systematically vary assumption violations across controlled severity gradations. This atlas fills that gap: it provides 10 families of 50 datasets each, where each family targets a specific violation type with continuous severity variation, enabling calibration and evaluation of assumption-checking tools and method-selection pipelines.

Dataset structure

synthetic_dgp_atlas_v02/
├── F1_clean_var/                    # Family 1: clean baseline
│   ├── F1_clean_var_000_data.csv    # Time-series data (DatetimeIndex + N columns)
│   ├── F1_clean_var_000_metadata.json  # Ground truth + generation params
│   ├── F1_clean_var_001_data.csv
│   ├── F1_clean_var_001_metadata.json
│   └── ...                          # 50 DGPs per family (000–049)
├── F2_structural_breaks/
├── F3_irregular_sampling/
├── F4_high_persistence/
├── F5_latent_confounders/
├── F6_seasonality/
├── F7_nonlinear/
├── F8_non_gaussian/
├── F9_mixed_violations/
└── F10_extreme_cases/

Each DGP consists of two files:

  • *_data.csv: a CSV with a datetime index (daily, starting 2020-01-01) and N numeric columns (X0, X1, ..., X_{N-1}). Missing values are encoded as empty cells (NaN).
  • *_metadata.json: a JSON file with the fields described below.

Metadata schema

{
  "dgp_id": "F1_clean_var_000",
  "family": "F1_clean_var",
  "description": "Clean VAR, no violations (TRUE NEGATIVES)",
  "n_variables": 7,
  "n_samples": 500,
  "true_graph": { "X0": ["X1", "X3"], "X1": ["X0", "X2"], ... },
  "nonstationarity_risk": 0.05,
  "irregularity_risk": 0.05,
  "persistence_risk": 0.25,
  "confounding_risk": 0.05,
  "generation_params": { "n_vars": 7, "n_samples": 500, "max_eigenvalue": 0.7, "noise_std": 0.1 },
  "random_seed": 121958,
  "generation_timestamp": "2025-12-01T15:50:50.559026"
}

The true_graph field encodes the causal adjacency list: true_graph[Xj] lists the parents of Xj (i.e., Xi → Xj for each Xi in the list). Risk labels are ground-truth values assigned during generation based on the known process parameters, not estimated from the data.

Family catalog

Family Name n N T Violation mechanism Severity
F1 Clean baseline 50 5–8 500–1000 None ρ(A) ≤ 0.7
F2 Structural breaks 50 5–8 500–1000 1–3 regime changes in VAR coefficients Continuous
F3 Irregular sampling 50 5–8 500–1000 MCAR / MAR / seasonal gaps 15–35% missing
F4 High persistence 50 5–8 500–1000 Near-unit-root spectral radius ρ(A) ∈ [0.92, 0.98]
F5 Latent confounders 50 5–8 500–1000 L ∈ {1, 2} hidden common causes σ_conf ∈ {0.3, 0.6, 0.9}
F6 Seasonality 50 5–8 500–1000 Additive harmonic components P ∈ {12, 24, 52}
F7 Nonlinear 50 5–8 500–1000 tanh / sin / ReLU transforms Moderate
F8 Non-Gaussian 50 5–8 500–1000 Student-t or Laplace noise ν ∈ {3, 5, 10}
F9 Mixed violations 50 5–8 500–1000 2–3 families combined Multiple high
F10 Boundary cases 50 3–12 200–2000 Short series / high-dim / sparse / near-unit-root Stress test

n = number of datasets, N = number of observed variables, T = number of time steps.

Families F1–F9 use N ∈ {5, 6, 7, 8} and T ∈ {500, 750, 1000}, approximately balanced across datasets. Family F10 uses wider ranges (N ∈ {3–12}, T ∈ {200–2000}) to probe boundary conditions.

Intended uses

  • Calibrating risk models for assumption-violation detection in causal discovery pipelines.
  • Benchmarking method-selection tools that choose among causal discovery algorithms based on data characteristics.
  • Evaluating robustness of causal discovery methods (PCMCI+, Granger, transfer entropy, LPCMCI, etc.) to specific violation types.
  • Training and validating classifiers that predict whether a dataset satisfies the assumptions of a given method.
  • Ablation studies comparing diagnostic statistics across violation families.

Generation

All datasets were generated with the script synthetic_atlas_extended_v02.py (included in the causal-audit repository) using global random seed 42. Each DGP receives a unique per-DGP seed derived from the global seed, recorded in the metadata. The generation is fully deterministic and reproducible.

python synthetic_atlas_extended_v02.py --output_dir synthetic_dgp_atlas_v02 --seed 42 --n_per_family 50

Technical details

  • Total datasets: 500 (10 families × 50)
  • Total files: 1000 (500 CSV + 500 JSON)
  • Total size: ~140 MB
  • Generation date: 2025-12-01
  • Global random seed: 42
  • Base process: VAR(1), X_t = A X_{t-1} + ε_t
  • Coefficient matrix: A is generated with controlled spectral radius ρ(A) and stabilized to ensure stationarity (except where violations are intended)
  • Noise: Gaussian N(0, σ²) unless otherwise specified (F8: Student-t or Laplace)
  • Index: daily DatetimeIndex starting 2020-01-01

Files

synthetic_dgp_atlas.zip

Files (23.2 MB)

Name Size Download all
md5:75f3285160cfc759ae2284bd2ba3d8bb
33.0 kB Download
md5:15ca94e9bae3bb87e344d717c4ee5606
23.2 MB Preview Download

Additional details

Additional titles

Alternative title
Causal-Audit: A Framework for Risk Assessment of Assumption Violations in Time-Series Causal Discovery

Related works

Is documented by
Preprint: arXiv:2604.02488 (arXiv)

Funding

Fundação para a Ciência e Tecnologia
LARSyS - Laboratory of Robotics and Engineering Systems LA/P/0083/2020

Software

Repository URL
https://github.com/marcoruizrueda/causal-audit
Programming language
Python
Development Status
Active