Published May 5, 2026 | Version 1.0.0
Preprint Open

Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning

Authors/Creators

Description

Traceprop is an open-source Python library providing the first unified system for end-to-end

data provenance in machine learning pipelines, connecting raw source files through preprocessing,

through model training, to individual predictions. Existing data attribution methods [Koh and

Liang, 2017, Park et al., 2023, Engstrom et al., 2024] identify which training samples influenced

a prediction but operate in isolation from the data pipeline. Existing computation lineage

tools (MLflow, DVC, TensorFlow MLMD) track artifact-level provenance but do not descend

into the computation graph or connect to gradient-level attribution. Traceprop fills this gap

by introducing a computation-level lineage layer that integrates natively with gradient-based

attribution. A single Traceprop query answers: “This model made prediction X: which rows in

which source files, through which preprocessing steps, most influenced that prediction, and can

we reduce that influence without retraining?” We demonstrate: (1) sub-1% lineage overhead in

production op-mode at 106+ array elements (1.007×on macOS, 0.979×on Linux); (2) Traceprop-

LL achieving LDS 0.622 ±0.180 on tabular data (UCI Adult Income, logistic regression) at 0.22 s

on CPU, and Traceprop-LL achieving LDS 0.0168 on CIFAR-2/ResNet-9 vs. TRAK’s 0.0290

at 266×lower wall-clock cost (2.6 s CPU vs. 691 s GPU); (3) provenance-guided approximate

unlearning exceeding the retrain-from-scratch gold standard (forget-set loss 0.425 vs. gold 0.401,

vs. 14% gap closed for random unlearning) with a test accuracy drop of only 0.5 percentage

points (0.915 vs. 0.920). Traceprop directly addresses EU AI Act Article 26 audit trail obligations

for high-risk AI systems, whose backstop enforcement date is 2 December 2027. The library is

available at https://pypi.org/project/traceprop/

Files

Traceprop_arxiv.pdf

Files (364.1 kB)

Name Size Download all
md5:7817cc77840ad1e27c5d21df6f4e9f1e
364.1 kB Preview Download

Additional details

Related works

Is supplemented by
Software: 10.5281/zenodo.20035922 (DOI)

Dates

Created
2026-05-05
Uploaded the preprint

Software

Repository URL
https://github.com/AmitoVrito/Traceprop
Programming language
Python , Python traceback
Development Status
Active