Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning
Authors/Creators
Description
Traceprop is an open-source Python library providing the first unified system for end-to-end
data provenance in machine learning pipelines, connecting raw source files through preprocessing,
through model training, to individual predictions. Existing data attribution methods [Koh and
Liang, 2017, Park et al., 2023, Engstrom et al., 2024] identify which training samples influenced
a prediction but operate in isolation from the data pipeline. Existing computation lineage
tools (MLflow, DVC, TensorFlow MLMD) track artifact-level provenance but do not descend
into the computation graph or connect to gradient-level attribution. Traceprop fills this gap
by introducing a computation-level lineage layer that integrates natively with gradient-based
attribution. A single Traceprop query answers: “This model made prediction X: which rows in
which source files, through which preprocessing steps, most influenced that prediction, and can
we reduce that influence without retraining?” We demonstrate: (1) sub-1% lineage overhead in
production op-mode at 106+ array elements (1.007×on macOS, 0.979×on Linux); (2) Traceprop-
LL achieving LDS 0.622 ±0.180 on tabular data (UCI Adult Income, logistic regression) at 0.22 s
on CPU, and Traceprop-LL achieving LDS 0.0168 on CIFAR-2/ResNet-9 vs. TRAK’s 0.0290
at 266×lower wall-clock cost (2.6 s CPU vs. 691 s GPU); (3) provenance-guided approximate
unlearning exceeding the retrain-from-scratch gold standard (forget-set loss 0.425 vs. gold 0.401,
vs. 14% gap closed for random unlearning) with a test accuracy drop of only 0.5 percentage
points (0.915 vs. 0.920). Traceprop directly addresses EU AI Act Article 26 audit trail obligations
for high-risk AI systems, whose backstop enforcement date is 2 December 2027. The library is
available at https://pypi.org/project/traceprop/
Files
Traceprop_arxiv.pdf
Files
(364.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7817cc77840ad1e27c5d21df6f4e9f1e
|
364.1 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Software: 10.5281/zenodo.20035922 (DOI)
Dates
- Created
-
2026-05-05Uploaded the preprint
Software
- Repository URL
- https://github.com/AmitoVrito/Traceprop
- Programming language
- Python , Python traceback
- Development Status
- Active