Published March 8, 2026 | Version v0.1
Preprint Open

Ixnos: A Deterministic Fragment Identifier Protocol for AI Training Data

Description

Report series: Ixnos Research Reports

Report number: IRR-002

This paper introduces Ixnos, a protocol for assigning deterministic, content-addressable identifiers to sub-document fragments of machine learning training data. Current training data documentation practices operate at the corpus or dataset level, making it impossible to detect fine-grained overlap between training sets and evaluation benchmarks, reproduce exact dataset compositions, or support audit workflows related to data governance regulations such as the EU AI Act.

Ixnos addresses these limitations by defining a fragment identification primitive — the Ixnos Fragment Identifier (IFI) — based on canonicalized content hashing with explicit segmentation profiles. Dataset Recipes and Provenance Manifests are defined as composable protocol layers enabling deterministic corpus fingerprinting, overlap detection, and reproducible dataset composition.

A minimal overlap-detection experiment and a larger-scale feasibility test demonstrate that IFI-based indexing correctly detects contamination at sub-document granularity and performs constant-time lookup on indexed corpora. Ixnos is proposed as a narrow infrastructure primitive for training data traceability rather than as a causal attribution or semantic similarity system.

Files

IRR-002_Ixnos_Protocol.pdf

Files (300.5 kB)

Name Size Download all
md5:086c541490462b08c8e8f3d2e9bf859b
300.5 kB Preview Download

Additional details

Dates

Submitted
2026

Software

Repository URL
https://github.com/Andr0meda/ixnos-research
Programming language
Python
Development Status
Active