Published April 1, 2026 | Version 1.0.0
Dataset Open

Dataset to "Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints"

Description

Artifact: Hidden Secrets in the arXiv

Publication

Jan Pennekamp, Johannes Lohmöller, David Schütte, Joscha Loos, and Martin Henze. 2026. Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints. In Proceedings of the 47th IEEE Symposium on Security and Privacy (SP '26). IEEE.

Contents

  • paper_issues.csv.gz
    Aggregated per-paper results of identified unintentional information disclosures. Paper identifiers are anonymized using HMAC-SHA256 as described below. Version 0 refers to s3-provided data that carries timestamps, but no exact version information.
  • survey.md
    User study instrument as deployed via SoSci Survey, including question texts and response options.
  • classification_prompt.txt
    Prompt used with Qwen2.5-72B to classify identified comments into disclosure categories.
  • example.tex / example.bib
    Synthetic LaTeX file containing sanitization test cases covering representative disclosure patterns.

Anonymization Scheme

Paper identifiers in `paper_issues.csv` are anonymized using HMAC-SHA256 with a secret key.
The scheme preserves the arXiv identifier structure (category prefix and numeric suffix) while replacing the numeric part with a 6-character hex digest.

import hmac
import hashlib
def anonymize_arxiv_id(arxiv_id):
    if '.' not in arxiv_id and '/' not in arxiv_id:
        hmac_hash = hmac.new(secret, arxiv_id.encode('utf-8'), hashlib.sha256).hexdigest()
        return f"{hmac_hash[:6]}"
    if '/' in arxiv_id:
        prefix, suffix = arxiv_id.split('/', 1)
    else:
        prefix, suffix = arxiv_id.split('.', 1)
    hmac_hash = hmac.new(secret, suffix.encode('utf-8'), hashlib.sha256).hexdigest()
    return f"{prefix}/{hmac_hash[:6]}"

Sanitizer

The sanitizer tool (ALC-NG) used to prevent unintentional disclosure in LaTeX source files is available at: https://github.com/COMSYS/ALC-NG

Files

classification_prompt.txt

Files (107.3 MB)

Name Size Download all
md5:3fd6b47184768cd0523b8da1e6186af8
4.0 kB Preview Download
md5:1336e948b50d879a09e29cc89500c302
1.2 kB Download
md5:8d02cf656f684a3f0d15677481c8c475
1.8 kB Download
md5:919096ac8d3b1c2c647f67591731bb93
107.3 MB Download
md5:db2804de8d5a8d53ebbda939b01d147b
7.8 kB Preview Download

Additional details

Software