Published April 1, 2026
| Version 1.0.0
Dataset
Open
Dataset to "Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints"
Authors/Creators
Description
Artifact: Hidden Secrets in the arXiv
Publication
Jan Pennekamp, Johannes Lohmöller, David Schütte, Joscha Loos, and Martin Henze. 2026. Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints. In Proceedings of the 47th IEEE Symposium on Security and Privacy (SP '26). IEEE.
Contents
- paper_issues.csv.gz
Aggregated per-paper results of identified unintentional information disclosures. Paper identifiers are anonymized using HMAC-SHA256 as described below. Version 0 refers to s3-provided data that carries timestamps, but no exact version information. - survey.md
User study instrument as deployed via SoSci Survey, including question texts and response options. - classification_prompt.txt
Prompt used with Qwen2.5-72B to classify identified comments into disclosure categories. - example.tex / example.bib
Synthetic LaTeX file containing sanitization test cases covering representative disclosure patterns.
Anonymization Scheme
Paper identifiers in `paper_issues.csv` are anonymized using HMAC-SHA256 with a secret key.The scheme preserves the arXiv identifier structure (category prefix and numeric suffix) while replacing the numeric part with a 6-character hex digest.
import hmacimport hashlibdef anonymize_arxiv_id(arxiv_id):if '.' not in arxiv_id and '/' not in arxiv_id:hmac_hash = hmac.new(secret, arxiv_id.encode('utf-8'), hashlib.sha256).hexdigest()return f"{hmac_hash[:6]}"if '/' in arxiv_id:prefix, suffix = arxiv_id.split('/', 1)else:prefix, suffix = arxiv_id.split('.', 1)hmac_hash = hmac.new(secret, suffix.encode('utf-8'), hashlib.sha256).hexdigest()return f"{prefix}/{hmac_hash[:6]}"
Sanitizer
The sanitizer tool (ALC-NG) used to prevent unintentional disclosure in LaTeX source files is available at: https://github.com/COMSYS/ALC-NG
Files
classification_prompt.txt
Files
(107.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3fd6b47184768cd0523b8da1e6186af8
|
4.0 kB | Preview Download |
|
md5:1336e948b50d879a09e29cc89500c302
|
1.2 kB | Download |
|
md5:8d02cf656f684a3f0d15677481c8c475
|
1.8 kB | Download |
|
md5:919096ac8d3b1c2c647f67591731bb93
|
107.3 MB | Download |
|
md5:db2804de8d5a8d53ebbda939b01d147b
|
7.8 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/COMSYS/ALC-NG