MLCerts Datasets and Language Models (ICSE 2026)
Description
Auxiliary material, up to date documentation, and issue tracking available at: https://github.com/rub-softsec/MLCerts
Docker images for reproducing artifacts are available at: https://zenodo.org/records/17850372
Datasets
Raw PEM certificates used in differential testing:
- v3-chain.tar.bz2: 12 synthetic certificate datasets.
- v3-experiments-extra.tar.bz2: MLCerts 1M dataset.
- frankencerts-v1-8M.tar.bz2: Frankencerts 8M dataset.
- seeds30k.tar.bz2: Transcert 30K dataset.
The CA information is available in customCA/ directory.
Language Models (llm-code-MLcerts-EXPORT.zip)
One of the model architectures below are used to generate synthetic ASN.1 instances (with BEGIN/END tags). asn1_to_pem.py is then used to convert them into a PEM format, with CA information copied from customCA/ directory.
RNN models
Code for RNN models, based on Char-RNN-Python, is available in Char-RNN-PyTorch directory. charRNN-custom.py is used for training, and generate.py for generating synthetic certificate instances.
python3 generate.py saved_model hidden_size layers temperature original_cert_dataset extra_run_name
Saved models available are:
- 2022-scanned-1024-3-0.0002lr-0.1dropout-epoch3-step300000
- 2022-scanned-256-3-0.0002lr-0.1dropout-epoch3-step300000
- balanced-versions-1024-3-0.0002lr-0.1dropout-epoch3-step300000
- balanced-versions-256-3-0.0002lr-0.1dropout-epoch3-step300000
- zmap-data-256-3-0.0002lr-0.1dropout-epoch3-step300000
- zmap-data-1024-3-0.0002lr-0.1dropout-epoch3-step300000
To generate certificates for the final model used in paper results (IPv4/RNN-Medium with Temperature = 1.5), use:
python3 generate.py zmap-data-1024-3-0.0002lr-0.1dropout-epoch3-step300000 1024 3 1.5 zmap-data testZmap1M
GPT Models
Code for GPT models, based on GPT-Neo-125, is available in Transformers directory. train_script.py is used for training (train_script_scratch.py for training from scratch), and generate.py for generating synthetic certificate instances.
python3 generate.py saved_model checkpoint_num training_type temperature
training_type can be 'finetune' or 'custom’, for instance:
python3 generate.py 2022-scanned-custom checkpoint-284400 custom 1.0
Saved models available are:
- 2022-scanned
- 2022-scanned-custom
- balanced-versions
- balanced-versions-custom
- zmap-data-custom
- zmap-data
The custom versions are the ones trained from scratch.
conda-env.yml can be consulted for environment dependencies.
BibTeX
Please cite our paper if you rely on the datasets for your work.
@inproceedings{icse2026-hallucinating-certificates,
title = {{Hallucinating Certificates: Differential Testing of TLS Certificate Validation Using Generative Language Models}},
author = {Paracha, Talha and Posluns, Kyle and Borgolte, Kevin and Lindorfer, Martina and Choffnes, David},
booktitle = {Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE)},
date = {2026-04},
edition = {48},
editor = {Mezini, Mira and Zimmermann, Thomas},
location = {Rio de Janeiro, Brazil},
publisher = {Association for Computing Machinery (ACM)/Institute of Electrical and Electronics Engineers (IEEE)}
}
Files
llm-code-MLcerts-EXPORT.zip
Files
(39.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2b5a49b5109615c225c21f8220457dd2
|
15.5 GB | Download |
|
md5:2c4b448fef20ff333ed1ab04b6af75a1
|
22.1 GB | Preview Download |
|
md5:624da75b1c4c6909a516b0530f50a200
|
29.4 MB | Download |
|
md5:46f49a64d60c1618d77b7033041e489a
|
833.1 MB | Download |
|
md5:4bfaf361c30353447c01fe968b924a83
|
504.2 MB | Download |