Published July 16, 2025 | Version v1
Dataset Open

MLCerts Datasets and Language Models (ICSE 2026)

Description

Auxiliary material, up to date documentation, and issue tracking available at: https://github.com/rub-softsec/MLCerts

Docker images for reproducing artifacts are available at: https://zenodo.org/records/17850372

Datasets

Raw PEM certificates used in differential testing:

  • v3-chain.tar.bz2: 12 synthetic certificate datasets.
  • v3-experiments-extra.tar.bz2: MLCerts 1M dataset.
  • frankencerts-v1-8M.tar.bz2: Frankencerts 8M dataset.
  • seeds30k.tar.bz2: Transcert 30K dataset.

The CA information is available in customCA/ directory.

Language Models (llm-code-MLcerts-EXPORT.zip)

One of the model architectures below are used to generate synthetic ASN.1 instances (with BEGIN/END tags). asn1_to_pem.py is then used to convert them into a PEM format, with CA information copied from customCA/ directory.

RNN models

Code for RNN models, based on Char-RNN-Python, is available in Char-RNN-PyTorch directory. charRNN-custom.py is used for training, and generate.py for generating synthetic certificate instances.

python3 generate.py saved_model hidden_size layers temperature original_cert_dataset extra_run_name

Saved models available are:

  • 2022-scanned-1024-3-0.0002lr-0.1dropout-epoch3-step300000
  • 2022-scanned-256-3-0.0002lr-0.1dropout-epoch3-step300000
  • balanced-versions-1024-3-0.0002lr-0.1dropout-epoch3-step300000
  • balanced-versions-256-3-0.0002lr-0.1dropout-epoch3-step300000
  • zmap-data-256-3-0.0002lr-0.1dropout-epoch3-step300000
  • zmap-data-1024-3-0.0002lr-0.1dropout-epoch3-step300000

To generate certificates for the final model used in paper results (IPv4/RNN-Medium with Temperature = 1.5), use:

python3 generate.py zmap-data-1024-3-0.0002lr-0.1dropout-epoch3-step300000 1024 3 1.5 zmap-data testZmap1M

GPT Models

Code for GPT models, based on GPT-Neo-125, is available in Transformers directory. train_script.py is used for training (train_script_scratch.py for training from scratch), and generate.py for generating synthetic certificate instances.

python3 generate.py saved_model checkpoint_num training_type temperature

training_type can be 'finetune' or 'custom’, for instance:

python3 generate.py 2022-scanned-custom checkpoint-284400 custom 1.0

Saved models available are:

  • 2022-scanned
  • 2022-scanned-custom
  • balanced-versions
  • balanced-versions-custom
  • zmap-data-custom
  • zmap-data

The custom versions are the ones trained from scratch.

conda-env.yml can be consulted for environment dependencies.  

BibTeX

Please cite our paper if you rely on the datasets for your work. 

@inproceedings{icse2026-hallucinating-certificates,
  title     = {{Hallucinating Certificates: Differential Testing of TLS Certificate Validation Using Generative Language Models}},
  author    = {Paracha, Talha and Posluns, Kyle and Borgolte, Kevin and Lindorfer, Martina and Choffnes, David},
  booktitle = {Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE)},
  date      = {2026-04},
  edition   = {48},
  editor    = {Mezini, Mira and Zimmermann, Thomas},
  location  = {Rio de Janeiro, Brazil},
  publisher = {Association for Computing Machinery (ACM)/Institute of Electrical and Electronics Engineers (IEEE)}
}

Files

llm-code-MLcerts-EXPORT.zip

Files (39.0 GB)

Name Size Download all
md5:2b5a49b5109615c225c21f8220457dd2
15.5 GB Download
md5:2c4b448fef20ff333ed1ab04b6af75a1
22.1 GB Preview Download
md5:624da75b1c4c6909a516b0530f50a200
29.4 MB Download
md5:46f49a64d60c1618d77b7033041e489a
833.1 MB Download
md5:4bfaf361c30353447c01fe968b924a83
504.2 MB Download