Published June 3, 2026 | Version v1
Dataset Open

OLTA-TR: Open-access Labeled Turkish Anti-phishing Email Dataset

Description

This dataset contains 18,305 synthetically generated Turkish email texts designed for phishing and legitimate email classification. The dataset consists of 11,082 phishing and 7,223 legitimate email samples generated under controlled conditions using Large Language Models (LLMs).

The dataset includes structured variations in sector context, communication tone, grammar complexity, message length, persuasion strategies, and credibility indicators. Phishing samples are intentionally designed to reflect common social engineering patterns observed in real-world phishing campaigns, while legitimate samples simulate routine institutional and informational communication.

All email contents in this dataset are synthetically generated and do not contain real personal or sensitive data. This dataset was created exclusively for research and educational purposes.

Categorical Distribution

Sector Legitimate Phishing Total
Education 1,444  2,124  3,568
Finance 1,329 1,988 3,317
Public Institutions 1,180 1,475 2,655
Retail 718 974 1,692
Defense Industry 1,374 3,053 4,427
Technology 1,178 1,468 2,646
Total  7,223 11,082 18,305

Data Generation Methodology

The dataset was generated using a structured prompt-based synthetic data generation pipeline built on multiple LLMs. Multiple models were intentionally used to increase linguistic diversity and reduce model-specific bias.

The "DeepSeek-V3.1 (671B)", "gpt-oss:120b-cloud", and "DeepSeek-R1-Distill-Llama-8B" models were used to create the dataset.

Email samples were generated by systematically varying predefined contextual attributes:

  • Sector domain
  • Email category
  • Communication tone
  • Grammar complexity level
  • Message length level
  • Language characteristics
  • URL inclusion patterns

For phishing samples, additional behavioral attributes were included to simulate social engineering strategies:

  • Persuasion strategies based on Cialdini's principles
  • Credibility indicators
  • Urgency and authority cues

A subset of phishing emails (n=1,458, ~13%) includes real malicious URLs sourced from URLhaus via stratified sampling. All remaining URLs are synthetically generated. 

Files Included

phishing.csv: It contains 11,082 synthetically generated phishing email samples.

legitimate.csv: It contains 7,223 synthetically generated legitimate email samples.

Common columns (both files)

Column Description
sector Sector or domain context
category Functional category (e.g., request, notification)
tone Communication tone
grammar_level Grammar complexity level
length_level Message length category
email_with_url Full email text content
email_without_url Email text content without URL
url URL, included in email

Additional columns (phishing dataset only)

Column Description
cialdini Persuasion principle applied (e.g., authority, scarcity)
credibility Credibility indicator type

Intended Research Areas

This dataset is intended to support research in:

  • Phishing email detection
  • Malicious content classification
  • Synthetic data generation
  • Social engineering analysis
  • Natural Language Processing (NLP)
  • Deep Learning-based text classification
  • LLM evaluation and benchmarking
  • Machine learning-based cybersecurity systems

Notes

This research was funded by the Information Technology for European Advancement (ITEA) cluster under the VESTA project (#21011).

Files

legitimate.csv

Files (21.8 MB)

Name Size Download all
md5:7f8ae5b70a8cdaaa06838d39bdd9c663
8.6 MB Preview Download
md5:422006159092ef69d129d1f3c9ec0032
13.2 MB Preview Download