OLTA-TR: Open-access Labeled Turkish Anti-phishing Email Dataset
Authors/Creators
- 1. BİTES Defence and Aerospace
Description
This dataset contains 18,305 synthetically generated Turkish email texts designed for phishing and legitimate email classification. The dataset consists of 11,082 phishing and 7,223 legitimate email samples generated under controlled conditions using Large Language Models (LLMs).
The dataset includes structured variations in sector context, communication tone, grammar complexity, message length, persuasion strategies, and credibility indicators. Phishing samples are intentionally designed to reflect common social engineering patterns observed in real-world phishing campaigns, while legitimate samples simulate routine institutional and informational communication.
All email contents in this dataset are synthetically generated and do not contain real personal or sensitive data. This dataset was created exclusively for research and educational purposes.
Categorical Distribution
| Sector | Legitimate | Phishing | Total |
| Education | 1,444 | 2,124 | 3,568 |
| Finance | 1,329 | 1,988 | 3,317 |
| Public Institutions | 1,180 | 1,475 | 2,655 |
| Retail | 718 | 974 | 1,692 |
| Defense Industry | 1,374 | 3,053 | 4,427 |
| Technology | 1,178 | 1,468 | 2,646 |
| Total | 7,223 | 11,082 | 18,305 |
Data Generation Methodology
The dataset was generated using a structured prompt-based synthetic data generation pipeline built on multiple LLMs. Multiple models were intentionally used to increase linguistic diversity and reduce model-specific bias.
The "DeepSeek-V3.1 (671B)", "gpt-oss:120b-cloud", and "DeepSeek-R1-Distill-Llama-8B" models were used to create the dataset.
Email samples were generated by systematically varying predefined contextual attributes:
- Sector domain
- Email category
- Communication tone
- Grammar complexity level
- Message length level
- Language characteristics
- URL inclusion patterns
For phishing samples, additional behavioral attributes were included to simulate social engineering strategies:
- Persuasion strategies based on Cialdini's principles
- Credibility indicators
- Urgency and authority cues
A subset of phishing emails (n=1,458, ~13%) includes real malicious URLs sourced from URLhaus via stratified sampling. All remaining URLs are synthetically generated.
Files Included
phishing.csv: It contains 11,082 synthetically generated phishing email samples.
legitimate.csv: It contains 7,223 synthetically generated legitimate email samples.
Common columns (both files)
| Column | Description |
|---|---|
sector |
Sector or domain context |
category |
Functional category (e.g., request, notification) |
tone |
Communication tone |
grammar_level |
Grammar complexity level |
length_level |
Message length category |
email_with_url |
Full email text content |
email_without_url |
Email text content without URL |
url |
URL, included in email |
Additional columns (phishing dataset only)
| Column | Description |
|---|---|
cialdini |
Persuasion principle applied (e.g., authority, scarcity) |
credibility |
Credibility indicator type |
Intended Research Areas
This dataset is intended to support research in:
- Phishing email detection
- Malicious content classification
- Synthetic data generation
- Social engineering analysis
- Natural Language Processing (NLP)
- Deep Learning-based text classification
- LLM evaluation and benchmarking
- Machine learning-based cybersecurity systems
Notes
Files
legitimate.csv
Files
(21.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7f8ae5b70a8cdaaa06838d39bdd9c663
|
8.6 MB | Preview Download |
|
md5:422006159092ef69d129d1f3c9ec0032
|
13.2 MB | Preview Download |