Superviz25-SQL: SQL Injection Detection Dataset
Authors/Creators
Description
Dataset Description
Superviz25-SQL is an SQL Injection Detection Dataset made to evaluate unsupervised SQL Injection detection techniques. The dataset allows to compare the capability of mechanisms trained on the benign workload of a specific database deployment to detect SQL Injection attack targeting this deployment.
We provide a test and train split, the train set consists of 335306 benign queries. The test set comprise 3017390 normal samples and 336281 malicious samples (90:10). Some queries typical of insider attacks are also contained in the dataset (n=1089). The dataset offers more than the SQL query and its label as extended metadata is provided to facilitate the comparison of detection approaches. For instance the attack technique information allows to compare the effectiveness of detection mechanism across existing SQL Injection techniques. Furthermore, the user input field allows the comparison of approaches basing their detection on full queries to those solely using the user inputs.
The code used for the generation of this dataset is available at gquetel/sqlia-dataset. If you use this dataset, please acknowledge it by citing the original paper:
Grégor Quetel, Eric Alata, Pierre-François Gimenez, Thomas Robert, Laurent Pautet. Superviz25-SQL: high-quality dataset to empower unsupervised SQL injection detection systems. 1st International Workshop on Assessment with New methodologies, Unified Benchmarks, and environments, of Intrusion detection and response Systems (ANUBIS), Sep 2025, Toulouse, France. ⟨hal-05314211⟩
This work has been partially supported by the French National Research Agency under the France 2030 label (Superviz ANR-22-PECY-0008). The views reflected herein do not necessarily reflect the opinion of the French government.
Usage
The proposed training and testing sets can easily be loaded using python and pandas as follows:
import pandas as pd
df = pd.read_csv(
"dataset.csv",
dtype={
"full_query": str,
"label": int,
"user_inputs": str,
"attack_stage": str,
"tamper_method": str,
"attack_status": str,
"statement_type": str,
"query_template_id": str,
"attack_id": str,
"attack_technique": str,
"split": str,
},
)
df_train = df[df["split"] == "train"]
df_test = df[df["split"] == "test"]
Series information
Each sample is characterized by the following columns:
full_query: The full SQL statement.label: 0 for normal samples, 1 for attacks.user_inputs: The user input without the query template.attack_stage: Empty for normal samples, either "recon" or "exploit" for attacks.tamper_method: Empty for normal samples, for attacks this field designate the randomly selected tamper script used bysqlmapto mutate this sample.attack_status: Empty for normal samples, else this indicates whether the attack campaign from which this sample was generated, succeeded ("success") or not ("failure").statement_type: Either "select", "delete", "execute", "modify", "admin" or "internal".query_template_id: The query template ID associated to this sample.attack_id: Empty for normal samples, else the attack campaign identifier.attack_technique: Empty for normal samples, else thesqlmaptechnique used for this attack: "boolean", "error", "inline", "stacked", "time", "union" or "insider".split: Proposed split for this dataset: either "train" or "test".
Methods
Generation Process
An effort during the design of the dataset was made to foster the realism and diversity of both normal and attack samples. The dataset generation relies on templates of queries, filled by legitimate values for normal samples generation, and sqlmap generated payloads for malicious samples. The chosen database schema on which the dataset is based comes from the OurAirports project. The project reports airport infrastructure around the world and was selected for its real-world usage, well-documented database schema, and the availability of publicly accessible data dumps. 62 SQL templates were manually defined to create our samples, covering different statement types: 23 SELECT, 10 UPDATE, 10 INSERT, 8 DELETE, and 11 administrative queries involving user and privilege management.
Normal samples
Normal samples were generated using the data dump provided by the OurAirports project. We made sure of their syntactical validity by submitting them to a MySQL server.
Attack samples
SQL Injection attack samples were generated by using the SQL Injection detection and exploitation tool sqlmap. Attack samples are generated for a random subset of 45 query templates (mimicking an application where not all endpoints are rendered available to external entities). For each of them, we created simulated HTTP vulnerable endpoints on which sqlmap attack campaigns were launched. Attack campaigns are made of two steps:
- Reconnaissance, where
sqlmapis iteratively invoked on the different HTTP parameters to identify a correct payload. - Exploitation, which only took place if the reconnaissance step was successful for an HTTP parameter. We asked
sqlmapto extract database information using the identified payload.
A random tamper script amongst those compatible with MySQL were selected at each invocation of sqlmap and random values for HTTP parameters that were not tested were provided to foster diversity of attacks and their realism.
SQL queries representative of insider attacks are also provided in the dataset. They are rarely present in existing dataset so we provided a few examples also generated using sqlmap. 4 campaigns with different exfiltration objectives were launched using the direct connection mode to generate these samples.
Notes
Files
dataset.csv
Files
(1.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:7fe0fd69a3cde9b57bc4f1e2d19260f8
|
1.1 GB | Preview Download |
Additional details
Additional titles
- Other
- Presented at ANUBIS'2025
Related works
- Is described by
- https://hal.science/hal-05314211 (URL)
Software
- Repository URL
- https://github.com/gquetel/sqlia-dataset