Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published May 8, 2020 | Version v2
Dataset Open

Dataset used for detecting DNS over HTTPS by Machine Learning.

  • 1. FIT CTU
  • 2. FIT CTU/CESNET z.s.p.o.
  • 3. CESNET z.s.p.o.

Description

 The dataset consists of three different data sources:

  1.  DoH enabled Firefox
  2. DoH enabled Google Chrome
  3. Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

- Label (1 - Doh, 0 - regular HTTPS)
- Data source
- Duration
- Minimal Inter-Packet Delay
- Maximal Inter-Packet Delay
- Average Inter-Packet Delay
- A variance of Incoming Packet Sizes
- A variance of Outgoing Packet Sizes
- A ratio of the number of Incoming and outgoing bytes
- A ration of the number of Incoming and outgoing packets
- Average of Incoming Packet sizes
- Average of Outgoing Packet sizes
- The median value of Incoming Packet sizes
- The median value of outgoing Packet sizes
- The ratio of bursts and pauses
- Number of bursts
- Number of pauses
- Autocorrelation
- Transmission symmetry in the 1st third of connection
- Transmission symmetry in the 2nd third of connection
- Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information. 

The zip file structure is:

|-- data
|   |-- extracted-features...extracted features used in ML for DoH recognition
|   |   |-- chrome
|   |   |-- cloudflared
|   |   `-- firefox
|   |-- flows...............................................exported flow data
|   |   |-- chrome
|   |   |-- cloudflared
|   |   `-- firefox
|   `-- pcaps....................................................raw PCAP data
|       |-- chrome
|       |-- cloudflared
|       `-- firefox
|-- LICENSE
`-- README.md


When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020,
author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas},
title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning},
year = {2020},
isbn = {9781450388337},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3407023.3409192},
doi = {10.1145/3407023.3409192},
booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security},
articleno = {87},
numpages = {8},
keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets},
location = {Virtual Event, Ireland},
series = {ARES '20}
}

 

Notes

This work was supported by the European Union's Horizon 2020 research and innovation program under grant agreement No. 833418 and also by the Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/20 funded by the MEYS of the Czech Republic.

Files

doh-dataset.zip

Files (34.2 GB)

Name Size Download all
md5:2cb76bd268ddb91807138b1f762c8b17
34.2 GB Preview Download

Additional details

Funding

SAPPAN – Sharing and Automation for Privacy Preserving Attack Neutralization 833418
European Commission