Dataset Open Access

Dataset used for detecting DNS over HTTPS by Machine Learning.

Vekshin,Dmitrii; Hynek,Karel; Cejka,Tomas

 The dataset consists of three different data sources:

  1.  DoH enabled Firefox
  2. DoH enabled Google Chrome
  3. Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

- Label (1 - Doh, 0 - regular HTTPS)
- Data source
- Duration
- Minimal Inter-Packet Delay
- Maximal Inter-Packet Delay
- Average Inter-Packet Delay
- A variance of Incoming Packet Sizes
- A variance of Outgoing Packet Sizes
- A ratio of the number of Incoming and outgoing bytes
- A ration of the number of Incoming and outgoing packets
- Average of Incoming Packet sizes
- Average of Outgoing Packet sizes
- The median value of Incoming Packet sizes
- The median value of outgoing Packet sizes
- The ratio of bursts and pauses
- Number of bursts
- Number of pauses
- Autocorrelation
- Transmission symmetry in the 1st third of connection
- Transmission symmetry in the 2nd third of connection
- Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information. 

The zip file structure is:

|-- data
|   |-- extracted-features...extracted features used in ML for DoH recognition
|   |   |-- chrome
|   |   |-- cloudflared
|   |   `-- firefox
|   |-- flows...............................................exported flow data
|   |   |-- chrome
|   |   |-- cloudflared
|   |   `-- firefox
|   `-- pcaps....................................................raw PCAP data
|       |-- chrome
|       |-- cloudflared
|       `-- firefox
|-- LICENSE
`-- README.md


When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020,
author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas},
title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning},
year = {2020},
isbn = {9781450388337},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3407023.3409192},
doi = {10.1145/3407023.3409192},
booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security},
articleno = {87},
numpages = {8},
keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets},
location = {Virtual Event, Ireland},
series = {ARES '20}
}

 

This work was supported by the European Union's Horizon 2020 research and innovation program under grant agreement No. 833418 and also by the Grant Agency of the CTU in Prague, grant No. SGS20/210/OHK3/3T/20 funded by the MEYS of the Czech Republic.
Files (34.2 GB)
Name Size
doh-dataset.zip
md5:2cb76bd268ddb91807138b1f762c8b17
34.2 GB Download
882
861
views
downloads
All versions This version
Views 882780
Downloads 861859
Data volume 29.4 TB29.4 TB
Unique views 736686
Unique downloads 405404

Share

Cite as