Published January 6, 2023 | Version v.1
Dataset Open

Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment

Description

Introduction

The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

  Related dataset

Same authors also produced the Labeled dataset of IEEE 802.11 probe requests  with same data layout and recording equipment.


Measurement setup  

The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device).
Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

The following information about each received PR is collected:
 - MAC address
 - Supported data rates
 - extended supported rates
 - HT capabilities
 - extended capabilities
 - data under extended tag and vendor specific tag
 - interworking
 - VHT capabilities
 - RSSI
 - SSID
 - timestamp when PR was received.

The collected data was forwarded to a remote database via a secure VPN connection.
A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.


Data preprocessing


The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database.
For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

PR_IE_data =
{
    'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext},
    'HT_CAP': DATA_htcap,
    'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap},
    'VHT_CAP': DATA_vhtcap,
    'INTERWORKING': DATA_inter,
    'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...},
    'VENDOR_SPEC': {VENDOR_1:{
                                'ID_1': DATA_1_vendor1,
                                'ID_2': DATA_2_vendor1
                                ...},
                    VENDOR_2:{
                                'ID_1': DATA_1_vendor2,
                                'ID_2': DATA_2_vendor2
                                ...}
                    ...}
}


Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.  
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:

{
    'TIME': [ DATA_time ],
    'RSSI': [ DATA_rssi ],
    'DATA': PR_IE_data
}.

 

This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored.
The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval.
If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended.
If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key.
The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.


  Folder structure

For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period.
Each folder contains four files, each containing samples from that device.

The folders are named after the start and end time (in UTC).
For example, the folder [2022-09-22T22-00-00_2022-09-23T22-00-00](2022-09-22T22-00-00_2022-09-23T22-00-00) contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

Files representing their location via mapping:
- 1.json -> location 1
- 2.json -> location 2
- 3.json -> location 3
- 4.json -> location 4

Environments description  

The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo
The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset.
As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system.
Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

Four Raspbery Pi-s were used:
- location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano)
- location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo
- location 3 -> nothernmost window in the building of Via Etnea near Piazza Università
- location 4 -> first window top the right of the entrance of the University of Catania

Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access)
Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

  Known dataset shortcomings

Due to technical and physical limitations, the dataset contains some identified deficiencies.

PRs are collected and transmitted in 10-second chunks.
Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

Every 20 minutes the service is restarted on the recording device.
This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond.
For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

     Location 1 - Piazza del Duomo - Chierici

 The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period.
 Its location is constant and is not disturbed, dataset seems to have complete coverage.

     Location 2 - Via Etnea - Piazza del Duomo

 The device is located inside the building.
 During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed.
 As the device was moved back and forth, power outages and internet connection issues occurred.
 The last three days in the record contain no PRs from this location.

     Location 3 - Via Etnea - Piazza Università

 Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building.
 Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall  when no people are present.
 This device appears to have been collecting data throughout the whole dataset period.
 
     Location 4 - Piazza Università

 This location is wirelessly connected to the access point.
 The device was placed statically on a windowsill overlooking the square.
 Due to physical limitations, the device had lost power several times during the deployment.
 The internet connection was also interrupted sporadically.

Recognitions

The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.

Files

Dataset.zip

Files (293.8 MB)

Name Size Download all
md5:23f9f14bc2ee075dfa5c629427a40510
293.8 MB Preview Download

Additional details

Funding

RESILOC – Resilient Europe and Societies by Innovating Local Communities 833671
European Commission