Published January 4, 2023 | Version v.1
Dataset Open

Labeled dataset of IEEE 802.11 probe requests

Description

Introduction  
 
The 802.11 standard includes several management features and corresponding frame types. One of them are probe requests (PR). They are sent by mobile devices in the unassociated state to search the nearby area for existing wireless networks. The frame part of PRs consists of variable length fields called information elements (IE). IE fields represent the capabilities of a mobile device, such as data rates.  
The dataset includes PRs collected in a controlled rural environment and in a semi-controlled indoor environment under different measurement scenarios.  
It can be used for various use cases, e.g., analysing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analysing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

 
Measurement setup  
 
The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture Wi-Fi signal traffic in monitoring mode. Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.
The following information about each PR received is collected: MAC address, Supported data rates, extended supported rates, HT capabilities, extended capabilities, data under extended tag and vendor specific tag, interworking, VHT capabilities, RSSI, SSID and timestamp when PR was received.
The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package for data collection, preprocessing and transmission.


Data preprocessing

The gateway collects PRs for each consecutive predefined scan interval (10 seconds). During this time interval, the data are preprocessed before being transmitted to the database.
For each detected PR in the scan interval, IEs fields are saved in the following JSON structure:
PR_IE_data =
{
    'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext},
    'HT_CAP': DATA_htcap,
    'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap},
    'VHT_CAP': DATA_vhtcap,
    'INTERWORKING': DATA_inter,
    'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...},
    'VENDOR_SPEC': {VENDOR_1:{
                                'ID_1': DATA_1_vendor1,
                                'ID_2': DATA_2_vendor1
                                ...},
                    VENDOR_2:{
                                'ID_1': DATA_1_vendor2,
                                'ID_2': DATA_2_vendor2
                                ...}
                    ...}
}

 
Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.  
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:
{
    'TIME': [ DATA_time ],
    'RSSI': [ DATA_rssi ],
    'DATA': PR_IE_data
}.

This data structure allows storing only TOA and RSSI for all PRs originating from the same MAC address and containing the same PR_IE_data. All SSIDs from the same MAC address are also stored.  
The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval.  
If identical PR's IE data from the same MAC address is already stored, then only data for the keys TIME and RSSI are appended.
If no identical PR's IE data has yet been received from the same MAC address, then PR_data structure of the new PR for that MAC address is appended to PROBE_REQs key.  
The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png  
At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data e.g. wireless gateway serial number and scan start and end timestamps. For an example of a single PR captured, see the ./Single_PR_capture_example.json file.


Environments description  
 
We performed measurements in a controlled rural outdoor environment and in a semi-controlled indoor environment of the Jozef Stefan Institute.
See the Excel spreadsheet Measurement_informations.xlsx for a list of mobile devices tested.  
 
   Indoor environment  
 
We used 3 RPi's for the acquisition of PRs in the Jozef Stefan Institute. They were placed indoors in the hallways as shown in the ./Figures/RPi_locations_JSI.png. Measurements were performed on weekend to minimize additional uncontrolled traffic from users' mobile devices. While there is some overlap in WiFi coverage between the devices at the location 2 and 3, the device at location 1 has no overlap with the other two devices.

   Rural environment outdoors  
 
The three RPi's used to collect PRs were placed at three different locations with non-overlapping WiFi coverage, as shown in ./Figures/RPi_locations_rural_env.png. Before starting the measurement campaign, all measured devices were turned off and the environment was checked for active WiFi devices. We did not detect any unknown active devices sending WiFi packets in the RPi's coverage area, so the deployment can be considered fully controlled.
All known WiFi enabled devices that were used to collect and send data to the database used a global MAC address, so they can be easily excluded in the preprocessing phase. MAC addresses of these devices can be found in the ./Measurement_informations.xlsx spreadsheet.
Note: The Huawei P20 device with ID 4.3 was not included in the test in this environment.


Scenarios description  
 
We performed three different scenarios of measurements.  
 
   Individual device measurements
 
For each device, we collected PRs for one minute with the screen on, followed by PRs collected for one minute with the screen off. In the indoor environment the WiFi interfaces of the other devices not being tested were disabled. In rural environment other devices were turned off. Start and end timestamps of the recorded data for each device can be found in the ./Measurement_informations.xlsx spreadsheet under the Indoor environment of Jozef Stefan Institute sheet and the Rural environment sheet.

   Three groups test

In this measurement scenario, the devices were divided into three groups. The first group contained devices from different manufacturers. The second group contained devices from only one manufacturer (Samsung). Half of the third group consisted of devices from the same manufacturer (Huawei), and the other half of devices from different manufacturers. The distribution of devices among the groups can be found in the ./Measurement_informations.xlsx spreadsheet.  
 
The same data collection procedure was used for all three groups. Data for each group were collected in both environments at three different RPis locations, as shown in ./Figures/RPi_locations_JSI.png and ./Figures/RPi_locations_rural_env.png.  
At each location, PRs were collected from each group for 10 minutes with the screen on. Then all three groups switched locations and the process was repeated. Thus, the dataset contains measurements from all three RPi locations of all three groups of devices in both measurement environments. The group movements and the timestamps for the start and end of the collection of PRs at each loacation can be found in spreadsheet ./Measurement_informations.xlsx.

   One group test  
 
In the last measurement scenario, all devices were grouped together. In rural evironement we first collected PRs for 10 minutes while the screen was on, and then for another 10 minutes while the screen was off. In indoor environment data were collected at first location with screens on for 10 minutes. Then all devices were moved to the location of the next RPi and PRs were collected for 5 minutes with the screen on and then for another 5 minutes with the screen off.

Folder structure  
 
The root directory contains two files in JSON format for each of the environments where the measurements took place (Data_indoor_environment.json and Data_rural_environment.json). Both files contain collected PRs for the entire day that the measurements were taken (12:00 AM to 12:00 PM) to get a sense of the behaviour of the unknown devices in each environment. The spreadsheet ./Measurement_informations.xlsx. contains three sheets. Devices description contains general information about the tested devices, RPis, and the assigned group for each device. The sheets Indoor environment of Jozef Stefan Institute and Rural environment contain the corresponding timestamps for the start and end of each measurement scenario. For the scenario where the devices were divided into groups, additional information about the movements between locations is included. The location names are based on the RPi gateway ID and may differ from those on the figures showing the locations of the RPIs for each environment.
The ./Figures folder contains the figures already mentioned above.

Notes

This research was partly funded by the Slovenian Research Agency (ARRS) grant no. P2-0016, J2-3048, J2-2507, and P2-0016 - "COVID extension" .

Files

Dataset.zip

Files (50.0 MB)

Name Size Download all
md5:3eeab562d6140adc0891aa122e829b8b
50.0 MB Preview Download

Additional details

Funding

RESILOC – Resilient Europe and Societies by Innovating Local Communities 833671
European Commission