VTT SCOTT IAQ Dataset ===================== Created: 29.04.2020 This dataset is about indoor air quality (IAQ). This README document consists of four main sections as follows: 1. Dataset Description, and 2. Re-producing Data Pre-processing and Analysis 3. Licence 4. Citation Section 1 provides a high-level summary on the dataset. In turn, the section 2 presents for results re-production purposes how data has been pre-processed and analysed at Technical Research Center of Finland (VTT). Section 3 describes the applied licensing scheme on the datase and example source code on data pre-processing and analysis. Section 4 describes how to refer to this dataset and the associated journal article published by authors of the dataset. The dataset and attached instructions provided here are "as is" and have been valid at the date of writing. The dataset in European GDPR compliant as it does not refer to any identifiable person or location. 1. Dataset Description ====================== The dataset is real-life and long-term data captured over a period of a year in 2019, including all four seasons in the Nordic climate. It comprises of about 22.6 million samples acquired from total of 62 sensor nodes in 13 rooms at VTT premises in Northern Finland. 1.1 IAQ Parameters ------------------ Measured IAQ parameters include o Temperature (T, in Celcius), o Relative humidity (H, in RH-%), o Air pressure (P, in hPa) o Carbon dioxide concentration (CO2, in ppm) o Activity level (PIR, an integer in range 0..12 limits inclusive). All IAQ paramers have been measured using commercially available sensor HW. With respect to the IAQ parameter 'activity level', there has been utilised a passive infra-red (PIR) sensor, which detects activity continuously, while the controller records the activity status in five second spans (0=no motion during the span; 1=motion detected during the span) and reports an aggregated sum of detected activity over period of one minute; accordingly, activity level parameter has a value range from 0 to 12. Data samples have been obtained once per minute for all sensors. However, there are missing data samples due to failures in communications and controller HW unit reading sensors. The dataset is provided as seen by the data storage service recoding the sensor data for long-term storage. That is, any data sample imputation has not been applied in the published dataset. 1.2 Measured Rooms ------------------ The dataset has been acquired from a total of 13 rooms, including 11 office cubicles for 2 - 3 persons and two 12-person meeting rooms. Every room has been equipped with a combo-sensor node of type THP-CO2. There has been a combo-sensor THP-PIR node per one and three seats in office cubicles and meeting rooms, respectively. Table below summarises area (in square meters), volume (in cubic meters) and maximum person capacity of each room in the dataset. Rooms Area Volume Person Capacity [m2] [m3] [persons] ----- ---- ------ --------- room00 28,0 89,6 12 room01 14,2 38,3 2 room02 20,2 54,5 2 room03 14,4 38,9 2 room04 31,0 96,1 12 room05 22,5 69,8 3 room06 19,8 61,4 2 room07 19,6 60,8 2 room08 22,2 68,8 3 room09 23,9 74,1 3 room10 22,5 69,8 2 room11 22,6 70,1 2 room12 35,0 94,5 4 1.3 Sensor Positioning ---------------------- The dataset has been collected by equipping each room with multiple sensor nodes as described in section 1.3 below. All the sensor nodes were positioned at about 1.1 meters above floor level following the national legislative recommendations in Finland [1]. The PIR sensors have been placed close (1-2 meters) to the seating positions in a room. In order to capture typical CO2 concentration and its variations in a room, the CO2 sensor node was placed closer to the ventilation outlet than the inlets based on three justifying grounds as follows; 1) All the rooms had a balanced mechanical ventilation system for both inlet and outlet with flow adjusted and calibrated to zero pressure difference; 2) Inlet and outlet valves were located at ceiling level at the center and edges of the room, respectively; 3) The inlet valves were of a diffuser type mixing the air efficiently in the room. References: [1] Finnish Ministry of Social Affairs and Health: Decree of the Ministry of Social Affairs and Health on Health-related Conditions of Housing and Other Residential Buildings and Qualification Requirements for Third-party Experts (545/2015). https://www.finlex.fi/en/laki/kaannokset/2015/en20150545.pdf (accessed 29.04.2020) 1.4 Dataset Format ------------------ Acquired data is in comma separated (CSV) format. There is a file per a sensor node. Secondly, the dataset has a folder for each room containing associated sensor node data files. Each sensor node file contains a header line referring to the measured IAQ parameters. Table below summarises utilised column names. Column Name Description ----------- ----------- co2 CO2 concentration in ppm. devicetimestamp Timestamp of the data sample is EPOCH time in seconds since Jan 1, 1970. humidity Relative humidity in RH-%. nodeid Identification of the sensor node (an integer). pir_cnt Activity level (based on PIR sensor) in range 0..12. pressure Air pressure in hPa. temperature Temperature in Celcius. An example start of CSV data file for a sensor node: nodeid,temperature,humidity,pressure,co2,devicetimestamp 924,22.23,18.7,989.3,598,1549884237 924,22.24,18.7,989.3,597,1549884297 ... 2. Re-producing Data Pre-processing and Analysis ================================================ 2.1 Indoor air quality analysis -------------------------------- Future value estimation of CO2. The data is located in the zipped folder `data_cleaning.zip`. Unzip it, within you'll find the folder `dirty_data` which contains roomwise datasets. If you just want to utilize the data and write your own code, you can skip to licence statement found in the bottom. If you want to utilize the Python codebase provided, see instructions below. 2.2 Installation ------------------ The instructions were tested with (X)Ubuntu 18.04 LTS. Other distributions, including Microsoft Windows, may or may not work. First make sure that the system is up to date: ``` sudo apt-get update sudo apt-get dist-upgrade ``` Have at least python version 3.6.9 (other versions may or may not work): ``` python3 --version Python 3.6.9 ``` Then make a virtual environment ``` sudo apt install python3-venv python3 -m venv myenv source myenv/bin/activate pip install --upgrade pip pip install --upgrade setuptools ``` Install the required libraries with `pip install -r requirements.txt` 2.2.2 Directory structure ------------------------- Within your root directory, the codebase is inside a folder called `codes`. Place dirty data under `./data_cleaning/dirty_data/`. Cleaned data will be stored under `./data_cleaning/cleaned_data`. 2.3 Codebase walkthrough -------------------------- 2.3.1 Data preprocessing ------------------------ Generate the files with continuous sections of at least 1hr with `python ./codes/prepare_data.py`. The resulting files will be stored under `./data_cleaning/cleaned_data`. The steps to create the files: 1. For each room, combine the data of different sensors to a dataframe. This will be saved as `./data_cleaning/dirty_data/room/room_dirty_data.pickle` where `room` is the name of the room. 2. Aggregate data for each room. There are multiple sensors in each room, aggregate their data to 1 minute interval by taking the median. If there is a single minute with missing data (previous and next value are available), then those missing minutes are filled with mean interpolation for CO2. No longer gaps are interpolated. The data is reindexed to contain each minute during the year 2019. The results are stored as `./data_cleaning/dirty_data/room/room_aggregate_data.pickle`. 3. Find continuous sections. Split data into continuous sections, currently the length is at least 1hr. At the same time, obvious outliers are removed. The results are stored as `./data_cleaning/cleaned_data/continuous_section_xx.pickle`, where `xx` is the length of continuous sections. 4. Isolate a test set that consists of especially difficult sections. This is implemented in the function `./codes/prepare_data/find hard sections`. It goes through all the continuous sections, calculates the standard deviation (std) and maximum absolute rate of change (maxdiff) for CO2 for each section, and then selects 30% of sections whose std or maxdiff is larger than the 80th quantile. The test set is stored as `./data_cleaning/cleaned_data/continuous_section_xx_test.pickle` and corresponding training set as `./data_cleaning/cleaned_data/continuous_section_xx_train.pickle`. 2.4 Analysis ------------ The code for the analysis is in the file `./codes/protocol.py`. The goal of the analysis is to predict CO2 future values, using varying length of history and input variables. Current version uses histroy of {2, 5, 10} minutes and predicts {1, 5, 10, 15} minutes into the future. There are four feature sets considered, CO2 alone, CO2 + PIR, CO2 + PIR + temperature + humidity, and PIR + temperature + humidity. The models considered are Ridge, Decision Tree, Random Forest and MultiLayer Perceptron. The regularization parameter for Ridge is optimized each time. The hyperparameters for DT and RF must be first optimized, this is implemented within the class `ScottAnalyzer`. The hyperparameters used in the adjoining publication are in `./results/dt_scores.pickle` and `./results/rf_scores.pickle`. A human-readable version is in `./results/tree_hyperparameters.csv`. A fixed structure of one hidden layer with 32 neurons was used for the MLP. Running the whole analysis with 5-fold cross-validation takes days. You may want to consider less demanding validation or models. Especially, MLP takes time. Other models can be run in a more reasonable time. 3. Licence ========== The dataset has been published under Creative Commons Attribution 4.0 International (CC BY 4.0) license available at URL https://creativecommons.org/licenses/by/4.0/. A human-readable summary excerpt of (and not a substitute for) the license: " You are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms. " This license lets you distribute, remix, adapt, and build upon the original work, even commercially, as long as you credit authors for the original creation. 4. Citation =========== If you use the data or the adjoining source code in your study, please cite the following article: J. Kallio, J. Tervonen, P. Räsänen, R. Mäkynen, J. Koivusaari, J. Peltola, Forecasting office indoor CO2 concentration using machine learning with a one-year dataset, Build. Environ., 187 (2021), Article 107409, https://doi.org/10.1016/j.buildenv.2020.107409