VTT SCOTT IAQ Dataset
=====================

Created: 29.04.2020

This dataset is about indoor air quality (IAQ). This README document consists
of four main sections as follows:
  1. Dataset Description, and
  2. Re-producing Data Pre-processing and Analysis
  3. Licence
  4. Citation

Section 1 provides a high-level summary on the dataset. In turn, the
section 2 presents for results re-production purposes how data has
been pre-processed and analysed at Technical Research Center of
Finland (VTT). Section 3 describes the applied licensing scheme
on the datase and example source code on data pre-processing and analysis.
Section 4 describes how to refer to this dataset
and the associated journal article published by authors of the dataset.

The dataset and attached instructions provided here are "as is" and have
been valid at the date of writing.

The dataset in European GDPR compliant as it does not refer to any
identifiable person or location.


1. Dataset Description
======================

The dataset is real-life and long-term data captured over a period of a year
in 2019, including all four seasons in the Nordic climate. It comprises of
about 22.6 million samples acquired from total of 62 sensor nodes in 13 rooms
at VTT premises in Northern Finland.


1.1 IAQ Parameters
------------------

Measured IAQ parameters include
   o Temperature (T, in Celcius),
   o Relative humidity (H, in RH-%),
   o Air pressure (P, in hPa)
   o Carbon dioxide concentration (CO2, in ppm)
   o Activity level (PIR, an integer in range 0..12 limits inclusive).

All IAQ paramers have been measured using commercially available sensor HW.
With respect to the IAQ parameter 'activity level', there has been utilised
a passive infra-red (PIR) sensor, which detects activity continuously, while
the controller records the activity status in five second spans (0=no motion
during the span; 1=motion detected during the span) and reports an aggregated
sum of detected activity over period of one minute; accordingly, activity
level parameter has a value range from 0 to 12.

Data samples have been obtained once per minute for all sensors. However,
there are missing data samples due to failures in communications and
controller HW unit reading sensors. The dataset is provided as seen by
the data storage service recoding the sensor data for long-term storage.
That is, any data sample imputation has not been applied in the published
dataset.

1.2 Measured Rooms
------------------

The dataset has been acquired from a total of 13 rooms, including 11 office
cubicles for 2 - 3 persons and two 12-person meeting rooms. Every room has
been equipped with a combo-sensor node of type THP-CO2. There has been a
combo-sensor THP-PIR node per one and three seats in office cubicles and
meeting rooms, respectively. 

Table below summarises area (in square meters), volume (in cubic meters)
and maximum person capacity of each room in the dataset. 

Rooms     Area    Volume   Person
                           Capacity
          [m2]    [m3]     [persons]
-----     ----    ------   ---------
room00    28,0    89,6       12
room01    14,2    38,3        2
room02    20,2    54,5        2
room03    14,4    38,9        2
room04    31,0    96,1       12
room05    22,5    69,8        3
room06    19,8    61,4        2
room07    19,6    60,8        2
room08    22,2    68,8        3
room09    23,9    74,1        3
room10    22,5    69,8        2
room11    22,6    70,1        2
room12    35,0    94,5        4


1.3 Sensor Positioning
----------------------

The dataset has been collected by equipping each room with multiple sensor
nodes as described in section 1.3 below. All the sensor nodes were positioned
at about 1.1 meters above floor level following the national legislative
recommendations in Finland [1].

The PIR sensors have been placed close (1-2 meters) to the seating positions
in a room.

In order to capture typical CO2 concentration and its variations in a room, the
CO2 sensor node was placed closer to the ventilation outlet than the inlets
based on three justifying grounds as follows;
  1) All the rooms had a balanced mechanical ventilation system for
     both inlet and outlet with flow adjusted and calibrated to zero
	 pressure difference;
  2) Inlet and outlet valves were located at ceiling level at the
     center and edges of the room, respectively;
  3) The inlet valves were of a diffuser type mixing the air efficiently
     in the room.

References:
[1] Finnish Ministry of Social Affairs and Health: Decree of the Ministry
    of Social Affairs and Health on Health-related Conditions of Housing
    and Other Residential Buildings and Qualification Requirements for
    Third-party Experts (545/2015).
    https://www.finlex.fi/en/laki/kaannokset/2015/en20150545.pdf
    (accessed 29.04.2020)

1.4 Dataset Format
------------------

Acquired data is in comma separated (CSV) format. There is a file per
a sensor node. Secondly, the dataset has a folder for each room containing
associated sensor node data files.

Each sensor node file contains a header line referring to the measured
IAQ parameters. Table below summarises utilised column names.


Column Name     Description
-----------     -----------
co2             CO2 concentration in ppm.
devicetimestamp Timestamp of the data sample is EPOCH time in seconds
                since Jan 1, 1970.
humidity        Relative humidity in RH-%.
nodeid          Identification of the sensor node (an integer).
pir_cnt         Activity level (based on PIR sensor) in range 0..12.
pressure        Air pressure in hPa.
temperature     Temperature in Celcius.

An example start of CSV data file for a sensor node:
   nodeid,temperature,humidity,pressure,co2,devicetimestamp
   924,22.23,18.7,989.3,598,1549884237
   924,22.24,18.7,989.3,597,1549884297
   ...


2. Re-producing Data Pre-processing and Analysis
================================================

2.1  Indoor air quality analysis
--------------------------------

Future value estimation of CO2. The data is located in the zipped
folder `data_cleaning.zip`. Unzip it, within you'll find the folder
`dirty_data` which contains roomwise datasets. If you just want to
utilize the data and write your own code, you can skip to licence
statement found in the bottom. If you want to utilize the Python
codebase provided, see instructions below.

2.2 Installation
------------------

The instructions were tested with (X)Ubuntu 18.04 LTS. Other
distributions, including Microsoft Windows, may or may not work.

First make sure that the system is up to date:

```
sudo apt-get update
sudo apt-get dist-upgrade
```

Have at least python version 3.6.9 (other versions may or may not work):

```
python3 --version
Python 3.6.9
```

Then make a virtual environment 
```
sudo apt install python3-venv

python3 -m venv myenv
source myenv/bin/activate
pip install --upgrade pip
pip install --upgrade setuptools
```

Install the required libraries with `pip install -r requirements.txt`

2.2.2 Directory structure
-------------------------

Within your root directory, the codebase is inside a folder called `codes`.
Place dirty data under `./data_cleaning/dirty_data/`. Cleaned data will
be stored under `./data_cleaning/cleaned_data`.

2.3 Codebase walkthrough
--------------------------

2.3.1 Data preprocessing
------------------------

Generate the files with continuous sections of at least 1hr with
`python ./codes/prepare_data.py`. The resulting files will be stored
under `./data_cleaning/cleaned_data`. The steps to create the files:

1. For each room, combine the data of different sensors to a dataframe.
   This will be saved as `./data_cleaning/dirty_data/room/room_dirty_data.pickle`
   where `room` is the name of the room.

2. Aggregate data for each room.
   There are multiple sensors in each room, aggregate their data to 1
   minute interval by taking the median.

   If there is a single minute with missing data (previous and next value
   are available), then those missing minutes are filled with mean interpolation
   for CO2. No longer gaps are interpolated. The data is reindexed to contain
   each minute during the year 2019. The results are stored as
   `./data_cleaning/dirty_data/room/room_aggregate_data.pickle`.

3. Find continuous sections.
   Split data into continuous sections, currently the length is at least 1hr.
   At the same time, obvious outliers are removed. The results are stored as
   `./data_cleaning/cleaned_data/continuous_section_xx.pickle`, where `xx`
   is the  length of continuous sections.

4. Isolate a test set that consists of especially difficult sections.
   This is implemented in the function `./codes/prepare_data/find hard sections`. 
   It goes through all the continuous sections, calculates the standard
   deviation (std) and maximum absolute rate of change (maxdiff) for CO2 for
   each section, and then selects 30% of sections whose std or maxdiff is larger
   than the 80th quantile. The test set is stored as
   `./data_cleaning/cleaned_data/continuous_section_xx_test.pickle` and
   corresponding training set as
  `./data_cleaning/cleaned_data/continuous_section_xx_train.pickle`.

2.4 Analysis
------------

The code for the analysis is in the file `./codes/protocol.py`. 

The goal of the analysis is to predict CO2 future values, using varying
length of history and input variables. Current version uses histroy of
{2, 5, 10} minutes and predicts {1, 5, 10, 15} minutes into the future.
There are four feature sets considered, CO2 alone, CO2 + PIR, CO2 + PIR +
temperature + humidity, and PIR + temperature + humidity. The models
considered are Ridge, Decision Tree, Random Forest and MultiLayer Perceptron.

The regularization parameter for Ridge is optimized each time.
The hyperparameters for DT and RF must be first optimized, 
this is implemented within the class `ScottAnalyzer`. The hyperparameters
used in the adjoining publication are in `./results/dt_scores.pickle`
and `./results/rf_scores.pickle`. A human-readable version is in
`./results/tree_hyperparameters.csv`.  A fixed structure of one hidden
layer with 32 neurons was used for the MLP.

Running the whole analysis with 5-fold cross-validation takes days.
You may want to consider less demanding validation or models.
Especially, MLP takes time. Other models can be run in a more reasonable
time.

3. Licence
==========

The dataset has been published under Creative Commons Attribution
4.0 International (CC BY 4.0) license available at URL
https://creativecommons.org/licenses/by/4.0/.

A human-readable summary excerpt of (and not a substitute for)
the license:

  "
  You are free to:
  
  Share — copy and redistribute the material in any medium or format 
  
  Adapt — remix, transform, and build upon the material 
          for any purpose, even commercially. 
  
  The licensor cannot revoke these freedoms as long as you follow the license terms.
  "

This license lets you distribute, remix, adapt, and build upon the
original work, even commercially, as long as you  credit authors for
the original creation.

4. Citation
===========

If you use the data or the adjoining source code in your study,
please cite the following article:

  J. Kallio, J. Tervonen, P. Räsänen, R. Mäkynen, J. Koivusaari, J. Peltola, Forecasting office indoor CO2 concentration using machine learning with a one-year dataset, Build. Environ., 187 (2021), Article 107409, https://doi.org/10.1016/j.buildenv.2020.107409