{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Companion Notebook to _Assessing Plant Photosynthesis and Environmental Measurements Using a Hyperspectral Camera_\n",
    "\n",
    "This Jupyter notebook showcases some of the data analysis we did on the averaged dataset, including adding vegetation indices and model training. First, we will load the basic dataset without the vegegation indices and have a look at the environmental data. Second, vegegation indices are added. Third and finally, a model is fit to the data and the predictions are visualised. \n",
    "\n",
    "## Loading the Data\n",
    "\n",
    "The data is stored in a CSV-file for easy access in any programming language. We'll use `pandas` to load the data into memory and manipulate it. Each CSV-file contains all environmental and gas-exchange data that is synchronised to the image data. In this illustartion, we'll load the plant data from the first experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "fn = os.path.join(\"data\", \"exp-1-plant.csv\")\n",
    "df = pd.read_csv(fn)\n",
    "df[\"HHMMSS\"] = pd.to_datetime(df[\"HHMMSS\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first few data entries are missing. This is due to the synchronisation offset. In the analysis for the paper, the first and last 300 samples were removed to remove any missing values or deviating measurements due to human interference.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "rm_idx = np.concatenate((np.arange(0,300, dtype=np.int), np.arange(len(df)-300, len(df))))\n",
    "\n",
    "df = df.drop(index=rm_idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column names are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first part (`HHMMSS` through `Status`) refers to data captured using the LI6400XT, for an explanation, we refer to _Using the LI‐6400 Portable Photosynthesis System_, Li-Cor Inc., 2012 which is freely available online. `xTair` and `xRH` are the air temperature and relative humidity respectively. These were measured using an external Vaisala 50Y probe connected to the LI6400XT. Some of the parameters have been renamed in the papar to make the names more meaningful. The following table provides an overview:\n",
    "\n",
    "| original parameter | new parameter|\n",
    "|:------------------:|:------------:|\n",
    "| xTair              | Tair         |\n",
    "| xRH                | RH           |\n",
    "| Trmmol             | Transp       |\n",
    "| PARo               | PAR          |\n",
    "\n",
    "The second part of the dataset is the hyperspectral data. They can be related to as follows to the camera:\n",
    "`c-{camera number}-c-{channel numnber}` where camera number 0 refers to H1 (the NIR camera) and 1 to H2 (the VIS camera). Do not that the channel numbers here do _not_ correspond to the channel numbers from imec. The index relationship is the following to convert to the imec channel numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "camera_0_reindex = [20, 21, 22, 23, 24, 15, 16, 17, 18, 19, 10, 11, 12, 13, 14, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]\n",
    "camera_1_reindex = [12, 13, 14, 15, 8,  9,  10, 11, 4, 5, 6, 7, 0, 1, 2, 3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we did not use the imec calibration procedure, this is just provided for reference. \n",
    "\n",
    "## Visualising the Environmental Data\n",
    "\n",
    "First, it's important to have an idea of the environmental data. We will generate simple time series plots and the 2D density plots for the three main external drivers: relative humidity, temperature and indicent light (PAR). \n",
    "\n",
    "We increase the figure size to have a better view of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matplotlib.rcParams['figure.figsize'] = [15, 6]\n",
    "matplotlib.rcParams['font.size'] = 15"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that part of the data is missing. This is intentional. For details, we refer to the paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(3, 1, sharex=True)\n",
    "fig.subplots_adjust(hspace=0)\n",
    "\n",
    "for param, ax in zip([\"xTair\", \"xRH\", \"PARo\"], axs):\n",
    "    ax.plot(df[\"HHMMSS\"], df[param])\n",
    "    ax.set_ylabel(param)\n",
    "axs[-1].set_xlabel(\"time\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The density plots for all tuples possible with the plots above. Contrary to the paper, these plots are nor normalised."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for a, b in [(\"PARo\", \"xTair\"), (\"PARo\", \"xRH\"), (\"xRH\", \"xTair\")]:\n",
    "    plt.figure()\n",
    "    plt.hist2d(df[a], df[b], (25, 25))\n",
    "    plt.xlabel(a)\n",
    "    plt.ylabel(b)\n",
    "    plt.colorbar()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating vegetation indices\n",
    "\n",
    "Vegetation indices of two types are added:\n",
    "\n",
    "$$ VI_{ij}^F = \\frac{C_i}{C_j}\\qquad i\\in\\{0,1,\\ldots,40\\}, j \\in\\{0,1,\\ldots,i-1,i+1,\\ldots,40\\}$$\n",
    "$$ VI_{ij}^{NF} = \\frac{C_i-C_j}{C_i+C_j}\\qquad i\\in\\{0,1,\\ldots,39\\}, j \\in\\{i+1,i+2,\\ldots,40\\}$$\n",
    "\n",
    "To this end, we've writting two functions and import then to get the desired data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helpers import get_vi_f, get_vi_nf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generation of VIs can take a while, because they are only added if the correlation is below 0.95. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_vi_f = get_vi_f(df, c_offset=2, verbose=False)\n",
    "df = pd.concat([df, df_vi_f], axis=1, sort=False)\n",
    "df_vi_nf = get_vi_nf(df, c_offset=2+41, verbose=False)\n",
    "df = pd.concat([df, df_vi_nf], axis=1, sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the Model\n",
    "\n",
    "Here, we illustrate how one can train a model like in the paper. We use `scikit-learn` models and preprocessors. In the paper the hyperparameter $\\alpha$ was evaluated from $10^{-10}$ to $10^{10}$ on a logaritmic scale. The train and validation sets were swapped and the best performing set was selected after averaging the two NMSE scores. \n",
    "\n",
    "Because we are using a `StandardScaler`, the `normalize` parameter of `Ridge` should not be used. \n",
    "\n",
    "The model parameters are listed below. Normally $10^{-5}$ should yield good results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_params = {\n",
    "    \"alpha\": 1e-5,\n",
    "    \"normalize\": False,\n",
    "    \"random_state\": 25,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we can start training the model, the relevant target variable and data have to be extracted. The x-data is extracted using a convenience function `get_x_data`, defined in the helper file. The groups object is not used be can be useful to see the original channel and camera. \n",
    "\n",
    "We will use this model to estimate the "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helpers import get_x_data\n",
    "\n",
    "target = \"Tleaf\"\n",
    "\n",
    "X, groups = get_x_data(df)\n",
    "y = df[target].values\n",
    "t = df[\"HHMMSS\"].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use batch sizes of 1500 samples, which correspond to a length of 4500 seconds. No time offset is used. It is important to pass the correct experiment index (`0` for the first and `1` for the second to get the correct masks in case of an offset. Since the mask generation has to keep the "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helpers import get_masks\n",
    "\n",
    "data_length = len(X)\n",
    "batch_size = 1500\n",
    "offset = 0\n",
    "exp_idx = 0\n",
    "\n",
    "mask_x, mask_y = get_masks(data_length=data_length, batch_size=batch_size, mask_type=\"train\", offset=offset,\n",
    "                           exp_idx=exp_idx, t=t)\n",
    "x_train, y_train = X[mask_x], y[mask_y]\n",
    "\n",
    "mask_x, mask_y = get_masks(data_length=data_length, batch_size=batch_size, mask_type=\"val\", offset=offset,\n",
    "                           exp_idx=exp_idx, t=t)\n",
    "x_val, y_val = X[mask_x], y[mask_y]\n",
    "\n",
    "mask_x, mask_y = get_masks(data_length=data_length, batch_size=batch_size, mask_type=\"test\", offset=offset,\n",
    "                           exp_idx=exp_idx, t=t)\n",
    "x_test, y_test = X[mask_x], y[mask_y]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it is time to train the model. The convenience function `get_performance` is used to get the NMSE and MSE values of the different splits. All results are stored in the `results` dictionary for easy retreival. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import Ridge\n",
    "from helpers import get_performance\n",
    "\n",
    "results = dict()\n",
    "\n",
    "x_scaler = StandardScaler()\n",
    "xt_tf = x_scaler.fit_transform(x_train)\n",
    "results[\"x_scaler\"] = x_scaler\n",
    "\n",
    "model = Ridge(**model_params)\n",
    "\n",
    "model.fit(xt_tf, y_train)\n",
    "results[\"model\"] = model\n",
    "\n",
    "# compute train data performance\n",
    "results = get_performance(x_train, y_train, results, x_scaler, model, \"train\")\n",
    "\n",
    "# compute validation data performance\n",
    "results = get_performance(x_val, y_val, results, x_scaler, model, \"val\")\n",
    "\n",
    "# compute test data performance\n",
    "results = get_performance(x_test, y_test, results, x_scaler, model, \"test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at the data performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for k in results:\n",
    "    if k.startswith(\"NMSE\"):\n",
    "        print(\"{}: {}\".format(k, results[k]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, the training data has better performance than the validation and test sets. The difference between validation and test sets is low, as expected. \n",
    "\n",
    "A visualisation of the models is also useful to have a better idea of the model performance. We will plot the entire data trace. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_tf = results[\"x_scaler\"].transform(X)\n",
    "\n",
    "yhat = model.predict(x_tf)\n",
    "\n",
    "plt.figure()\n",
    "plt.plot(t, yhat, label=\"estimated\")\n",
    "plt.plot(t, y, label=\"measured\")\n",
    "plt.legend(loc=\"upper right\")\n",
    "plt.grid(which=\"both\")\n",
    "plt.xlabel(\"time\")\n",
    "plt.ylabel(\"leaf temperature [C]\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}