
|	Data for the publication "Understanding cirrus clouds using explainable machine learning"|
| 	Jeggle et al., 2023								|
|	Author: Kai Jeggle								|
| 	Date: 30 01 2023								|

-------------------------------------------------
A. Basic description:
-------------------------------------------------

Data to train machine learning models and conduct the analysis in:

Authors: Kai Jeggle , David Neubauer, Gustau Camps-Valls and Ulrike Lohmann
Title: Understanding cirrus clouds using explainable machine learning
Journal: Climate Informatics Conference / Environmental Data Science (Submitted)
Date: 2023

The code is to be found in a public github repository: https://github.com/tabularaza27/explaining_cirrus

Feel free to contact me in case you have any questions (kai.jeggle@env.ethz.ch)

-------------------------------------------------
B. Directory contents:
-------------------------------------------------

├── instantaneous_data.csv
├── temporal_data.csv
├── README.txt

-------------------------------------------------
C. Descriptions:
-------------------------------------------------

Here, we provide only the co-located data sets presented in the publication, the original data sets are freely available online: 

DARDAR-Nice (DARNI_L2_PRO.v1.10) : https://doi.org/10.25326/09
ERA5: https://doi.org/10.24381/cds.bd0915c6
MERRA2: https://doi.org/10.5067/LTVB4GPCOTK2

The python code for training and explaining the machine learning models can be found at: https://github.com/tabularaza27/explaining_cirrus/

**Preprocessing steps conducted to create the co-located data sets.**

DARDAR-Nice:
* Aggregate in cloud values all observations that lie in a gridcell (0.25x0.25x300mx1h) 
* Calculate cloud cover for gridcell for each altitude level (# observations with clouds/ # all observations)
* Filter out observations with bad quality flag

ERA5:
* Interpolate to 0.25x0.25 grid (conservative for relative humidity, linear for other variables)
* Calculate Pressure and height levels from model levels)
* Interpolate to 300m vertical levels

MERRA2:
* Horizontal Interpolation 0.5x0.625 → 0.25x0.25 (conservative for all variables)
* Calculate volume mixing ratio from mass mixing ratio and air density
* Calculate Pressure and height levels from model levels)
* Interpolate to 300m vertical levels
* Temporal Upsampling: Nearest neighbor upsampling from 3-hourly to hourly

Data fusion:
* Co-locate all data on 0.25x0.25x300mx1h grid
* Create dataframe (table format) with each observation representing on gridcell ( lat, lon, lev, time )) with a cirrus observation 

Temporal Dataset:
For the temporally resolved dataset 48h of Lagrangian backtrajectories are calculated for each observation and meteorological and aerosol variables are traced along the trajectory

**Data Variables**

Variables from DARDAR-Nice:

lev [m]: altitude of cloud layer. layers are 300m thick
time: time & date of observations aggregated on hourly resolution
lat: latitude on 0.25°x0.25° grid
lon: longitude on 0.25°x0.25° grid
season: season of observations, possible values ['DJF', 'MAM', 'JJA', 'SON']
lat_region: 10° bins of latitude, used as predictor in ML models
instrument_flag: Active instrument(s) used in retrievals
land_water_mask:  Surface type at laser footpring, from CALIPSO files [-]
nightday_flag: Flag indicating day/night conditions
iwc [mg m⁻³]: mass concentration of frozen water in air
icnc_5um [cm⁻³]: number concentration of ice crystals larger than 5um in air
reffcli [um]: effective radius of cloud ice particles
dz_top_v2 [m]: distance from cloud top for a given layer (calculated based on clm_v2 in DARDAR-Nice)
cloud_thickness_v2 [m]: vertical extent of cloud (based on dz_top_v2)
cloud_cover: percentage of DARDAR-Nice observations in 0.25°x0.25°x300mx1h gridbox containing cirrus cloud

Variables from ERA5:

t [K] - air temperature
w [Pa s⁻¹] - Vertical velocity
u [m s**-1] - U component of wind 
v [m s**-1] - V component of wind 
rh - relative humidity wrt water
rh_ice - relative humidity wrt ice
wind_speed [m s⁻¹] 
surface_height [m] 

Variables from MERRA2:

SO4 [mg kg⁻¹] - Mass mixing ratio of Sulphate aerosols
DU [mg kg⁻¹] - Mass mixing ratio of all mineral dust aerosols
DU_sub [mg kg⁻¹] - Mass mixing ratio of mineral dust aerosols < 1 um
DU_sup [mg kg⁻¹] - Mass mixing ratio of mineral dust aerosols > 1 um

Additional variables in time-resolved data set:

timestep [int]: cirrus observations are denoted by timestep=0. Backtrajectories are calculated hourly for each observation and are denoted by -t, with t being the time since observation
trajectory_id [str]: unique identifier for each backtrajectory

Note that only reanalysis data are available along the backtrajectories

**Categorical Variable Meanings**

land_water_mask

0: shallow ocean
1: land
2: coastlines
3: shallow inland water
4: intermittent water
5: deep inland water
6: continental ocean

night_day_flag

0: day
1: night

instrument_flag

0: none
1: CALIOP_only
2: CPR_only
3: CALIOP_and_CPR

**Misc Notes**

* the aerosol variables are given in volume mixing ratio [mg m**-3] in the time-resolved dataset

-------------------------------------------------



