Presence-Absence Points for Tree Species Distribution Modelling for Europe
The dataset is a collection of presence and absence points for forest tree species for Europe. Each unique combination of longitude, latitude and year was considered as an independent sample. Presence data was obtained from the harmonized tree species occurrence dataset by Heisig and Hengl (2020) and absence data from the LUCAS (in-situ source) dataset.
A set of 50 different forest tree species was selected from the harmonized tree species dataset and data lacking a temporal observation was overlaid with yearly forest masks derived from land cover maps produced by Parente et al. (2021). We overlaid the points with the probability maps for the classes:
- 311: Broad-leaved forest,
- 312: Coniferous forest,
- 313: Mixed forest,
- 323: Sclerophyllous forest,
- 324: Transitional woodland-shrub,
- 333: Sparsely vegetated area.
Points were included in the dataset only if the probability value extracted for at least one of the above classes was ≥ 50% for all the years considered. An additional quality flag was added to distinguish points coming from this operation and the points with original year of observation coming from source datasets.
The final dataset contains 4,359,999 observations for and a total of 630 columns.
The first 8 columns of the dataset contain metadata information used to uniquely identify the points:
- id: unique point identifier,
- year: year of observation,
- postprocess: quality flag to identify if the temporal reference of an observation comes from the original dataset or is the result of spatiotemporal overlay with forest masks,
- Tile_ID: contains the tile id from the eu_tiling_system (30 km grid),
- easting: longitude coordinates in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035),
- northing: latitude coordinates in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035),
- Atlas_class: name of the tree species according to the European Atlas of Forest Tree Species or NULL in case of absence point,
- lc1: contains original LUCAS land cover class or NULL if it's a presence point.
The remaining columns contain the extracted values of a series of predictor variables (temperature, precipitation, elevation, topographical information, spectral reflectance) useful for species distribution modeling applications. These points were used to model the potential and realized distribution of a series of 16 target species for the period 2000 - 2020. The approach involved training three ML models to predict probability of presence (i.e. Random Forest, XGBoost, GLM), which served as input to train a linear meta-model (i.e. Logistic regression classifier), responsible for predicting the final probability of presence for each species.
The RDS file is created from a data.table object and suitable for fast reading in the R-programming environment. The CSV.GZ file contains records as a table with easting and northing in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035) and can be fed in a GIS after being unzipped.
We provide RDS files for a 30km tile as an example containing raster stacks at 30m resolution of all the covariates included in the regression matrix. You can find the specific geographical location of the tile in Europe using the attached GeoPackage ("eu_tiling_system_30km"): open it in QGIS and filter by "ID".
In our approach we considered both static and dynamic covariates: dynamic covariates are calculated as averages of a 4 years time window (example: 2004 contains averages from 2002 to 2006). To get the predictions for a specific year, covariates contained in the static RDS file need to be bound with the respective year.
To access our predictions (probabilities and uncertainties) produced for the target species access:
- Open Data Science Europe viewer: https://maps.opendatascience.eu
- Check the Related identifiers section of this repository to access each species individually
If you instead would like to know more about the creation of this dataset and the modeling:
- watch the talk at Open Data Science Workshop 2021 (TIB AV-PORTAL)
- access the repository with our R/Python scripts and follow the instructions (GitLab)
A publication describing, in detail, all processing steps, accuracy assessment and general analysis of species distribution maps is available on PeerJ. To suggest any improvement/fix use https://gitlab.com/geoharmonizer_inea/spatial-layers/-/issues.
||4.5 MB||Preview Download|