This is supporting information for the Methods in Ecology and Evolution article entitled, Integrated SDM database: enhancing the relevance and utility of species distribution models in conservation management, by Veronica F. Frans*, Amélie A. Augé, Jim Fyfe, Yuqian Zhang, Nathan McNally, Hendrik Edelhoff, Niko Balkenhol, & Jan O. Engler. Please contact the corresponding author (*) for any inquiries.
To predict New Zealand sea lion (Phocarctos hookeri; NZSL) terrestrial breeding sites across mainland New Zealand, we used multi-state species distribution model (SDM) training data for this species from Frans et al. (2018). In the multi-state SDM framework, behavioural states are modelled as separate SDMs and then combined based on minimum habitat patch sizes for each behavioural state and for a minimum number of individuals, as each state habitat type is essential to define the overall potential distribution of the focal species. NZSL females exhibit three states of inland movement during the breeding season, where their habitat preferences shift from sandy beaches up to 100 m inland (breeding state, S1), to high sward up to 1000 m inland (transition state, S2), and forests up to 2000 m inland (dispersion state, S3) (Augé et al., 2009; Augé et al., 2012). Traditional SDMs do not account for multiple life-cycle stages or behaviours, nor can they define habitat patches of a minimum area size.
The multi-state SDM approach from Frans et al. (2018) simultaneously allows us to account for the unique breeding habitat preferences of the NZSL and to simplify SDM results into polygons of breeding sites, for use in the integrated SDM database (iSDMdb). Note that the iSDMdb is compatible with other SDM algorithms and/or other methods to create polygons of suitable sites.
Here, we demonstrate how the multi-state SDM training step from Frans et al. (2018) was conducted. We used female NZSL occurrences from Sandy Bay, Auckland Islands (collected from Augé et al. 2009; data can be downloaded here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.14mt7) for model training against eight environmental variables at a 25-metre resolution (cliff edges, land cover, slope, and distances from inland water bodies, grass, forest, rock, and the coastline). Table 1 of this current publication (Frans et al. 2021) explains the reasoning and assumptions behind these variables’ incorporation in the SDMs, based on expert opinion and previous studies on NZSL terrestrial breeding habitat preferences.
The final products from this multi-state SDM training step that we will use in the subsequent steps for model prediction and building other components of the iSDMdb are as follows:
Maxent model training background information data (.lambda and sampleAverages.csv files) for use in the model prediction to the mainland (see Appendix S2) and assessments of model uncertainty (see Appendix S3).
Maxent results files (maxentResults.csv) for extracting threshold information to define suitability and make polygons of suitable sites (see Appendix S2); and
model evaluation materials (AUC, TSS and Soerensen’s Similarity scores; species response curves).
The script presented here was done using R (version 4.0.5; R Core Team 2021) and its packages.
library("raster") # raster data
library("rgdal") # input/output; projections; reading ASCII files
library("maptools") # mapping
library("RColorBrewer") # colours for graphics
library("igraph") # clumping contiguous pixels
library("rgeos") # polygon and point conversions
library("tmap") # data visualisation
library("tmaptools") # data visualisation
library("dismo") # species distribution model package
library("plyr") # data manipulation
library("ggplot2") # data visualisation
Change raster options to maximise memory.
# view default raster options
rasterOptions()
## format : raster
## datatype : FLT4S
## overwrite : FALSE
## progress : none
## timer : FALSE
## chunksize : 1e+08
## maxmemory : 5e+09
## memfrac : 0.6
## tmpdir : G:/R/temp\RtmpWYszEA/raster/
## tmptime : 168
## setfileext : TRUE
## tolerance : 0.1
## standardnames : TRUE
## warn depracat.: TRUE
## header : none
# modify max memory from 10gb to 30gb (this is a 32GB RAM computer)
rasterOptions(maxmemory = 3e10)
We create and use multiple folders to organise our inputs and outputs. Similar directories can be made anywhere by only changing the root directory object, dir
.
# Root directory
dir <- c('G:\\R\\NZSL_MSSDM_MCA_2019_NZ\\')
# Data directory
dat.dir <- paste0(dir,'data\\')
# Maxent.jar file directory
maxent.dir <- paste0(dir,'maxent')
# Maxent results directory
maxout.dir <- paste0(dir,'m_output\\')
# Maxent cache files (training environmental variables)
train.cache <- c(paste0(dat.dir,'maxent.cache\\training\\'))
# Final layers
dir.create(paste0(dir,'layers'),recursive=TRUE)
lay.dir <- paste0(dir,'layers\\')
# Final tables
dir.create(paste0(dir,'tables'),recursive=TRUE)
tab.dir <- paste0(dir,'tables\\')
# Final figures
dir.create(paste0(dir,'figures'),recursive=TRUE)
fig.dir <- paste0(dir,'figures\\')
Load rasters and project. The default projection is set to NZDG2000/New Zealand Transverse Mercator 2000 (EPSG: 2193).
# load raster and project
get.ras <- function(r_name,
# defaults
ras_dir=dat.dir, prj=c("+init=epsg:2193")){
# read raster
r <- raster(paste0(ras_dir,"\\",r_name))
# project and return raster
crs(r) <- CRS(paste0(prj))
return(r)
}
Load points or polygons and project. The default projection is set to NZDG2000/New Zealand Transverse Mercator 2000 (EPSG: 2193).
# load point/polygon and project
get.shp <- function(p_name,
# defaults
polydir=dat.dir, prj=c("+init=epsg:2193")){
# read shapefile
pol <- readOGR(paste0(polydir,p_name))
# project and return point/polygon
pol <- spTransform(pol, CRS(paste0(prj)))
return(pol)
}
Calculate true skill statistic (TSS). Original function obtained from Johanes Soerensen.
# tss function
calc_tss <-function( samples, background, threshold ){
# calculate sensitivity and specificity
sensitivity <- sum( samples$predict >= threshold ) / nrow( samples )
specificity <- sum( background$predict < threshold ) / nrow( background )
# calculate tss
tss <- sensitivity + specificity - 1
return(tss)
}
Calculate Soerensen’s similarity index (2TP/(FN + 2TP + FP)). See Leroy et al. 2018 for more information on this metric.
# tss function
calc_ssi <-function( samples, background, threshold ){
# extract
tp <- sum(samples$predict >= threshold) # true positive
fn <- sum(samples$predict < threshold) # false negative
tn <- sum(background$predict < threshold) # true negative
fp <- sum(background$predict >= threshold) # false positive
# calculate ssi
ssi <- (2*tp)/(fn + (2*tp) + fp)
return(ssi)
}
# occurrence data
all_states <- read.csv(paste0(dat.dir,"NZSL_states.csv"),header=T)
xy <- all_states[,c("LONGITUDE","LATITUDE")]
all_states_pts <- SpatialPointsDataFrame(coords=xy,
data=all_states,
proj4string=CRS("+init=epsg:2193"))
# environmental variables
var_stack <- stack(list.files(path=paste0(dat.dir,"//train"),
pattern=paste0(".asc"),full.names=TRUE))
crs(var_stack) <- CRS("+init=epsg:2193")
# Auckland Islands polygon
AI_polygon <- get.shp("AI_Polygon.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "G:\R\NZSL_MSSDM_MCA_2019_NZ\data\AI_Polygon.shp", layer: "AI_Polygon"
## with 1 features
## It has 2 fields
Plot all occurrences for each state, focussing on Sandy Bay, Auckland Islands.
# get extent of occurrences for map focus
extent(all_states_pts)
## class : Extent
## xmin : 1123540
## xmax : 1124709
## ymin : 4383964
## ymax : 4384806
Create a background polygon with that extent.
# create background polygon
bkg <- as(raster::extent(c(1123540-200,1124709+200,4383964-200,4384806+200)),
"SpatialPolygons")
proj4string(bkg) <- paste0("+init=epsg:2193 +proj=tmerc +lat_0=0 +lon_0=173 +k=0.9996",
" +x_0=1600000 +y_0=10000000 +ellps=GRS80 ",
"+towgs84=0,0,0,0,0,0,0 +units=m +no_defs")
bkg <- spTransform(bkg, CRS("+init=epsg:2193"))
Make a base map with the background polygon as the extent, and Auckland Islands polygon as an overlay.
# set mode
tmap_mode("plot")
# base map
# background polygon
bkg_map <- tm_shape(bkg) +
tm_borders('white', alpha = 0) +
tm_layout(frame=FALSE, bg.color = NA)
# AI polygon
AI_map <- tm_shape(AI_polygon) +
tm_polygons(col='white', border.alpha = 1) +
tm_legend(show=FALSE)
# combine
base_map <- bkg_map + AI_map
Plot.
# ensure factor
all_states_pts$STATE <- as.factor(all_states_pts$STATE)
levels(all_states_pts$STATE) <- c("S1","S2","S3")
# plot
base_map +
tm_shape(var_stack$landcover) +
tm_raster(style = "cat", n=7,
labels = c('rock','scrub','forest','grass','water','sand'),
palette = get_brewer_pal("BrBG", plot=FALSE),
title = 'land cover', legend.show = TRUE) +
tm_shape(all_states_pts) +
tm_dots("STATE", palette = c('#969696','#636363','#252525'), size = .1,
title='state') +
# layout
tm_scale_bar(breaks = c(0,.5,1), size = .75) +
tm_layout(legend.show = TRUE,legend.outside = TRUE,
frame=FALSE,
bg.color = NA)
We used Maxent v.3.3.3k (Phillips & Dudik 2008) to train distribution models of all three behavioural states against the eight Auckland Islands environmental variables. The model was trained for 100 iterations (runs) per state, making a total of 300 models. For each iteration, the training data were stored in folders, sorted by corresponding state and iteration. Evaluation statistics such as AUC, TSS, Soerensen’s similarity index, and response curves were also extracted. In future applications, this script also works for Maxent v.3.4.1.
# Empty vectors
train_state <- list() # for mean training prediction across state iterations
AUC_state <- list() # for mean AUC across state iterations
thresh_state <- list() # for mean selected threshold across state iterations
thresh_runs_state <- list() # for table of selected threshold across state iterations
resp_state <- list() # for variable responses across state iterations
back_state <- list() # for background prediction data
samp_train_state <- list() # for sample prediction data (train)
samp_test_state <- list() # for sample prediction data (test)
tss_state <- list() # for storing TSS across state iterations
ssi_state <- list() # for storing Soerensen's similarity index across states
# State for-loop
for(state in 1:3){
# create directory for each state
dir.create(paste0(dir,'m_output\\',state),recursive=TRUE)
# output path
maxent_out <- paste0(dir,'m_output\\',state)
# select occurrences by state
occ_state <- subset(all_states, grepl(paste0("*_",state), all_states$STATE))
# Maxent training for 100 iterations
for(run in 1:100){
# Create a directory for each iteration's output
dir.create(paste0(maxent_out,"\\",run),recursive=TRUE)
runs_out <- paste0(maxent_out,"\\",run)
# set random seed to select the points
set.seed(run)
# create a subset function for each run to select 133 points
samp <- function(dataf){dataf[sample(1:nrow(dataf)[1],
size=133,replace=FALSE),]}
# Randomly select 133 points per run
sample_data <- ddply(occ_state, .(STATE), samp)
# Convert into spatial points data frame
sample_coords <- data.frame(x=sample_data$LONGITUDE,y=sample_data$LATITUDE)
occ <- SpatialPointsDataFrame(sample_coords, sample_data,
proj4string=CRS("+init=epsg:2193"))
# Maxent
train_mod <- maxent(
# raster stack
x=var_stack,
# species coordinates
p=occ[,c('LONGITUDE','LATITUDE')],
# maxent directory
path=runs_out,
# arguments
args=c(
# logistic output
'outputformat=logistic',
# autofeatures changed to FALSE
'autofeature=false',
# using hinge-only
'linear=false',
'quadratic=false',
'product=false',
'threshold=false',
'hinge=true',
# extract output data
'responsecurves=true',
'jackknife=true',
'writeplotdata=true',
# duplicated already removed
'removeduplicates=false',
# other settings
'askoverwrite=false',
'writeclampgrid=true',
'writemess=false',
'randomseed=false',
# select 25% (33 points) for testing
'randomtestpoints=25',
# for model prediction
'writebackgroundpredictions=true'
),
# indicate categorical variables
factors=c('landcover','cliff_edges'))
# Predict within training area and save
train_pred <- predict(
object=train_mod,
x=var_stack,
filename=paste0(runs_out,"\\train_S",
state,"_",run,".asc"),
overwrite=TRUE,
progress='text'
)
} # end of training for 100 iterations
# Calculate training area mean predictions per state
# Create an empty vector
maxent_rasters <- list()
# For-loop
for (run in 1:100){
# Directory for each iteration's output
runs_out <- paste0(maxent_out,"\\",run,"\\")
# Search for ".asc" file and stack
filenames <- list.files(path=paste0(runs_out),
pattern=".asc",full.names=TRUE)
# Add to list
maxent_rasters[[run]] <- filenames
}
# Raster stack
maxent_100_rasters <- do.call(stack, maxent_rasters)
# Calculate mean
mean_train <- calc(x=maxent_100_rasters, fun=mean)
# Add to list
train_state[[state]] <- mean_train
# Save raster
writeRaster(mean_train,paste0(maxent_out,"\\mean_train_S",state,".asc"),
overwrite=TRUE)
# Extract and calculate mean AUC values
# Create an empty vector
AUC_vals <- list()
# For-loop
for(run in 1:100){
# Output Path
runs_out <- paste0(maxent_out,"\\",run,"\\")
# Read maxent result table and extract threshold values
eval_table <- read.csv(paste0(runs_out,"maxentResults.csv"),
header=T)
AUC_df <- eval_table[,c(grep(".AUC",names(eval_table)))]
# Add a column to table to differentiate among states and runs
AUC_df <- data.frame(state, AUC_df, run)
AUC_df$run <- paste0(run)
AUC_df$state <- paste0(state)
# Add to List
AUC_vals[[run]] <- AUC_df
}
# Combine all 100 runs
AUC <- do.call(rbind, AUC_vals)
# Calculate mean and standard deviation
AUC_means <- ddply(AUC, .(state), summarize,
mean_train=mean(Training.AUC),sd_train=sd(Training.AUC),
mean_test=mean(Test.AUC), sd_test=sd(Test.AUC)
)
# Add to list
AUC_state[[state]] <- AUC_means
# Extract and calculate mean threshold values
# Only one threshold (MaxSSS) is extracted for later use
# Create an empty vector
thresh_vals <- list()
# For-loop
for(run in 1:100){
# Output Path
runs_out <- paste0(maxent_out,"\\",run,"\\")
# Read maxent result table and extract threshold values
res_df <- read.csv(paste0(runs_out,"maxentResults.csv"),
header=T)
# Select the MaxSSS threshold (both testing and training)
thresh_df <- res_df[,c(
"Maximum.training.sensitivity.plus.specificity.Logistic.threshold",
"Maximum.test.sensitivity.plus.specificity.Logistic.threshold")]
colnames(thresh_df) <- c('maxSSS_train','maxSSS_test')
# Add a column to table to differentiate among states and runs
thresh_df <- data.frame(state, thresh_df, run)
thresh_df$run <- paste0(run)
thresh_df$state <- paste0(state)
# Add to List
thresh_vals[[run]] <- thresh_df
}
# Combine all 100 runs
thresholds <- do.call(rbind, thresh_vals)
# Calculate mean and standard deviation of selected threshold
thresh_means <- ddply(thresholds, .(state), summarize,
maxSSS_train_mean=mean(maxSSS_train),
maxSSS_train_sd=sd(maxSSS_train),
maxSSS_test_mean=mean(maxSSS_test),
maxSSS_test_sd=sd(maxSSS_test))
# Add to list
thresh_state[[state]] <- thresh_means
thresh_runs_state[[state]] <- thresholds
# Calculate TSS and Soerensen's similarity index
# Extract background prediction files (to calculate TSS/Soerensen's)
# Create an empty vector
s_back_df <- list() # list of species background data across runs
# For-loop
for(run in 1:100){
# Output Path
runs_out <- paste0(maxent_out,"\\",run,"\\")
# Read CSV file of maxent background info (generated by maxent above)
res_df <- read.csv(paste0(runs_out,
"\\species_backgroundPredictions.csv"), header=TRUE)
# Select only logistic prediction values
# name here matches name in tss_calc function
predict <- res_df[,c(grep("Logistic",names(res_df)))]
# Add a column to indicate corresponding run within the table
back_df <- data.frame(run,predict)
back_df$state <- paste0("S",state)
# Add to List
s_back_df[[run]] <- back_df
}
# Combine all runs for state
back_all <- do.call(rbind, s_back_df)
# Add to list
back_state[[state]] <- back_all
# Extract sample prediction files (to calculate TSS/Sorensen's Similarity)
# Create an empty vector
s_samp_train_df <- list() # list of species training data across runs
s_samp_test_df <- list() # list of species test data across runs
# For-loop
for(run in 1:100){
# Output Path
runs_out <- paste0(maxent_out,"\\",run,"\\")
# Read CSV file of maxent background info (generated by maxent above)
res_df <- read.csv(paste0(runs_out,
"\\species_samplePredictions.csv"), header=TRUE)
# Select only logistic prediction values
res_df_train <- res_df[(res_df$Test.or.train=="train"),]
res_df_test <- res_df[(res_df$Test.or.train=="test"),]
# Select only logistic threshold values and invert the table
predict_train <- res_df_train[,c(grep("Logistic.prediction",
names(res_df_train)))]
predict_test <- res_df_test[,c(grep("Logistic.prediction",
names(res_df_test)))]
# Add a column to indicate corresponding run within the table
samp_df_train <- data.frame(run, predict=predict_train)
samp_df_test <- data.frame(run, predict=predict_test)
samp_df_train$state <- paste0("S",state)
samp_df_test$state <- paste0("S",state)
# Add to List
s_samp_train_df[[run]] <- samp_df_train
s_samp_test_df[[run]] <- samp_df_test
}
# Combine all runs for state
samp_all_train <- do.call(rbind, s_samp_train_df)
samp_all_test <- do.call(rbind, s_samp_test_df)
# Add to list
samp_train_state[[state]] <- samp_all_train
samp_test_state[[state]] <- samp_all_test
# TSS for-loop
# Create an empty vector
tss_train_df <- list() # list of TSS train scores across runs
tss_test_df <- list() # list of TSS test scores across runs
# For-loop
for(run in 1:100){
# extract threshold by number in row
thresh <- thresholds[run,]
# extract sample and background predictions by number of run
samples_train <- samp_all_train[(samp_all_train$run==run),]
samples_test <- samp_all_test[(samp_all_test$run==run),]
background <- back_all[(back_all$run==paste0(run)),]
# run tss function
tss_run_train <- calc_tss(samples_train,background,thresh$maxSSS_train)
tss_run_test <- calc_tss(samples_test,background,thresh$maxSSS_test)
# add to list
tss_train_df[[run]] <- tss_run_train
tss_test_df[[run]] <- tss_run_test
}
# Combine
tss_train_vals <- do.call(rbind,tss_train_df)
tss_test_vals <- do.call(rbind,tss_test_df)
# Calculate mean and standard deviation across runs
tss_mean_train <- mean(tss_train_vals)
tss_sd_train <- sd(tss_train_vals)
tss_mean_test <- mean(tss_test_vals)
tss_sd_test <- sd(tss_test_vals)
# Combine into dataframe
tss_df <- data.frame(state=paste0(state),
mean_train=tss_mean_train,
sd_train=tss_sd_train,
mean_test=tss_mean_test,
sd_test=tss_sd_test)
# Add to list
tss_state[[state]] <- tss_df
# Soerensen's for-loop
# Create an empty vector
ssi_train_df <- list() # list of ssi train scores across runs
ssi_test_df <- list() # list of ssi test scores across runs
# For-loop
for(run in 1:100){
# extract threshold by number in row
thresh <- thresholds[run,]
# extract sample and background predictions by number of run
samples_train <- samp_all_train[(samp_all_train$run==run),]
samples_test <- samp_all_test[(samp_all_test$run==run),]
background <- back_all[(back_all$run==paste0(run)),]
# run ssi function
ssi_run_train <- calc_ssi(samples_train,background,thresh$maxSSS_train)
ssi_run_test <- calc_ssi(samples_test,background,thresh$maxSSS_test)
# add to list
ssi_train_df[[run]] <- ssi_run_train
ssi_test_df[[run]] <- ssi_run_test
}
# Combine
ssi_train_vals <- do.call(rbind,ssi_train_df)
ssi_test_vals <- do.call(rbind,ssi_test_df)
# Calculate mean and standard deviation across runs
ssi_mean_train <- mean(ssi_train_vals)
ssi_sd_train <- sd(ssi_train_vals)
ssi_mean_test <- mean(ssi_test_vals)
ssi_sd_test <- sd(ssi_test_vals)
# Combine into dataframe
ssi_df <- data.frame(state=paste0(state),
mean_train=ssi_mean_train,
sd_train=ssi_sd_train,
mean_test=ssi_mean_test,
sd_test=ssi_sd_test)
# Add to list
ssi_state[[state]] <- ssi_df
# Extract variable responses
# Extracting responses for when all variables were run together
# Create an empty vector
resp_list <- list() #for response when all variables are run in Maxent
# For-loop
for(run in 1:100){
# Output Path
runs_out <- paste0(maxent_out,"\\",run,"\\")
# Select all files of .dat extension (generated by Maxent earlier)
filenames <- list.files(paste0(runs_out,"\\plots"),pattern="*.dat",
full.names=TRUE)
# Select the responses where Maxent ran each variable together
ran_together.files <- filenames[-grep("_only.dat",filenames, fixed=T)]
# Make combined sets of dat files for this state
resp_dat <- ldply(ran_together.files,read.table,sep=",",header=TRUE)
# Create new dataframes with new columns
resp_df <- data.frame(state,resp_dat)
resp_df$type <- "together" # add to table
resp_df$state <- paste0("S",state) # add to table
# Add to list
resp_list[[state]] <- resp_df
}
# Combine all states into one table
resp_all <- do.call(rbind,resp_list)
# Add to list
resp_state[[state]] <- resp_all
} # end of state for-loop
# Stack mean training prediction maps
train_s123 <- do.call(stack,train_state)
# Gather calculated mean AUC values into one table and save
AUC <- do.call(rbind, AUC_state)
write.csv(AUC, paste0(dir,"AUC.csv"), row.names=FALSE)
# Gather calculated mean threshold values (MaxSSS only) and save
thresholds <- do.call(rbind, thresh_state)
write.csv(thresholds, paste0(dir,"thresholds.csv"), row.names=FALSE)
# Gather per-run threshold values and save (maxSSS only)
thresh_runs <- do.call(rbind, thresh_runs_state)
write.csv(thresh_runs, paste0(dir,"maxSSS.csv"), row.names=FALSE)
# background and sample data across state runs
sample_data_train <- do.call(rbind, samp_train_state)
sample_data_test <- do.call(rbind, samp_test_state)
background <- do.call(rbind, back_state)
# TSS and Soerensen's across state runs
TSS <- do.call(rbind, tss_state)
soerensens <- do.call(rbind, ssi_state)
write.csv(TSS,paste0(dir,"TSS.csv"), row.names=FALSE)
write.csv(soerensens,paste0(dir,"soerensens.csv"), row.names=FALSE)
# Gather variable response tables
responses <- do.call(rbind, resp_state)
write.csv(responses,paste0(dir,"var_responses.csv"), row.names=FALSE)
The model training procedure has also been evaluated in Frans et al. (2018). Here, we demonstrate some of the evaluation criteria.
Area under the Receiver Operating Characteristic (ROC) curve (AUC) is a measure of model accuracy, and is typically used in SDM assessment procedures. AUC ranges from 0 to 1, where an AUC from 0 to 0.5 indicates that the SDM’s ability to predict is worse than random, and AUC >0.5 is better than random (Phillips et al. 2006). Typically, a model with an AUC >0.75 has good predictive power (Elith et al. 2006). The mean and SD AUC train and test scores across the state model iterations are as follows, and indicate excellent performance:
# show table
knitr::kable(AUC)
state | mean_train | sd_train | mean_test | sd_test |
---|---|---|---|---|
1 | 0.999180 | 0.0001128 | 0.9995234 | 0.0018583 |
2 | 0.995271 | 0.0003138 | 0.9986124 | 0.0004151 |
3 | 0.995856 | 0.0003255 | 0.9986974 | 0.0005302 |
True Skill Statistic is another measure of model accuracy. It accounts for sensitivity and specificity and ranges from -1 to 1. Values from -1 to 0 indicate that the SDM’s prediction was worse than random, and values greater than zero to 1 are better than random predictions, with 1 being perfect agreement. Compared to another SDM accuracy measure like kappa, prevalence does not affect TSS. Here, we also wanted to measure TSS because the AUC score appeared to show an overfit model, which can be an artifact of the NZSL’s spatial distribution (restricted range in the Auckland islands) and the small number of occurrences available for each behavioural state. The mean and SD TSS scores across the state model iterations are as follows, and indicate excellent performance:
# show table
knitr::kable(TSS)
state | mean_train | sd_train | mean_test | sd_test |
---|---|---|---|---|
1 | 0.9606579 | 0.0390955 | 0.9170001 | 0.1175453 |
2 | 0.9813887 | 0.0082280 | 0.9734165 | 0.0224609 |
3 | 0.9825027 | 0.0094086 | 0.9698385 | 0.0253797 |
Another model accuracy measure is Soerensen’s similarity index. Soerensen’s measures the similarity between predictions and observations. It avoids inflation from true negatives by focussing on true positives, false positives, and false negatives (Leroy 2018). Soerensen’s values range from 0 to 1, where a value of 0 indicates that none of the predictions match the observations, and a value of 1 indicates that predictions match observations perfectly (i.e. no false positives or false negatives). The mean and SD Soerensen’s scores across the state model iterations are as follows, and indicate that the model performed well:
# show table
knitr::kable(soerensens)
state | mean_train | sd_train | mean_test | sd_test |
---|---|---|---|---|
1 | 0.8701534 | 0.1368870 | 0.8159698 | 0.1288954 |
2 | 0.7649749 | 0.0942729 | 0.5794007 | 0.0939203 |
3 | 0.7483930 | 0.1032794 | 0.5875510 | 0.0815262 |
Variable response curves were also extracted from the model training iterations. Here, we display the NZSL’s response to environmental variables when all the variables were run together.
First, we change the categories to factors.
# change to factors
responses$state <- as.factor(responses$state)
responses$variable <- as.factor(responses$variable)
Then, we visualise the variables. The first set to inspect are the distance variables. Although the Auckland Islands training area and Enderby Island, Sandy Bay are small study areas when compared to the New Zealand mainland (our prediction area for the current study and iSDMdb demonstration), we see here that the model training procedure was able to capture different habitat preferences across the different behavioural states.
# Distance variables
# Subset dataset
dist_resp <- responses[!(responses$variable=="cliff_edges"),]
dist_resp <- dist_resp[!(dist_resp$variable=="slope"),]
dist_resp <- dist_resp[!(dist_resp$variable=="landcover"),]
# facet grid
ggplot(dist_resp,
aes(x=x,y=y,group=variable,fill=variable)) +
geom_smooth(aes(color=variable, se=TRUE)) +
scale_fill_brewer(palette="Set1") +
facet_grid(state~variable, scales="fixed") +
ylim(-.05,1) + # probability of presence
xlim(0,2000)+ # only focus on <2km distance (max. NZSL dispersal)
xlab("Distance (m)") + ylab("Probability of presence") +
theme_bw() +
theme(strip.background = element_blank(), strip.placement = "outside") +
theme(legend.position='none')
Next, we inspect the response to slope. We also include a vertical line for the known slope limit for breeding female NZSL inland movement (20 degrees), based on a field study on habitat preferences from Augé et al. (2012). A decrease in probability of presence after this threshold is expected, and helps us to evaluate the robustness of the model training procedure.
# Extract slope
# Subset dataset
slp_resp <- responses[(responses$variable=="slope"),]
# facet grid
ggplot(slp_resp,
aes(x=x,y=y,group=variable,fill=variable)) +
geom_smooth(aes(color=variable, se=TRUE)) +
scale_fill_brewer(palette="Set1") +
facet_grid(state~variable, scales="fixed") +
ylim(-.05,1) + # probability of presence
geom_vline(aes(xintercept = 20), colour = 'grey', lty = 2) + #indicate slope limit
xlab("Slope (degrees)") + ylab("Probability of presence") +
theme_bw() +
theme(strip.background = element_blank(), strip.placement = "outside") +
theme(legend.position='none')
We inspect the response to land cover. As categorical data, mean responses have to be calculated and then the plots will be bar plots. In terms of evaluation, we noticed an increase in probability of presence for sand in S3. This is expected, as female NZSLs return to the beach during all three states, going to sea to forage while leaving pups protected further inland (Augé et al. 2009; Augé et al. 2012).
# Extract slope
# Subset dataset
lc_resp <- responses[(responses$variable=="landcover"),]
# replace names
lc_resp$x[lc_resp$x==6] <- "sand"
lc_resp$x[lc_resp$x==5] <- "water"
lc_resp$x[lc_resp$x==4] <- "grass"
lc_resp$x[lc_resp$x==3] <- "forest"
lc_resp$x[lc_resp$x==2] <- "scrub"
lc_resp$x[lc_resp$x==1] <- "rock"
# facet grid
ggplot(lc_resp, aes(x=x,y=y, fill=x)) +
geom_bar(aes(fill=x), stat="identity",alpha=0.8) +
scale_fill_brewer(palette="Set1") +
facet_grid(state~variable, scales="fixed") +
ylim(-.05,1) +
xlab("Land cover type") + ylab("Probability of presence") +
theme_bw() +
theme(strip.background = element_blank(), strip.placement = "outside") +
theme(legend.position='none')
Finally, we plot the response to cliff edges.
# Extract slope
# Subset dataset
cliff_resp <- responses[(responses$variable=="cliff_edges"),]
# replace names
cliff_resp$x[cliff_resp$x==0] <- "no cliff"
cliff_resp$x[cliff_resp$x==1] <- "cliff"
# facet grid
ggplot(cliff_resp, aes(x=x,y=y, fill=x)) +
geom_bar(aes(fill=x), stat="identity",alpha=0.8) +
scale_fill_brewer(palette="Set1") +
facet_grid(state~variable, scales="fixed") +
ylim(-.05,1) +
xlab("Land cover type") + ylab("Probability of presence") +
theme_bw() +
theme(strip.background = element_blank(), strip.placement = "outside") +
theme(legend.position='none')
We then map the training predictions for the northern portion of the Auckland Islands (training area). Mapping and reviewing the mean predictions for the three states in this area is advantageous, since multiple states are required for successful breeding colonies. In this northern area of the Auckland Islands, there are two breeding colony locations: Enderby Island (Sandy Bay) and Dundas Island. If the two breeding colonies can be predicted across the behavioural states, then that gives us confidence to use the multi-state SDM and our selected threshold (here, maximum sum of sensitivity and specificity) on the mainland.
# rename raster stack layers
names(train_s123) <- c("S1","S2","S3")
# Set up plots (S1-3)
png(width = 1800, height = 2400,
paste0(fig.dir,"/train_S1-3.png"),res=300)
sdmplot <- spplot(train_s123,
main=c("Training (S1-S3)"),
layout=c(1,3), #(ncols, nrows)
maxpixels=500000,
col.regions=c(rev(brewer.pal(11,"Spectral"))),
cuts=10,
colorkey=list(labels=list(at=seq(0,1,by=0.1)), #set up labels
space="right"),
par.settings = list(panel.background=list(col="black"))
)
sdmplot
dev.off()
# Display figure
sdmplot
A second map, zooming in on Sandy Bay.
# Set up plots (S1-3)
png(width = 2400, height = 900,
paste0(fig.dir,"/train_S1-3_sandy.png"),res=300)
sbplot <- spplot(train_s123,
main=c("Sandy Bay training (S1-S3)"),
layout=c(3,1), #(ncols, nrows)
maxpixels=500000,
col.regions=c(rev(brewer.pal(11,"Spectral"))),
cuts=10,
colorkey=list(labels=list(at=seq(0,1,by=0.1)), #set up labels
space="right"),
par.settings = list(panel.background=list(col="black")),
xlim=c(1123540-200,1124709+200),
ylim=c(4383964-200,4384806+200)
)
sbplot
dev.off()
# Display figure
sbplot
A final map, zooming in on Dundas Island.
# Set up plots (S1-3)
png(width = 2400, height = 900,
paste0(fig.dir,"/train_S1-3_dundas.png"),res=300)
dunplot <- spplot(train_s123,
main=c("Dundas Island training (S1-S3)"),
layout=c(3,1), #(ncols, nrows)
maxpixels=500000,
col.regions=c(rev(brewer.pal(11,"Spectral"))),
cuts=10,
colorkey=list(labels=list(at=seq(0,1,by=0.1)), #set up labels
space="right"),
par.settings = list(panel.background=list(col="black")),
xlim=c(1107432+15000,1127989),
ylim=c(4373022+200,4386119-8000)
)
dunplot
dev.off()
# Display figure
dunplot
Information on other model evaluation metrics can be found in Frans et al. (2018).
# save workspace to load later if needed
save.image("G:/R/mssdm_AI.RData")
This concludes the multi-state SDM training step.
Next, we use the training data for SDM predictions onto the mainland and identification of suitable sites. See Appendix S2.
Augé, A. A., Chilvers, B. L., Moore, A., Mathieu, R., & Robertson, B. C. (2009). Aggregation and dispersion of female New Zealand sea lions at the Sandy Bay breeding colony, Auckland Islands: how unusual is their spatial behaviour? Behaviour, 146 (9), 1287–1311. https://doi.org/10.1163/15683909X427687
Augé, A. A., Chilvers, B. L., Mathieu, R., & Moore, A. B. (2012). On-land habitat preferences of female New Zealand sea lions at Sandy Bay, Auckland Islands. Marine Mammal Science, 28 (3), 620–637. https://doi.org/10.1111/j.1748-7692.2011.00515.x
Elith, J., et al. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29, 129-151. https://doi.org/10.1111/j.2006.0906-7590.04596.x
Frans, V.F., Augé, A.A., Edelhoff, H., Erasmi, S., Balkenhol, N., & Engler, J.O. (2018) Quantifying apart what belongs together: A multi-state species distribution modelling framework for species using distinct habitats. Methods in Ecology and Evolution, 9 (1), 98-108. https://doi.org/10.1111/2041-210X.12847
Leroy, B., Delsol, R., Hugueny, B., Meynard, C. N., Barhoumi, C., Barbet-Massin, M., & Bellard, C. (2018). Without quality presence-absence data, discrimination metrics such as TSS can be misleading measures of model performance. Journal of Biogeography, 45(9), 1994–2002. https://doi.org/10.1111/jbi.13402
Phillips, S.J., Anderson, R.P., & Schapire, R.E.. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190, 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026
Phillips, S.J. & Dudik, M. (2008) Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, 31 (2), 161-175. https://doi.org/10.1111/j.0906-7590.2008.5203.x