This is supporting information for the Methods in Ecology and Evolution article entitled, Integrated SDM database: enhancing the relevance and utility of species distribution models in conservation management, by Veronica F. Frans*, Amélie A. Augé, Jim Fyfe, Yuqian Zhang, Nathan McNally, Hendrik Edelhoff, Niko Balkenhol, & Jan O. Engler. Please contact the corresponding author (*) for any inquiries.

1. Methods summary

To predict New Zealand sea lion (Phocarctos hookeri; NZSL) terrestrial breeding sites across mainland New Zealand, we used multi-state species distribution model (SDM) training data for this species from Frans et al. (2018). In the multi-state SDM framework, behavioural states are modelled as separate SDMs and then combined based on minimum habitat patch sizes for each behavioural state and for a minimum number of individuals, as each state habitat type is essential to define the overall potential distribution of the focal species. NZSL females exhibit three states of inland movement during the breeding season, where their habitat preferences shift from sandy beaches up to 100 m inland (breeding state, S1), to high sward up to 1000 m inland (transition state, S2), and forests up to 2000 m inland (dispersion state, S3) (Augé et al., 2009; Augé et al., 2012). Traditional SDMs do not account for multiple life-cycle stages or behaviours, nor can they define habitat patches of a minimum area size.

The multi-state SDM approach from Frans et al. (2018) simultaneously allows us to account for the unique breeding habitat preferences of the NZSL and to simplify SDM results into polygons of breeding sites, for use in the integrated SDM database (iSDMdb). Note that the iSDMdb is compatible with other SDM algorithms and/or other methods to create polygons of suitable sites.

Here, we demonstrate how the multi-state SDM training step from Frans et al. (2018) was conducted. We used female NZSL occurrences from Sandy Bay, Auckland Islands (collected from Augé et al. 2009; data can be downloaded here: https://datadryad.org/stash/dataset/doi:10.5061/dryad.14mt7) for model training against eight environmental variables at a 25-metre resolution (cliff edges, land cover, slope, and distances from inland water bodies, grass, forest, rock, and the coastline). Table 1 of this current publication (Frans et al. 2021) explains the reasoning and assumptions behind these variables’ incorporation in the SDMs, based on expert opinion and previous studies on NZSL terrestrial breeding habitat preferences.

The final products from this multi-state SDM training step that we will use in the subsequent steps for model prediction and building other components of the iSDMdb are as follows:

  1. Maxent model training background information data (.lambda and sampleAverages.csv files) for use in the model prediction to the mainland (see Appendix S2) and assessments of model uncertainty (see Appendix S3).

  2. Maxent results files (maxentResults.csv) for extracting threshold information to define suitability and make polygons of suitable sites (see Appendix S2); and

  3. model evaluation materials (AUC, TSS and Soerensen’s Similarity scores; species response curves).

2. R Setup

The script presented here was done using R (version 4.0.5; R Core Team 2021) and its packages.

2.1 Libraries

  library("raster")       # raster data
  library("rgdal")        # input/output; projections; reading ASCII files
  library("maptools")     # mapping
  library("RColorBrewer") # colours for graphics
  library("igraph")       # clumping contiguous pixels
  library("rgeos")        # polygon and point conversions
  library("tmap")         # data visualisation
  library("tmaptools")    # data visualisation
  library("dismo")        # species distribution model package
  library("plyr")         # data manipulation
  library("ggplot2")      # data visualisation

2.1.1 Options

Change raster options to maximise memory.

# view default raster options
  rasterOptions()
## format        : raster 
## datatype      : FLT4S 
## overwrite     : FALSE 
## progress      : none 
## timer         : FALSE 
## chunksize     : 1e+08 
## maxmemory     : 5e+09 
## memfrac       : 0.6 
## tmpdir        : G:/R/temp\RtmpWYszEA/raster/ 
## tmptime       : 168 
## setfileext    : TRUE 
## tolerance     : 0.1 
## standardnames : TRUE 
## warn depracat.: TRUE 
## header        : none
# modify max memory from 10gb to 30gb (this is a 32GB RAM computer)
  rasterOptions(maxmemory = 3e10)

2.2 Directories

We create and use multiple folders to organise our inputs and outputs. Similar directories can be made anywhere by only changing the root directory object, dir.

# Root directory
  dir <- c('G:\\R\\NZSL_MSSDM_MCA_2019_NZ\\')

# Data directory
  dat.dir <- paste0(dir,'data\\')

# Maxent.jar file directory
  maxent.dir <- paste0(dir,'maxent')
  
# Maxent results directory
  maxout.dir <- paste0(dir,'m_output\\')
  
# Maxent cache files (training environmental variables)
  train.cache <- c(paste0(dat.dir,'maxent.cache\\training\\'))
  
# Final layers
  dir.create(paste0(dir,'layers'),recursive=TRUE)
  lay.dir <- paste0(dir,'layers\\')
  
# Final tables
  dir.create(paste0(dir,'tables'),recursive=TRUE)
  tab.dir <- paste0(dir,'tables\\')
  
# Final figures
  dir.create(paste0(dir,'figures'),recursive=TRUE)
  fig.dir <- paste0(dir,'figures\\')

2.3 Custom functions

2.3.1 Load raster and ensure projection

Load rasters and project. The default projection is set to NZDG2000/New Zealand Transverse Mercator 2000 (EPSG: 2193).

# load raster and project
  get.ras <- function(r_name,
                      # defaults
                      ras_dir=dat.dir, prj=c("+init=epsg:2193")){
      # read raster
        r <- raster(paste0(ras_dir,"\\",r_name))
      
      # project and return raster
        crs(r) <- CRS(paste0(prj))
        return(r)
  }

2.3.2 Load points or polygons and project

Load points or polygons and project. The default projection is set to NZDG2000/New Zealand Transverse Mercator 2000 (EPSG: 2193).

# load point/polygon and project
  get.shp <- function(p_name,
                      # defaults
                      polydir=dat.dir, prj=c("+init=epsg:2193")){
    # read shapefile
      pol <- readOGR(paste0(polydir,p_name))

    # project and return point/polygon
      pol <- spTransform(pol, CRS(paste0(prj)))
      return(pol)
  }

2.3.3 True skill statistic

Calculate true skill statistic (TSS). Original function obtained from Johanes Soerensen.

# tss function
  calc_tss <-function( samples, background, threshold ){
    # calculate sensitivity and specificity 
      sensitivity <- sum( samples$predict >= threshold ) / nrow( samples )
      specificity <- sum( background$predict < threshold ) / nrow( background )
      
    # calculate tss
      tss <- sensitivity + specificity - 1
      return(tss)
  }  

2.3.4 Soerensen’s similarity index

Calculate Soerensen’s similarity index (2TP/(FN + 2TP + FP)). See Leroy et al. 2018 for more information on this metric.

# tss function
  calc_ssi <-function( samples, background, threshold ){
    
    # extract 
      tp <- sum(samples$predict >= threshold)     # true positive
      fn <- sum(samples$predict < threshold)      # false negative
      tn <- sum(background$predict < threshold)   # true negative
      fp <- sum(background$predict >= threshold)  # false positive
    
    # calculate ssi
      ssi <- (2*tp)/(fn + (2*tp) + fp)
      return(ssi)
  }  

3. Load data

# occurrence data
  all_states <- read.csv(paste0(dat.dir,"NZSL_states.csv"),header=T)
  xy <- all_states[,c("LONGITUDE","LATITUDE")]
  all_states_pts <- SpatialPointsDataFrame(coords=xy,
                                           data=all_states,
                                           proj4string=CRS("+init=epsg:2193"))

  
# environmental variables
  var_stack <- stack(list.files(path=paste0(dat.dir,"//train"),
                                pattern=paste0(".asc"),full.names=TRUE))
  crs(var_stack) <- CRS("+init=epsg:2193")

# Auckland Islands polygon
  AI_polygon <- get.shp("AI_Polygon.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "G:\R\NZSL_MSSDM_MCA_2019_NZ\data\AI_Polygon.shp", layer: "AI_Polygon"
## with 1 features
## It has 2 fields

3.1 Inspect data

Plot all occurrences for each state, focussing on Sandy Bay, Auckland Islands.

# get extent of occurrences for map focus
  extent(all_states_pts)
## class      : Extent 
## xmin       : 1123540 
## xmax       : 1124709 
## ymin       : 4383964 
## ymax       : 4384806

Create a background polygon with that extent.

# create background polygon
  bkg <- as(raster::extent(c(1123540-200,1124709+200,4383964-200,4384806+200)),
            "SpatialPolygons")
  proj4string(bkg) <- paste0("+init=epsg:2193 +proj=tmerc +lat_0=0 +lon_0=173 +k=0.9996",
                             " +x_0=1600000 +y_0=10000000 +ellps=GRS80 ",
                             "+towgs84=0,0,0,0,0,0,0 +units=m +no_defs")
  bkg <- spTransform(bkg, CRS("+init=epsg:2193"))

Make a base map with the background polygon as the extent, and Auckland Islands polygon as an overlay.

# set mode
  tmap_mode("plot")

# base map
  # background polygon
    bkg_map <- tm_shape(bkg) +
               tm_borders('white', alpha = 0) +
               tm_layout(frame=FALSE, bg.color = NA)
  
  # AI polygon
    AI_map <- tm_shape(AI_polygon) + 
              tm_polygons(col='white', border.alpha = 1) +
              tm_legend(show=FALSE)
    
  # combine
    base_map <- bkg_map + AI_map

Plot.

# ensure factor
  all_states_pts$STATE <- as.factor(all_states_pts$STATE)
  levels(all_states_pts$STATE) <- c("S1","S2","S3")
  
# plot
  base_map + 
  tm_shape(var_stack$landcover) +
      tm_raster(style = "cat", n=7,
                labels = c('rock','scrub','forest','grass','water','sand'),
                palette = get_brewer_pal("BrBG", plot=FALSE),
                title = 'land cover', legend.show = TRUE) +
  tm_shape(all_states_pts) +
  tm_dots("STATE", palette = c('#969696','#636363','#252525'), size = .1,
          title='state') +
# layout
  tm_scale_bar(breaks = c(0,.5,1), size = .75) +
  tm_layout(legend.show = TRUE,legend.outside = TRUE,
           frame=FALSE,
           bg.color = NA)

4. Multi-state SDM training

We used Maxent v.3.3.3k (Phillips & Dudik 2008) to train distribution models of all three behavioural states against the eight Auckland Islands environmental variables. The model was trained for 100 iterations (runs) per state, making a total of 300 models. For each iteration, the training data were stored in folders, sorted by corresponding state and iteration. Evaluation statistics such as AUC, TSS, Soerensen’s similarity index, and response curves were also extracted. In future applications, this script also works for Maxent v.3.4.1.

# Empty vectors
  train_state <- list()        # for mean training prediction across state iterations
  AUC_state <- list()          # for mean AUC across state iterations
  thresh_state <- list()       # for mean selected threshold across state iterations
  thresh_runs_state <- list()  # for table of selected threshold across state iterations
  resp_state <- list()         # for variable responses across state iterations
  back_state <- list()         # for background prediction data
  samp_train_state <- list()   # for sample prediction data (train)
  samp_test_state <- list()    # for sample prediction data (test)
  tss_state  <- list()         # for storing TSS across state iterations
  ssi_state  <- list()         # for storing Soerensen's similarity index across states
  
# State for-loop
  for(state in 1:3){
    
      # create directory for each state
        dir.create(paste0(dir,'m_output\\',state),recursive=TRUE)
    
      # output path
        maxent_out <- paste0(dir,'m_output\\',state)

      # select occurrences by state
        occ_state <-  subset(all_states, grepl(paste0("*_",state), all_states$STATE))

      # Maxent training for 100 iterations
        for(run in 1:100){

            # Create a directory for each iteration's output
              dir.create(paste0(maxent_out,"\\",run),recursive=TRUE)
              runs_out <- paste0(maxent_out,"\\",run)

            # set random seed to select the points
              set.seed(run)

            # create a subset function for each run to select 133 points
              samp <- function(dataf){dataf[sample(1:nrow(dataf)[1],
                                                   size=133,replace=FALSE),]}

            # Randomly select 133 points per run
              sample_data <- ddply(occ_state, .(STATE), samp)

            # Convert into spatial points data frame
              sample_coords <- data.frame(x=sample_data$LONGITUDE,y=sample_data$LATITUDE)
              occ <- SpatialPointsDataFrame(sample_coords, sample_data,
                                            proj4string=CRS("+init=epsg:2193"))

            # Maxent
              train_mod <- maxent(
                                  # raster stack
                                    x=var_stack,

                                  # species coordinates
                                    p=occ[,c('LONGITUDE','LATITUDE')],

                                  # maxent directory
                                    path=runs_out,

                                  # arguments
                                    args=c(
                                          # logistic output
                                            'outputformat=logistic',

                                          # autofeatures changed to FALSE
                                            'autofeature=false',

                                          # using hinge-only
                                            'linear=false',
                                            'quadratic=false',
                                            'product=false',
                                            'threshold=false',
                                            'hinge=true',

                                          # extract output data
                                            'responsecurves=true',
                                            'jackknife=true',
                                            'writeplotdata=true',

                                           # duplicated already removed
                                            'removeduplicates=false',

                                           # other settings
                                            'askoverwrite=false',
                                            'writeclampgrid=true',
                                            'writemess=false',
                                            'randomseed=false',

                                          # select 25% (33 points) for testing
                                            'randomtestpoints=25',

                                          # for model prediction
                                            'writebackgroundpredictions=true'
                                          ),
                                  # indicate categorical variables
                                    factors=c('landcover','cliff_edges'))

            # Predict within training area and save
              train_pred <- predict(
                                    object=train_mod,
                                    x=var_stack,
                                    filename=paste0(runs_out,"\\train_S",
                                                    state,"_",run,".asc"),
                                    overwrite=TRUE,
                                    progress='text'
                                    )

            }  # end of training for 100 iterations

      # Calculate training area mean predictions per state

            # Create an empty vector
              maxent_rasters <- list()

            # For-loop
              for (run in 1:100){
                   # Directory for each iteration's output
                     runs_out <- paste0(maxent_out,"\\",run,"\\")

                   # Search for ".asc" file and stack
                     filenames <- list.files(path=paste0(runs_out),
                                             pattern=".asc",full.names=TRUE)

                   # Add to list
                     maxent_rasters[[run]] <- filenames
                   }

            # Raster stack
              maxent_100_rasters <- do.call(stack, maxent_rasters)

            # Calculate mean
              mean_train <- calc(x=maxent_100_rasters, fun=mean)

            # Add to list
              train_state[[state]] <- mean_train

            # Save raster
              writeRaster(mean_train,paste0(maxent_out,"\\mean_train_S",state,".asc"),
                          overwrite=TRUE)

      # Extract and calculate mean AUC values

            # Create an empty vector
              AUC_vals <- list()

            # For-loop
              for(run in 1:100){
                  # Output Path
                    runs_out <- paste0(maxent_out,"\\",run,"\\")

                  # Read maxent result table and extract threshold values
                    eval_table <- read.csv(paste0(runs_out,"maxentResults.csv"),
                                           header=T)

                    AUC_df <- eval_table[,c(grep(".AUC",names(eval_table)))]

                  # Add a column to table to differentiate among states and runs
                    AUC_df <- data.frame(state, AUC_df, run)
                    AUC_df$run <- paste0(run)
                    AUC_df$state <- paste0(state)

                  # Add to List
                    AUC_vals[[run]] <- AUC_df
                  }

              # Combine all 100 runs
                AUC <- do.call(rbind, AUC_vals)

              # Calculate mean and standard deviation
                AUC_means <- ddply(AUC, .(state), summarize,
                                  mean_train=mean(Training.AUC),sd_train=sd(Training.AUC),
                                  mean_test=mean(Test.AUC), sd_test=sd(Test.AUC)
                                  )

              # Add to list
                AUC_state[[state]] <- AUC_means

      # Extract and calculate mean threshold values
      # Only one threshold (MaxSSS) is extracted for later use

            # Create an empty vector
              thresh_vals <- list()

            # For-loop
              for(run in 1:100){
                  # Output Path
                    runs_out <- paste0(maxent_out,"\\",run,"\\")

                  # Read maxent result table and extract threshold values
                    res_df <- read.csv(paste0(runs_out,"maxentResults.csv"),
                                       header=T)

                  # Select the MaxSSS threshold (both testing and training)
                    thresh_df <- res_df[,c(
                    "Maximum.training.sensitivity.plus.specificity.Logistic.threshold",
                    "Maximum.test.sensitivity.plus.specificity.Logistic.threshold")]
                    colnames(thresh_df) <- c('maxSSS_train','maxSSS_test')

                  # Add a column to table to differentiate among states and runs
                    thresh_df <- data.frame(state, thresh_df, run)
                    thresh_df$run <- paste0(run)
                    thresh_df$state <- paste0(state)

                  # Add to List
                    thresh_vals[[run]] <- thresh_df
                  }

            # Combine all 100 runs
              thresholds <- do.call(rbind, thresh_vals)

            # Calculate mean and standard deviation of selected threshold
              thresh_means <- ddply(thresholds, .(state), summarize,
                                    maxSSS_train_mean=mean(maxSSS_train),
                                    maxSSS_train_sd=sd(maxSSS_train),
                                    maxSSS_test_mean=mean(maxSSS_test),
                                    maxSSS_test_sd=sd(maxSSS_test))

            # Add to list
              thresh_state[[state]] <- thresh_means
              thresh_runs_state[[state]] <- thresholds
              
  
      # Calculate TSS and Soerensen's similarity index
        # Extract background prediction files (to calculate TSS/Soerensen's)
                            
              # Create an empty vector
                s_back_df <- list()  # list of species background data across runs
  
              # For-loop
                for(run in 1:100){
                    # Output Path
                      runs_out <- paste0(maxent_out,"\\",run,"\\")
              
                    # Read CSV file of maxent background info (generated by maxent above)
                      res_df <- read.csv(paste0(runs_out,
                                      "\\species_backgroundPredictions.csv"), header=TRUE)
                    
                    # Select only logistic prediction values
                    # name here matches name in tss_calc function
                      predict <- res_df[,c(grep("Logistic",names(res_df)))]
              
                    # Add a column to indicate corresponding run within the table
                      back_df <- data.frame(run,predict)
                      back_df$state <- paste0("S",state)
              
                    # Add to List
                      s_back_df[[run]] <- back_df
                   }
          
              # Combine all runs for state
                back_all <- do.call(rbind, s_back_df)
                
              # Add to list
                back_state[[state]] <- back_all
                
          # Extract sample prediction files (to calculate TSS/Sorensen's Similarity)
                            
              # Create an empty vector
                s_samp_train_df <- list()  # list of species training data across runs
                s_samp_test_df <- list()   # list of species test  data across runs
  
              # For-loop
                for(run in 1:100){
                    # Output Path
                      runs_out <- paste0(maxent_out,"\\",run,"\\")
              
                    # Read CSV file of maxent background info (generated by maxent above)
                      res_df <- read.csv(paste0(runs_out,
                                      "\\species_samplePredictions.csv"), header=TRUE)
                    
                    # Select only logistic prediction values
                      res_df_train <- res_df[(res_df$Test.or.train=="train"),]
                      res_df_test <- res_df[(res_df$Test.or.train=="test"),]
                      
                    # Select only logistic threshold values and invert the table
                      predict_train <- res_df_train[,c(grep("Logistic.prediction",
                                                      names(res_df_train)))]
                      predict_test <- res_df_test[,c(grep("Logistic.prediction",
                                                      names(res_df_test)))]
                                            
                    # Add a column to indicate corresponding run within the table
                      samp_df_train <- data.frame(run, predict=predict_train)
                      samp_df_test <- data.frame(run, predict=predict_test)
                      samp_df_train$state <- paste0("S",state)
                      samp_df_test$state <- paste0("S",state)
              
                    # Add to List
                      s_samp_train_df[[run]] <- samp_df_train
                      s_samp_test_df[[run]] <- samp_df_test
                   }
          
              # Combine all runs for state
                samp_all_train <- do.call(rbind, s_samp_train_df)
                samp_all_test <- do.call(rbind, s_samp_test_df)
                
              # Add to list
                samp_train_state[[state]] <- samp_all_train
                samp_test_state[[state]] <- samp_all_test
                
          # TSS for-loop
                            
              # Create an empty vector
                tss_train_df <- list()  # list of TSS train scores across runs
                tss_test_df <- list()   # list of TSS test scores across runs
  
              # For-loop
                for(run in 1:100){
                    # extract threshold by number in row
                      thresh <- thresholds[run,]
                      
                    # extract sample and background predictions by number of run
                      samples_train <- samp_all_train[(samp_all_train$run==run),]
                      samples_test <- samp_all_test[(samp_all_test$run==run),]
                      background <- back_all[(back_all$run==paste0(run)),]
                      
                    # run tss function
                      tss_run_train <- calc_tss(samples_train,background,thresh$maxSSS_train)
                      tss_run_test <- calc_tss(samples_test,background,thresh$maxSSS_test)
                      
                    # add to list
                      tss_train_df[[run]] <- tss_run_train
                      tss_test_df[[run]] <- tss_run_test
                }
                
              # Combine
                tss_train_vals <- do.call(rbind,tss_train_df)
                tss_test_vals <- do.call(rbind,tss_test_df)
                  
              # Calculate mean and standard deviation across runs
                tss_mean_train <- mean(tss_train_vals)
                tss_sd_train <- sd(tss_train_vals)
                tss_mean_test <- mean(tss_test_vals)
                tss_sd_test <- sd(tss_test_vals)
                
              # Combine into dataframe
                tss_df <- data.frame(state=paste0(state),
                                     mean_train=tss_mean_train,
                                     sd_train=tss_sd_train,
                                     mean_test=tss_mean_test,
                                     sd_test=tss_sd_test)
                
              # Add to list
                tss_state[[state]] <- tss_df
                
          # Soerensen's for-loop
                
              # Create an empty vector
                ssi_train_df <- list()  # list of ssi train scores across runs
                ssi_test_df <- list()   # list of ssi test scores across runs
  
              # For-loop
                for(run in 1:100){
                    # extract threshold by number in row
                      thresh <- thresholds[run,]
                      
                    # extract sample and background predictions by number of run
                      samples_train <- samp_all_train[(samp_all_train$run==run),]
                      samples_test <- samp_all_test[(samp_all_test$run==run),]
                      background <- back_all[(back_all$run==paste0(run)),]
                      
                    # run ssi function
                      ssi_run_train <- calc_ssi(samples_train,background,thresh$maxSSS_train)
                      ssi_run_test <- calc_ssi(samples_test,background,thresh$maxSSS_test)
                      
                    # add to list
                      ssi_train_df[[run]] <- ssi_run_train
                      ssi_test_df[[run]] <- ssi_run_test
                }
                
              # Combine
                ssi_train_vals <- do.call(rbind,ssi_train_df)
                ssi_test_vals <- do.call(rbind,ssi_test_df)
                  
              # Calculate mean and standard deviation across runs
                ssi_mean_train <- mean(ssi_train_vals)
                ssi_sd_train <- sd(ssi_train_vals)
                ssi_mean_test <- mean(ssi_test_vals)
                ssi_sd_test <- sd(ssi_test_vals)
                
              # Combine into dataframe
                ssi_df <- data.frame(state=paste0(state),
                                     mean_train=ssi_mean_train,
                                     sd_train=ssi_sd_train,
                                     mean_test=ssi_mean_test,
                                     sd_test=ssi_sd_test)
                
              # Add to list
                ssi_state[[state]] <- ssi_df  
                

# Extract variable responses
# Extracting responses for when all variables were run together

# Create an empty vector
  resp_list <- list()  #for response when all variables are run in Maxent

# For-loop
  for(run in 1:100){
      # Output Path
        runs_out <- paste0(maxent_out,"\\",run,"\\")

      # Select all files of .dat extension (generated by Maxent earlier)
        filenames <- list.files(paste0(runs_out,"\\plots"),pattern="*.dat",
                                full.names=TRUE)

      # Select the responses where Maxent ran each variable together
        ran_together.files <- filenames[-grep("_only.dat",filenames, fixed=T)]

      # Make combined sets of dat files for this state
        resp_dat <- ldply(ran_together.files,read.table,sep=",",header=TRUE)

      # Create new dataframes with new columns
        resp_df <- data.frame(state,resp_dat)
        resp_df$type <- "together"          # add to table
        resp_df$state <- paste0("S",state)  # add to table

      # Add to list
        resp_list[[state]] <- resp_df
     }

# Combine all states into one table
  resp_all <- do.call(rbind,resp_list)

# Add to list
  resp_state[[state]] <- resp_all
  
                        
     } # end of state for-loop

# Stack mean training prediction maps
  train_s123 <- do.call(stack,train_state)

# Gather calculated mean AUC values into one table and save
  AUC <- do.call(rbind, AUC_state)
  write.csv(AUC, paste0(dir,"AUC.csv"), row.names=FALSE)

# Gather calculated mean threshold values (MaxSSS only) and save
  thresholds <- do.call(rbind, thresh_state)
  write.csv(thresholds, paste0(dir,"thresholds.csv"), row.names=FALSE)

# Gather per-run threshold values and save (maxSSS only)
  thresh_runs <- do.call(rbind, thresh_runs_state)
  write.csv(thresh_runs, paste0(dir,"maxSSS.csv"), row.names=FALSE)

# background and sample data across state runs
  sample_data_train <- do.call(rbind, samp_train_state)
  sample_data_test <- do.call(rbind, samp_test_state)
  background <- do.call(rbind, back_state)

# TSS and Soerensen's across state runs
  TSS <- do.call(rbind, tss_state)
  soerensens <- do.call(rbind, ssi_state)
  write.csv(TSS,paste0(dir,"TSS.csv"), row.names=FALSE)
  write.csv(soerensens,paste0(dir,"soerensens.csv"), row.names=FALSE)
  
# Gather variable response tables
  responses <- do.call(rbind, resp_state)
  write.csv(responses,paste0(dir,"var_responses.csv"), row.names=FALSE)

5. Model evaluation

The model training procedure has also been evaluated in Frans et al. (2018). Here, we demonstrate some of the evaluation criteria.

5.1 AUC

Area under the Receiver Operating Characteristic (ROC) curve (AUC) is a measure of model accuracy, and is typically used in SDM assessment procedures. AUC ranges from 0 to 1, where an AUC from 0 to 0.5 indicates that the SDM’s ability to predict is worse than random, and AUC >0.5 is better than random (Phillips et al. 2006). Typically, a model with an AUC >0.75 has good predictive power (Elith et al. 2006). The mean and SD AUC train and test scores across the state model iterations are as follows, and indicate excellent performance:

# show table
  knitr::kable(AUC)
state mean_train sd_train mean_test sd_test
1 0.999180 0.0001128 0.9995234 0.0018583
2 0.995271 0.0003138 0.9986124 0.0004151
3 0.995856 0.0003255 0.9986974 0.0005302

5.2 TSS

True Skill Statistic is another measure of model accuracy. It accounts for sensitivity and specificity and ranges from -1 to 1. Values from -1 to 0 indicate that the SDM’s prediction was worse than random, and values greater than zero to 1 are better than random predictions, with 1 being perfect agreement. Compared to another SDM accuracy measure like kappa, prevalence does not affect TSS. Here, we also wanted to measure TSS because the AUC score appeared to show an overfit model, which can be an artifact of the NZSL’s spatial distribution (restricted range in the Auckland islands) and the small number of occurrences available for each behavioural state. The mean and SD TSS scores across the state model iterations are as follows, and indicate excellent performance:

# show table
  knitr::kable(TSS)
state mean_train sd_train mean_test sd_test
1 0.9606579 0.0390955 0.9170001 0.1175453
2 0.9813887 0.0082280 0.9734165 0.0224609
3 0.9825027 0.0094086 0.9698385 0.0253797

5.3 Soerensen’s similarity index

Another model accuracy measure is Soerensen’s similarity index. Soerensen’s measures the similarity between predictions and observations. It avoids inflation from true negatives by focussing on true positives, false positives, and false negatives (Leroy 2018). Soerensen’s values range from 0 to 1, where a value of 0 indicates that none of the predictions match the observations, and a value of 1 indicates that predictions match observations perfectly (i.e. no false positives or false negatives). The mean and SD Soerensen’s scores across the state model iterations are as follows, and indicate that the model performed well:

# show table
  knitr::kable(soerensens)
state mean_train sd_train mean_test sd_test
1 0.8701534 0.1368870 0.8159698 0.1288954
2 0.7649749 0.0942729 0.5794007 0.0939203
3 0.7483930 0.1032794 0.5875510 0.0815262

5.4 Variable responses

Variable response curves were also extracted from the model training iterations. Here, we display the NZSL’s response to environmental variables when all the variables were run together.

First, we change the categories to factors.

# change to factors
  responses$state <- as.factor(responses$state)
  responses$variable <- as.factor(responses$variable)

Then, we visualise the variables. The first set to inspect are the distance variables. Although the Auckland Islands training area and Enderby Island, Sandy Bay are small study areas when compared to the New Zealand mainland (our prediction area for the current study and iSDMdb demonstration), we see here that the model training procedure was able to capture different habitat preferences across the different behavioural states.

# Distance variables

  # Subset dataset
    dist_resp <- responses[!(responses$variable=="cliff_edges"),]
    dist_resp <- dist_resp[!(dist_resp$variable=="slope"),]
    dist_resp <- dist_resp[!(dist_resp$variable=="landcover"),]
  
  # facet grid
    ggplot(dist_resp,
           aes(x=x,y=y,group=variable,fill=variable)) +
           geom_smooth(aes(color=variable, se=TRUE)) +
           scale_fill_brewer(palette="Set1") + 
           facet_grid(state~variable, scales="fixed") +
           ylim(-.05,1) + # probability of presence
           xlim(0,2000)+  # only focus on <2km distance (max. NZSL dispersal)
           xlab("Distance (m)") + ylab("Probability of presence") +
           theme_bw() +
           theme(strip.background = element_blank(), strip.placement = "outside") +
           theme(legend.position='none')

Next, we inspect the response to slope. We also include a vertical line for the known slope limit for breeding female NZSL inland movement (20 degrees), based on a field study on habitat preferences from Augé et al. (2012). A decrease in probability of presence after this threshold is expected, and helps us to evaluate the robustness of the model training procedure.

# Extract slope

  # Subset dataset
    slp_resp <- responses[(responses$variable=="slope"),]
  
  # facet grid
    ggplot(slp_resp,
           aes(x=x,y=y,group=variable,fill=variable)) +
           geom_smooth(aes(color=variable, se=TRUE)) +
           scale_fill_brewer(palette="Set1") + 
           facet_grid(state~variable, scales="fixed") +
           ylim(-.05,1) + # probability of presence
           geom_vline(aes(xintercept = 20), colour = 'grey', lty = 2) + #indicate slope limit
           xlab("Slope (degrees)") + ylab("Probability of presence") +
           theme_bw() +
           theme(strip.background = element_blank(), strip.placement = "outside") +
           theme(legend.position='none')

We inspect the response to land cover. As categorical data, mean responses have to be calculated and then the plots will be bar plots. In terms of evaluation, we noticed an increase in probability of presence for sand in S3. This is expected, as female NZSLs return to the beach during all three states, going to sea to forage while leaving pups protected further inland (Augé et al. 2009; Augé et al. 2012).

# Extract slope

  # Subset dataset
    lc_resp <- responses[(responses$variable=="landcover"),]
    
  # replace names
    lc_resp$x[lc_resp$x==6] <- "sand"
    lc_resp$x[lc_resp$x==5] <- "water"
    lc_resp$x[lc_resp$x==4] <- "grass"
    lc_resp$x[lc_resp$x==3] <- "forest"
    lc_resp$x[lc_resp$x==2] <- "scrub"
    lc_resp$x[lc_resp$x==1] <- "rock"
    
  # facet grid
    ggplot(lc_resp, aes(x=x,y=y, fill=x)) +
           geom_bar(aes(fill=x), stat="identity",alpha=0.8) +
           scale_fill_brewer(palette="Set1") + 
           facet_grid(state~variable, scales="fixed") +
           ylim(-.05,1) +
           xlab("Land cover type") + ylab("Probability of presence") +
           theme_bw() +
           theme(strip.background = element_blank(), strip.placement = "outside") +
           theme(legend.position='none')

Finally, we plot the response to cliff edges.

# Extract slope

  # Subset dataset
    cliff_resp <- responses[(responses$variable=="cliff_edges"),]
    
  # replace names
    cliff_resp$x[cliff_resp$x==0] <- "no cliff"
    cliff_resp$x[cliff_resp$x==1] <- "cliff"
    
  # facet grid
    ggplot(cliff_resp, aes(x=x,y=y, fill=x)) +
           geom_bar(aes(fill=x), stat="identity",alpha=0.8) +
           scale_fill_brewer(palette="Set1") + 
           facet_grid(state~variable, scales="fixed") +
           ylim(-.05,1) +
           xlab("Land cover type") + ylab("Probability of presence") +
           theme_bw() +
           theme(strip.background = element_blank(), strip.placement = "outside") +
           theme(legend.position='none')

5.5 Training area prediction

We then map the training predictions for the northern portion of the Auckland Islands (training area). Mapping and reviewing the mean predictions for the three states in this area is advantageous, since multiple states are required for successful breeding colonies. In this northern area of the Auckland Islands, there are two breeding colony locations: Enderby Island (Sandy Bay) and Dundas Island. If the two breeding colonies can be predicted across the behavioural states, then that gives us confidence to use the multi-state SDM and our selected threshold (here, maximum sum of sensitivity and specificity) on the mainland.

# rename raster stack layers
  names(train_s123) <- c("S1","S2","S3")

# Set up plots (S1-3)
  png(width = 1800, height = 2400,
      paste0(fig.dir,"/train_S1-3.png"),res=300)
      sdmplot <- spplot(train_s123,
                 main=c("Training (S1-S3)"),
                 layout=c(1,3), #(ncols, nrows)
                 maxpixels=500000,
                 col.regions=c(rev(brewer.pal(11,"Spectral"))),
                 cuts=10,
                 colorkey=list(labels=list(at=seq(0,1,by=0.1)),  #set up labels
                               space="right"),
                 par.settings = list(panel.background=list(col="black"))
                 )
  sdmplot
  dev.off()
# Display figure
  sdmplot

A second map, zooming in on Sandy Bay.

# Set up plots (S1-3)
  png(width = 2400, height = 900,
      paste0(fig.dir,"/train_S1-3_sandy.png"),res=300)
      sbplot <- spplot(train_s123,
                       main=c("Sandy Bay training (S1-S3)"),
                       layout=c(3,1), #(ncols, nrows)
                       maxpixels=500000,
                       col.regions=c(rev(brewer.pal(11,"Spectral"))),
                       cuts=10,
                       colorkey=list(labels=list(at=seq(0,1,by=0.1)), #set up labels
                                     space="right"),
                       par.settings = list(panel.background=list(col="black")),
                       xlim=c(1123540-200,1124709+200),
                       ylim=c(4383964-200,4384806+200)  
                       )
  sbplot
  dev.off()
# Display figure
  sbplot

A final map, zooming in on Dundas Island.

# Set up plots (S1-3)
  png(width = 2400, height = 900,
      paste0(fig.dir,"/train_S1-3_dundas.png"),res=300)
      dunplot <- spplot(train_s123,
                        main=c("Dundas Island training (S1-S3)"),
                        layout=c(3,1), #(ncols, nrows)
                        maxpixels=500000,
                        col.regions=c(rev(brewer.pal(11,"Spectral"))),
                        cuts=10,
                        colorkey=list(labels=list(at=seq(0,1,by=0.1)),   #set up labels
                                      space="right"),
                        par.settings = list(panel.background=list(col="black")),
                        xlim=c(1107432+15000,1127989),
                        ylim=c(4373022+200,4386119-8000)  
                        )
  dunplot
  dev.off()
# Display figure
  dunplot

Information on other model evaluation metrics can be found in Frans et al. (2018).

6. Save workspace

# save workspace to load later if needed
  save.image("G:/R/mssdm_AI.RData")

 

This concludes the multi-state SDM training step.

Next, we use the training data for SDM predictions onto the mainland and identification of suitable sites. See Appendix S2.

7. References

Augé, A. A., Chilvers, B. L., Moore, A., Mathieu, R., & Robertson, B. C. (2009). Aggregation and dispersion of female New Zealand sea lions at the Sandy Bay breeding colony, Auckland Islands: how unusual is their spatial behaviour? Behaviour, 146 (9), 1287–1311. https://doi.org/10.1163/15683909X427687

Augé, A. A., Chilvers, B. L., Mathieu, R., & Moore, A. B. (2012). On-land habitat preferences of female New Zealand sea lions at Sandy Bay, Auckland Islands. Marine Mammal Science, 28 (3), 620–637. https://doi.org/10.1111/j.1748-7692.2011.00515.x

Elith, J., et al. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29, 129-151. https://doi.org/10.1111/j.2006.0906-7590.04596.x

Frans, V.F., Augé, A.A., Edelhoff, H., Erasmi, S., Balkenhol, N., & Engler, J.O. (2018) Quantifying apart what belongs together: A multi-state species distribution modelling framework for species using distinct habitats. Methods in Ecology and Evolution, 9 (1), 98-108. https://doi.org/10.1111/2041-210X.12847

Leroy, B., Delsol, R., Hugueny, B., Meynard, C. N., Barhoumi, C., Barbet-Massin, M., & Bellard, C. (2018). Without quality presence-absence data, discrimination metrics such as TSS can be misleading measures of model performance. Journal of Biogeography, 45(9), 1994–2002. https://doi.org/10.1111/jbi.13402

Phillips, S.J., Anderson, R.P., & Schapire, R.E.. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190, 231–259. https://doi.org/10.1016/j.ecolmodel.2005.03.026

Phillips, S.J. & Dudik, M. (2008) Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, 31 (2), 161-175. https://doi.org/10.1111/j.0906-7590.2008.5203.x