Published February 3, 2023 | Version v1
Dataset Open

Codes in R for spatial statistics analysis, ecological response models and spatial distribution models

  • 1. Facultad de Ciencias, Universidad Autónoma de San Luis Potosí. San Luis Potosí, S.L.P. México.
  • 2. Campus San Luis, Colegio de Postgraduados. Salinas de Hidalgo, S.L.P. México.
  • 1. Campus San Luis, Colegio de Postgraduados. Salinas de Hidalgo, S.L.P. México.
  • 2. Facultad de Ciencias, Universidad Autónoma de San Luis Potosí. San Luis Potosí, S.L.P. México.

Description

In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.).  Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

                                    

Validation set

Model

True

False

Presence

A

B

Background

C

D

We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

Regarding the model evaluation and estimation, we selected the following estimators:

1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

Files

Files (28.9 kB)

Name Size Download all
md5:e02e5835bcc320fdd5b9882c7b2ff5b2
2.2 kB Download
md5:e3b9f39400913077cd4f10b77caa24db
3.4 kB Download
md5:99494ae3882dfbafeb7653fc2280c0b7
1.8 kB Download
md5:9b9e7193e1b1d0d518a0305531669022
1.7 kB Download
md5:21c798fa1f6173bbd89c7f37de55b772
3.4 kB Download
md5:3b65cae73e3385627df807d45940a042
1.1 kB Download
md5:5212973ed6b85e246e6fe3698ee372f3
1.7 kB Download
md5:6894e4fe1ab9fce4f5ed9237873fd6bf
10.0 kB Download
md5:7ae92a83fc94a3c5ae6c6631ac6694d9
3.7 kB Download

Additional details

References

  • Allouche, O., Tsoar, A., y Kadmon, R. (2006). Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of applied ecology, 43(6), 1223-1232. https://doi.org/10.1111/j.1365-2664.2006.01214.x.
  • Awan, M. N., Saqib, Z., Buner, F., Lee, D. C., & Pervez, A. (2021). Using ensemble modeling to predict breeding habitat of the red-listed Western Tragopan (Tragopan melanocephalus) in the Western Himalayas of Pakistan. Global Ecology and Conservation, 31, e01864. https://doi.org/10.1016/j.gecco.2021.e01864.
  • Carpenter, G., Gillison, A. N., y Winter, J. (1993). DOMAIN: a flexible modelling procedure for mapping potential distributions of plants and animals. Biodiversity & Conservation, 2(6), 667-680.
  • DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 837-845.
  • Den Burg, M. P., Van Belleghem, S. M., y Villanueva, C. N. D. J. (2020). The continuing march of Common Green Iguanas: arrival on mainland Asia. Journal for Nature Conservation, 57, 125-888. https://doi.org/10.1016/j.jnc.2020.125888.
  • Eddelbuettel, D., & Balamuta, J. J. (2018). Extending R with C++: a brief introduction to Rcpp. The American Statistician, 72(1), 28-36.
  • Fielding, A. H., y Bell, J. F. (1997). A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental conservation, 38-49.
  • Freund, Y., y Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139. https://doi.org/10.1006/jcss.1997.1504.
  • Friedman, J. H. (2002). Stochastic Gradient Boosting. Computational Statistics and Data Analysis, 38(4), 367-378. https://doi.org/10.1016/S0167-9473(01)00065-2.
  • Gladstone, A. (2022). Building an R Package. In C++ Software Interoperability for Windows Programmers: Connecting to C#, R, and Python Clients (pp. 91-112). Berkeley, CA: Apress.
  • Guisan, A., & Hofer, U. (2003). Predicting reptile distributions at the mesoscale: relation to climate and topography. Journal of Biogeography, 30(8), 1233-1243. https://doi.org/10.1046/j.1365-2699.2003.00914.x.
  • Hijmans, R. J., & Elith, J. (2017). Species distribution modeling with R. R CRAN Project.
  • Hvitfeldt, E., & Silge, J. (2021). Supervised machine learning for text analysis in R. Chapman and Hall/CRC.
  • Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., ... y Kenkel, B. (2020). Package 'caret': Classification and Regression Training, R-package version 6.0-p. Misc functions for training and plotting classification and regression models.
  • Kim, J. H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational statistics & data analysis, 53(11), 3735-3745.
  • Musa, K. I., & Mansor, W. N. A. W. (2021). Exploring Data Using R. Penerbit USM.
  • Naimi, B., y Araujo, M.B. (2016) sdm: a reproducible and extensible R platform for species distribution modelling. Ecography, 39:368-375, doi: 10.1111/ecog.01881.
  • Naoki, K., Gómez, M. I., López, R. P., Meneses, R. I., y Vargas, J. (2006). Comparación de modelos de distribución de especies para predecir la distribución potencial de vida silvestre en Bolivia. Ecología en Bolivia, 41(1), 65-78. ISSN 2075-5023.
  • Neuwirth, E. (2014). RColorBrewer: ColorBrewer palettes. R package version 1.1-2. The R Foundation.
  • Olden, J. D., y Jackson, D. A. (2002). A comparison of statistical approaches for modelling fish species distributions. Freshwater biology, 47(10), 1976-1995. https://doi.org/10.1046/j.1365-2427.2002.00945.x.
  • Pebesma, E. (2020). Map overlay and spatial aggregation in sp.
  • Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J., C., y Müller, M. (2011). "pROC: an open-source package for R and S+ to analyze and compare ROC curves". BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77.
  • Sarkar, D. (2008) Lattice: Multivariate Data Visualization with R, Springer. ISBN: 978-0-387-75968-5 http://lmdvr.r-forge.r-project.org/.
  • Soberón, J., y Nakamura, M. (2009). Niches and distributional areas: concepts, methods, and assumptions. Proceedings of the National Academy of Sciences, 106 (Supplement 2), 19644-19650. https://doi.org/10.1073/pnas.0901637106.
  • Verzani, J. (2011). Getting started with RStudio. " O'Reilly Media, Inc.".
  • Wei, T., Simko, V., Levy, M., Xie, Y., Jin, Y., & Zemla, J. (2017). Package 'corrplot'. Statistician, 56(316), e24.
  • Wickham, H. (2015). R packages: organize, test, document, and share your code. " O'Reilly Media, Inc."
  • Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., y Woo, K. (2019). ggplot2: create elegant data visualisations using the grammar of graphics. 2018. URL https://CRAN. R-project. org/package= ggplot2. R package version, 2(1), 2.
  • Wickham, H., & Wickham, M. H. (2020). Package 'plyr'. Obtenido Httpscran Rproject Orgwebpackagesdplyrdplyr Pdf.
  • Zainuddin, A. A., Rahim, A., Ramadany, S., Dharmayani, H., Kuswanto, H., Kadir, R. R. A., ... & Rasyid, H. (2023). Geospatial analysis of type 2 diabetes mellitus and hypertension in South Sulawesi, Indonesia. Scientific Reports, 13(1), 838.