3 alt text

You are reading the work-in-progress Spatial Sampling and Resampling for Machine Learning. This chapter is currently currently draft version, a peer-review publication is pending. You can find the polished first edition at https://opengeohub.github.io/spatial-sampling-ml/.

3.1 Case study: Cookfarm dataset

We next look at the Cookfarm dataset, which is available via the landmap package and described in detail in C. K. Gasch et al. (2015):

library(landmap)
#?landmap::cookfarm
data("cookfarm")

This dataset contains spatio-temporal (3D+T) measurements of three soil properties and a number of spatial and temporal regression covariates. In this example we fit a spatiotemporal model to predict soil moisture, soil temperature and electrical conductivity in 3D+T (hence 1 extra dimension).

We can load the prediction locations and regression-matrix from:

library(rgdal)
library(ranger)
cookfarm.rm = readRDS('extdata/cookfarm_st.rds')
cookfarm.grid = readRDS('extdata/cookfarm_grid10m.rds')

We are interested in modeling soil moisture (VW) as a function of soil depth (altitude), elevation (DEM), Topographic Wetness Index (TWI), Normalized Difference Red Edge Index (NDRE.M), Normalized Difference Red Edge Index (NDRE.sd), Cumulative precipitation in mm (Precip_cum), Maximum measured temperature (MaxT_wrcc), Minimum measured temperature (MinT_wrcc) and the transformed cumulative day (cdayt):

fm <- VW ~ altitude+DEM+TWI+NDRE.M+NDRE.Sd+Precip_cum+MaxT_wrcc+MinT_wrcc+cdayt

We can use the ranger package to fit a random forest model:

m.vw = ranger(fm, cookfarm.rm, num.trees = 100)
m.vw
#> Ranger result
#> 
#> Call:
#>  ranger(fm, cookfarm.rm, num.trees = 100) 
#> 
#> Type:                             Regression 
#> Number of trees:                  100 
#> Sample size:                      107851 
#> Number of independent variables:  9 
#> Mtry:                             3 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       0.0009826038 
#> R squared (OOB):                  0.8479968

which shows that a significant model can be fitting using this data with R-square above 0.80. This model, however, as shown in C. K. Gasch et al. (2015), ignores the fact that many VW measurements have exactly the same location (monitoring station with four depths), hence ranger over-fits data and gives unrealistic R-square.

We can now fit an Ensemble ML model, but we will also use a blocking parameter that should protect from over-fitting: the unique code of the station (SOURCEID). This means that complete stations will be either used for training or for validation. This satisfies the requirement of (Roberts2017?) for predicting to new data or predictor space.

We use the same procedure in mlr as in the previous example:

library(mlr)
SL.lst = c("regr.ranger", "regr.gamboost", "regr.cvglmnet")
lrns.st <- lapply(SL.lst, mlr::makeLearner)
## subset to 5% to speed up computing
subs <- runif(nrow(cookfarm.rm))<.05
tsk.st <- mlr::makeRegrTask(data = cookfarm.rm[subs, all.vars(fm)], target = "VW", 
                            blocking = as.factor(cookfarm.rm$SOURCEID)[subs])
tsk.st
#> Supervised task: cookfarm.rm[subs, all.vars(fm)]
#> Type: regr
#> Target: VW
#> Observations: 5321
#> Features:
#>    numerics     factors     ordered functionals 
#>           9           0           0           0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: TRUE
#> Has coordinates: FALSE

The resulting model again used simple linear regression for stacking various learners:

#> Starting parallelization in mode=socket with cpus=32.
#> Exporting objects to slaves for mode socket: .mlr.slave.options
#> Mapping in parallel: mode = socket; level = mlr.resample; cpus = 32; elements = 10.
#> Exporting objects to slaves for mode socket: .mlr.slave.options
#> Mapping in parallel: mode = socket; level = mlr.resample; cpus = 32; elements = 10.
#> Loading required package: mboost
#> Loading required package: parallel
#> Loading required package: stabs
#> 
#> Attaching package: 'stabs'
#> The following object is masked from 'package:mlr':
#> 
#>     subsample
#> The following object is masked from 'package:spatstat.core':
#> 
#>     parameters
#> This is mboost 2.9-2. See 'package?mboost' and 'news(package  = "mboost")'
#> for a complete list of changes.
#> 
#> Attaching package: 'mboost'
#> The following object is masked from 'package:glmnet':
#> 
#>     Cindex
#> The following object is masked from 'package:spatstat.core':
#> 
#>     Poisson
#> The following objects are masked from 'package:raster':
#> 
#>     cv, extract
#> Exporting objects to slaves for mode socket: .mlr.slave.options
#> Mapping in parallel: mode = socket; level = mlr.resample; cpus = 32; elements = 10.
#> Stopped parallelization. All cleaned up.

Note that here we can use full-parallelisation to speed up computing by using the parallelMap package. This resulting EML model now shows a more realistic R-square / RMSE:

summary(eml.VW$learner.model$super.model$learner.model)
#> 
#> Call:
#> stats::lm(formula = f, data = d)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.188277 -0.044502  0.003824  0.044429  0.177609 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    0.044985   0.009034   4.979 6.58e-07 ***
#> regr.ranger    1.002421   0.029440  34.050  < 2e-16 ***
#> regr.gamboost -0.664149   0.071813  -9.248  < 2e-16 ***
#> regr.cvglmnet  0.510689   0.052930   9.648  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06262 on 5317 degrees of freedom
#> Multiple R-squared:  0.3914, Adjusted R-squared:  0.3911 
#> F-statistic:  1140 on 3 and 5317 DF,  p-value: < 2.2e-16

This is now a 3D+T model of VW, which means that we can use it to predict values of VW at any new x,y,d,t location. To make prediction for a specific slice we use:

#> Joining by: Date
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

To plot prediction together with locations of training points we can use:

cookfarm.grid$pr.VW = pr.df$data$response
plot(raster::raster(cookfarm.grid["pr.VW"]), col=rev(bpy.colors()),
     main="Predicted VW for 2012-07-30 and depth -0.3 m", axes=FALSE, box=FALSE)
points(cookfarm$profiles[,c("Easting","Northing")], pch="+")
Predicted soil water content based on spatiotemporal EML.

Figure 3.1: Predicted soil water content based on spatiotemporal EML.

Because this is a spacetime dataset, we could also predict values of VW for longer periods (e.g. 100 days) then visualize changes using e.g. the animation package or similar.