Key Features
-
Spatial Cross-Validation: K-Nearest Neighbor Distance Matching (KNNDM) ensures geographic distance matching between validation folds
-
Advanced Feature Selection: Automated forward feature selection (FFS) identifies optimal covariate subsets
-
Dual Uncertainty Quantification:
-
Tile-Based Processing: Automatic parallelized tiling for large raster stacks (>5M cells)
-
Comprehensive Quality Assessment: ISO 19157-compliant data quality metadata and fitness-for-purpose documentation
Methodological Framework
Processing Workflow (14 Steps)
-
Package Verification: R dependency management with automatic installation
-
Custom Function Definition: User-defined plotting and extraction utilities
-
Data Loading & Validation: Spatial data import with CRS compatibility checks
-
Covariate Extraction: 3×3 median-filtered neighborhood extraction at sample points
-
Spatial Partitioning: CreateSpacetimeFolds with k-fold spatial blocking
-
KNNDM Fold Setup: Geodistance matching for CV fold creation
-
Forward Feature Selection: FFS with Random Forest (ntree=100) via CAST package
-
Model Training: Random Forest (ntree=500) with optimized hyperparameters
-
External Validation: Independent test set performance assessment
-
Full-Dataset Retraining: Maximizes training information for production predictions
-
Wall-to-Wall Prediction: Random Forest spatial prediction across study area
-
Quantile Predictions: QRF (Q05, Q50, Q95) with 90% prediction intervals
-
Distance-Based Uncertainty: AOA/DI computation with automatic tiling
-
Output Generation: Raster, CSV, and visualization deliverables
Key Algorithms & Packages
| Package |
Version |
Function |
| CAST |
0.7.0 |
Spatial cross-validation (KNNDM) and Area of Applicability |
| caret |
6.0 |
Machine learning framework and hyperparameter tuning |
| randomForest |
4.7 |
Random Forest implementation |
| quantregForest |
1.3 |
Quantile regression for prediction intervals |
| terra |
1.7 |
Raster data operations and spatial analysis |
| sf |
1.0 |
Vector data handling |
| doParallel |
1.0 |
Parallelization (PSOCK on Windows, fork on Unix/Linux) |
Input Data Requirements
Soil Samples (ESRI Shapefile)
-
Format: .shp, .shx, .dbf, .prj files
-
Geometry: Point features
-
CRS: Any projected coordinate system (e.g., EPSG:25832 for UTM Zone 32N)
-
Required Attributes: SOC column (numeric, concentration in g/kg or %)
-
Quality Criteria:
-
Minimum 100 samples (200+ recommended)
-
No duplicate coordinates
-
Valid coordinates within study area
-
SOC values within plausible range (0-30 typical)
-
Sample density ≥0.5 samples/km² minimum
Environmental Covariates (Multi-band GeoTIFF)
-
Format: GeoTIFF with multiple bands
-
Bands: 5-50 layers (typically 10-25 recommended)
-
CRS: Compatible with soil samples (automatic reprojection if needed)
-
Resolution: Uniform across all bands (10m, 30m, or 100m typical)
-
Covariate Types:
-
Terrain: slope, aspect, curvature, elevation, relief
-
Climate: precipitation, temperature, solar radiation
-
Remote sensing: NDVI, SAVI, SAR backscatter
-
Auxiliary: parent material, lithology, land use
Data Quality Requirements:
-
Complete spatial coverage over study area (no gaps/holes)
-
No zero-variance layers (automatically removed)
-
Physically plausible values across study domain
-
Covariate-SOC relationships captured by covariate selection
Output Deliverables
Raster Outputs (GeoTIFF format, LZW compression)
-
FinalPredictionSocRaster.tif - Point predictions from Random Forest
-
FinalPredictionSocQuantileLayers.tif - Multi-band: Q05, Q50, Q95, PIW (Prediction Interval Width)
-
FinalPredictionDissimilarityIndex.tif - Multivariate distance to training feature space
-
FinalPredictionAOAmask.tif - Binary AOA mask (1=reliable, 0=extrapolation)
Statistical Tables (CSV format)
-
ExtractValuesSampleDataSummary.csv - Input data completeness and descriptive statistics
-
FinalModelAccuracy.csv - Spatial cross-validation performance metrics
-
ValidationPerformanceComparison.csv - Independent test set evaluation (RMSE, MAE, R², Bias)
-
ForwardFeatureSelectionSelectedVariables.csv - Ranked selected covariates
-
FinalModelVariableImportance.csv - Variable importance scores
-
QuantilePredictionUncertaintySummary.csv - PI statistics (mean, median, 95th percentile width)
-
FinalPredictionAoaSummaryStatistics.csv - AOA coverage and DI distribution
Visualization Outputs (PNG, 300 DPI)
-
SpatialDataPartitionMapTrainTestSplit.png - Training/test sample distribution
-
SpatialCrossValidationGeodistEcdf.png - Fold quality assessment
-
ValidationScatterPlot.png - Observed vs. predicted values with R²
-
ValidationResidualPlot.png - Residual analysis for bias assessment
-
FinalPredictionMapSocRF.png - Predicted SOC concentration map
-
FinalPredictionMap[Q05/Q50/Q95]Percentile.png - Uncertainty bounds
-
FinalPredictionMapPIW90.png - Prediction interval width spatial pattern
-
FinalPredictionMapDI.png - Dissimilarity Index visualization
Model Objects (RDS format)
Fitness-for-Purpose Assessment
Intended Uses & Suitability
Highly Suitable (R ≥0.60, AOA ≥80%):
-
Regional soil carbon accounting and inventories (1:150,000 - 1:250,000)
-
Climate change mitigation monitoring
-
Landscape-scale land management planning
-
Environmental impact assessments
Moderately Suitable (R ≥0.50, AOA 60-80%):
-
Field-scale precision agriculture (with validation recommended)
-
Soil quality assessments with acknowledged uncertainties
-
Policy support and conservation planning
Limited Suitability (R <0.50 or AOA <50%):
-
High-precision operational applications
-
Critical decision-making without independent validation
-
Extrapolation beyond training geographic or environmental space
Model Limitations
-
Cannot Extrapolate: Predictions in high-DI zones revert to training data mean
-
Underpredicts Extremes: Random Forest tendency to average, limiting tail predictions
-
Spatial Stationarity Assumption: Assumes constant SOC-covariate relationships across study area
-
Covariate Dependency: Prediction quality limited by covariate relevance and resolution
-
Temporal Transferability: Historical model may not extrapolate to future conditions without recalibration (recommend refreshing every 10 years)
-
Computational Intensity: Large rasters (>50M cells) require substantial RAM and processing time
Data Quality & ISO 19157 Compliance
This workflow implements ISO 19157 data quality standards across six key elements:
| Element |
Implementation |
Metrics |
| Completeness |
NA counting and removal statistics |
% samples retained, missing covariates |
| Positional Accuracy |
3×3 neighborhood median extraction reduces GPS error |
Inherits from input sample accuracy (typically ±10m) |
| Thematic Accuracy |
Spatial CV and independent test validation |
RMSE, MAE, R², Bias |
| Logical Consistency |
Domain value checks, CRS validation |
Pass/fail automated checks |
| Temporal Quality |
Sampling period documentation |
Date range, temporal coverage ≥5 years recommended |
| Usability |
Fitness-for-purpose assessment |
AOA coverage, uncertainty range |
Quality Threshold Targets
-
R² (test set): ≥0.30 minimum, ≥0.50 preferred
-
Bias: Within ±10% of mean SOC (close to 0)
-
AOA Coverage: ≥70% of study area desirable (≥50% minimum)
-
RMSE (%): ≤30% acceptable, ≤20% good
-
PI Coverage: 90% of test observations within predicted 90% interval
FAIR Principles Implementation
Findable: DOI via GitHub-Zenodo integration, GitHub indexed by Google/Bing, Zenodo indexed by OpenAIRE, BASE, CERN
Accessible: HTTPS access to code and data without authentication, MIT permissive license
Interoperable:
-
Standard R/tidyverse conventions
-
ISO 19157 data quality vocabulary
-
ISO 19115 geospatial metadata elements
-
W3C PROV-DM provenance modeling
-
EPSG coordinate reference systems
Reusable:
-
Comprehensive FAIR documentation with use cases
-
Detailed provenance tracking with W3C PROV-O ontology
-
Complete parameter documentation
-
Domain-relevant standards (ISO 19157, ISO 19115-3, OGC)
System Requirements & Performance
Minimum Requirements
-
R Version: 4.0.0+
-
RAM: 8 GB minimum (16 GB recommended)
-
CPU: Multi-core processor (4+ cores recommended)
-
Operating Systems: Windows, Linux/Unix, macOS
-
Storage: ~1 GB for typical workflow
Performance Estimates
Reference: 1,000 samples, 20 covariates, 1M pixel raster, 4 cores
| Step |
Time |
Memory |
Notes |
| Covariate extraction |
2-5 min |
2-3 GB |
Depends on sample count |
| Forward feature selection |
15-45 min |
3-5 GB |
Iterative training, slowest |
| Model training + validation |
6-17 min |
3-5 GB |
5 CV folds + test set |
| QRF uncertainty |
20-90 min |
4-8 GB |
Tile-based with parallelization |
| AOA/DI computation |
10-45 min |
3-6 GB |
Distance calculations parallelized |
| Total |
1-3 hours |
8-16 GB peak |
Highly variable by dataset |
Parallelization
-
Linux/Unix: Fork-based (efficient memory sharing)
-
Windows: PSOCK cluster (explicit variable export)
-
Core Allocation: detectCores() - 1 (reserves 1 core for system)
-
Tiling Threshold: Automatic for rasters >5M cells
Citation & Reproducibility
Recommended Citation Format:
Möller, M. (2025). SOCastR: Soil Organic Carbon Prediction Workflow with Uncertainty Quantification (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.17503422
CITATION.cff: Included in repository root (CFF v1.2.0 compliant)
Reproducibility Features:
-
Fixed random seed (42) for spatial fold creation and model initialization
-
Complete parameter documentation with defaults
-
Version-controlled code via Git
-
DOI assigned to each release
-
All outputs include execution metadata and processing logs
Known Limitations & Caveats
-
Random Forest Constraints: Predictions plateau at training data mean in extrapolation zones (high DI)
-
Overprediction of Spatial Autocorrelation: KNNDM mitigates but doesn't eliminate spatial dependence bias
-
Quantile Interval Coverage: QRF prediction intervals may not achieve nominal 90% coverage in all regions
-
AOA Threshold Sensitivity: Threshold is training-data dependent; may be overly conservative or permissive
-
Temporal Stationarity: Assumes SOC-covariate relationships constant over training period; recalibration recommended for projections
-
Covariate Quality Critical: Predictions only as reliable as input covariates; missing key drivers reduces accuracy
-
Computational Trade-offs: FFS time scales quadratically with covariate count; tiling granularity affects memory/IO efficiency
Installation & Usage Quick Start
Installation
# Clone repository
# git clone https://github.com/JKI-GDM/SOCastR.git
# OR download ZIP and extract
# Set working directory and source main script
setwd("SOCastR")
source("SOCastR.R")
# Verify installation
exists("SOCastR") # Should return TRUE
Basic Usage
# Prepare data:
# - input/SAMPLES_EPSG25832.shp (soil samples with SOC column)
# - input/COVARIATES_EPSG25832.tif (multi-band environmental raster)
# Run workflow with defaults
SOCastR(
workingdir = getwd(),
inputdir = "input",
outputdir = "output",
samples = "SAMPLES_EPSG25832.shp",
covariates = "COVARIATES_EPSG25832.tif",
soccolumn = "SOC",
n.tile = 4, # 4x4 tile grid = 16 tiles
modeluncertainty = TRUE,
distanceuncertainty = TRUE
)
Supporting Documentation
-
README.md: Quick start, installation, basic usage
-
CITATION.cff: Standardized software citation metadata (CFF v1.2.0)
-
LICENSE: MIT open-source license
-
SOCastR.R: Fully documented R source code with inline comments
-
SOCastR-FAIR-Docs.md: Comprehensive FAIR documentation (this document)
References & Key Literature
-
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
-
Meinshausen, N. (2006). Quantile Regression Forests. Journal of Machine Learning Research, 7, 983-999.
-
Meyer, H., & Pebesma, E. (2022). Machine Learning-based Global Maps of Ecological Variables and the Challenge of Assessing Them. Nature Ecology & Evolution, 6(5), 2021-2032.
-
ISO 19157:2013. Geographic information — Data quality
-
ISO 19115-3:2016. Geographic information — Metadata — Part 3: XML schema implementation for fundamental concepts
Workflow Version: 1.0.0
Last Updated: 2025-11-01
Author Contact: markus.moeller@julius-kuehn.de
Institution: Julius Kühn Institute (JKI), Federal Research Centre for Cultivated Plants, Germany