Published November 1, 2025 | Version 1.0
Software Open

SOCastR: Soil Organic Carbon Prediction Workflow with Uncertainty Quantification

  • 1. ROR icon Julius Kühn-Institut

Description

SOCastR is a comprehensive R-based workflow for digital soil mapping (DSM) that predicts soil organic carbon (SOC) content using Random Forest and Quantile Regression Forest models. The workflow implements rigorous spatial cross-validation (KNNDM), forward feature selection (FFS), and dual uncertainty quantification methods to provide spatially explicit predictions with explicit handling of model uncertainty and extrapolation risk assessment.

Notes

Key Features

  • Spatial Cross-Validation: K-Nearest Neighbor Distance Matching (KNNDM) ensures geographic distance matching between validation folds

  • Advanced Feature Selection: Automated forward feature selection (FFS) identifies optimal covariate subsets

  • Dual Uncertainty Quantification:

    • Quantile Regression Forest (QRF) for prediction interval estimation

    • Area of Applicability (AOA) with Dissimilarity Index (DI) for extrapolation risk assessment

  • Tile-Based Processing: Automatic parallelized tiling for large raster stacks (>5M cells)

  • Comprehensive Quality Assessment: ISO 19157-compliant data quality metadata and fitness-for-purpose documentation

Methodological Framework

Processing Workflow (14 Steps)

  1. Package Verification: R dependency management with automatic installation

  2. Custom Function Definition: User-defined plotting and extraction utilities

  3. Data Loading & Validation: Spatial data import with CRS compatibility checks

  4. Covariate Extraction: 3×3 median-filtered neighborhood extraction at sample points

  5. Spatial Partitioning: CreateSpacetimeFolds with k-fold spatial blocking

  6. KNNDM Fold Setup: Geodistance matching for CV fold creation

  7. Forward Feature Selection: FFS with Random Forest (ntree=100) via CAST package

  8. Model Training: Random Forest (ntree=500) with optimized hyperparameters

  9. External Validation: Independent test set performance assessment

  10. Full-Dataset Retraining: Maximizes training information for production predictions

  11. Wall-to-Wall Prediction: Random Forest spatial prediction across study area

  12. Quantile Predictions: QRF (Q05, Q50, Q95) with 90% prediction intervals

  13. Distance-Based Uncertainty: AOA/DI computation with automatic tiling

  14. Output Generation: Raster, CSV, and visualization deliverables

Key Algorithms & Packages

Package Version Function
CAST 0.7.0 Spatial cross-validation (KNNDM) and Area of Applicability
caret 6.0 Machine learning framework and hyperparameter tuning
randomForest 4.7 Random Forest implementation
quantregForest 1.3 Quantile regression for prediction intervals
terra 1.7 Raster data operations and spatial analysis
sf 1.0 Vector data handling
doParallel 1.0 Parallelization (PSOCK on Windows, fork on Unix/Linux)

Input Data Requirements

Soil Samples (ESRI Shapefile)

  • Format: .shp, .shx, .dbf, .prj files

  • Geometry: Point features

  • CRS: Any projected coordinate system (e.g., EPSG:25832 for UTM Zone 32N)

  • Required Attributes: SOC column (numeric, concentration in g/kg or %)

  • Quality Criteria:

    • Minimum 100 samples (200+ recommended)

    • No duplicate coordinates

    • Valid coordinates within study area

    • SOC values within plausible range (0-30 typical)

    • Sample density ≥0.5 samples/km² minimum

Environmental Covariates (Multi-band GeoTIFF)

  • Format: GeoTIFF with multiple bands

  • Bands: 5-50 layers (typically 10-25 recommended)

  • CRS: Compatible with soil samples (automatic reprojection if needed)

  • Resolution: Uniform across all bands (10m, 30m, or 100m typical)

  • Covariate Types:

    • Terrain: slope, aspect, curvature, elevation, relief

    • Climate: precipitation, temperature, solar radiation

    • Remote sensing: NDVI, SAVI, SAR backscatter

    • Auxiliary: parent material, lithology, land use

Data Quality Requirements:

  • Complete spatial coverage over study area (no gaps/holes)

  • No zero-variance layers (automatically removed)

  • Physically plausible values across study domain

  • Covariate-SOC relationships captured by covariate selection

Output Deliverables

Raster Outputs (GeoTIFF format, LZW compression)

  1. FinalPredictionSocRaster.tif - Point predictions from Random Forest

  2. FinalPredictionSocQuantileLayers.tif - Multi-band: Q05, Q50, Q95, PIW (Prediction Interval Width)

  3. FinalPredictionDissimilarityIndex.tif - Multivariate distance to training feature space

  4. FinalPredictionAOAmask.tif - Binary AOA mask (1=reliable, 0=extrapolation)

Statistical Tables (CSV format)

  • ExtractValuesSampleDataSummary.csv - Input data completeness and descriptive statistics

  • FinalModelAccuracy.csv - Spatial cross-validation performance metrics

  • ValidationPerformanceComparison.csv - Independent test set evaluation (RMSE, MAE, R², Bias)

  • ForwardFeatureSelectionSelectedVariables.csv - Ranked selected covariates

  • FinalModelVariableImportance.csv - Variable importance scores

  • QuantilePredictionUncertaintySummary.csv - PI statistics (mean, median, 95th percentile width)

  • FinalPredictionAoaSummaryStatistics.csv - AOA coverage and DI distribution

Visualization Outputs (PNG, 300 DPI)

  • SpatialDataPartitionMapTrainTestSplit.png - Training/test sample distribution

  • SpatialCrossValidationGeodistEcdf.png - Fold quality assessment

  • ValidationScatterPlot.png - Observed vs. predicted values with R²

  • ValidationResidualPlot.png - Residual analysis for bias assessment

  • FinalPredictionMapSocRF.png - Predicted SOC concentration map

  • FinalPredictionMap[Q05/Q50/Q95]Percentile.png - Uncertainty bounds

  • FinalPredictionMapPIW90.png - Prediction interval width spatial pattern

  • FinalPredictionMapDI.png - Dissimilarity Index visualization

Model Objects (RDS format)

  • trainDIobject.rds - Trained DI threshold object for AOA recomputation

Fitness-for-Purpose Assessment

Intended Uses & Suitability

Highly Suitable (R ≥0.60, AOA ≥80%):

  • Regional soil carbon accounting and inventories (1:150,000 - 1:250,000)

  • Climate change mitigation monitoring

  • Landscape-scale land management planning

  • Environmental impact assessments

Moderately Suitable (R ≥0.50, AOA 60-80%):

  • Field-scale precision agriculture (with validation recommended)

  • Soil quality assessments with acknowledged uncertainties

  • Policy support and conservation planning

Limited Suitability (R <0.50 or AOA <50%):

  • High-precision operational applications

  • Critical decision-making without independent validation

  • Extrapolation beyond training geographic or environmental space

Model Limitations

  1. Cannot Extrapolate: Predictions in high-DI zones revert to training data mean

  2. Underpredicts Extremes: Random Forest tendency to average, limiting tail predictions

  3. Spatial Stationarity Assumption: Assumes constant SOC-covariate relationships across study area

  4. Covariate Dependency: Prediction quality limited by covariate relevance and resolution

  5. Temporal Transferability: Historical model may not extrapolate to future conditions without recalibration (recommend refreshing every 10 years)

  6. Computational Intensity: Large rasters (>50M cells) require substantial RAM and processing time

Data Quality & ISO 19157 Compliance

This workflow implements ISO 19157 data quality standards across six key elements:

Element Implementation Metrics
Completeness NA counting and removal statistics % samples retained, missing covariates
Positional Accuracy 3×3 neighborhood median extraction reduces GPS error Inherits from input sample accuracy (typically ±10m)
Thematic Accuracy Spatial CV and independent test validation RMSE, MAE, R², Bias
Logical Consistency Domain value checks, CRS validation Pass/fail automated checks
Temporal Quality Sampling period documentation Date range, temporal coverage ≥5 years recommended
Usability Fitness-for-purpose assessment AOA coverage, uncertainty range

Quality Threshold Targets

  • R² (test set): ≥0.30 minimum, ≥0.50 preferred

  • Bias: Within ±10% of mean SOC (close to 0)

  • AOA Coverage: ≥70% of study area desirable (≥50% minimum)

  • RMSE (%): ≤30% acceptable, ≤20% good

  • PI Coverage: 90% of test observations within predicted 90% interval

FAIR Principles Implementation

Findable: DOI via GitHub-Zenodo integration, GitHub indexed by Google/Bing, Zenodo indexed by OpenAIRE, BASE, CERN

Accessible: HTTPS access to code and data without authentication, MIT permissive license

Interoperable:

  • Standard R/tidyverse conventions

  • ISO 19157 data quality vocabulary

  • ISO 19115 geospatial metadata elements

  • W3C PROV-DM provenance modeling

  • EPSG coordinate reference systems

Reusable:

  • Comprehensive FAIR documentation with use cases

  • Detailed provenance tracking with W3C PROV-O ontology

  • Complete parameter documentation

  • Domain-relevant standards (ISO 19157, ISO 19115-3, OGC)

System Requirements & Performance

Minimum Requirements

  • R Version: 4.0.0+

  • RAM: 8 GB minimum (16 GB recommended)

  • CPU: Multi-core processor (4+ cores recommended)

  • Operating Systems: Windows, Linux/Unix, macOS

  • Storage: ~1 GB for typical workflow

Performance Estimates

Reference: 1,000 samples, 20 covariates, 1M pixel raster, 4 cores

Step Time Memory Notes
Covariate extraction 2-5 min 2-3 GB Depends on sample count
Forward feature selection 15-45 min 3-5 GB Iterative training, slowest
Model training + validation 6-17 min 3-5 GB 5 CV folds + test set
QRF uncertainty 20-90 min 4-8 GB Tile-based with parallelization
AOA/DI computation 10-45 min 3-6 GB Distance calculations parallelized
Total 1-3 hours 8-16 GB peak Highly variable by dataset

Parallelization

  • Linux/Unix: Fork-based (efficient memory sharing)

  • Windows: PSOCK cluster (explicit variable export)

  • Core Allocation: detectCores() - 1 (reserves 1 core for system)

  • Tiling Threshold: Automatic for rasters >5M cells

Citation & Reproducibility

Recommended Citation Format:

Möller, M. (2025). SOCastR: Soil Organic Carbon Prediction Workflow with Uncertainty Quantification (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.17503422

CITATION.cff: Included in repository root (CFF v1.2.0 compliant)

Reproducibility Features:

  • Fixed random seed (42) for spatial fold creation and model initialization

  • Complete parameter documentation with defaults

  • Version-controlled code via Git

  • DOI assigned to each release

  • All outputs include execution metadata and processing logs

Known Limitations & Caveats

  1. Random Forest Constraints: Predictions plateau at training data mean in extrapolation zones (high DI)

  2. Overprediction of Spatial Autocorrelation: KNNDM mitigates but doesn't eliminate spatial dependence bias

  3. Quantile Interval Coverage: QRF prediction intervals may not achieve nominal 90% coverage in all regions

  4. AOA Threshold Sensitivity: Threshold is training-data dependent; may be overly conservative or permissive

  5. Temporal Stationarity: Assumes SOC-covariate relationships constant over training period; recalibration recommended for projections

  6. Covariate Quality Critical: Predictions only as reliable as input covariates; missing key drivers reduces accuracy

  7. Computational Trade-offs: FFS time scales quadratically with covariate count; tiling granularity affects memory/IO efficiency

Installation & Usage Quick Start

Installation

 
r
# Clone repository # git clone https://github.com/JKI-GDM/SOCastR.git # OR download ZIP and extract # Set working directory and source main script setwd("SOCastR") source("SOCastR.R") # Verify installation exists("SOCastR") # Should return TRUE

Basic Usage

 
r
# Prepare data: # - input/SAMPLES_EPSG25832.shp (soil samples with SOC column) # - input/COVARIATES_EPSG25832.tif (multi-band environmental raster) # Run workflow with defaults SOCastR( workingdir = getwd(), inputdir = "input", outputdir = "output", samples = "SAMPLES_EPSG25832.shp", covariates = "COVARIATES_EPSG25832.tif", soccolumn = "SOC", n.tile = 4, # 4x4 tile grid = 16 tiles modeluncertainty = TRUE, distanceuncertainty = TRUE )

Supporting Documentation

  • README.md: Quick start, installation, basic usage

  • CITATION.cff: Standardized software citation metadata (CFF v1.2.0)

  • LICENSE: MIT open-source license

  • SOCastR.R: Fully documented R source code with inline comments

  • SOCastR-FAIR-Docs.md: Comprehensive FAIR documentation (this document)

References & Key Literature

  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

  • Meinshausen, N. (2006). Quantile Regression Forests. Journal of Machine Learning Research, 7, 983-999.

  • Meyer, H., & Pebesma, E. (2022). Machine Learning-based Global Maps of Ecological Variables and the Challenge of Assessing Them. Nature Ecology & Evolution, 6(5), 2021-2032.

  • ISO 19157:2013. Geographic information — Data quality

  • ISO 19115-3:2016. Geographic information — Metadata — Part 3: XML schema implementation for fundamental concepts

Workflow Version: 1.0.0
Last Updated: 2025-11-01
Author Contact: markus.moeller@julius-kuehn.de
Institution: Julius Kühn Institute (JKI), Federal Research Centre for Cultivated Plants, Germany

Files

SOCastR-main.zip

Files (60.5 kB)

Name Size Download all
md5:6aff4e65a5b67a6a47e9e358b424a053
60.5 kB Preview Download

Additional details

Related works

Is supplement to
Dataset: 10.5281/zenodo.17479134 (DOI)
Report: 10.3220/253-2025-220 (DOI)

Funding

Deutsche Forschungsgemeinschaft
FAIRe Dateninfrastruktur für die Agrosystemforschung 501899475

Software

Repository URL
https://gitea.julius-kuehn.de/markus.moeller/SOCastR/
Programming language
R
Development Status
Active