Benchmark dataset for agricultural KGML model development with PyKGML

Yang, Yufeng; LIU, LICHENG

doi:10.5281/zenodo.17137916

Published September 16, 2025 | Version v5

Dataset Open

Benchmark dataset for agricultural KGML model development with PyKGML

1. University of Minnesota Twin Cities
2. University of Minnesota

This benchmark dataset works as the demonstrative data in the testing of PyKGML, the Python library for the efficient development of knowledge-guided machine learning (KGML) models.

The dataset are developed using agroecosystem data in two KGML studies:

1. "KGML-ag: A Modeling Framework of Knowledge-Guided Machine Learning to Simulate Agroecosystems: A Case Study of Estimating N₂O Emission using Data from Mesocosm Experiments".

Licheng Liu, Shaoming Xu, Zhenong Jin*, Jinyun Tang, Kaiyu Guan, Timothy J. Griffis, Matt D. Erickson, Alexander L. Frie, Xiaowei Jia, Taegon Kim, Lee T. Miller, Bin Peng, Shaowei Wu, Yufeng Yang, Wang Zhou, Vipin Kumar.

2. "Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems".

Licheng Liu, Wang Zhou, Kaiyu Guan, Bin Peng, Shaoming Xu, Jinyun Tang, Qing Zhu, Jessica Till, Xiaowei Jia, Chongya Jiang, Sheng Wang, Ziqi Qin, Hui Kong, Robert Grant, Symon Mezbahuddin, Vipin Kumar, Zhenong Jin.

All the files belong to the corresponding author, Dr. Licheng Liu, at University of Minnesota (lichengl@umn.edu).

There are two parts in this dataset, the CO₂ data from study 1 and the N₂O data from study 2, both contain a pre-training subset and a fine-tuning subset. Data descriptions are as follows:

1. CO₂ dataset:

co2_pretrain_data:
- 100 samples (100 sites) of synthetic data generated by ecosys.
- Each sample is a 6570 daily sequence over 18 years (2001-2018).
- 19 input_features and 3 output_features.
- Data split: the first 16 years for training, and the last two years for testing.
Input features (19):
- Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), (max-min) air T (TDIF_AIR), max air humidity (HMAX_AIR), (max-min) air humidity (HDIF_AIR), wind speed (WIND), precipitation (PRECN).
- Soil properties (9): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), field capacity (TFC), wilting point (TWP), saturate hydraulic conductivity (TKSat), soil organic carbon concetration (TSOC), pH (TPH), cation exchange capacity (TCEC)
- Other (3): year (Year), crop type (Crop_Type), gross primary productivity (GPP)
Output features (3):
- Autotrophic respiration (Ra), heterotrophic respiration (Rh), net ecosystem exchange (NEE).
co2_finetune_data:
- One sample of field observations (11 sites were concatenated into one sequence due to their varied sequence lengths).
- A Daily sequence of total 124 site-years (45260 in length).
- 19 input_features and 2 output_features.
- Data split: the last two years of each site were combined as the testing data, and the rest were included in the training data.
Input features (19):
- The same as co2_pretrain_data.
Output features (2):
- Ecosystem respiration (Reco, Reco = Ra + Rh), net ecosystem exchange (NEE).

2. N₂O dataset:

n2o_pretrain_data:
- 1980 simulations at 99 counties x 20 N-fertilizer rates in the 3I states (Illinois, Iowa, Indiana); synthetic data generated by ecosys.
- Daily sequences over 18 years (2001-2018).
- Data split: the first 16 years for training, and the last two years for testing.
Input variables (16):
- Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), min air T (TMIN_AIR), max air humidity (HMAX_AIR), min air humidity (HMIN_AIR), wind speed (WIND), precipitation (PRECN).
- Soil properties (6): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), pH (TPH), cation exchange capacity (TCEC), soil organic carbon concetration (TSOC)
- Management (3): N-fertilizer rate (FERTZR_N), planting day of year (PDOY), crop type (PLANTT).
Output variables (3):
- N₂O fluxes (N2O_FLUX), soil CO₂ fluxes (CO2_FLUX), soil water content at 10 cm (WTR_3), soil ammonium concentration at 10 cm (NH4_3), soil nitrate concentration at 10 cm (NO3_3).
n2o_finetune_augment_data:
- Observations of 6 chambers in a mesocosm environment.
- Daily sequences of 122 days x 3 years (2016-2018).
- 1000 augmentations from hourly data at each chamber (6000 x 122 x 3 in total length).
- Data split: 5 chambers as the training data, and the other one as the testing data.
Input variables (16):
- Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), min air T (TMIN_AIR), max air humidity (HMAX_AIR), min air humidity (HMIN_AIR), wind speed (WIND), precipitation (PRECN).
- Soil properties (6): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), pH (TPH), cation exchange capacity (TCEC), soil organic carbon concetration (TSOC)
- Management (3): N-fertilizer rate (FERTZR_N), planting day of year (PDOY), crop type (PLANTT).
Output variables (3):
- N2O fluxes (N2O_FLUX), soil CO2 fluxes (CO2_FLUX), soil water content at 10 cm (WTR_3), soil ammonium concentration at 10 cm (NH4_3), soil nitrate concentration at 10 cm (NO3_3).

Each file is a serialized Python dictionary containing the following keys and values:

data={'X_train': X_train,

'X_test': X_test,

'Y_train': Y_train,

'Y_test': Y_test,

'x_scaler': x_scaler,

'y_scaler': y_scaler,

'input_features': input_features,

'output_features': output_features}

X_train, X_test: Feature matrices for training and testing. 3 dimensions [samples, sequences, input_features].
Y_train, Y_test: Target values for training and testing. 3 dimensions [samples, sequences, output_features].
x_scaler: The scaler (mean, std) used for normalizing input features. 2 dimensions [[mean, std], input_features].
y_scaler: The scaler (mean, std) used for normalizing output features. 2 dimensions [[mean, std], output_features].
input_features: A list of input feature names.
output_features: A list of output feature names.

The PyTorch function torch.load() can be used to load data:

co2_finetune_file = data_path + 'co2_finetune_data.sav'

data = torch.load(co2_finetune_file, weights_only=False)

Please download and use the latest version of this dataset, as it contains important updates.

Contact: Dr. Licheng Liu (lichengl@umn.edu), Dr. Yufeng Yang (yang6956@umn.edu)

Files

Files (1.4 GB)

Name	Size	Download all
co2_finetune_data.sav md5:1192d9a50723f5c45ab76fb7e2546a9b	3.8 MB	Download
co2_pretrain_data.sav md5:33d24959a000869fb6c4c508a2e17f88	57.8 MB	Download
n2o_finetune_augment_data.sav md5:0e9b33870543113cc3ee17f744a46abe	228.4 MB	Download
n2o_pretrain_data.sav md5:0b2057e03c516a21c5b7a6d240094fc6	1.1 GB	Download

Additional details

Created: 2025-09-16

Updated scalers

	All versions	This version
Views	347	38
Downloads	335	25
Data volume	145.9 GB	8.0 GB

Benchmark dataset for agricultural KGML model development with PyKGML

Authors/Creators

Description

Files

Files (1.4 GB)

Additional details

Dates