Benchmark dataset for agricultural KGML model development with PyKGML
Authors/Creators
Description
This benchmark dataset works as the demonstrative data in the testing of PyKGML, the Python library for the efficient development of knowledge-guided machine learning (KGML) models.
The dataset are developed using agroecosystem data in two KGML studies:
1. "KGML-ag: A Modeling Framework of Knowledge-Guided Machine Learning to Simulate Agroecosystems: A Case Study of Estimating N2O Emission using Data from Mesocosm Experiments".
Licheng Liu, Shaoming Xu, Zhenong Jin*, Jinyun Tang, Kaiyu Guan, Timothy J. Griffis, Matt D. Erickson, Alexander L. Frie, Xiaowei Jia, Taegon Kim, Lee T. Miller, Bin Peng, Shaowei Wu, Yufeng Yang, Wang Zhou, Vipin Kumar.
2. "Knowledge-guided machine learning can improve carbon cycle quantification in agroecosystems".
Licheng Liu, Wang Zhou, Kaiyu Guan, Bin Peng, Shaoming Xu, Jinyun Tang, Qing Zhu, Jessica Till, Xiaowei Jia, Chongya Jiang, Sheng Wang, Ziqi Qin, Hui Kong, Robert Grant, Symon Mezbahuddin, Vipin Kumar, Zhenong Jin.
-
co2_pretrain_data:
- 100 samples (100 sites) of synthetic data generated by ecosys.
- Each sample is a 6570 daily sequence over 18 years (2001-2018).
- 19 input_features and 3 output_features.
- Data split: the first 16 years for training, and the last two years for testing.
Input features (19):
- Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), (max-min) air T (TDIF_AIR), max air humidity (HMAX_AIR), (max-min) air humidity (HDIF_AIR), wind speed (WIND), precipitation (PRECN).
- Soil properties (9): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), field capacity (TFC), wilting point (TWP), saturate hydraulic conductivity (TKSat), soil organic carbon concetration (TSOC), pH (TPH), cation exchange capacity (TCEC)
- Other (3): year (Year), crop type (Crop_Type), gross primary productivity (GPP)
Output features (3):
- Autotrophic respiration (Ra), heterotrophic respiration (Rh), net ecosystem exchange (NEE).
-
co2_finetune_data:
- One sample of field observations (11 sites were concatenated into one sequence due to their varied sequence lengths).
- A Daily sequence of total 124 site-years (45260 in length).
- 19 input_features and 2 output_features.
- Data split: the last two years of each site were combined as the testing data, and the rest were included in the training data.
Input features (19):
- The same as co2_pretrain_data.
Output features (2):
- Ecosystem respiration (Reco, Reco = Ra + Rh), net ecosystem exchange (NEE).
2. N2O dataset:
-
n2o_pretrain_data:
- 1980 simulations at 99 counties x 20 N-fertilizer rates in the 3I states (Illinois, Iowa, Indiana); synthetic data generated by ecosys.
- Daily sequences over 18 years (2001-2018).
- Data split: the first 16 years for training, and the last two years for testing.
Input variables (16):
- Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), min air T (TMIN_AIR), max air humidity (HMAX_AIR), min air humidity (HMIN_AIR), wind speed (WIND), precipitation (PRECN).
- Soil properties (6): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), pH (TPH), cation exchange capacity (TCEC), soil organic carbon concetration (TSOC)
- Management (3): N-fertilizer rate (FERTZR_N), planting day of year (PDOY), crop type (PLANTT).
Output variables (3):
- N2O fluxes (N2O_FLUX), soil CO2 fluxes (CO2_FLUX), soil water content at 10 cm (WTR_3), soil ammonium concentration at 10 cm (NH4_3), soil nitrate concentration at 10 cm (NO3_3).
-
n2o_finetune_augment_data:
- Observations of 6 chambers in a mesocosm environment.
- Daily sequences of 122 days x 3 years (2016-2018).
- 1000 augmentations from hourly data at each chamber (6000 x 122 x 3 in total length).
- Data split: 5 chambers as the training data, and the other one as the testing data.
Input variables (16):
- Meterological (7): solar radiation (RADN), max air T (TMAX_AIR), min air T (TMIN_AIR), max air humidity (HMAX_AIR), min air humidity (HMIN_AIR), wind speed (WIND), precipitation (PRECN).
- Soil properties (6): bulk density (TBKDS), sand content (TSAND), silt content (TSILT), pH (TPH), cation exchange capacity (TCEC), soil organic carbon concetration (TSOC)
- Management (3): N-fertilizer rate (FERTZR_N), planting day of year (PDOY), crop type (PLANTT).
Output variables (3):
- N2O fluxes (N2O_FLUX), soil CO2 fluxes (CO2_FLUX), soil water content at 10 cm (WTR_3), soil ammonium concentration at 10 cm (NH4_3), soil nitrate concentration at 10 cm (NO3_3).
-
X_train,X_test: Feature matrices for training and testing. 3 dimensions [samples, sequences, input_features]. -
Y_train,Y_test: Target values for training and testing. 3 dimensions [samples, sequences, output_features]. x_scaler: The scaler (mean, std) used for normalizing input features. 2 dimensions [[mean, std], input_features].-
y_scaler: The scaler (mean, std) used for normalizing output features. 2 dimensions [[mean, std], output_features]. -
input_features: A list of input feature names. output_features: A list of output feature names.
The PyTorch function torch.load() can be used to load data:
co2_finetune_file = data_path + 'co2_finetune_data.sav'
data = torch.load(co2_finetune_file, weights_only=False)
Please download and use the latest version of this dataset, as it contains important updates.
Contact: Dr. Licheng Liu (lichengl@umn.edu), Dr. Yufeng Yang (yang6956@umn.edu)
Files
Files
(1.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1192d9a50723f5c45ab76fb7e2546a9b
|
3.8 MB | Download |
|
md5:33d24959a000869fb6c4c508a2e17f88
|
57.8 MB | Download |
|
md5:0e9b33870543113cc3ee17f744a46abe
|
228.4 MB | Download |
|
md5:0b2057e03c516a21c5b7a6d240094fc6
|
1.1 GB | Download |
Additional details
Dates
- Created
-
2025-09-16Updated scalers