Published October 6, 2025 | Version 1.10
Dataset Open

CY-Bench: A comprehensive benchmark dataset for subnational crop yield forecasting

  • 1. ROR icon Wageningen University & Research
  • 2. ROR icon Technical University of Munich
  • 3. ROR icon Purdue University West Lafayette
  • 4. ROR icon Ankara University
  • 5. ROR icon University of Maryland, College Park
  • 6. NASA GISS
  • 7. ROR icon Université Mohammed VI Polytechnique
  • 8. ROR icon Vrije Universiteit Amsterdam
  • 9. ROR icon Potsdam Institute for Climate Impact Research
  • 10. ROR icon University of Manitoba
  • 11. Universitat de València
  • 12. Seidor Consulting
  • 13. ROR icon International Crops Research Institute for the Semi-Arid Tropics
  • 14. ROR icon International Institute of Tropical Agriculture
  • 15. CSIRO
  • 16. ROR icon Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement
  • 17. ROR icon Leibniz Centre for Agricultural Landscape Research
  • 18. Texas A&M University - College Station
  • 19. ROR icon Helmholtz Centre for Environmental Research
  • 20. ROR icon Joint Research Centre

Description

CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting


Overview

CY-Bench is a dataset and benchmark for subnational crop yield forecasting, with coverage of major crop growing countries of the world for maize and wheat. By subnational, we mean the administrative level where yield statistics are published. When statistics are available for multiple levels, we pick the highest resolution. The dataset combines sub-national yield statistics with relevant predictors, such as growing-season weather indicators, remote sensing indicators, evapotranspiration, soil moisture indicators, and static soil properties. CY-Bench has been designed and curated by agricultural experts, climate scientists, and machine learning researchers from the AgML Community, with the aim of facilitating model intercomparison across the diverse agricultural systems around the globe in conditions as close as possible to real-world operationalization. Ultimately, by lowering the barrier to entry for ML researchers in this crucial application area, CY-Bench will facilitate the development of improved crop forecasting tools that can be used to support decision-makers in food security planning worldwide.

* Crops : Wheat & Maize
* Spatial Coverage : Wheat (29 countries), Maize (38).
  See CY-Bench Summary for the list of countries.
* Temporal Coverage : Varies. See CY-Bench Summary.

Data 

Data format


The benchmark data is organized as a collection of CSV files (with the exception of location information, see below), with each file representing a specific category of variable for a particular country. Each CSV file is named according to the category and the country it pertains to, facilitating easy identification and retrieval. The data within each CSV file is structured in tabular format, where rows represent observations and columns represent different predictors related to a category of variable.

Data content

All data files are provided as .csv.

Data Description Variables (units) Temporal Resolution Data Source (Reference)
crop_calendar start and end of growing season sos (day of the year),
eos (day of the year)
static World Cereal (Franch et al, 2022)
crop_mask crop area fraction crop_area (km2), crop_area_percentage (%) static WorldCereal (Van Tricht et al., 2023; EC-JRC, 2024)
fpar fraction of absorbed photosynthetically active radiation fpar (%) Dekadal (3 times a month; 1-10, 11-20, 21-31) European Commission's Joint Research Centre (EC-JRC, 2024)
ndvi normalized difference vegetation index - approximately weekly MOD09CMG (Vermote, 2015)
meteo temperature, precipitation (prec), radiation, potential evapotranspiration (et0), climatic water balance (= prec - et0)   tmin (C), tmax (C), tavg (C), prec (mm0, et0 (mm), cwb (mm), rad (J m-2 day-1) daily AgERA5 (Boogaard et al, 2022)
soil_moisture surface soil moisture, rootzone soil moisture ssm (kg m-2), rsm (kg m-2) daily GLDAS (Rodell et al, 2004)
soil available water capacity, bulk density, drainage class awc (c m-1), bulk_density (kg dm-3), drainage class (category) static WISE Soil database (Batjes, 2016)
location centroid latitude, logitude, region_area (km2) static  
yield end-of-season yield yield (t ha-1) yearly Various country or region specific sources (see crop_statistics_... in https://github.com/WUR-AI/AgML-CY-Bench/tree/main/data_preparation)

 

Folder structure

  1. cybench-data: The CY-Bench dataset has been structure at first level by crop type and subsequently by country. For each country, the folder name follows the ISO 3166-1 alpha-2 two-character code. A separate .csv is available for each predictor data and crop calendar as shown below. The csv files are named to reflect the corresponding country and crop type e.g. **variable_croptype_country.csv**.
    ```
    CY-Bench

    └─── maize
    │   │
    │   └─── AO
    │   │   -- crop_calendar_maize_AO.csv
    │   │   -- crop_mask_maize_AO.csv
    │   │   -- fpar_maize_AO.cs
    │   │   -- location_maize_AO.csv
    │   │   -- meteo_maize_AO.csv
    │   │   -- ndvi_maize_AO.csv
    │   │   -- soil_maize_AO.csv
    │   │   -- soil_moisture_maize_AO.csv
    │   │   -- yield_maize_AO.csv
    │   │ 
    │   └─── AR
    │       -- crop_calendar_maize_AR.csv
    │       -- crop_mask_maize_AR.csv
    │       -- fpar_maize_AR.csv
    │       -- ...
    │   
    └─── wheat
    │   │
    │   └─── AR
    │   │   -- crop_calendar_wheat_AR.csv
    │   │   -- crop_mask_wheat_AR.csv
    │   │   -- fpar_wheat_AR.csv
    │   │   ...
    ```

    Example : CSV data content for maize in country X

    ```
    X
    └─── crop_calendar_maize_X.csv
    │   -- crop_name (name of the crop)
    │   -- adm_id (unique identifier for a subnational unit)
    │   -- sos (start of crop season)
    │   -- eos (end of crop season)

    └─── crop_mask_maize_X.csv
    │   -- crop_name
    │   -- adm_id 
    │   -- crop_area
    │   -- crop_area_percentage
    │   
    └─── fpar_maize_X.csv
    │   -- crop_name
    │   -- adm_id 
    │   -- date (in the format YYYYMMdd)
    │   -- fpar

    └─── location_maize_X.csv
    │   -- crop_name
    │   -- adm_id 
    │   -- latitude
    │   -- longitude
    │   -- region_area

    └─── meteo_maize_X.csv
    │   -- crop_name
    │   -- adm_id 
    │   -- date (in the format YYYYMMdd)

    │   -- tmin (minimum temperature)
    │   -- tmax (maximum temperature)
    │   -- prec (precipitation)
    │   -- rad (radiation)
    │   -- tavg (average temperature)
    │   -- et0 (evapotranspiration)
    │   -- vpd (vapor pressure deficit)
    │   -- cwb (crop water balance)
    │ 
    └─── ndvi_maize_X.csv
    │   -- crop_name
    │   -- adm_id
    │   -- date (in the format YYYYMMdd)
    │   -- ndvi  
    │   
    └─── soil_maize_X.csv
    │   -- crop_name
    │   -- adm_id
    │   -- awc (available water capacity)
    │   -- bulk_density
    │   -- drainage_class
    │   
    └─── soil_moisture_maize_X.csv
    │   -- crop_name
    │   -- adm_id
    │   -- date (in the format YYYYMMdd)
    │   -- ssm (surface soil moisture)
    │   -- rsm ()
    │   
    └─── yield_maize_X.csv
    │   -- crop_name
    │   -- country_code
    │   -- adm_id
    │   -- harvest_year
    │   -- yield
    │   -- harvest_area
    │   -- production

  2. centroids.zip and polygons.zip include shapes or geometries as centroids ( x and y coordinates) and polygons (multipolygons) of administrative regions respectively. They are organized as follows:

    centroids

    │   └─── AO
    │   │   -- AO.cpg
    │   │   -- AO.dbf
    │   │   -- AO.prj
    │   │   -- AO.shp
    │   │   -- AO.shx
    │   └─── AR
    │   │   -- AR.cpg
    │   │   -- AR.dbf
    │   │   -- AR.prj
    │   │   -- AR.shp
    │   │   -- AR.shx

    ...

    polygons

    │   └─── AO
    │   │   -- AO.cpg
    │   │   -- AO.dbf
    │   │   -- AO.prj
    │   │   -- AO.shp
    │   │   -- AO.shx
    │   └─── AR
    │   │   -- AR.cpg
    │   │   -- AR.dbf
    │   │   -- AR.prj
    │   │   -- AR.shp
    │   │   -- AR.shx

    ...

Data access

The full dataset can be downloaded directly from Zenodo or using the ```zenodo_get``` library


License and citation


We kindly ask all users of CY-Bench to properly respect licensing and citation conditions of the datasets included.

 

Version Notes

1.0 is the dataset submitted to NeurIPS Datasets and Benchmarks Track. The paper and discussions are here: https://openreview.net/forum?id=jkJDNG468g#discussion

1.1 and 1.2 fix some issues with column names and mismatches in adm_id between yield data and input data.

1.3 includes location information in the form of centroids and polygons of admin regions.

1.4 updates the fpar data for 2023. fpar data was incomplete for 2023 in earlier versions (due to unavailability in the data source itself).

1.5 fixes an issue in crop calendar

1.6 fixes an issue in ndvi time series

1.7 updates storage precision to 3 decimal places to reduce data size

1.8 filter out invalid yield values

1.9 Add vpd. Add location. ET0 obtained from AgERA5 (was AQUASTAT-FAO ). Use AgERA5 2.0 (was AgERA5 1.1)

1.10 Add region_are to location*.csv. Add crop_mask_*.csv. Fix error in yield Australia. 

Files

polygons.zip

Files (6.3 GB)

Name Size Download all
md5:3610ce8ec638641fa60e42d1021b989e
329.3 kB Preview Download
md5:8ec510982f396469eca25a4f0fa8632e
6.2 GB Preview Download
md5:815d0e94f6746f15febb99b627142a04
105.0 MB Preview Download

Additional details

References