CY-Bench: A comprehensive benchmark dataset for subnational crop yield forecasting
Creators
-
Paudel, Dilli
(Project leader)1
-
Kallenberg, Michiel
(Contact person)1
- Ofori-Ampofo, Stella (Contact person)2
-
Baja, Hilmy
(Contact person)1
- van Bree, Ron (Contact person)1
- Potze, Aike (Contact person)1
-
Poudel, Pratishtha
(Contact person)3
- Saleh, Abdelrahman (Researcher)4
-
Anderson, Weston
(Researcher)5
- von Bloh, Malte (Researcher)2
- Castellano, Andres (Researcher)6
- Ennaji, Oumnia (Researcher)7
- Hamed, Raed (Researcher)8
- Laudien, Rahel (Researcher)9
- Lee, Donghoon (Researcher)10
- Luna, Inti (Researcher)11
-
Masiliūnas, Dainius
(Researcher)1
- Meroni, Michele (Researcher)12
- Mutuku, Janet Mumo (Researcher)13
- Mkuhlani, Siyabusa (Researcher)14
- Richetti, Jonathan (Researcher)15
- Ruane, Alex C. (Researcher)6
- Sahajpal, Ritvik (Researcher)5
- Shuai, Guanyuan (Researcher)5
- Sitokonstantinou, Vasileios (Researcher)11
- de Souza Noia Junior, Rogerio (Researcher)16
- Srivastava, Amit Kumar (Researcher)17
- Strong, Robert (Researcher)18
-
Sweet, Lily-belle
(Researcher)19
- Vojnović, Petar (Researcher)20
-
de Wit, Allard
(Researcher)1
- Zachow, Maximilian (Researcher)2
- Athanasiadis, Ioannis N. (Supervisor)1
-
1.
Wageningen University & Research
-
2.
Technical University of Munich
-
3.
Purdue University West Lafayette
-
4.
Ankara University
-
5.
University of Maryland, College Park
- 6. NASA GISS
-
7.
Université Mohammed VI Polytechnique
-
8.
Vrije Universiteit Amsterdam
-
9.
Potsdam Institute for Climate Impact Research
-
10.
University of Manitoba
- 11. Universitat de València
- 12. Seidor Consulting
-
13.
International Crops Research Institute for the Semi-Arid Tropics
-
14.
International Institute of Tropical Agriculture
- 15. CSIRO
-
16.
Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement
-
17.
Leibniz Centre for Agricultural Landscape Research
- 18. Texas A&M University - College Station
-
19.
Helmholtz Centre for Environmental Research
-
20.
Joint Research Centre
Description
CY-Bench: A comprehensive benchmark dataset for sub-national crop yield forecasting
Overview
CY-Bench is a dataset and benchmark for subnational crop yield forecasting, with coverage of major crop growing countries of the world for maize and wheat. By subnational, we mean the administrative level where yield statistics are published. When statistics are available for multiple levels, we pick the highest resolution. The dataset combines sub-national yield statistics with relevant predictors, such as growing-season weather indicators, remote sensing indicators, evapotranspiration, soil moisture indicators, and static soil properties. CY-Bench has been designed and curated by agricultural experts, climate scientists, and machine learning researchers from the AgML Community, with the aim of facilitating model intercomparison across the diverse agricultural systems around the globe in conditions as close as possible to real-world operationalization. Ultimately, by lowering the barrier to entry for ML researchers in this crucial application area, CY-Bench will facilitate the development of improved crop forecasting tools that can be used to support decision-makers in food security planning worldwide.
* Crops : Wheat & Maize
* Spatial Coverage : Wheat (29 countries), Maize (38).
See CY-Bench Summary for the list of countries.
* Temporal Coverage : Varies. See CY-Bench Summary.
Data
Data format
The benchmark data is organized as a collection of CSV files (with the exception of location information, see below), with each file representing a specific category of variable for a particular country. Each CSV file is named according to the category and the country it pertains to, facilitating easy identification and retrieval. The data within each CSV file is structured in tabular format, where rows represent observations and columns represent different predictors related to a category of variable.
Data content
All data files are provided as .csv.
Data | Description | Variables (units) | Temporal Resolution | Data Source (Reference) |
crop_calendar | start and end of growing season | sos (day of the year), eos (day of the year) |
static | World Cereal (Franch et al, 2022) |
crop_mask | crop area fraction | crop_area (km2), crop_area_percentage (%) | static | WorldCereal (Van Tricht et al., 2023; EC-JRC, 2024) |
fpar | fraction of absorbed photosynthetically active radiation | fpar (%) | Dekadal (3 times a month; 1-10, 11-20, 21-31) | European Commission's Joint Research Centre (EC-JRC, 2024) |
ndvi | normalized difference vegetation index | - | approximately weekly | MOD09CMG (Vermote, 2015) |
meteo | temperature, precipitation (prec), radiation, potential evapotranspiration (et0), climatic water balance (= prec - et0) | tmin (C), tmax (C), tavg (C), prec (mm0, et0 (mm), cwb (mm), rad (J m-2 day-1) | daily | AgERA5 (Boogaard et al, 2022) |
soil_moisture | surface soil moisture, rootzone soil moisture | ssm (kg m-2), rsm (kg m-2) | daily | GLDAS (Rodell et al, 2004) |
soil | available water capacity, bulk density, drainage class | awc (c m-1), bulk_density (kg dm-3), drainage class (category) | static | WISE Soil database (Batjes, 2016) |
location | centroid | latitude, logitude, region_area (km2) | static | |
yield | end-of-season yield | yield (t ha-1) | yearly | Various country or region specific sources (see crop_statistics_... in https://github.com/WUR-AI/AgML-CY-Bench/tree/main/data_preparation) |
Folder structure
- cybench-data: The CY-Bench dataset has been structure at first level by crop type and subsequently by country. For each country, the folder name follows the ISO 3166-1 alpha-2 two-character code. A separate .csv is available for each predictor data and crop calendar as shown below. The csv files are named to reflect the corresponding country and crop type e.g. **variable_croptype_country.csv**.
```
CY-Bench
│
└─── maize
│ │
│ └─── AO
│ │ -- crop_calendar_maize_AO.csv
│ │ -- crop_mask_maize_AO.csv
│ │ -- fpar_maize_AO.cs
│ │ -- location_maize_AO.csv
│ │ -- meteo_maize_AO.csv
│ │ -- ndvi_maize_AO.csv
│ │ -- soil_maize_AO.csv
│ │ -- soil_moisture_maize_AO.csv
│ │ -- yield_maize_AO.csv
│ │
│ └─── AR
│ -- crop_calendar_maize_AR.csv
│ -- crop_mask_maize_AR.csv
│ -- fpar_maize_AR.csv
│ -- ...
│
└─── wheat
│ │
│ └─── AR
│ │ -- crop_calendar_wheat_AR.csv
│ │ -- crop_mask_wheat_AR.csv
│ │ -- fpar_wheat_AR.csv
│ │ ...
```Example : CSV data content for maize in country X
```
X
└─── crop_calendar_maize_X.csv
│ -- crop_name (name of the crop)
│ -- adm_id (unique identifier for a subnational unit)
│ -- sos (start of crop season)
│ -- eos (end of crop season)
│
└─── crop_mask_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- crop_area
│ -- crop_area_percentage
│
└─── fpar_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- fpar
│
└─── location_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- latitude
│ -- longitude
│ -- region_area
│
└─── meteo_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)│ -- tmin (minimum temperature)
│ -- tmax (maximum temperature)
│ -- prec (precipitation)
│ -- rad (radiation)
│ -- tavg (average temperature)
│ -- et0 (evapotranspiration)
│ -- vpd (vapor pressure deficit)
│ -- cwb (crop water balance)
│
└─── ndvi_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- ndvi
│
└─── soil_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- awc (available water capacity)
│ -- bulk_density
│ -- drainage_class
│
└─── soil_moisture_maize_X.csv
│ -- crop_name
│ -- adm_id
│ -- date (in the format YYYYMMdd)
│ -- ssm (surface soil moisture)
│ -- rsm ()
│
└─── yield_maize_X.csv
│ -- crop_name
│ -- country_code
│ -- adm_id
│ -- harvest_year
│ -- yield
│ -- harvest_area
│ -- production - centroids.zip and polygons.zip include shapes or geometries as centroids ( x and y coordinates) and polygons (multipolygons) of administrative regions respectively. They are organized as follows:
centroids
│ └─── AO
...
│ │ -- AO.cpg
│ │ -- AO.dbf
│ │ -- AO.prj
│ │ -- AO.shp
│ │ -- AO.shx
│ └─── AR
│ │ -- AR.cpg
│ │ -- AR.dbf
│ │ -- AR.prj
│ │ -- AR.shp
│ │ -- AR.shxpolygons
│ └─── AO
...
│ │ -- AO.cpg
│ │ -- AO.dbf
│ │ -- AO.prj
│ │ -- AO.shp
│ │ -- AO.shx
│ └─── AR
│ │ -- AR.cpg
│ │ -- AR.dbf
│ │ -- AR.prj
│ │ -- AR.shp
│ │ -- AR.shx
Data access
The full dataset can be downloaded directly from Zenodo or using the ```zenodo_get``` library
License and citation
We kindly ask all users of CY-Bench to properly respect licensing and citation conditions of the datasets included.
Version Notes
1.0 is the dataset submitted to NeurIPS Datasets and Benchmarks Track. The paper and discussions are here: https://openreview.net/forum?id=jkJDNG468g#discussion
1.1 and 1.2 fix some issues with column names and mismatches in adm_id between yield data and input data.
1.3 includes location information in the form of centroids and polygons of admin regions.
1.4 updates the fpar data for 2023. fpar data was incomplete for 2023 in earlier versions (due to unavailability in the data source itself).
1.5 fixes an issue in crop calendar
1.6 fixes an issue in ndvi time series
1.7 updates storage precision to 3 decimal places to reduce data size
1.8 filter out invalid yield values
1.9 Add vpd. Add location. ET0 obtained from AgERA5 (was AQUASTAT-FAO ). Use AgERA5 2.0 (was AgERA5 1.1)
1.10 Add region_are to location*.csv. Add crop_mask_*.csv. Fix error in yield Australia.
Files
polygons.zip
Additional details
Software
- Repository URL
- https://github.com/WUR-AI/AgML-CY-Bench
References
- Ministerio de Agrícultura, Ganaderia y Pesca. (2023), "Estimaciones Agrícolas", available at: https://datosestimaciones.magyp.gob.ar/reportes.php?reporte=Estimaciones (accessed 9 February 2024).
- ABARES (2024) Australian Bureau of Agricultural and Resource Economics and Sciences Farm Data Portal. Accessed on 2024/03/05. https://www.agriculture.gov.au/abares/data/farm-data-portal#data-download
- IBGE SIDRA. (2022), "Tabela 1612: Área plantada, área colhida, quantidade produzida, rendimento médio e valor da produção das lavouras temporárias", available at: https://sidra.ibge.gov.br/tabela/1612 (accessed 6 February 2024).
- National Bureau of Statistics of China, 2024. National Data Portal. https://data.stats.gov.cn, Last accessed: Feb 18, 2024.
- Duden, C., Nacke, C. & Offermann, F. Crop yields and area in Germany from 1979 to 2021 at a harmonized district-level. OpenAgrar https://doi.org/10.3220/DATA20231117103252-0 (2023).
- Duden, C., Nacke, C. & Offermann, F. German yield and area data for 11 crops from 1979 to 2021 at a harmonized spatial resolution of 397 districts. Sci Data 11, 95 (2024). https://doi.org/10.1038/s41597-024-02951-8
- Ronchetti, Giulia; Nisini-Scacchiafichi, Luigi; Seguini, Lorenzo; Cerrani, Iacopo; van der Velde, Marijn (2023): Harmonized European Union subnational crop statistics. European Commission, Joint Research Centre (JRC) [Dataset]. doi: 10.2905/685949ff-56de-4646-a8df-844b5bb5f835 PID: http://data.europa.eu/89h/685949ff-56de-4646-a8df-844b5bb5f835.
- EC-JRC, 2024. JRC Agri4Cast Data Portal. https://agri4cast.jrc.ec.europa.eu/DataPortal/, Last accessed: Feb 22, 2024.
- Ronchetti, G., Nisini Scacchiafichi, L., Seguini, L., Cerrani, I., and van der Velde, M.: Harmonized European Union subnational crop statistics can reveal climate impacts and crop cultivation shifts, Earth Syst. Sci. Data, 16, 1623–1649, https://doi.org/10.5194/essd-16-1623-2024, 2024.
- ICRISAT, 2024a. District Level Database Portal. http://data.icrisat.org/dld/src/crops.html, Last accessed: Feb 9, 2024.
- ICRISAT, 2024b. International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Bamako, Mali.
- INEGI. Agricultural Census and Survey Data. [Census Project 2022, Census Map Project 2022, Survey Data for 2019,2017 and 2014]. Available at: https://www.inegi.org.mx/programas/ca/2022, https://www.inegi.org.mx/app/biblioteca/ficha.html?upc=794551067284, https://www.inegi.org.mx/programas/ena/2019, https://www.inegi.org.mx/programas/ena/2017, https://www.inegi.org.mx/programas/ena/2014. Accessed on 2024-04-10.
- USDA-NASS, 2023. The Yield Forecasting Program of NASS. Technical Report. United States Department of Agriculture (USDA). https://www.nass.usda.gov/Publications/Methodology_and_Data_Quality/Advanced_Topics/Yield%20Forecasting%20Program%20of%20NASS_2023.pdf, Last accessed: Feb 23, 2024.
- Potter NA (2019). "rnassqs: An 'R' package to access agricultural data via the USDA National Agricultural Statistics Service (USDA-NASS) 'Quick Stats' API." The Journal of Open Source Software.
- Potter N (2022). rnassqs: Access the NASS 'Quick Stats' API. R package version 0.6.1, https://CRAN.R-project.org/package=rnassqs.
- Boogaard, H., Schubert, J., De Wit, A., Lazebnik, J., Hutjes, R., Van der Grijn, G., (2020): Agrometeorological indicators from 1979 to present derived from reanalysis. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.6c68c9bb (Accessed on 06-06-2024)
- Vermote, E.. MOD09CMG MODIS/Terra Surface Reflectance Daily L3 Global 0.05Deg CMG V006. 2015, distributed by NASA EOSDIS Land Processes Distributed Active Archive Center, https://doi.org/10.5067/MODIS/MOD09CMG.006. Accessed 2024-04-19.
- FAO-AQUASTAT. Reference evapotranspiration - AgERA5 derived (Global - Daily - 10km). https://data.apps.fao.org/catalog//iso/f22813e9-679e-4864-bd92-d48f5dfc436c, 2021. Accessed: 2024-06-05
- Van Tricht, K. and Degerickx, J. and Gilliams, S. and Zanaga, D. and Battude, M. and Grosu, A. and Brombacher, J. and Lesiv, M. and Bayas, J. C. L. and Karanam, S. and Fritz, S. and Becker-Reshef, I. and Franch, B. and Moll`a-Bononad, B. and Boogaard, H. and Pratihast, A. K. and Koetz, B. and Szantoi, Z. (2023). WorldCereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping, Earth System Science Data, 15, 5491--5515, DOI: 10.5194/essd-15-5491-2023.
- L. Seguini, A. Klish, M. Meroni, et al. Global near real-time filtered 500 m 10-day fraction of photosynthetically active radiation absorbed by vegetation (FPAR) from MODIS and VIIRS instruments suited for operational agriculture monitoring and crop yield forecasting systems. https://agricultural-production-hotspots.ec.europa.eu/data/indicators_fpar/, 2024. Under preparation.
- Franch, B., Cintas, J., Becker-Reshef, I., Sanchez-Torres, M.J., Roger, J., Skakun, S., Sobrino, J.A., Van Tricht, K., Degerickx, J., Gilliams, S. and Koetz, B., 2022. Global crop calendars of maize and wheat in the framework of the WorldCereal project. GIScience & Remote Sensing, 59(1), pp.885-913.
- Batjes NH 2016. Harmonised soil property values for broad-scale modelling (WISE30sec) with estimates of global soil carbon stocks. Geoderma 2016(269), 61-68 ( http://dx.doi.org/10.1016/j.geoderma.2016.01.034 )
- Rodell, M., P.R. Houser, U. Jambor, J. Gottschalck, K. Mitchell, C.-J. Meng, K. Arsenault, B. Cosgrove, J. Radakovich, M. Bosilovich, J.K. Entin, J.P. Walker, D. Lohmann, and D. Toll, The Global Land Data Assimilation System, Bull. Amer. Meteor. Soc., 85(3), 381-394, 2004.
- OCHA. (2020), "Argentina - Subnational Administrative Boundaries - Humanitarian Data Exchange", available at: https://data.humdata.org/dataset/cod-ab-arg (accessed 6 February 2024).
- Eurostat-GISCO, 2024. Eurostat - Geographical Information and Maps. Administrative Units/Statistical Units. https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units, Last accessed: Feb 22, 2024.
- INEGI. Mapa. Mapas-Marco Geoestadístico, Censo Agropecuario 2022. https://www.inegi.org.mx/app/biblioteca/ficha.html?upc=794551067284 (accessed 2024-04-18).
- Walker K (2024). tigris: Load Census TIGER/Line Shapefiles. R package version 2.1, https://CRAN.R-project.org/package=tigris