Published January 4, 2021 | Version 2.0
Dataset Open

Sample data for "Machine learning for large-scale forecasting"


This dataset includes sample data for the Netherlands to run the machine learning baseline as described in the paper titled Machine learning for large-scale crop yield forecasting, accessible at The software implementation of the machine learning baseline is available at:


The NUTS classification (Nomenclature of territorial units for statistics) is a hierarchical system for dividing up the economic territory of the EU and the UK (see Eurostat, 2016) for more details).


The dataset consists of 11 CSV files. They are formatted to work as sample inputs to the machine learning baseline.

  1. Crop Area Fractions (NUTS2, NUTS1): We aggregated the predictions of the machine learning baseline from NUTS2 to national (NUTS0) level by weighting them on the modeled crop area. Cerrani and López Lozano (2017) have described in detail the algorithm used to model crop areas for different NUTS levels. The data comes from the MARS Crop Yield Forecasting System (MCYFS) of European Commission's Joint Research Centre (JRC) (see Lecerf et al., 2019).
  2. Centroids (NUTS2): Data includes latitude, longitude and distance to coast of the centroids of NUTS2 regions.
  3. Meteo Daily Data and Meteo Dekadal Data (NUTS2): The data comes from MCYFS (see EC-JRC, 2020). By default, the implementation uses daily data.
  4. Remote Sensing Data (NUTS2, see Copernicus Global Land Service, 2020): Data includes fraction of absorbed photosynthetically active radiation (FAPAR) aggregated to NUTS2.
  5. Soil Data: Data includes soil moisture information that can be used to calculate soil water holding capacity. The data comes from MCYFS (see Lecerf et al., 2019).
  6. WOFOST data (NUTS2): The World Food Studies (WOFOST) crop model (van Diepen et al., 1989; Supit et al., 1994; de Wit et al.  2019) is a simulation model for the quantitative analysis of the growth and production of annual field crops. It is a mechanistic, dynamic model that explains daily crop growth on the basis of the underlying processes, such as photosynthesis, respiration and how these processes are influenced by environmental conditions. The crop simulation is fed by weather, soil and crop data. Observed meteorological data is interpolated on a regular 25 km grid using a method based on the distance, altitude and climatic region similarity between the center of grid cells and weather stations (see Van der Goot, 1998). WOFOST runs on the intersection between the 25 km meteorological grid and soil units based on the European soil map ( In order to have the output data aggregated to administrative regions such as countries or provinces, simulation units are further intersected with the boundaries of these regions. The outputs at soil unit (STU) level are aggregated to grid level in an area weighted manner. Gridded simulations are aggregated to lowest NUTS level 3 considering the arable land area of each grid, derived from GLOBCOVER and CORINE Land Cover (Cerrani and Lopez Lozano, 2017). From NUTS3 to higher levels, crop area fractions for the current year, retrieved from Eurostat, are used to weight and aggregate the output (Cerrani and Lopez Lozano, 2017).
  7. GAES data: GAES data includes agro-climatic features of regions, such as elevation and slope (from USGS-EROS, 2021), field size (from Lesiv et al., 2019), irrigated (crop) areas (from EC-JRC, 2020) and crop areas (from EC-JRC, 2020).
  8. National yield statistics (NUTS0): These are the official Eurostat national yield statistics (Eurostat, 2020a). We used these yield statistics as reference to compare the machine learning predictions aggregated to NUTS0 and the actual MCYFS forecasts (see van der Velde and Nisini, 2019).
  9. Regional yield statistics (NUTS2): We used NUTS2 yield statistics as labels to train and evaluate machine learning algorithms. We got NUTS2 yield statistics from The Central Bureau of Statistics (CBS) of the Netherlands (NL-CBS, 2020).
  10. Past MCYFS Yield Forecasts (NUTS0): These are actual forecasts made by MCYFS in the past (see van der Velde and Nisini, 2019). We used the official Eurostat national yield statistics (see point 7 above) as the reference to compare the machine learning predictions aggregated to NUTS0 and MCYFS forecasts.

Crop ID and name mapping

2 : grain maize

6 : sugar beets

7 : potatoes

90 : soft wheat

93 : sunflower

95 : spring barley



We would like to thank S. Niemeyer from the European Commission’s Joint Research Centre (JRC) for the permission to provide open access to the Netherlands data. Similarly, we would like to thank M. van der Velde, L. Nisini and I. Cerrani from JRC for sharing with us past MCYFS forecasts and Eurostat national yield statistics.



Files (23.5 MB)

Name Size Download all
21.2 kB Preview Download
65.6 kB Preview Download
592 Bytes Preview Download
42.5 kB Preview Download
1.6 kB Preview Download
15.0 MB Preview Download
1.5 MB Preview Download
275.3 kB Preview Download
457 Bytes Preview Download
6.5 MB Preview Download
5.9 kB Preview Download
35.5 kB Preview Download
26.5 kB Preview Download

Additional details

Related works

Is supplement to
Journal article: 10.1016/j.agsy.2020.103016 (DOI)


European Commission


  • Cerrani, I., Lopez Lozano, R., 2017. Algorithm for the disaggregation of crop area statistics in the MARS crop yield forecasting system., Last accessed: Oct 8, 2020.
  • Copernicus Global Land Service, 2020. Fraction of Absorbed Photosynthetically Active Radiation., Last accessed: Oct 19, 2020.
  • De Wit, A., Boogaard, H., Fumagalli, D., Janssen, S., Knapen, R., van Kraalin- gen, D., Supit, I., van der Wijngaart, R., van Diepen, K., 2019. 25 years of the WOFOST cropping systems model. Agricultural Systems 168, 154–167. doi:10.1016/j.agsy.2018.06.018.
  • EC-JRC, 2020. JRC Agri4Cast Data Portal., Last accessed: May 11, 2020.
  • Eurostat, 2016. Nomenclature of territorial units for statistics., Last accessed: May 11, 2020.
  • Eurostat, 2020a. Eurostat - agricultural production - crops. Agricultural_production_-_crops, Last accessed: May 11, 2020.
  • Lecerf, R., Ceglar, A., Lopez-Lozano, R., Van Der Velde, M., Baruth, B., 2019. Assessing the information in crop model and meteorological indicators to forecast crop yield over europe. Agricultural Systems 168, 191–202. doi:10. 1016/j.agsy.2018.03.002.
  • Lesiv, M., Laso Bayas, J.C., See, L., Duerauer, M., Dahlia, D., Durando, N., Hazarika, R., Kumar Sahariah, P., Vakolyuk, M., Blyshchyk, V., et al., 2019. Estimating the global distribution of field size using crowdsourcing. Global change biology 25, 174–186. doi:10.1111/gcb.14492.
  • NL-CBS, 2020. Cbs open data portal. #/CBS/nl/?fromstatweb, Last accessed: May 11, 2020.
  • Supit, I., Hooijer, A., Van Diepen, C., 1994. System description of the WOFOST 6.0 crop simulation model implemented in CGMS. vol. 1. theory and algo- rithms., in: EUR Publication No. 15959 EN, Office for Official Publications of the European Communties, Luxembourg. p. 146.
  • USGS-EROS, 2021. USGS EROS Archive - Digital Elevation - Global 30 Arc-Second Elevation (GTOPO30)., Last accessed: May 11, 2021.
  • Van der Goot, E., 1998. Spatial interpolation of daily meteorological data for the Crop Growth Monitoring System (CGMS), in: Proceedings of Seminar on Data Spatial Distribution in Meteorology and Climatology 28 September - 3 October 1997, Office for Official Publications of the EU, Luxembourg. pp. 141–153.
  • Van der Velde, M., Nisini, L., 2019. Performance of the MARS-crop yield forecasting system for the European Union: Assessing accuracy, in-season, and year-to-year improvements from 1993 to 2015. Agricultural Systems 168, 203–212. doi:10.1016/j.agsy.2018.06.009.
  • Van Diepen, C., Wolf, J., Van Keulen, H., Rappoldt, C., 1989. WOFOST: a simulation model of crop production. Soil Use and Management 5, 16–24. doi:10.1111/j.1475-2743.1989.tb00755.x.