Dataset Open Access

Sample data for "Machine learning for large-scale forecasting"

Dilli Paudel; Hendrik Boogaard; Allard de Wit; Sander Janssen; Sjoukje Osinga; Christos Pylianidis; Ioannis Athanasiadis

Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="" xmlns:oai_dc="" xmlns:xsi="" xsi:schemaLocation="">
  <dc:creator>Dilli Paudel</dc:creator>
  <dc:creator>Hendrik Boogaard</dc:creator>
  <dc:creator>Allard de Wit</dc:creator>
  <dc:creator>Sander Janssen</dc:creator>
  <dc:creator>Sjoukje Osinga</dc:creator>
  <dc:creator>Christos Pylianidis</dc:creator>
  <dc:creator>Ioannis Athanasiadis</dc:creator>
  <dc:description>This dataset includes sample data for the Netherlands to run the machine learning baseline as described in the paper titled Machine learning for large-scale crop yield forecasting, accessible at The software implementation of the machine learning baseline is available at:


The NUTS classification (Nomenclature of territorial units for statistics) is a hierarchical system for dividing up the economic territory of the EU and the UK (see Eurostat, 2016) for more details).


The dataset consists of 11 CSV files. They are formatted to work as sample inputs to the machine learning baseline.

	Crop Area Fractions (NUTS2, NUTS1): We aggregated the predictions of the machine learning baseline from NUTS2 to national (NUTS0) level by weighting them on the modeled crop area. Cerrani and López Lozano (2017) have described in detail the algorithm used to model crop areas for different NUTS levels. The data comes from the MARS Crop Yield Forecasting System (MCYFS) of European Commission's Joint Research Centre (JRC) (see Lecerf et al., 2019).
	Centroids (NUTS2): Data includes latitude, longitude and distance to coast of the centroids of NUTS2 regions.
	Meteo Daily Data and Meteo Dekadal Data (NUTS2): The data comes from MCYFS (see EC-JRC, 2020). By default, the implementation uses daily data.
	Remote Sensing Data (NUTS2, see Copernicus Global Land Service, 2020): Data includes fraction of absorbed photosynthetically active radiation (FAPAR) aggregated to NUTS2.
	Soil Data: Data includes soil moisture information that can be used to calculate soil water holding capacity. The data comes from MCYFS (see Lecerf et al., 2019).
	WOFOST data (NUTS2): The World Food Studies (WOFOST) crop model (van Diepen et al., 1989; Supit et al., 1994; de Wit et al.  2019) is a simulation model for the quantitative analysis of the growth and production of annual field crops. It is a mechanistic, dynamic model that explains daily crop growth on the basis of the underlying processes, such as photosynthesis, respiration and how these processes are influenced by environmental conditions. The crop simulation is fed by weather, soil and crop data. Observed meteorological data is interpolated on a regular 25 km grid using a method based on the distance, altitude and climatic region similarity between the center of grid cells and weather stations (see Van der Goot, 1998). WOFOST runs on the intersection between the 25 km meteorological grid and soil units based on the European soil map ( In order to have the output data aggregated to administrative regions such as countries or provinces, simulation units are further intersected with the boundaries of these regions. The outputs at soil unit (STU) level are aggregated to grid level in an area weighted manner. Gridded simulations are aggregated to lowest NUTS level 3 considering the arable land area of each grid, derived from GLOBCOVER and CORINE Land Cover (Cerrani and Lopez Lozano, 2017). From NUTS3 to higher levels, crop area fractions for the current year, retrieved from Eurostat, are used to weight and aggregate the output (Cerrani and Lopez Lozano, 2017).
	National yield statistics (NUTS0): These are the official Eurostat national yield statistics (Eurostat, 2020a). We used these yield statistics as reference to compare the machine learning predictions aggregated to NUTS0 and the actual MCYFS forecasts (see van der Velde and Nisini, 2019).
	Regional yield statistics (NUTS2): We used NUTS2 yield statistics as labels to train and evaluate machine learning algorithms. We got NUTS2 yield statistics from The Central Bureau of Statistics (CBS) of the Netherlands (NL-CBS, 2020).
	Past MCYFS Yield Forecasts (NUTS0): These are actual forecasts made by MCYFS in the past (see van der Velde and Nisini, 2019). We used the official Eurostat national yield statistics (see point 7 above) as the reference to compare the machine learning predictions aggregated to NUTS0 and MCYFS forecasts.

Crop ID and name mapping

2 : grain maize

6 : sugar beets

7 : potatoes

90 : soft wheat

93 : sunflower

95 : spring barley



We would like to thank S. Niemeyer from the European Commission’s Joint Research Centre (JRC) for the permission to provide open access to the Netherlands data. Similarly, we would like to thank M. van der Velde, L. Nisini and I. Cerrani from JRC for sharing with us past MCYFS forecasts and Eurostat national yield statistics.</dc:description>
  <dc:source>Agricultural Systems 187</dc:source>
  <dc:subject>Crop yield prediction; Machine learning; Modularity; Reusability; Large-scale crop yield forecasting.</dc:subject>
  <dc:title>Sample data for "Machine learning for large-scale forecasting"</dc:title>
All versions This version
Views 287287
Downloads 504504
Data volume 806.6 MB806.6 MB
Unique views 230230
Unique downloads 202202


Cite as