Published April 15, 2026 | Version v2
Dataset Open

SCARFACE: a harmonized spatio-temporal dataset integrating socio-economic, environmental, and agricultural indicators for the Po Valley (Italy), 2011--2024

  • 1. ROR icon University of Milano-Bicocca
  • 2. ROR icon Fondazione Eni Enrico Mattei
  • 3. ROR icon University of Glasgow
  • 4. Consiglio per la ricerca in agricoltura e l'analisi dell'economia agraria (CREA)
  • 5. EDMO icon University of Milan Bicocca, Department of Earth and Environmental Sciences

Description

The SCARFACE research initiative

SCARFACE (Sequestering CARbon through Forests, AgriCulture, and land usE - https://www.paolomaranzano.net/scarface) is a research initiative funded by the University of Milano-Bicocca (UniMiB), Italy. The project blends complementary and interdisciplinary research experiences from the statistical, data science and environmental and atmospheric chemistry backgrounds at UniMiB. Along with researchers from UniMiB, the project involves researchers from the Italian Council for Agricultural Research and Economics - Research Centre for Agricultural Policies and Bioeconomy (CREA-PB), Italy, and the School of Mathematics and Statistics of the University of Glasgow (Scotland, UK).

The SCARFACE dataset

The project assembles a harmonized spatio-temporal dataset that integrates several domain, such as climate, air quality, pollution emissions, land cover, soil properties, agro-industry dynamics and socio-economic indicators, to jointly investigate interconnected processes linking agricultural systems, atmospheric dynamics, emissions, and socioeconomic conditions in the Po Valley (Northern Italy), an area characterized by strong interactions among agricultural systems, environmental processes, and human activities.

The spatial reference unit adopted in SCARFACE is the Agrarian Sub-Region (ASR), a territorial classification defined by the Italian National Statistics Office (ISTAT). ASRs represent groups of contiguous municipalities that are considered relatively homogeneous with respect to natural conditions, agronomic characteristics, and agricultural production systems. The Po Valley can be partitioned into m=256 ASRs with different sizes and shapes.

The SCARFACE dataset integrates information for the period from 2011 to 2024 (i.e., T=14 time stamps), with the initial and final temporal coverage depending on the availability of the individual data sources. Therefore, the final database adopts an annual panel structure defined over ASR spatial units and composed of a total number of spatio-temporal observations equal to N=mxT=256x14=3584 for each variable.

Overall, SCARFACE comprises a set of p=2748 variables (plus three unique identifiers, that is, year, ASR and geometry) that include administrative records, gridded environmental products, satellite-derived land information, and survey-based socio-economic indicators sourced from national and international public institutions, covering a wide range of thematic domains. Farm activity and agro-economic indicators are derived from the Farm Accountancy Data Network (FADN) survey coordinated by the Italian Council for Agricultural Research and Economics (CREA). Emissions data are obtained from the EDGAR inventories developed by the European Commission, while air quality information is sourced from both the European Environment Agency (EEA) and the Copernicus Atmosphere Monitoring Service (CAMS). Meteorological variables are retrieved from the ERA5-Land reanalysis produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), and extreme weather indicators are provided by the European Drought Observatory (EDO). Land cover information is based on the CORINE Land Cover dataset from Copernicus and the Global Dynamic Land Cover (GDLC) dataset. Livestock data are obtained from the Italian National Livestock Registry (BDN) managed by the Italian Ministry of Health, while socio-economic indicators are produced by ISTAT. Finally, geographical features and administrative metadata are derived from a combination of Amazon Web Service (AWS), ISTAT and Eurostat.

The dataset is designed as a versatile resource supporting both methodological and applied developments, as well as policy-relevant analyses, including:

  1. Panel data analyses at moderate spatial and temporal resolutions
  2. Advanced spatio-temporal modeling in the presence of heterogeneous covariates and high-dimensional settings
  3. Spatial and spatio-temporal clustering exercises, facilitating the identification of regional typologies and underlying patterns in agricultural and environmental systems.
  4. Reproducible, cross-domain policy-oriented analyses, particularly in relation to agricultural transitions, air quality management, and climate variability in one of Europe’s most critical environmental hotspots.

The building process of the dataset is detailed in the companion paper.

Table of contents (English)

This repository contains the following files:

  • SCARFACE_DatasetSingleObjects_April2026.xlsx: this is an Excel (.xlsx) file containing the 14 source-specific dataset that constitute the SCARFACE framework. Data are organized into 14 distinct sheets that can be matched using the primary keys 'ASR' (unique geographical/spatial ID) and 'Year' (unique temporal ID). Geometries can be added matching the shapefile contained in ASRs_Geometries.zip (with primary key 'ASR');
  • SCARFACE_DatasetSingleObjects_April2026.RData: this is a RData file containing the 14 source-specific dataset that constitute the SCARFACE framework. Data are organized into 14 distinct data frame that can be matched using the primary keys 'ASR' (unique geographical/spatial ID) and 'Year' (unique temporal ID). The object ASRsPoValley_sf is of class 'sf' and contains geometries of each polygon;
  • SCARFACE_DatasetExtended_April2026.xlsx: this is an Excel (.xlsx) file containing the whole SCARFACE dataset in a single sheet. Individual dataset were matched using a full join approach using the primary keys 'ASR' (unique geographical/spatial ID) and 'Year' (unique temporal ID). Geometries can be added matching the shapefile contained in ASRs_Geometries.zip (with primary key 'ASR');
  • SCARFACE_DatasetExtended_April2026.RData: this is a RData file containing the whole SCARFACE dataset in a data frame of class sf, that is, a spatial object. Individual dataset were matched using a full join approach using the primary keys 'ASR' (unique geographical/spatial ID) and 'Year' (unique temporal ID). Geometries can be added matching the shape file contained in ASRs_Geometries.zip (with primary key 'ASR');
  • ASRs_Geometries.zip: this is a zip file that contains the shapefiles containing the geometries of the m=256 ASRs polygon;
  • SCARFACE_MissingAnalysis_April2026.xlsx: this is an Excel (.xlsx) file containing post-merging information about missing values for each individual dataset. Missing values are described for each year, variable and ASR. Information is reported in a separate sheet for each dataset;
  • SCARFACE_StructureAnalysis_April2026.xlsx: this is an Excel (.xlsx) file containing information about the structure of each individual dataset. In particular, the number of rows, the number of ASRs with valid values, the temporal range (coverage), and the number of variables are reported for each dataset;
  • SCARFACE - Methodological note.pdf: this a PDF file containing a note on the statistical methodologies used to spatially-align gridded datasets via spatial block kriging (Section S1), the post-stratification procedure adopted to generate the spatio-temporal weighting system (Section S2) and the Generalized Variance Function (GVF) methodology adopted to regularize direct estimates of the variance for FADN survey data (Section S3);
  • SCARFACE - Tables and list of available indicators.pdf: this a PDF file containing tables describing the available information (e.g., reclassification, aggregation, etc.) for all the data sources included in the SCARFACE dataset;
  • SCARFACE - Data and replication code.zip: this is a zip file containing R and Python code to reproduce the final merged data frame. The zip file contains
    • A separate folder for each source-specific dataset (i.e., Animals, CAMSgrid, EDGARgrid, EDOgrid, EEAconc, EnvironmentalVars, LandCover, ISTATSocioEconomicData and CREA)
    • A folder with auxiliary functions to apply the spatial block kriging algorith used to upscale gridded data (i.e., AuxFuns_Kriging)
    • A folder containing data extracted from the 7th Italian Agricultural Census 2020 and provided by ISTAT used to check the validity of several data sources (i.e., ISTAT_AgroCensus2020)
    • A folder containing data and code to generate the geographical metadata at the ASR level (i.e., Match_ASR_Munic). Among others, the folder contains the matching table "Match_LAUs_RegAgrarie_PoValley" (CSV and RData) that reports the complete municipality--ASR correspondence to facilitate reproducibility of the spatial aggregation procedures;
    • A folder containing code to merge individual dataset and to check the data quality (i.e., Merge dataset).

Notes (English)

Technical note on the harmonization workflow

The SCARFACE - Methodological note.pdf file contains a technical note on the statistical methodologies used to spatially-align gridded datasets via spatial block kriging (Section S1), the post-stratification procedure adopted to generate the spatio-temporal weighting system for FADN survey data (Section S2) and the Generalized Variance Function (GVF) methodology adopted to regularize direct estimates of the variance for FADN survey data.

Notes (English)

Update

This dataset was initially compiled for the "Sequestering CARbon through Forests, AgriCulture, and land usE (SCARFACE)" research initiative funded by the University of Milano-Bicocca and will be systematically updated whenever new information, for instance new releases of the underlying raw dataset, will become available.

Notes (English)

Note about data and replication code

The file SCARFACE - Data and replication code.zip contains the following

  • A separate folder for each source-specific dataset (i.e., Animals, CAMSgrid, EDGARgrid, EDOgrid, EEAconc, EnvironmentalVars, LandCover, ISTATSocioEconomicData and CREA)
  • A folder with auxiliary functions to apply the spatial block kriging algorith used to upscale gridded data (i.e., AuxFuns_Kriging)
  • Other files

Each folder contains:

  • R and Python code to generate (potential) intermediate output (e.g., spatially-aligned data not yet aggregated at the areal level) and the final dataset associated with each thematic domain (e.g., yearly georeferenced panel data with area-level values)
  • Intermediate output(s)
  • Pre-processed final dataset associated with each thematic domain

Notice the following:

  1. The folders DO NOT contain the original data for EDGAR emissions inventories, CAMS concentrations inventories, EEA concentration inventories, EDO extreme events variables and ERA5 environmental variables (oversized files). Data can be retrieved through the corresponding scripts. Intermediate dataset (i.e., temporally and spatially interpolated data) are provided;
  2. The folders CONTAINS the original data for ISTAT socio-economic variables, BDN livestock, land cover, ISTAT agricultural census 2020 and 2010. Data can also be retrieved through the corresponding scripts. Intermediate dataset are provided;
  3. Before running R and Python scripts is necessary to change the working directory (notice that codes are setup to work inside the same folder where the original data are located)

Note on the availability of FADN data

While all datasets included in SCARFACE are provided by national and international public institutions and are, in principle, publicly available, farm activity and agro-economic indicators derived from the FADN represent the only exception. In fact, they are accessed under a research data agreement with the data provider, that is, the Italian Council for Agricultural Research and Economics -- Research Centre for Agricultural Policies and Bioeconomy (CREA-PB). Specifically, the analysis relies on farm-level microdata that cannot be publicly disclosed without appropriate anonymization and spatial aggregation. The methodology adopted to ensure confidentiality is described in detail in Section Methods of the companion paper.

Notes (English)

Usage notes

The dataset is ready to be used as is by any user interested in studying the complex interactions among agricultural systems, air quality management, environmental processes, and human activities in a critical area of Europe. However, users should take into account some potential caveats.

Potential missing values

Due to structural heterogeneity of the data sources (i.e., different spatial and temporal coverages) several missing values could arise (e.g., socio-economic data cover the period 2014--2023, while emissions data cover the whole period 2011--2024), especially across the temporal domain. We recall that no missing data treatment was enforced during the dataset building. For transparency purposes, in file SCARFACE_MissingAnalysis_April2026.xlsx we provide a descriptive analysis about the missing values detected for each individual dataset, year, variable and ASR.

Uncertainty quantification and interpolation diagnostics

In addition to the main data products, the SCARFACE framework provides a comprehensive set of uncertainty-related information, including metrics associated with temporal aggregation, spatial interpolation, and the tuning of interpolation hyperparameters.

First, temporal aggregation uncertainty arises from the harmonization of input datasets originally available at monthly or higher temporal resolution (e.g., meteorological and drought indicators). Prior to spatial alignment via BK, these data were summarized at the grid-cell level using annual and seasonal statistics (mean, minimum, maximum, and standard deviation). The standard deviation captures infra-annual variability and provides an indirect measure of temporal aggregation uncertainty.

Second, for gridded data interpolated using local spatial block kriging (in addition to the harmonized dataset) we provide both the estimated parameters, the covariance function used to interpolate and the corresponding cross-validation results (see the objects '*_BKinterpolated.RData' and '* _BK_output.RData' inside each 'InterpolationResults' folder and the corresponding scripts '* - Application of Xval spatial block kriging.R' and '* - Stacking block kriging estimates') . Similarly, for FADN survey data we provide both the original non-smoothed variances and the GVF smoothed variances (as well as the empirical smoothing model) allowing a fair comparison of the pre-regularization and post-regularization distributions (see folder CREA/GVFDiagPlots).

Thrid, for every variable spatially-interpolated using local spatial block kriging, estimates of the ASR-level predicted kriging average (i.e., '*_BKm') and the corresponding ASR-level kriging prediction variance (i.e., '*_BKmv') are provided. Also, FADN variables aggregated at the ASR-level using the Horvitz-Thompson (HT) expansion estimator, estimates of the average value (HTm) and the total value (HTt), as well the corresponding variances (HTmv and HTtv) are provided. In both cases, the estimated area-level variances quantify the interpolation uncertainty and are intended to support downstream analyses in which estimation precision is critical, such as weighted regression models, uncertainty propagation, sensitivity analyses, and the identification of domains with lower inferential reliability. In this regard, for modeling purposes, such uncertainty should be carefully taken into account through specific model-based adjustments, such as multi-stage bootstrap inference, in order to correct estimates and inference for the prior knowledge of the underlying data interpolation process.

Other (English)

Acknowledgements

This work is part of the "Sequestering CARbon through Forests, AgriCulture, and land usE (SCARFACE)" research project, funded by the University of Milano-Bicocca, under grant number 2024-ATEQC-0048. Further information about the project can be found at the link https://www.paolomaranzano.net/scarface.

We acknowledge the Italian Council for Agricultural Research and Economics -- Research Centre for Agricultural Policies and Bioeconomy (CREA-PB) for providing the research team with access to the RICA-FADN database within the AgroGeoStat research agreement.

We also acknowledge the GEMMA center in the framework of project MUR "Dipartimenti di eccellenza 2023-2027".

We also acknowledge researchers from Associazione Economia e Sostenibilità (EStà) and Terra! for the feedback provided within the joint research projects Allevamenti intensivi e sistemi alimentari sostenibili and Per il lavoro dignitoso e la transizione giusta: verso l’Osservatorio Lavoro e Ambiente nei sistemi alimentari.

We also acknowledge and thank all the colleagues that provided the research team with comments and suggestions, in particular Laura Marcis (University of Valle d'Aosta, IT), Renato Salvatore (University of Cassino and Southern Lazio, IT), Paul Parker (UCSC, USA) and Scott Holan (University of Missouri, USA) for the survey data integration.

Files

SCARFACE_logo_500x500.png

Files (14.0 GB)

Name Size Download all
md5:e84f53de07fecf3a3bf8091f680d7809
23.2 kB Preview Download
md5:dca04d461a1413ecba93634f640a0253
304.0 kB Preview Download
md5:b8739d57aa655043ec4724d2a6b61374
1.2 MB Preview Download
md5:c0b90e4e4a94f166eefd1bcf4f4527dd
6.9 GB Preview Download
md5:0437f0aac72de1a0355c55cc6c1c750e
6.9 GB Preview Download
md5:35f66732fb225c725d0746908615f87d
550.0 kB Preview Download
md5:300edcd2c3422dfe004a7ea2b682ec0a
469.6 kB Preview Download
md5:2f3c4a05545754d251f2da0bc7e39133
39.6 MB Download
md5:0e4c7fb429a1fe6be0ec34d1a6146675
75.5 MB Download
md5:fdaadfd1c1f6002c4e971b26254cb2d6
39.3 MB Download
md5:29e5aec29f554dda608197e61a56de87
73.0 MB Download
md5:88d72c7be27ac45da049d1277bc6a380
74.2 kB Preview Download
md5:ea5a326e8b7f5a4b1a2e430194584b0a
535.9 kB Download
md5:0d5f40f3b076dab0f919ece548218c63
7.2 kB Download

Additional details

Funding

University of Milano-Bicocca
Sequestering CARbon through Forests, AgriCulture, and land usE (SCARFACE) 2024-ATEQC-0048

Software

Repository URL
https://github.com/ScarfaceSeqCARForAgriCultLandusE
Programming language
R , Python
Development Status
Active