Working with regional data is very hard. If you have already tried, you know it, and if you have not, probably you would underestimate the problem. On average every 2-3 years there is change in national boundaries, such as the unification of Germany, or the separation of Ethiopia and Eritrea. In EU, OECD or global datasets, these are rare events and do not cause a big problem. Regional boundaries, on the other hand, change in a couple of dozen points just in the European Union every three years, and globally on countless ocassions every year. Comparing regions of Europe, Australia, United States, Canada, just to start with developed and data rich countries, is very difficult, and if you start adding other countries, it gets harder and harder.

Yet it is a very rewarding task. If you compare the 27 EU member states, the 35 OECD countries, or the 50 states of the USA, you usually do not have a large enough data panel for a detailed analysis on an aggregated level. And on a micro-data level, such broad comparisons are almost always impossible. However, if you are able to organize EU or OECD or US sub-national aggregates into data panels, you can work with hundreds of observations with sufficient heterogenity to find far more interesting

The Problem

regional_gdp <- eurostat::get_eurostat ("tgs00003",
                                        time_format = "num" ) %>%
  mutate ( country = regions::get_country_code(geo)) %>%
  dplyr::filter ( country == "FR")

population <- eurostat::get_eurostat ("tgs00096",
                                      time_format = "num" ) %>%
  mutate ( country = regions::get_country_code(geo)) %>%
  dplyr::filter ( country == "FR", 
                  sex == "T",     #TOTAL without division by sex
                  age == "TOTAL") #TOTAL without division by age

alt_regional_gdp <- eurostat::get_eurostat ( "nama_10r_2gdp",
                                             time_format = "num" ) %>%
  mutate ( country = regions::get_country_code(geo)) %>%
  dplyr::filter ( unit == "MIO_EUR", 
                  country == "FR" ) %>%
  dplyr::select ( all_of(c("geo", "time", "values")))

One problem is that you will not find the French GDP data prior to 2015 in the Eurostat Regional gross domestic product by NUTS 2 regions - million EUR. If this does not put you off, you will realize that some datasets are back-filled, others are not, and you find eventually the Gross domestic product (GDP) at current market prices by NUTS 2 regions data table. Even the famous wayback machine cannot help. PDF hardcopy? Only a colorful map in the Eurostat regional yearbook 2013. The wayback machine did its best but did not crawl the whole database. I am not a fan of the undo button when it comes to official statistical releases, but I think that Eurostat history is gone.

Let’s go back to France! Mind my French, you will not find anything like Regional gross domestic product by NUTS 2 regions - million EUR on the website of INSEE, because GDP is PIB in that part of the world. So what you are looking for is actually Produits intérieurs bruts régionaux de 1990 à 2015.

download.file('https://www.insee.fr/fr/statistiques/fichier/1893220/PIB_1990_2015_regions_diffusion.xls', 
              destfile = file.path('data-raw', 
                                   'regional_gdp_fr.xlx'), 
              mode = 'wb')  #for Windows machines

Joining Regional Data

To keep the comparison easy to follow, let’s set our sights to two particularly sunny overseas regions of France: Réunion and Guadeloupe!

load (file = file.path('..', 'data-raw' , 'regional_gdp_blogpost.rda'))

regional_gdp_france <- readxl::read_excel(
  path = file.path('..','data-raw', 'regional_gdp_fr.xlx'), 
  sheet = 'PIB en valeur 1990-2015', 
  skip = 4, 
  na = "n.d") %>%
  dplyr::filter ( !is.na("Région")) %>%
  purrr::set_names ( c("region", paste0("Y", 1990:2015))) %>%
  tidyr::pivot_longer( cols = starts_with("Y"), 
                       names_to = 'time', 
                       values_to = 'values') %>%
  dplyr::mutate ( time  = as.numeric(gsub("Y", "", time)))

pib <- regional_gdp_france %>%
  dplyr::filter ( region %in% c("Réunion", "Guadeloupe", "Martinique"), 
                  time %in% c(2008,2012)) %>%
  dplyr::rename ( gdp = values, 
                  geo = region )

knitr::kable(pib)
geo time gdp
Guadeloupe 2008 7550.798
Guadeloupe 2012 8038.617
Martinique 2008 8142.636
Martinique 2012 8590.551
Réunion 2008 15633.902
Réunion 2012 17108.526

OK, so which of FR, FR1, FR10, FRB, FRB0, FRC, FRC1, FRC2, FRD, FRD1, FRD2, FRE, FRE1, FRE2, FRF, FRF1, FRF2, FRF3, FRG, FRG0, FRH, FRH0, FRI, FRI1, FRI2, FRI3, FRJ, FRJ1, FRJ2, FRK, FRK1, FRK2, FRL, FRL0, FRM, FRM0, FRY, FRY1, FRY2, FRY3, FRY4, FRY5, FRZ, FRZZ is actually Guadeloupe? Or Réunion? Or is it La Réunion? Let’s have a look at the geographical name labels:

gdp_labelled <- eurostat::get_eurostat ( "nama_10r_2gdp",
                                         time_format = "num" ) %>%
  eurostat::label_eurostat( fix_duplicated = TRUE) %>%
  dplyr::filter ( unit == "Million euro", 
                  time == 2016) %>%
  dplyr::select ( all_of(c("geo", "time", "values")) ) %>%
  dplyr::filter ( geo %in% c("La Réunion", "Guadeloupe", "Martinique")) %>%
  dplyr::rename ( gdp = values )


population_labelled <- eurostat::get_eurostat ("tgs00096",
                                      time_format = "num" ) %>%
  eurostat::label_eurostat( fix_duplicated = TRUE) %>%
  dplyr::filter ( grepl("Réunion|Guadeloupe|Martinique", geo), 
                  sex == "Total",   
                  age == "Total", 
                  time %in% c(2008, 2012, 2016) ) %>%
  dplyr::select(geo, time, values) %>%
  dplyr::rename( population = values )

Let’s try to join them:

gdp_labelled %>%
  dplyr::full_join( population_labelled, by = c("geo", "time")) %>%
  dplyr::full_join( pib, by = c("geo", "time", "gdp") ) %>%
  knitr::kable()
geo time gdp population
Guadeloupe 2016 9904.940 429856
Martinique 2016 9268.370 376480
La Réunion 2016 19019.890 852924
Martinique (NUTS 2013) 2008 NA 397693
La Réunion (NUTS 2013) 2008 NA 808250
Martinique 2008 NA 397693
La Réunion 2008 NA 808250
Martinique (NUTS 2013) 2012 NA 388364
La Réunion (NUTS 2013) 2012 NA 833944
Martinique 2012 NA 388364
La Réunion 2012 NA 833944
Guadeloupe (NUTS 2013) 2016 NA 431602
Martinique (NUTS 2013) 2016 NA 378043
La Réunion (NUTS 2013) 2016 NA 855992
Guadeloupe 2008 7550.798 NA
Guadeloupe 2012 8038.617 NA
Martinique 2008 8142.636 NA
Martinique 2012 8590.551 NA
Réunion 2008 15633.902 NA
Réunion 2012 17108.526 NA

Obviously something does not look very good here:

gdp_labelled %>%
  dplyr::full_join( pib, by = c("geo", "time", "gdp") ) %>%
  tidyr::pivot_wider ( names_from = "geo", 
                       values_from = "gdp")
## # A tibble: 3 x 5
##    time Guadeloupe Martinique `La Réunion` Réunion
##   <dbl>      <dbl>      <dbl>        <dbl>   <dbl>
## 1  2016      9905.      9268.       19020.     NA 
## 2  2008      7551.      8143.          NA   15634.
## 3  2012      8039.      8591.          NA   17109.

Validation

The regions package contains a joined, tidy version of all Correspondence tables published by Eurostat. As we can see, there were several changes in the metadata:

data(nuts_changes, package ="regions")

nuts_changes %>%
  filter ( grepl ( "Réunion|Guadeloupe|Martinique", geo_name_2016) ) %>%
  select ( contains(c("2010")) ) %>%
  filter ( !is.na(code_2010)) %>%
  knitr::kable()
code_2010 geo_name_2010 change_2010
FR92 Martinique NA
FR94 Réunion NA
FR920 Martinique NA
FR940 Réunion NA
nuts_changes %>%
  filter ( grepl ( "Réunion|Guadeloupe|Martinique", geo_name_2016) ) %>%
  select ( contains(c("2013")) ) %>%
  filter (!is.na(geo_name_2013)) %>%
  knitr::kable()
code_2013 geo_name_2013 change_2013
FRA2 Martinique code change
FRA4 La Réunion code, name change
FRA20 Martinique code change
FRA40 La Réunion code, name change
nuts_changes %>%
  filter ( grepl ( "Réunion|Guadeloupe|Martinique", geo_name_2016) ) %>%
  select ( contains(c("2016")) ) %>%
  knitr::kable()
code_2016 geo_name_2016 change_2016
FRY1 Guadeloupe recoded
FRY2 Martinique recoded
FRY4 La Réunion recoded
FRY10 Guadeloupe recoded
FRY20 Martinique recoded
FRY40 La Réunion recoded

First of all, Réunion was relabelled to La Réunion, which is a grammatically better representation in the French language, and makes column names a bit more awkward. Secondly, the boundaries of Guadeloupe were changed, which means that you should better check if this may affect data comparability. And because the data is not fully compatible, the French statistical authority correctly recoded them. The idea of using the alphanumeric geographical codes is that they are more computer friendly, so I will stick to them, but they cannot be uniquely matched with the PIB data, because both regions have two alphanumerical codes. How is this possible?

If you look at the duplication, it ends with a 0, which means that it is a technical duplication. Because both Guadeloupe, and La Réunion are small regions, they are not further sub-divided to NUTS3 levels. For comparability with both NUTS2 and NUTS3 level data, they are included in both typologies with a different code, adding a 0 to the NUTS3 level codes. (All NUTS2 level codes are made of two characters for the country and 2 numbers, and NUTS3 level codes are two characters for the country and three for the region.) This approach makes sense, because these regions are administratively behave like NUTS2 regions, but their size resembles NUTS3 regions. If you compare more political, administrative data, maybe you should compare them to other NUTS2 regions, and if your analysis is likely to be affected by homogenity of the population, then compare them to NUTS3 level regions.

Often Eurostat publishes datasets that contain mixed geo codes. The first step is to validate:

reunion_codes <- c("FRA4", "FRA40", "FR94", "FR940", "FRY4", "FRY40")

population %>%
  filter ( geo %in% c(reunion_codes), 
           time %in% c(2008,2012,2016)) %>%
  arrange(geo) %>%
  regions::validate_nuts_regions(nuts_year = 2016)%>%
  knitr::kable()
unit age sex geo time values country typology valid_2016
NR TOTAL T FRA4 2008 808250 FR NA FALSE
NR TOTAL T FRA4 2012 833944 FR NA FALSE
NR TOTAL T FRA4 2016 855992 FR NA FALSE
NR TOTAL T FRY4 2008 808250 FR nuts_level_2 TRUE
NR TOTAL T FRY4 2012 833944 FR nuts_level_2 TRUE
NR TOTAL T FRY4 2016 852924 FR nuts_level_2 TRUE

This dataset clearly breaches the definition of tidy data, but at least we have the data. Eurostat releases the data in the same table with two different coding.

We have a different problem with Guadeloupe: we do not have historical data because of the boundary change.

guadeloupe_codes <- c("FRA1", "FRA10", "FRY1", "FRY10")

population %>%
  filter ( geo %in% c(guadeloupe_codes), 
           time %in% c(2008,2012,2016)) %>%
  arrange(geo) %>%
  regions::validate_nuts_regions(nuts_year = 2016)%>%
  knitr::kable()
unit age sex geo time values country typology valid_2016
NR TOTAL T FRA1 2016 431602 FR NA FALSE
NR TOTAL T FRY1 2016 429856 FR nuts_level_2 TRUE

And at last:

martinique_codes <- c("FR92", "FR920", "FRA2", "FRA20", "FRY2", "FRY20")

population %>%
  filter ( geo %in% c(martinique_codes), 
           time %in% c(2008,2012,2016)) %>%
  arrange(geo) %>%
  regions::validate_nuts_regions(nuts_year = 2016)%>%
  knitr::kable()
unit age sex geo time values country typology valid_2016
NR TOTAL T FRA2 2008 397693 FR NA FALSE
NR TOTAL T FRA2 2012 388364 FR NA FALSE
NR TOTAL T FRA2 2016 378043 FR NA FALSE
NR TOTAL T FRY2 2008 397693 FR nuts_level_2 TRUE
NR TOTAL T FRY2 2012 388364 FR nuts_level_2 TRUE
NR TOTAL T FRY2 2016 376480 FR nuts_level_2 TRUE

Since I wanted to show the use of NUTS2 data, I’ll use the compatible alphanumerical codes from the NUTS2016 typology in this case:

regional_gdp  %>% 
  select ( geo, time, values )  %>%
  rename ( gdp = values )  %>%
  filter ( geo %in% c("FRY1", "FRY2", "FRY4"), 
           time %in% c(2008,2012,2016) ) %>%
  bind_rows (pib %>%
               mutate ( geo = case_when (
                 geo == "Guadeloupe" ~ 'FRY1', 
                 geo ==  "Martinique" ~ 'FRY2',
                 TRUE ~ "FRY4"
               )) ) %>%
  left_join ( 
    population %>%
      select ( all_of(c("geo", "time", "values"))) %>%
      filter ( geo %in% c(guadeloupe_codes, 
                          reunion_codes, 
                          martinique_codes), 
               time %in% c(2008,2012,2016)) %>%
      arrange(geo) %>%
      regions::validate_nuts_regions(nuts_year = 2016) %>%
      filter ( valid_2016 ) %>%
      select ( all_of(c("geo", "time", "values"))) %>%
      dplyr::rename( population = values ), 
    by = c("geo", "time")
    ) %>%
  arrange ( time, geo ) %>%
  mutate ( gdp_capita = gdp*1000000/population)
## # A tibble: 9 x 5
##   geo    time    gdp population gdp_capita
##   <chr> <dbl>  <dbl>      <dbl>      <dbl>
## 1 FRY1   2008  7551.         NA        NA 
## 2 FRY2   2008  8143.     397693     20475.
## 3 FRY4   2008 15634.     808250     19343.
## 4 FRY1   2012  8039.         NA        NA 
## 5 FRY2   2012  8591.     388364     22120.
## 6 FRY4   2012 17109.     833944     20515.
## 7 FRY1   2016  9905.     429856     23042.
## 8 FRY2   2016  9268.     376480     24618.
## 9 FRY4   2016 19020.     852924     22300.

Sadly, Guadeloupe is missing from all Eurostat population datasets, including demo_r_pjanaggr3 - which is a NUTS3 dataset in name only, it contains NUTS1, NUTS2 and NUTS3 data and, and from demo_r_pjangroup, too, which is in name a NUTS2 dataset, but it contains NUTS1 and NUTS2 data.

Next Steps

The regions package in its current, fledling 0.1.0 version currently helps