Working with regional data is very hard. If you have already tried, you know it, and if you have not, probably you would underestimate the problem. On average every 2-3 years there is change in national boundaries, such as the unification of Germany, or the separation of Ethiopia and Eritrea. In EU, OECD or global datasets, these are rare events and do not cause a big problem. Regional boundaries, on the other hand, change in a couple of dozen points just in the European Union every three years, and globally on countless ocassions every year. Comparing regions of Europe, Australia, United States, Canada, just to start with developed and data rich countries, is very difficult, and if you start adding other countries, it gets harder and harder.
Yet it is a very rewarding task. If you compare the 27 EU member states, the 35 OECD countries, or the 50 states of the USA, you usually do not have a large enough data panel for a detailed analysis on an aggregated level. And on a micro-data level, such broad comparisons are almost always impossible. However, if you are able to organize EU or OECD or US sub-national aggregates into data panels, you can work with hundreds of observations with sufficient heterogenity to find far more interesting
regional_gdp <- eurostat::get_eurostat ("tgs00003",
time_format = "num" ) %>%
mutate ( country = regions::get_country_code(geo)) %>%
dplyr::filter ( country == "FR")
population <- eurostat::get_eurostat ("tgs00096",
time_format = "num" ) %>%
mutate ( country = regions::get_country_code(geo)) %>%
dplyr::filter ( country == "FR",
sex == "T", #TOTAL without division by sex
age == "TOTAL") #TOTAL without division by age
alt_regional_gdp <- eurostat::get_eurostat ( "nama_10r_2gdp",
time_format = "num" ) %>%
mutate ( country = regions::get_country_code(geo)) %>%
dplyr::filter ( unit == "MIO_EUR",
country == "FR" ) %>%
dplyr::select ( all_of(c("geo", "time", "values")))
One problem is that you will not find the French GDP data prior to 2015 in the Eurostat Regional gross domestic product by NUTS 2 regions - million EUR. If this does not put you off, you will realize that some datasets are back-filled, others are not, and you find eventually the Gross domestic product (GDP) at current market prices by NUTS 2 regions data table. Even the famous wayback machine cannot help. PDF hardcopy? Only a colorful map in the Eurostat regional yearbook 2013. The wayback machine did its best but did not crawl the whole database. I am not a fan of the undo button when it comes to official statistical releases, but I think that Eurostat history is gone.
Let’s go back to France! Mind my French, you will not find anything like Regional gross domestic product by NUTS 2 regions - million EUR on the website of INSEE, because GDP is PIB in that part of the world. So what you are looking for is actually Produits intérieurs bruts régionaux de 1990 à 2015.
download.file('https://www.insee.fr/fr/statistiques/fichier/1893220/PIB_1990_2015_regions_diffusion.xls',
destfile = file.path('data-raw',
'regional_gdp_fr.xlx'),
mode = 'wb') #for Windows machines
To keep the comparison easy to follow, let’s set our sights to two particularly sunny overseas regions of France: Réunion and Guadeloupe!
load (file = file.path('..', 'data-raw' , 'regional_gdp_blogpost.rda'))
regional_gdp_france <- readxl::read_excel(
path = file.path('..','data-raw', 'regional_gdp_fr.xlx'),
sheet = 'PIB en valeur 1990-2015',
skip = 4,
na = "n.d") %>%
dplyr::filter ( !is.na("Région")) %>%
purrr::set_names ( c("region", paste0("Y", 1990:2015))) %>%
tidyr::pivot_longer( cols = starts_with("Y"),
names_to = 'time',
values_to = 'values') %>%
dplyr::mutate ( time = as.numeric(gsub("Y", "", time)))
pib <- regional_gdp_france %>%
dplyr::filter ( region %in% c("Réunion", "Guadeloupe", "Martinique"),
time %in% c(2008,2012)) %>%
dplyr::rename ( gdp = values,
geo = region )
knitr::kable(pib)
geo | time | gdp |
---|---|---|
Guadeloupe | 2008 | 7550.798 |
Guadeloupe | 2012 | 8038.617 |
Martinique | 2008 | 8142.636 |
Martinique | 2012 | 8590.551 |
Réunion | 2008 | 15633.902 |
Réunion | 2012 | 17108.526 |
OK, so which of FR, FR1, FR10, FRB, FRB0, FRC, FRC1, FRC2, FRD, FRD1, FRD2, FRE, FRE1, FRE2, FRF, FRF1, FRF2, FRF3, FRG, FRG0, FRH, FRH0, FRI, FRI1, FRI2, FRI3, FRJ, FRJ1, FRJ2, FRK, FRK1, FRK2, FRL, FRL0, FRM, FRM0, FRY, FRY1, FRY2, FRY3, FRY4, FRY5, FRZ, FRZZ is actually Guadeloupe? Or Réunion? Or is it La Réunion? Let’s have a look at the geographical name labels:
gdp_labelled <- eurostat::get_eurostat ( "nama_10r_2gdp",
time_format = "num" ) %>%
eurostat::label_eurostat( fix_duplicated = TRUE) %>%
dplyr::filter ( unit == "Million euro",
time == 2016) %>%
dplyr::select ( all_of(c("geo", "time", "values")) ) %>%
dplyr::filter ( geo %in% c("La Réunion", "Guadeloupe", "Martinique")) %>%
dplyr::rename ( gdp = values )
population_labelled <- eurostat::get_eurostat ("tgs00096",
time_format = "num" ) %>%
eurostat::label_eurostat( fix_duplicated = TRUE) %>%
dplyr::filter ( grepl("Réunion|Guadeloupe|Martinique", geo),
sex == "Total",
age == "Total",
time %in% c(2008, 2012, 2016) ) %>%
dplyr::select(geo, time, values) %>%
dplyr::rename( population = values )
Let’s try to join them:
gdp_labelled %>%
dplyr::full_join( population_labelled, by = c("geo", "time")) %>%
dplyr::full_join( pib, by = c("geo", "time", "gdp") ) %>%
knitr::kable()
geo | time | gdp | population |
---|---|---|---|
Guadeloupe | 2016 | 9904.940 | 429856 |
Martinique | 2016 | 9268.370 | 376480 |
La Réunion | 2016 | 19019.890 | 852924 |
Martinique (NUTS 2013) | 2008 | NA | 397693 |
La Réunion (NUTS 2013) | 2008 | NA | 808250 |
Martinique | 2008 | NA | 397693 |
La Réunion | 2008 | NA | 808250 |
Martinique (NUTS 2013) | 2012 | NA | 388364 |
La Réunion (NUTS 2013) | 2012 | NA | 833944 |
Martinique | 2012 | NA | 388364 |
La Réunion | 2012 | NA | 833944 |
Guadeloupe (NUTS 2013) | 2016 | NA | 431602 |
Martinique (NUTS 2013) | 2016 | NA | 378043 |
La Réunion (NUTS 2013) | 2016 | NA | 855992 |
Guadeloupe | 2008 | 7550.798 | NA |
Guadeloupe | 2012 | 8038.617 | NA |
Martinique | 2008 | 8142.636 | NA |
Martinique | 2012 | 8590.551 | NA |
Réunion | 2008 | 15633.902 | NA |
Réunion | 2012 | 17108.526 | NA |
Obviously something does not look very good here:
gdp_labelled %>%
dplyr::full_join( pib, by = c("geo", "time", "gdp") ) %>%
tidyr::pivot_wider ( names_from = "geo",
values_from = "gdp")
## # A tibble: 3 x 5
## time Guadeloupe Martinique `La Réunion` Réunion
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 9905. 9268. 19020. NA
## 2 2008 7551. 8143. NA 15634.
## 3 2012 8039. 8591. NA 17109.
The regions
package contains a joined, tidy version of all Correspondence tables published by Eurostat. As we can see, there were several changes in the metadata:
data(nuts_changes, package ="regions")
nuts_changes %>%
filter ( grepl ( "Réunion|Guadeloupe|Martinique", geo_name_2016) ) %>%
select ( contains(c("2010")) ) %>%
filter ( !is.na(code_2010)) %>%
knitr::kable()
code_2010 | geo_name_2010 | change_2010 |
---|---|---|
FR92 | Martinique | NA |
FR94 | Réunion | NA |
FR920 | Martinique | NA |
FR940 | Réunion | NA |
nuts_changes %>%
filter ( grepl ( "Réunion|Guadeloupe|Martinique", geo_name_2016) ) %>%
select ( contains(c("2013")) ) %>%
filter (!is.na(geo_name_2013)) %>%
knitr::kable()
code_2013 | geo_name_2013 | change_2013 |
---|---|---|
FRA2 | Martinique | code change |
FRA4 | La Réunion | code, name change |
FRA20 | Martinique | code change |
FRA40 | La Réunion | code, name change |
nuts_changes %>%
filter ( grepl ( "Réunion|Guadeloupe|Martinique", geo_name_2016) ) %>%
select ( contains(c("2016")) ) %>%
knitr::kable()
code_2016 | geo_name_2016 | change_2016 |
---|---|---|
FRY1 | Guadeloupe | recoded |
FRY2 | Martinique | recoded |
FRY4 | La Réunion | recoded |
FRY10 | Guadeloupe | recoded |
FRY20 | Martinique | recoded |
FRY40 | La Réunion | recoded |
First of all, Réunion was relabelled to La Réunion, which is a grammatically better representation in the French language, and makes column names a bit more awkward. Secondly, the boundaries of Guadeloupe were changed, which means that you should better check if this may affect data comparability. And because the data is not fully compatible, the French statistical authority correctly recoded them. The idea of using the alphanumeric geographical codes is that they are more computer friendly, so I will stick to them, but they cannot be uniquely matched with the PIB data, because both regions have two alphanumerical codes. How is this possible?
If you look at the duplication, it ends with a 0, which means that it is a technical duplication. Because both Guadeloupe, and La Réunion are small regions, they are not further sub-divided to NUTS3
levels. For comparability with both NUTS2
and NUTS3
level data, they are included in both typologies with a different code, adding a 0 to the NUTS3
level codes. (All NUTS2
level codes are made of two characters for the country and 2 numbers, and NUTS3
level codes are two characters for the country and three for the region.) This approach makes sense, because these regions are administratively behave like NUTS2
regions, but their size resembles NUTS3
regions. If you compare more political, administrative data, maybe you should compare them to other NUTS2
regions, and if your analysis is likely to be affected by homogenity of the population, then compare them to NUTS3
level regions.
Often Eurostat publishes datasets that contain mixed geo codes. The first step is to validate:
reunion_codes <- c("FRA4", "FRA40", "FR94", "FR940", "FRY4", "FRY40")
population %>%
filter ( geo %in% c(reunion_codes),
time %in% c(2008,2012,2016)) %>%
arrange(geo) %>%
regions::validate_nuts_regions(nuts_year = 2016)%>%
knitr::kable()
unit | age | sex | geo | time | values | country | typology | valid_2016 |
---|---|---|---|---|---|---|---|---|
NR | TOTAL | T | FRA4 | 2008 | 808250 | FR | NA | FALSE |
NR | TOTAL | T | FRA4 | 2012 | 833944 | FR | NA | FALSE |
NR | TOTAL | T | FRA4 | 2016 | 855992 | FR | NA | FALSE |
NR | TOTAL | T | FRY4 | 2008 | 808250 | FR | nuts_level_2 | TRUE |
NR | TOTAL | T | FRY4 | 2012 | 833944 | FR | nuts_level_2 | TRUE |
NR | TOTAL | T | FRY4 | 2016 | 852924 | FR | nuts_level_2 | TRUE |
This dataset clearly breaches the definition of tidy data, but at least we have the data. Eurostat releases the data in the same table with two different coding.
We have a different problem with Guadeloupe: we do not have historical data because of the boundary change.
guadeloupe_codes <- c("FRA1", "FRA10", "FRY1", "FRY10")
population %>%
filter ( geo %in% c(guadeloupe_codes),
time %in% c(2008,2012,2016)) %>%
arrange(geo) %>%
regions::validate_nuts_regions(nuts_year = 2016)%>%
knitr::kable()
unit | age | sex | geo | time | values | country | typology | valid_2016 |
---|---|---|---|---|---|---|---|---|
NR | TOTAL | T | FRA1 | 2016 | 431602 | FR | NA | FALSE |
NR | TOTAL | T | FRY1 | 2016 | 429856 | FR | nuts_level_2 | TRUE |
And at last:
martinique_codes <- c("FR92", "FR920", "FRA2", "FRA20", "FRY2", "FRY20")
population %>%
filter ( geo %in% c(martinique_codes),
time %in% c(2008,2012,2016)) %>%
arrange(geo) %>%
regions::validate_nuts_regions(nuts_year = 2016)%>%
knitr::kable()
unit | age | sex | geo | time | values | country | typology | valid_2016 |
---|---|---|---|---|---|---|---|---|
NR | TOTAL | T | FRA2 | 2008 | 397693 | FR | NA | FALSE |
NR | TOTAL | T | FRA2 | 2012 | 388364 | FR | NA | FALSE |
NR | TOTAL | T | FRA2 | 2016 | 378043 | FR | NA | FALSE |
NR | TOTAL | T | FRY2 | 2008 | 397693 | FR | nuts_level_2 | TRUE |
NR | TOTAL | T | FRY2 | 2012 | 388364 | FR | nuts_level_2 | TRUE |
NR | TOTAL | T | FRY2 | 2016 | 376480 | FR | nuts_level_2 | TRUE |
Since I wanted to show the use of NUTS2
data, I’ll use the compatible alphanumerical codes from the NUTS2016
typology in this case:
regional_gdp %>%
select ( geo, time, values ) %>%
rename ( gdp = values ) %>%
filter ( geo %in% c("FRY1", "FRY2", "FRY4"),
time %in% c(2008,2012,2016) ) %>%
bind_rows (pib %>%
mutate ( geo = case_when (
geo == "Guadeloupe" ~ 'FRY1',
geo == "Martinique" ~ 'FRY2',
TRUE ~ "FRY4"
)) ) %>%
left_join (
population %>%
select ( all_of(c("geo", "time", "values"))) %>%
filter ( geo %in% c(guadeloupe_codes,
reunion_codes,
martinique_codes),
time %in% c(2008,2012,2016)) %>%
arrange(geo) %>%
regions::validate_nuts_regions(nuts_year = 2016) %>%
filter ( valid_2016 ) %>%
select ( all_of(c("geo", "time", "values"))) %>%
dplyr::rename( population = values ),
by = c("geo", "time")
) %>%
arrange ( time, geo ) %>%
mutate ( gdp_capita = gdp*1000000/population)
## # A tibble: 9 x 5
## geo time gdp population gdp_capita
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 FRY1 2008 7551. NA NA
## 2 FRY2 2008 8143. 397693 20475.
## 3 FRY4 2008 15634. 808250 19343.
## 4 FRY1 2012 8039. NA NA
## 5 FRY2 2012 8591. 388364 22120.
## 6 FRY4 2012 17109. 833944 20515.
## 7 FRY1 2016 9905. 429856 23042.
## 8 FRY2 2016 9268. 376480 24618.
## 9 FRY4 2016 19020. 852924 22300.
Sadly, Guadeloupe is missing from all Eurostat population datasets, including demo_r_pjanaggr3 - which is a NUTS3
dataset in name only, it contains NUTS1
, NUTS2
and NUTS3
data and, and from demo_r_pjangroup, too, which is in name a NUTS2
dataset, but it contains NUTS1
and NUTS2
data.
The regions package in its current, fledling 0.1.0 version currently helps