Downloading Community Contextual Data
Research Question
Once we identify the appropriate access metric to use, we can now add contextual data to understand the drivers and identify any disparities in access. Such datasets are often sourced from the US Census Bureau. In this tutorial we demonstrate how to explore and download most commonly used population datasets from the same, with and without spatial components. Please note this tutorial focuses only on the American Community Survey datasets available via the Census Bureau API.
Environment Setup
To replicate the codes & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.
Packages used
We will use the following packages:
sf: to read/write sf (spatial) objectstidycensus: to download census variables using ACS APItidyverse: to manipulate and clean datatigris: to download census tiger shapefiles
Required Inputs and Expected Outputs
We will not be using an external input for this exercise.
Our output will be a .csv file and shapefile (.shp suite) with race data at the census tract level.
Get your Census API Key
To be able to use the Census API, we need to signup for an API key. This key effectively is a string identifier for the server to communicate with your machine. A key can be obtained using an email from here. Once we get the key, we can install it by running the code below.
In instances where we might not want to save our key in the .Renviron - for example, when using a shared computer, we can always reinstall the same key using the code above but with install = FALSE.
To check an already installed census API key, run
Download variables of interest
We can now start using the tidycensus package to download population based datasets from the US Census Bureau. In this tutorial, we will be covering methods to download data at the state, county, zip and census tract levels. We will also be covering methods to download the data with and without the geometry feature of the geographic entities.
To download a particular variable or table using tidycensus, we need the relevant variable ID, which one can check by reviewing the variables available via load_variables() function. For details on exploring the variables available via the tidycensus & to get their identifiers, check the Explore variables available section in Appendix.
We can now download the variables using get_acs() function. Given ACS data is based of an annual sample, the datapoints are available as an estimate with a margin or error (moe). The package provides both values for any requested variable in the tidy format.
For the examples covered in this tutorial, the 4 main inputs for get_acs() function are:
geography- for what scale to source the data for (state / county / tract / zcta)variables- character string or a vector of character strings of variable IDs to sourceyear- the year to source the data forgeometry- whether or not to include the geometry feature in the tibble. (TRUE / FALSE)
State Level
To get data for only a specific state, we can add state = sampleStateName.
stateDf <- get_acs(geography = 'state', variables = c(totPop18 = "B01001_001",
hispanic ="B03003_003",
notHispanic = "B03003_002",
white = "B02001_002",
afrAm = "B02001_003",
asian = "B02001_005"),
year = 2018, geometry = FALSE)
head(stateDf)## # A tibble: 6 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama totPop18 4864680 NA
## 2 01 Alabama white 3317453 3345
## 3 01 Alabama afrAm 1293186 2745
## 4 01 Alabama asian 64609 1251
## 5 01 Alabama notHispanic 4661534 393
## 6 01 Alabama hispanic 203146 393
As we can see the data is available in the tidy format. We can use other tools in the tidyverse universe to clean and manipulate it.
stateDf <- stateDf %>%
select(GEOID, NAME, variable, estimate) %>%
spread(variable, estimate) %>%
mutate(hispPr18 = hispanic/totPop18, WhitePr18 = white/totPop18,
AfrAmPr18 = afrAm/totPop18, AsianPr18 = asian/totPop18) %>%
select(GEOID,totPop18,hispPr18,WhitePr18,AfrAmPr18, AsianPr18)
head(stateDf)## # A tibble: 6 x 6
## GEOID totPop18 hispPr18 WhitePr18 AfrAmPr18
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01 4864680 0.0418 0.682 0.266
## 2 02 738516 0.0693 0.648 0.0327
## 3 04 6946685 0.311 0.772 0.0439
## 4 05 2990671 0.0732 0.770 0.154
## 5 06 39148760 0.389 0.601 0.0579
## 6 08 5531141 0.214 0.842 0.0412
## # … with 1 more variable: AsianPr18 <dbl>
County Level
Similarly, for county level
- use
geometry = countyto download for all counties in the U.S. - use
geometry = county, state = sampleStateNamefor all counties within a state - use
geometry = county, state = sampleStateName, county = sampleCountyNamefor a specific county
We can also use the FIPS codes for the relevant state & counties. Finally, we can also write the tibble to a .csv file.
countyDf <- get_acs(geography = 'county', variables = c(totPop18 = "B01001_001",
hispanic ="B03003_003",
notHispanic = "B03003_002",
white = "B02001_002",
afrAm = "B02001_003",
asian = "B02001_005"),
year = 2018, state = 'IL', geometry = FALSE) %>%
select(GEOID, NAME, variable, estimate) %>%
spread(variable, estimate) %>%
mutate(hispPr18 = hispanic/totPop18, WhitePr18 = white/totPop18,
AfrAmPr18 = afrAm/totPop18, AsianPr18 = asian/totPop18) %>%
select(GEOID,totPop18,hispPr18,WhitePr18,AfrAmPr18, AsianPr18)
head(countyDf)## # A tibble: 6 x 6
## GEOID totPop18 hispPr18 WhitePr18 AfrAmPr18
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 17001 66427 0.0154 0.931 0.0408
## 2 17003 6532 0.0112 0.624 0.332
## 3 17005 16712 0.0346 0.909 0.0624
## 4 17007 53606 0.214 0.874 0.0222
## 5 17009 6675 0.0428 0.774 0.204
## 6 17011 33381 0.0897 0.936 0.00932
## # … with 1 more variable: AsianPr18 <dbl>
Zipcode Level
For zipcode level, use geometry = zcta. Given zips cross state lines, zcta data is only available for the entire U.S.
zctaDf <- get_acs(geography = 'zcta',variables = c(totPop18 = "B01001_001",
hispanic ="B03003_003",
notHispanic = "B03003_002",
white = "B02001_002",
afrAm = "B02001_003",
asian = "B02001_005"),
year = 2018, geometry = FALSE) %>%
select(GEOID, NAME, variable, estimate) %>%
spread(variable, estimate) %>%
mutate(hispPr18 = hispanic/totPop18, WhitePr18 = white/totPop18,
AfrAmPr18 = afrAm/totPop18, AsianPr18 = asian/totPop18) %>%
select(GEOID,totPop18,hispPr18,WhitePr18,AfrAmPr18, AsianPr18)
head(zctaDf)## # A tibble: 6 x 6
## GEOID totPop18 hispPr18 WhitePr18 AfrAmPr18
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 00601 17242 0.997 0.755 0.00841
## 2 00602 38442 0.935 0.794 0.0278
## 3 00603 48814 0.974 0.765 0.0395
## 4 00606 6437 0.998 0.408 0.0231
## 5 00610 27073 0.962 0.755 0.0257
## 6 00612 60303 0.993 0.807 0.0456
## # … with 1 more variable: AsianPr18 <dbl>
## [1] 33120 6
Census Tract Level
For census tract level, at the minimum stateName needs to be provided.
- use
geometry = tract, state = sampleStateNameto download all tracts within a state - use
geometry = tract, state = sampleStateName, county = sampleCountyNameto download all tracts within a specific county
tractDf <- get_acs(geography = 'tract',variables = c(totPop18 = "B01001_001",
hispanic ="B03003_003",
notHispanic = "B03003_002",
white = "B02001_002",
afrAm = "B02001_003",
asian = "B02001_005"),
year = 2018, state = 'IL', geometry = FALSE) %>%
select(GEOID, NAME, variable, estimate) %>%
spread(variable, estimate) %>%
mutate(hispPr18 = hispanic/totPop18, WhitePr18 = white/totPop18,
AfrAmPr18 = afrAm/totPop18, AsianPr18 = asian/totPop18) %>%
select(GEOID,totPop18,hispPr18,WhitePr18,AfrAmPr18, AsianPr18)
head(tractDf)For more details on the other geographies available via the tidycensus package, check here
Get Geometry
The datasets downloaded so far did not have a spatial geometry feature attached to them. To run any spatial analysis on the race data above, we would need to join these dataframes to another spatially-enabled sf object. We can do so by joining on the ‘GEOID’ or any other identifier. We can download the geometry information using two methods :
- using
tigris - using
tidycensus
Using tigris
To download and use the Tiger Shapefiles shared by the US Census Bureau we will use the tigris package. Set cb = TRUE to get generalized files, these don’t have high resolution details and hence are smaller in size.
yeartoFetch <- 2018
stateShp <- states(year = yeartoFetch, cb = TRUE)
countyShp <- counties(year = yeartoFetch, state = 'IL', cb = TRUE)
zctaShp <- zctas(year = yeartoFetch, cb = TRUE)
tractShp <- tracts(state = 'IL',year = yeartoFetch, cb = TRUE) Now we can merge these geometry files with the race data downloaded in previous section.
For states:
# check object types & identifier variable type
# str(stateShp)
# str(stateDf)
stateShp <- merge(stateShp, stateDf, by.x = 'STATEFP', by.y = 'GEOID', all.x = TRUE)
head(stateShp)## Simple feature collection with 6 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1489 ymin: 30.22333 xmax: 179.7785 ymax: 71.36516
## geographic CRS: NAD83
## STATEFP STATENS AFFGEOID GEOID STUSPS
## 1 01 01779775 0400000US01 01 AL
## 2 02 01785533 0400000US02 02 AK
## 3 04 01779777 0400000US04 04 AZ
## 4 05 00068085 0400000US05 05 AR
## 5 06 01779778 0400000US06 06 CA
## 6 08 01779779 0400000US08 08 CO
## NAME LSAD ALAND AWATER totPop18
## 1 Alabama 00 1.311740e+11 4593327154 4864680
## 2 Alaska 00 1.478840e+12 245481577452 738516
## 3 Arizona 00 2.941986e+11 1027337603 6946685
## 4 Arkansas 00 1.347689e+11 2962859592 2990671
## 5 California 00 4.035039e+11 20463871877 39148760
## 6 Colorado 00 2.684229e+11 1181621593 5531141
## hispPr18 WhitePr18 AfrAmPr18 AsianPr18
## 1 0.04175938 0.6819468 0.26583167 0.01328124
## 2 0.06930926 0.6483732 0.03267228 0.06303993
## 3 0.31141645 0.7721872 0.04394312 0.03294910
## 4 0.07324510 0.7700192 0.15413598 0.01470840
## 5 0.38881377 0.6010169 0.05792968 0.14315496
## 6 0.21420427 0.8417041 0.04120994 0.03122231
## geometry
## 1 MULTIPOLYGON (((-88.05338 3...
## 2 MULTIPOLYGON (((179.4825 51...
## 3 MULTIPOLYGON (((-114.8163 3...
## 4 MULTIPOLYGON (((-94.61783 3...
## 5 MULTIPOLYGON (((-118.6044 3...
## 6 MULTIPOLYGON (((-109.0603 3...
Similarly for counties, zctas & census tracts we can use the code below and then finally save the census tract results with geometry in a shapefile using write_sf.
countyShp <- merge(countyShp, countyDf, by.x = 'GEOID', by.y = 'GEOID', all.x = TRUE)%>%
select(GEOID, STATEFP, COUNTYFP, totPop18,hispPr18,WhitePr18,AfrAmPr18, AsianPr18)
zctaShp <- merge(zctaShp, zctaDf, by.x = 'GEOID10', by.y = 'GEOID', all.x = TRUE)
tractShp <- merge(tractShp, tractDf, by.x = 'GEOID', by.y = 'GEOID', all.x = TRUE)
write_sf(countyShp, "IL_County_18.shp")Using tidycensus
The previous method adds an additional step of using tigris package to download the shapefile.
The tidycensus package already has the wrapper for invoking tigris within the get_acs() function, and we can simply download the dataset with geometry feature by using geometry = TRUE.
The wrapper adds the geometry information to each variable sourced, so the file size can become large in the intermediary steps and slow down the performance, even though the data is in tidy format. In case of large API requests, we recommend downloading the dataset without geometry information and then downloading a nominal variable like total population or percapita income with get geometry using get_acs() or simply using the tigris method, as covered in previous section & then implementing a merge.
tractDf <- get_acs(geography = 'tract', variables = c(totPop18 = "B01001_001",
hispanic ="B03003_003",
notHispanic = "B03003_002",
white = "B02001_002",
afrAm = "B02001_003",
asian = "B02001_005"),
year = 2018, state = 'IL', geometry = FALSE) %>%
select(GEOID, NAME, variable, estimate) %>%
spread(variable, estimate) %>%
mutate(hispPr18 = hispanic/totPop18, WhitePr18 = white/totPop18,
AfrAmPr18 = afrAm/totPop18, AsianPr18 = asian/totPop18) %>%
select(GEOID,totPop18,hispPr18,WhitePr18,AfrAmPr18, AsianPr18)
tractShp <- get_acs(geography = 'tract', variables = c(perCapitaIncome = "DP03_0088"),
year = 2018, state = 'IL', geometry = TRUE) %>%
select(GEOID, NAME, variable, estimate) %>%
spread(variable, estimate)
tractsShp <- merge(tractShp, tractDf, by.x = 'GEOID', by.y = 'GEOID', all.x = TRUE)
head(tractShp)## Simple feature collection with 6 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -88.79336 ymin: 41.7943 xmax: -87.63536 ymax: 41.95088
## geographic CRS: NAD83
## GEOID
## 1 17031843800
## 2 17037001002
## 3 17031243000
## 4 17031250600
## 5 17031251700
## 6 17031260400
## NAME
## 1 Census Tract 8438, Cook County, Illinois
## 2 Census Tract 10.02, DeKalb County, Illinois
## 3 Census Tract 2430, Cook County, Illinois
## 4 Census Tract 2506, Cook County, Illinois
## 5 Census Tract 2517, Cook County, Illinois
## 6 Census Tract 2604, Cook County, Illinois
## perCapitaIncome geometry
## 1 19331 MULTIPOLYGON (((-87.64554 4...
## 2 11308 MULTIPOLYGON (((-88.79317 4...
## 3 48843 MULTIPOLYGON (((-87.68195 4...
## 4 22905 MULTIPOLYGON (((-87.7756 41...
## 5 14739 MULTIPOLYGON (((-87.74826 4...
## 6 12610 MULTIPOLYGON (((-87.74061 4...
Appendix
Explore variables available
Using tidycensus we can download datasets from various types of tables. Most commonly used are:
- Data Profiles - These are the most commonly used collection of variables grouped by category, e.g. Social (DP02), Economic (DP03), Housing (DP04), Demographic (DP05)
- Subject Profiles - These generally have more detailed information variables (than DP) grouped by category, e.g. Age & Sex (S0101), Disability Characteristics (S1810)
- The package also allows access to a suite of B & C tables.
We can explore all the variables for our year of interest by running the code below. Please note as the Profiles evolve, variable IDs might change from year to year.
sVarnames <- load_variables(2018, "acs5/subject", cache = TRUE)
pVarnames <- load_variables(2018, "acs5/profile", cache = TRUE)
otherVarnames <- load_variables(2018, "acs5", cache = TRUE)
head(pVarnames)## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 DP02_0… Estimate!!HOUSEHOLDS… SELECTED SOCIAL CHAR…
## 2 DP02_0… Percent Estimate!!HO… SELECTED SOCIAL CHAR…
## 3 DP02_0… Estimate!!HOUSEHOLDS… SELECTED SOCIAL CHAR…
## 4 DP02_0… Percent Estimate!!HO… SELECTED SOCIAL CHAR…
## 5 DP02_0… Estimate!!HOUSEHOLDS… SELECTED SOCIAL CHAR…
## 6 DP02_0… Percent Estimate!!HO… SELECTED SOCIAL CHAR…
A tibble with table & variable information has three columns : name, label, concept.
Name is a combination of table id and variable id within that table. Concept generally identifies the table name or grouping used to arrange variables. Label provides textual details about the variable.
We can explore these tibbles to identify the correct variable ID name to use with the get_acs() function by using View(sVarnames) or other filters e.g. for age
sVarnames %>% filter(str_detect(concept, "AGE AND SEX")) %>% # search for this concept
filter(str_detect(label, "Under 5 years")) %>% # search for variables
mutate(label = sub('^Estimate!!', '', label)) %>% # remove unnecessary text
select(variableId = name, label) # drop unnecessary columns and rename## # A tibble: 6 x 2
## variableId label
## <chr> <chr>
## 1 S0101_C01_002 Total!!Total population!!AGE!!Under …
## 2 S0101_C02_002 Percent!!Total population!!AGE!!Unde…
## 3 S0101_C03_002 Male!!Total population!!AGE!!Under 5…
## 4 S0101_C04_002 Percent Male!!Total population!!AGE!…
## 5 S0101_C05_002 Female!!Total population!!AGE!!Under…
## 6 S0101_C06_002 Percent Female!!Total population!!AG…
sVarnames %>% filter(str_sub(name, 1, 5) == "S0101") %>% # search for these tables
filter(str_detect(label, "Under 5 years")) %>% # search for variables
mutate(label = sub('^Estimate!!', '', label)) %>% # remove unnecessary text
select(variableId = name, label) # drop unnecessary columns and rename## # A tibble: 6 x 2
## variableId label
## <chr> <chr>
## 1 S0101_C01_002 Total!!Total population!!AGE!!Under …
## 2 S0101_C02_002 Percent!!Total population!!AGE!!Unde…
## 3 S0101_C03_002 Male!!Total population!!AGE!!Under 5…
## 4 S0101_C04_002 Percent Male!!Total population!!AGE!…
## 5 S0101_C05_002 Female!!Total population!!AGE!!Under…
## 6 S0101_C06_002 Percent Female!!Total population!!AG…
e.g per capita income, we can check on DP table variables.
pVarnames %>% filter(str_detect(label, "Per capita")) %>% # search for variables
mutate(label = sub('^Estimate!!', '', label)) %>% # remove unnecessary text
select(variable = name, label) # drop unnecessary columns and rename## # A tibble: 2 x 2
## variable label
## <chr> <chr>
## 1 DP03_0088 INCOME AND BENEFITS (IN 2018 INFLATION-…
## 2 DP03_0088P Percent Estimate!!INCOME AND BENEFITS (…
pVarnames %>% filter(str_detect(label, "Under 5 years")) %>% # search for variables
mutate(label = sub('^Estimate!!', '', label)) %>% # remove unnecessary text
select(variable = name, label) # drop unnecessary columns and rename## # A tibble: 2 x 2
## variable label
## <chr> <chr>
## 1 DP05_0005 SEX AND AGE!!Total population!!Under 5 …
## 2 DP05_0005P Percent Estimate!!SEX AND AGE!!Total po…
The order and structure of profile tables can change from year to year, hence the variable Id or label, so when downloading same dataset over different years we recommend using the standard B & C tables.
otherVarnames %>% filter(str_detect(label, "Per capita")) %>% # search for variables
mutate(label = sub('^Estimate!!', '', label)) %>% # remove unnecessary text
select(variable = name, label) # drop unnecessary columns and rename## # A tibble: 10 x 2
## variable label
## <chr> <chr>
## 1 B19301_001 Per capita income in the past 12 mont…
## 2 B19301A_001 Per capita income in the past 12 mont…
## 3 B19301B_001 Per capita income in the past 12 mont…
## 4 B19301C_001 Per capita income in the past 12 mont…
## 5 B19301D_001 Per capita income in the past 12 mont…
## 6 B19301E_001 Per capita income in the past 12 mont…
## 7 B19301F_001 Per capita income in the past 12 mont…
## 8 B19301G_001 Per capita income in the past 12 mont…
## 9 B19301H_001 Per capita income in the past 12 mont…
## 10 B19301I_001 Per capita income in the past 12 mont…
Contributors and Further Resources
Contributors
Moksha Menghaney, University of Chicago is the principal author of the initial version of this tutorial. Helpful improvements provided by Marynia Kolak.
Email: mmenghaney@uchicago.edu for any issues/comments.