2 Geocoding Resource Locations

2.1 Overview

A common goal in opioid environment research is to calculate and compare access metrics to different providers of Medications for Opioid Overuse Disorder (MOUDs). Before we can run any analytics on the resource location data, we need to convert resource addresses to spatial data points, which can be then used to calculate access metrics.

Geocoding is the process of converting addresses (like a street address) into geographic coordinates using a known coordinate reference system. We can then use these coordinates (latitude, longitude) to spatially enable our data. This means we convert to a spatial data frame (sf) within R for spatial analysis within our R session, and then save as a shapefile (a spatial data format) for future use. In this tutorial we demonstrate how to geocode resource location addresses and convert to spatial data points that can be used for future mapping and geospatial analysis.

Our objectives are thus to:

  • Geocode addresses to get geographic coordinates
  • Visualize the resource locations as points on a map in R
  • Transform a flat file (.CSV) to a spatially enabled shapefile (.SHP)

2.2 Environment Setup

To replicate the code & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.

2.2.1 Input/Output

Our input will be a CSV file that include addresses of our resources. This files can be found here.

  • Chicago Methadone Clinics, chicago_methadone.csv

We will convert these addresses to geographic coordinates using an appropriate coordinate reference system (CRS), and then spatially enable the data for mapping. We will then export the spatial dataframe as a shapefile.

2.2.2 Load Libraries

We will use the following packages in this tutorial:

  • sf: to manipulate spatial data
  • tmap: to visualize and create maps
  • tidygeocoder: to convert addresses to geographic coordinates

Then load the libraries for use. Note: The messages you see about GEOS, GDAL, and PROJ refer to software libraries that allow you to work with spatial data.

2.2.3 Load Data

We will use a CSV that includes methadone clinic addresses in Chicago as an example. We start with a small dataset to test our geocoding workflow, as best practice.

Let’s take a look at the first few rows of the dataset. Our data includes addresses but not geographic coordinates.

##   X                                                         Name                 Address    City
## 1 1                Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago
## 2 2                      Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview    3934 N. Lincoln Ave. Chicago
## 4 4                                        PDSSC - Chicago, Inc.     2260 N. Elston Ave. Chicago
## 5 5                          Center for Addictive Problems, Inc.        609 N. Wells St. Chicago
## 6 6                                Family Guidance Centers, Inc.     310 W. Chicago Ave. Chicago
##   State   Zip
## 1    IL 60640
## 2    IL 60640
## 3    IL 60613
## 4    IL 60614
## 5    IL 60654
## 6    IL 60654

2.3 Geocode addresses

2.3.1 Quality Control

Before geocoding, perform an initial quality check on the data. Note that the address, city, state, and zip code are all separated as different columns. This will make it easier to stitch together for a coherent, standard address for geocoding. Furthermore, there do not appear to be any major errors. The city name “Chicago” is spelled consistently, without missing addresses or zip codes. This will not always be the case, unfortunately. Data must be cleaned prior to loading into a geocoding service.

2.3.2 Selecting Geocoding Service

To get a geographic coordinate for each site, we’ll need to geocode. There are a number of geocoding options in R; here we use we the tidygeocoder package. It uses mutliple geocoding services, providing the user with an option to choose. It also provides the option to use a cascade method which queries other geocoding services incase the default method fails to provide coordinates.

When considering which geocoding service to use, consider scale and potential geocoding errors. Some geocoding services are more accurate than others, so if your coordinates were not coded precisely, try a different service. If you have thousands of addresses to geocode, you may require more complex data pipelines. The default method used here is via US Census geocoder, which allows around 10,000 addresses to be geocoded at once. Others have varying daily limits. The Google Maps API and ESRI Geocoding service are additional high-quality geocoding services with varying cost associated.

2.3.3 Test Geocoding Service

Before geocoding your entire dataset, first review the documentation for the geocoding service you’ll be using. In our example we use tidygeocoder, with documentation found here. Let’s test the service by starting with one address:

## # A tibble: 1 x 4
##   address                             latitude longitude geo_method
##   <chr>                                  <dbl>     <dbl> <chr>     
## 1 4545 North Broadway St. Chicago, IL     42.0     -87.7 census

What did the output look like? Get familiar with the input parameters, expected output, and review the documentation further if needed.

2.3.4 Prepare input parameter

To apply the function to multiple addresses, we first we need ensure that we have a character vector of full addresses.

## 'data.frame':    27 obs. of  6 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name   : Factor w/ 25 levels "*","A Rincon Family Services",..: 5 25 23 21 3 8 2 1 14 24 ...
##  $ Address: Factor w/ 27 levels "110 E. 79th St.",..: 20 21 17 6 23 10 16 3 5 8 ...
##  $ City   : Factor w/ 1 level "Chicago": 1 1 1 1 1 1 1 1 1 1 ...
##  $ State  : Factor w/ 1 level "IL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Zip    : int  60640 60640 60613 60614 60654 60654 60651 60607 60607 60616 ...

Next we convert all fields to character first to avoid issues with factors (a common peril of R!).

##   X                                                         Name                 Address    City
## 1 1                Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago
## 2 2                      Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview    3934 N. Lincoln Ave. Chicago
## 4 4                                        PDSSC - Chicago, Inc.     2260 N. Elston Ave. Chicago
## 5 5                          Center for Addictive Problems, Inc.        609 N. Wells St. Chicago
## 6 6                                Family Guidance Centers, Inc.     310 W. Chicago Ave. Chicago
##   State   Zip                                  fullAdd
## 1    IL 60640 4453 North Broadway st. Chicago IL 60640
## 2    IL 60640 4545 North Broadway St. Chicago IL 60640
## 3    IL 60613    3934 N. Lincoln Ave. Chicago IL 60613
## 4    IL 60614     2260 N. Elston Ave. Chicago IL 60614
## 5    IL 60654        609 N. Wells St. Chicago IL 60654
## 6    IL 60654     310 W. Chicago Ave. Chicago IL 60654

2.3.5 Batch Geocoding

Now we are ready to geocode the addresses. The “tibble” data structure below shows us the address, latitude, longitude and also the geocoding service used to get the coordinates. Note that geocoding takes a bit of time, so patience is required.

## # A tibble: 27 x 10
##        X Name               Address     City  State   Zip fullAdd        latitude longitude geo_method
##    <int> <fct>              <fct>       <fct> <fct> <int> <chr>             <dbl>     <dbl> <chr>     
##  1     1 "Chicago Treatmen… 4453 North… Chic… IL    60640 4453 North Br…     NA        NA   osm       
##  2     2 "Sundace Methadon… 4545 North… Chic… IL    60640 4545 North Br…     NA        NA   osm       
##  3     3 "Soft Landing Int… 3934 N. Li… Chic… IL    60613 3934 N. Linco…     42.0     -87.7 census    
##  4     4 "PDSSC - Chicago,… 2260 N. El… Chic… IL    60614 2260 N. Elsto…     41.9     -87.7 census    
##  5     5 "Center for Addic… 609 N. Wel… Chic… IL    60654 609 N. Wells …     41.9     -87.6 census    
##  6     6 "Family Guidance … 310 W. Chi… Chic… IL    60654 310 W. Chicag…     41.9     -87.6 census    
##  7     7 "A Rincon Family … 3809 W. Gr… Chic… IL    60651 3809 W. Grand…     41.9     -87.7 census    
##  8     8 "*"                140 N. Ash… Chic… IL    60607 140 N. Ashlan…     41.9     -87.7 osm       
##  9     9 "Healthcare Alter… 210 N. Ash… Chic… IL    60607 210 N. Ashlan…     41.9     -87.7 census    
## 10    10 "Specialized Assi… 2630 S. Wa… Chic… IL    60616 2630 S. Wabas…     41.8     -87.6 census    
## # … with 17 more rows

The code worked for all addresses except the first two. We already resolved the 4545 North Broadway St.address above but here in the dataframe we get NAs. It is pointing to some issue with the string input. These were missed in the previous quality check, but give us a clue to the types of errors we could see if geocoding more addresses. Unfortunately, such quirks are common across geocoding services in R and we just have to handle them. We manually update the full address strings to get apprpriate coordinates.

Now we can geocode the full suite of addresses with success:

## # A tibble: 27 x 10
##        X Name               Address     City  State   Zip fullAdd        latitude longitude geo_method
##    <int> <fct>              <fct>       <fct> <fct> <int> <chr>             <dbl>     <dbl> <chr>     
##  1     1 "Chicago Treatmen… 4453 North… Chic… IL    60640 4453 North Br…     42.0     -87.7 osm       
##  2     2 "Sundace Methadon… 4545 North… Chic… IL    60640 4545 North Br…     42.0     -87.7 osm       
##  3     3 "Soft Landing Int… 3934 N. Li… Chic… IL    60613 3934 N. Linco…     42.0     -87.7 census    
##  4     4 "PDSSC - Chicago,… 2260 N. El… Chic… IL    60614 2260 N. Elsto…     41.9     -87.7 census    
##  5     5 "Center for Addic… 609 N. Wel… Chic… IL    60654 609 N. Wells …     41.9     -87.6 census    
##  6     6 "Family Guidance … 310 W. Chi… Chic… IL    60654 310 W. Chicag…     41.9     -87.6 census    
##  7     7 "A Rincon Family … 3809 W. Gr… Chic… IL    60651 3809 W. Grand…     41.9     -87.7 census    
##  8     8 "*"                140 N. Ash… Chic… IL    60607 140 N. Ashlan…     41.9     -87.7 osm       
##  9     9 "Healthcare Alter… 210 N. Ash… Chic… IL    60607 210 N. Ashlan…     41.9     -87.7 census    
## 10    10 "Specialized Assi… 2630 S. Wa… Chic… IL    60616 2630 S. Wabas…     41.8     -87.6 census    
## # … with 17 more rows

2.4 Convert to Spatial Data

While we have geographic coordinates loaded in our data, it is still not spatially enabled. To convert to a spatial data format, we have to enable to coordinate reference system that connects the latitude and longitude recorded to actual points on Earth.

2.4.1 Spatial Reference Systems

There are thousands of ways to model the Earth, and each requires a different spatial reference system. This is a very complicated domain of spatial applications (for a primer see here), but for our purposes, we simplify by using a geodetic CRS that uses coordinates longitude and latitude. Not all coordinates will appear as a latitude/longitude, however, so it’s important to at least check for the CRS used when working with geographic data. The lat/long coordinates provided by the geocoding service we used report data using the CRS coded as 4326, a World Geodetic System (WGS84) model also used by Google Earth and many other applications. In this system, distance is measured as degrees and distorted. So while useful for visualizing points, we will need to convert to another CRS for other types of spatial analysis.

2.4.2 Enable Points

Next we convert our dataframe to a spatial data frame using the st_as_sf() function. The coords argument specifies which two columns are the X and Y for your data. We set the crs argument equal to 4326.

Please note longitude is entered as first column rather than the latitude. It is a very common mistake. The X, Y field actually refers to longitude, latitude.

##   X                                                         Name                 Address    City
## 1 1                Chicago Treatment and Counseling Center, Inc. 4453 North Broadway st. Chicago
## 2 2                      Sundace Methadone Treatment Center, LLC 4545 North Broadway St. Chicago
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview    3934 N. Lincoln Ave. Chicago
## 4 4                                        PDSSC - Chicago, Inc.     2260 N. Elston Ave. Chicago
## 5 5                          Center for Addictive Problems, Inc.        609 N. Wells St. Chicago
## 6 6                                Family Guidance Centers, Inc.     310 W. Chicago Ave. Chicago
##   State   Zip                                  fullAdd geo_method                   geometry
## 1    IL 60640 4453 North Broadway St.,Chicago IL 60640        osm POINT (-87.65566 41.96321)
## 2    IL 60640 4545 North Broadway St.,Chicago IL 60640        osm POINT (-87.65694 41.96475)
## 3    IL 60613    3934 N. Lincoln Ave. Chicago IL 60613     census POINT (-87.67818 41.95331)
## 4    IL 60614     2260 N. Elston Ave. Chicago IL 60614     census POINT (-87.67407 41.92269)
## 5    IL 60654        609 N. Wells St. Chicago IL 60654     census POINT (-87.63409 41.89268)
## 6    IL 60654     310 W. Chicago Ave. Chicago IL 60654     census POINT (-87.63636 41.89657)

Note that this is a data frame, but that it has a final column called “geometry” that stores the spatial information.

2.4.3 Visualize Points

We can now plot the location of the methadone clinics with base R. This is a recommended step to confirm that you translated your coordinates correctly. A common mistake is switching the lat/long values, so your points could plot across the globe. If that happens, repeat the step above with flipped long/lat values.

First we switch the tmap mode to view so we can look at the points with a live basemap layer.

## tmap mode set to interactive viewing

Next, we plot our points as dots, and add the basemap.

## Warning in `$.crs`(gm$shape.master_crs, "proj4string"): CRS uses proj4string, which is deprecated.

2.4.4 Convert to Shapefile

Finally, we save this spatial dataframe as a shapefile which can be used for further spatial analysis.