title: “rnassqs: An R
package to access agricultural data via the USDA National Agricultural Statistics Service (USDA-NASS) ‘Quick Stats’ API” authors:
The rnassqs R
package [@rnassqs] is an API wrapper for the United States Department of Agriculture National Agricultural Statistics Service (USDA-NASS) ‘Quick Stats’ API [@quickstats]. The core functionality allows the user to query agricultural data from ‘Quick Stats’ in a reproducible and automated way.
rnassqs
manages API authentication by setting a system environmental variable for the duration of the R
session. Convenience functions facilitate querying common data. Users can also use rnassqs
to query the list of data parameters and available values for a given parameter (for example, to see the commodities available). The query can request data in JSON or CSV formats, and parses that data into a data.frame
object.
rnassqs
was written to address issues that arise from accessing data via a point-and-click interface, particularly in terms of automation and reproducibility of data requests.
‘Quick Stats’ is a web interface to access data produced by USDA-NASS. The data comes primarily from the Census of Agriculture [@agCensus], but also includes data from USDA-NASS surveys on a wide range of topics. The Census of Agriculture is conducted every five years in years ending in ‘2’ and ‘7’. The most recent census was completed in 2017. ‘Quick Stats’ provides access to census data beginning in 1997, and some survey data as far back as 1850.
Aggregate data from the census and surveys is released primarily at the national, state, and county level, though some data may be released for congressional districts, watersheds, and zip codes. It includes a range of data classified under five sectors: Animals & Products, Crops, Demographics, Economics, and Environmental. Examples of data available in these sectors include counts of farms, farm operators, acres of cropland, farm sales, farm expenses, and crop yields, to name a few.
rnassqs
over ‘Quick Stats’‘Quick Stats’ provides a number of selection fields in which the user can select categories of data. Selections available in each field change to reflect available options based on other selections that the user has made. Data requests are limited to 50,000 records. This works well for exploration of available data or quick access to data for a single use. However, there are several cases in which the ‘Quick Stats’ interface is not ideal:
rnassqs
addresses each of these issues by making the ‘Quick Stats’ API accessible with R
code. This allows the user to loop over a series of requests to address the first issue, to execute (perhaps automated on a schedule) a data request repeatedly to access new data with the same query to address the second, and to make code available that allows others to reproducibly access the same data to address the third.
For example, there are currently there are currently 332,125 records of crop yields in all U.S. counties from 2000 to 2018. Accessing this data through ‘Quick Stats’ would require manually selecting either a set of years or a set of states to reduce each request to less than 50,000 records and then aggregating that data. With rnassqs
this can be done with:
# Access yields for all counties and all crops
params <- list(sector_desc = "CROPS",
group_desc = c("FIELD CROPS", "FRUIT & TREE NUTS",
"HORTICULTURE", "VEGETABLES"),
statisticcat_desc = "YIELD",
agg_level_desc = "COUNTY")
# Get all years from 2000 to 2018 in a list of data.frames
data_list <- lapply(2000:2018, function(yr) {
params$year <- yr
rnassqs::nassqs(params, url_only = TRUE)
})
# Aggregate the list of data.frames into a single data.frame
d <- do.call("rbind", data_list)
rnassqs
USDA-NASS also provides FTP access to text data files^[Available at: ftp://ftp.nass.usda.gov/quickstats/]. By accessing the data via FTP users can avoid using the selection interface of ‘Quick Stats’ and avoid limitations on the number of records per request, but do not resolve issues of automated repeated requests or of making data requests reproducible.