Quickstart_Flagging_problematic_coordinates_in_a_nutshell.RmdThe clean_coordinates function enables a fast, automated and reproducible flagging of potentially erroneous occurrence coordinates based on geographic gazetteers. The function flags records based on known problems common to biological and palaeontological collection databases.
Individual tests can be switched on and off by a logical flag (see ?clean_coordinates) and distance thresholds for all tests can be adapted. Custom gazetteers can be provided for all tests (for instance for a higher level of detail). See here for a detailed tutorial on how to clean occurrence records using CoordinateCleaner.
Please find a detailed tutorial on how to clean occurrence records (e.g. from GBIF) here and how to clean fossil data (e.g. from PBDB) here.
clean_coordinates wraps around multiple tests for common error sources in species distribution records. Individual test can be included or excluded from a run with the tests argument of clean_coordinates, e.g. "seas" switches the seas test off. Most basic tests are switched on by defaults.
library(CoordinateCleaner)
library(dplyr)
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
decimallongitude = runif(250, min = 42, max = 51),
decimallatitude = runif(250, min = -26, max = -11),
countries = "MDG")
#run all tests
dat <- clean_coordinates(exmpl, tests = c("capitals", "centroids", "countries", "equal", "gbif", "institutions", "outliers", "seas", "urban", "zeros"), countries = "countries")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\az64mycy\AppData\Local\Temp\RtmpWGCnPS", layer: "ne_50m_land"
## with 1420 features
## It has 3 fields
## Integer64 fields read as strings: scalerank
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\az64mycy\AppData\Local\Temp\RtmpWGCnPS", layer: "ne_50m_urban_areas"
## with 2143 features
## It has 4 fields
## Integer64 fields read as strings: scalerank
#only run the validity test
dat <- clean_coordinates(exmpl, tests = c(""))| Test | Function | Background | Default |
|---|---|---|---|
| capitals | radius around capitals | georeferenced from location description | on |
| centroids | radius around country and province centroids | geo-referenced from description | on |
| countries | coordinates in the right country | switched lon/lat, data entry errors | off |
| duplicates | records from one species with identical coordinates | repetitive observation of identical individual, same voucher from multiple data sources, genetic data | off |
| gbif | radius around GBIF headquarters | data entry errors, falsely geo-referenced | on |
| institutions | radius around biodiversity institutions | falsely geo-referenced, zoo or garden records | on |
| iucn | records outside the natural range, or any custom polygon | off | |
| outliers | records far away from all other records of this species | various | off |
| seas | in the sea | switched lon/lat | on |
| urban | within urban area | cultivated/captivity | off |
| validity | outside reference coordinate system | missing data, data entry errors | on |
| zeros | plain zeros, lat = lon | missing data, data entry errors | on |
capitals, centroids and institutions
The capitals, centroids and institutions test use a radius around gazetteers to flag coordinates. You can change this radius for each test using the *.rad arguments. The radius is specified in meters. If you want to specify it in degrees rather, use the individual cc_* functions.
clean_coordinates(exmpl, capitals_rad = 5000)You can use custom gazetteers for all CleanCoordinates tests, via the *.ref arguments of the function. For example the capitals.ref argument controls the reference for the capitals test. Customized reference data must follow the same format as the default reference for the same test. You can check the structure of gazetteers via their documentation or by looking at the gazetteer (e.g. head(capitals)). For example:
#check the format of the default capitals reference
head(countryref) #a data.frame with four columns: ISO3, capital, longitude, latitude
#create new reference data set from scratch. For real analysis you
#probably want to load the alternative file from a .txt file
my.cap <- data.frame(ISO3 = LETTERS[1:10],
capital = letters[1:10],
capital.longitude = runif(10, -180, 180),
capital.latitude = runif(10, -90, 90))
flags <- clean_coordinates(exmpl, capitals.ref = my.cap)In this way, you can fully customize the tests, and for example provide a gazetteer with the locations of hardware stores (in the capitals format) if you want to flag records around hardware stores.
Classes of the default gazetteers of clean_coordinates.
| Test | Default gazetteer | Class | Argument |
|---|---|---|---|
| capitals | countryref | data.frame |
capitals.ref |
| centroids | countryref | data.frame |
centroids.ref |
| countrycheck | rnaturalearth::ne_countries(scale = “medium”) | SpatialPolygonsDataFrame |
country.ref |
| institutions | institutions | data.frame |
inst.ref |
| seas | landmass | SpatialPolygonsDataFrame |
seas.ref |
| urban | rnaturalearth::ne_download(scale = ‘medium’, type = ‘urban_areas’) | SpatialPolygonsDataFrame |
urban.ref |
You cane easily summarize the results of clean_coordinates either with the report option or via summary. If report == T the summary is written to the working directory as a .txt file, if report is a character, it is the path to which the summary file will be written, Alternatively, you can get a summary of the number of records flagged with summary.
#via the report option
flags <- clean_coordinates(exmpl)
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\az64mycy\AppData\Local\Temp\RtmpWGCnPS", layer: "ne_50m_land"
## with 1420 features
## It has 3 fields
## Integer64 fields read as strings: scalerank
#flags <- clean_coordinates(exmpl, report = T)
#via summary
summary(flags)
## .val .equ .zer .cap .cen .sea .otl .gbf
## 0 0 0 0 0 163 0 0
## .inst .summary
## 0 163The output of clean_coordinates is in the same order as the input, thus you can easily exclude flagged records.
#exclude records flagged by any test
clean <- exmpl[flags$.summary, ]
#exclude records flagged by the centroids test
clean <- exmpl[flags$.cen, ]Alternatively, you can run the individual functions in a pipe compatible way:
cleaned <- exmpl%>%
cc_val()%>%
cc_cap()%>%
cc_cen()%>%
cc_dupl()%>%
cc_equ()%>%
cc_gbif()%>%
cc_inst()%>%
cc_outl()%>%
cc_sea()%>%
cc_zero()
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\az64mycy\AppData\Local\Temp\RtmpWGCnPS", layer: "ne_110m_land"
## with 127 features
## It has 3 fieldsIn this way, you can also add the individual test results as columns to your intial data.frame:
exmpl %>%
as_tibble() %>%
mutate(val = cc_val(., value = "flagged"),
sea = cc_sea(., value = "flagged"))
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\az64mycy\AppData\Local\Temp\RtmpWGCnPS", layer: "ne_110m_land"
## with 127 features
## It has 3 fields
## # A tibble: 250 x 6
## species decimallongitude decimallatitude countries val sea
## <fct> <dbl> <dbl> <fct> <lgl> <lgl>
## 1 k 49.2 -12.1 MDG TRUE FALSE
## 2 o 47.4 -19.6 MDG TRUE TRUE
## 3 f 43.5 -24.1 MDG TRUE FALSE
## 4 m 43.4 -16.8 MDG TRUE FALSE
## 5 c 46.8 -14.1 MDG TRUE FALSE
## 6 c 43.5 -15.3 MDG TRUE FALSE
## 7 z 46.3 -16.9 MDG TRUE TRUE
## 8 b 49.4 -20.3 MDG TRUE FALSE
## 9 a 44.0 -20.8 MDG TRUE TRUE
## 10 q 50.0 -17.9 MDG TRUE FALSE
## # ... with 240 more rows