Artificial Hotspot Occurrence Inventory (AHOI)

Species occurrence records are essential to understanding Earth's biodiversity and addressing global environmental issues, but do not always reflect actual locations of occurrence. Certain geographical coordinates are assigned repeatedly to thousands of observation/collection records. This may result from imperfect data management and georeferencing practices, and can greatly bias the inferred distribution of biodiversity and associated environmental conditions. Nonetheless, these ‘biodiverse’ coordinates are often overlooked in taxon‐centric studies, as they are identifiable only in aggregate across taxa and datasets, and it is difficult to determine their true circumstance without in‐depth, focused investigation. Here we assess highly recurring coordinates in biodiversity data to determine artificial hotspots of occurrences.


| INTRODUC TI ON
Records of species' occurrences in space and time are essential to understanding our planet's biodiversity and addressing global environmental issues. These primary biodiversity data have been used to prioritize areas for conservation (Daru et al., 2019;Jung et al., 2021), predict climate-driven changes in species distributions in space and time Zanatta et al., 2020), assess the drivers of biological invasions (Cardador & Blackburn, 2020;Park & Potter, 2015a, 2015b, evaluate the impacts of agriculture (Duchenne et al., 2020;Schooler et al., 2020), and model the origins and spread of disease (Redding et al., 2016;Zhou et al., 2021).
Species occurrence data enable us to identify hotspots of biodiversity and elucidate their associations with environmental conditions (Chapman, 2005c). Indeed, it has been shown that occurrence data are most frequently used to assess species distributions and diversity (Ball-Damerow et al., 2019). However, these data are subject to biases, and may not always accurately reflect species occurrences (Daru et al., 2018;Park & Davis, 2017). For instance, Zizka et al. (2020) found that over 40% of occurrence records belonging to 18 Neotropical taxa on GBIF were unfit for most downstream analyses. Errors such as flipped, mistyped, or incomplete coordinates are not uncommon in primary biodiversity data (Chapman, 2005a(Chapman, , 2005b. Due to their historic nature, many records do not comprise accurate geographical information beyond a general description of the collection location. In other cases, locality information can be intentionally recorded at coarser resolutions, such as occurrences on a grid or political area. The coordinates assigned to such records have generally been approximated at a later date by those other than the collector/observer, and are thus often imprecise to varying degrees. These georeferencing practices are not inherently problematic, and can often be desirable and add utility and value to occurrence records as long as sufficient information of how these records were processed is provided. However, documentation of any applied georeferencing and their associated uncertainties are not always accessible to users of occurrence data. Relevant fields such as 'geo-referenceProtocol' or 'georeferenceRemarks' often remain unpopulated, or described in widely varying unstandardized language in primary biodiversity data (Table S1). As a consequence, users may unintentionally treat these georeferenced coordinates as precise point occurrences, and by extension assumes accurate knowledge of the environmental variables associated with the locality (Feeley & Silman, 2010). Along these lines, approaches to estimate a georeferenced biodiversity record's probability of being true or false have been proposed (Arle et al., 2021).
One consequence of georeferencing and errors is that certain geographical coordinates are repeatedly assigned to thousands of species observation and/or collection records (Figure 1). Though some of these records may represent fairly accurate information, others may result from imperfect data management and georeferencing practices. In the latter case, such records can greatly bias the inferred distribution of biodiversity (Maldonado et al., 2015).
Nonetheless, many of these potentially problematic coordinates are often overlooked in taxon-centric studies utilizing such data, as (i) they can be identified often only in aggregate across taxa and datasets, and (ii) it is difficult to determine their true circumstance without in-depth, focused investigation. Existing data cleaning tools can alleviate some of these issues, for instance, by searching for centroid coordinates of political divisions or museums (Zizka et al., 2019).
However, such tools are often limited in their effectiveness as there are a plethora of possible errors and georeferencing practices that cannot be uniformly accounted for programmatically. There are a multitude of ways to define geographical centroids alone-indeed, as noted by the United States Geological Survey (U.S. Department of the Interior, 1964), 'There is no generally accepted definition of geographic center, and no completely satisfactory method for determining it. Because of this, there may be as many geographic centers of a State or country as there are definitions of the term'.
Rather than applying the standard approach of trying to determine if the coordinates assigned to individual occurrence records are inaccurate, here we assess why certain coordinates are assigned repeatedly to numerous occurrence records. To this end, we compiled an inventory of the most highly recurring coordinates across all plant, bird, insect and mammal records with associated coordinates in the Global Biodiversity Information Facility (GBIF), the largest aggregator of biodiversity data. We evaluated these coordinates to determine if and why they do not represent true observation and/or collection locations by manually examining the context and circumstance of associated occurrence records and datasets. The resulting Artificial Hotspot Occurrence Inventory (AHOI) provides information on artificial aggregates of species occurrences and can be used to (i) improve the accuracy of biodiversity assessments; (ii) estimate uncertainty associated with records from artificial biodiversity hotspots and make informed decisions on whether to include such records in scientific studies; and (iii) identify problems in biodiversity informatics workflows and priorities for improvement.

| Data compilation
Primary biodiversity records were queried from the Global assessed the frequency of the geographical coordinates and identified the most frequently recurring sets of coordinates across each taxonomic group. Coordinates were assessed as provided in the F I G U R E 1 Map of highly recurring coordinates across plant, bird, insect and mammal records in the Global Biodiversity Information Facility (GBIF). Some of these points were verified to represent actual data collection locations with relatively high precision (a). Other points were determined to likely represent artificial aggregates associated with large degrees of spatial uncertainty (b). Different colours indicate different taxonomic groups and circle size represents the number of occurrences recorded at the coordinates. Projection of maps is Mollweide.
'decimalLatitude' and 'decimalLongitude' columns of the downloaded data without any rounding to be conservative. Rounding coordinates before assessing their frequency would increase the overall number of records associated with each set of coordinates and increase the risk of associating true points with georeferenced ones. Only exact matches were counted to calculate the frequency of each unique set of coordinates.

| Verifying coordinates
We determined which of the highly recurrent coordinates are likely artificial by examining metadata and images from datasets comprising over 40 million records to date, assessing spatial distributions of associated datasets, contacting data managers and reviewing literature ( Figure 2). We used QGIS software to validate grid centroid coordinates by plotting the grid systems over the reported occurrence coordinates to confirm the grid centroid, grid size and the coordinate reference system. Countries represented in our dataset that utilized such grids were identified through occurrence record metadata, visual inspection of associated datasets, literature review, and data managers and included France, United Kingdom, Germany, the Netherlands, Belgium, Switzerland and Spain. For each group, we started by evaluating the most recurrent set of coordinates and proceeded in order of decreasing frequency. We initially examined the top 100 recurring coordinates for plants and the top 50 recurring coordinates for each animal group. These coordinates were manually curated into the following categories when possible: grid centroid, geopolitical centroid, georeferenced location and true observation or collection site. Some coordinates could be associated with multiple categories. It is possible that the determinations we made for highly recurrent coordinates could also be extended to additional, less recurrent, coordinates that were assigned to other records in the datasets they belonged to (but not included in our initial survey).
These data were compiled into AHOI, an inventory of highly recurrent GBIF coordinates, with their descriptions and determinations (Appendix S1).

| Technical validation
To validate our approach and assess whether artificial biodiversity hotspots are the result of systemic practices or errors, we additionally evaluated data from the Field Museum of Natural History, as some of the top 100 most recurring coordinates were associated with the institution. We downloaded all plant records from this dataset and evaluated all coordinates that were assigned to at least 1000 records. We found that the coordinates from this dataset represented artificial aggregates of specimens around geopolitical centroids. These verifications were also included in AHOI. Furthermore, we listed the rationale for each individual coordinate determination and provides examples of relevant information from occurrence record metadata in the 'example_description' and 'reasoning' fields, respectively, in Appendix S1.

| Artificial hotspot occurrence inventory
AHOI lists all sets of coordinates that were assigned to at least 1000 bird, insect, mammal and plant occurrence records and their frequency in GBIF at the time of evaluation. It also includes determinations of whether these coordinates constitute an artificial biodiversity hotspot, the category of each point (e.g. grid centroid) and information on how we arrived at the listed determinations for the most highly recurring coordinates (Appendix S1  Figure 3). The geographical uncertainty associated with these hotspots (i.e. the total area from which occurrences could have originated) can range from several square kilometres to millions. Such artificial biodiversity hotspots were most prevalent in plant records. For instance, the centroid coordinates of Brazil (>8.5 million km 2 ) were assigned to over 100,000 plant occurrence records in GBIF, and points that have at least 1000 associated occurrences comprise over 9 million plant records. In contrast, a larger portion of highly recurring coordinates in animal data were associated with the actual sites of observation.

| Unpopulated metadata fields
Occurrence data on GBIF comprise several metadata fields that can help flag erroneous or inaccurate coordinates and assess the degrees of spatial uncertainty. Though inherently useful, these fields are frequently left unpopulated (Table S1). Furthermore, the information in these fields is often not presented in a standardized format, making automated inferences difficult. Finally, while the information these  For instance, the field 'GeospatialIssues' was 'FALSE' for 99% of the records located at the highly recurrent coordinates we examined (Table S2).  Neotropical taxa, where it was found that the political centroids were more frequently assigned to plant occurrences compared to those of other taxa on average (Zizka et al., 2020). Also, it is difficult to ascertain whether occurrence records for a species of interest are from a gridded survey without examining additional data from each dataset that contributed to these records in aggregate unless clearly stated in the metadata (Feng et al., 2022). Furthermore, we found that individual datasets of occurrence data can comprise different types of records and both gridded and non-gridded coordinates simultaneously, adding additional complications. Though typically smaller than countries, grid cells can comprise immense environmental heterogeneity as well. For example, a single cell in the French 10 km Grille nationale can encompass −11.61 to 7.23°C in terms of mean annual temperature, 1073-2883 mm precipitation and 933-4736 m of elevation. Interestingly, almost none of the highly recurrent coordinates we evaluated were from Asia and Africa, despite the continents comprising several actual hotspots of biodiversity (Myers et al., 2000). This may reflect varying georeferencing practices across regions and/or a comparative lack of data from these areas on GBIF (Daru et al., 2018;Feng et al., 2021).

| Artificial patterns of biodiversity
Surprisingly, despite comprising the highest number of records per coordinates on average, most of the bird records we reviewed were likely to be relatively accurate (80%). Most of the bird coordinates either fell on, or near, birding sites and observatories.
This may be due to the immense popularity of birding-there are over 20 million people who regard themselves as regular birders in the United States alone (USFWS, 2006). A sizeable portion of the highly recurrent insect and mammal coordinates were also likely to be fairly accurate (29% and 37%, respectively). This was largely due to the methods of data collection. For instance, much of the insect data were gathered through traps, which are typically laid out in a location for a duration of time and thus represent multiple (if non-independent) sampling events. Similarly, highly recurring coordinates of mammal observations often represent the coordinates of camera traps or tagging/banding locations. Thus, in general, the higher proportion of relatively accurate coordinates in animal data can be attributed to their mobility, which allows multiple animals to occupy the same space at different times, and practices of data collection across time (e.g. repeat observations from a single location for a given duration). In contrast, due to their sessile nature, repeat observations of plant taxa from the same location are less common, and community turnover generally happens at a much slower pace.

| Accounting for uncertainty
These results do not imply that the practice of georeferencing occurrence records without accurate coordinate data is inherently bad. We note that not all applications of species occurrence records require accurate location information, and that certain degrees of spatial uncertainty can be acceptable and/or accounted for, depending on the purpose and goal of downstream analyses. However, it is critical that users are aware of the uncertainties associated with inaccurately placed occurrence data (Marcer et al., 2022). The necessity of a resource such as AHOI, in part, arises from the fact that occurrence metadata regarding georeferencing methods and associated uncertainties are not always clearly provided and/or standardized. We acknowledge that transcribing and standardizing metadata are still largely a manual process that requires significant time and effort. Nonetheless, to reduce the future creation and unknowing use of artificial hotspots in biodiversity databases, efforts should be made to record and provide such information in an easy to use, standardized format.

| Application and use
Given the demonstrated prevalence of artificial biodiversity hotspots, especially in plant occurrence data, and difficulties in accounting for such uncertainty within analytical workflows, it is critical to identify such locations and determine whether/how data from these locations may be used before conducting biodiversity assessments or spatial analyses in biogeographical studies. To this end, AHOI provides a growing inventory of artificial biodiversity hotspots, and the information necessary for users to responsibly account for the uncertainty associated with these coordinates. For instance, we provide the spatial resolution of grid systems underlying highly recurrent coordinates, which can be used to estimate spatial and/or environmental uncertainty as we demonstrated above. Even when the method of georeferencing for an occurrence record is stated in the associated metadata, the range of uncertainty is often ambigu-

| Limitations and extensions
AHOI represents a work in progress and should be used accordingly.
The ranking, frequency and determinations of highly recurring coordinates in GBIF can change as more occurrence records become available and the metadata of existing records updated. Also, as AHOI does not provide evaluations of individual occurrence records, it is possible that a small number of records from artificial biodiversity hotspots were actually collected/observed at that location.
Finally, improved data cleaning workflows may eventually allow the automated filtering of identified artificial biodiversity hotspots. For instance, GBIF is testing an experimental data-clustering feature for identifying 'clusters' of potentially related records, though currently it only compares occurrences across (not within) datasets and seems primarily utilized for finding duplicate records of a given species.
Along these lines, AHOI will be continuously updated on DRYAD (doi:10.5061/dryad.v41ns1s0p) by adding and revising evaluations as new data become available and additional problematic coordinates are verified.
As we face a potential sixth mass extinction (Ceballos et al., 2015), assessing the distribution of biodiversity and prioritizing hotspots for conservation and management has become more important than ever before. Accurate occurrence data are critical to such endeavours, and AHOI will help identify and rectify inaccurate data that can significantly bias them. In particular, our in-depth, top-down investigations of potentially problematic coordinates have allowed us to identify artificial hotspots of biodiversity that would otherwise be difficult to detect programmatically, especially when evaluating these data by species. Though the required spatial accuracy of occurrence records can vary across applications, it is critical that users are aware of the uncertainties associated with inaccurately placed occurrence data. AHOI does not replace existing practices and software for cleaning and vetting coordinate datarather it complements these tools by providing an inventory of suspect points not easily detectable by traditional methods.

ACK N OWLED G EM ENTS
We express gratitude to the many collectors and curators of biodiversity data who made this research possible. No permits were needed to carry out the work presented here.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicting interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data are archived on DRYAD (doi:10.5061/dryad.v41ns1s0p).
Additional data discussed in the paper are either publicly available through GBIF (https://www.gbif.org/) or attached supplements.