Dataset Open Access
A set of six databases used in a study of the biogeography of Greater Caribbean reef fishes entitled:
Comparing biodiversity databases: Greater Caribbean reef-fishes as a case study
Iliana Chollett, D. Ross Robertson
Database Authors: D Ross Robertson and Ernesto Peña, Smithsonian Tropical Research Institute, Panamá
This set of six databases contains georeferenced location records from six sources as described below.These six sources provided georeferenced records of occurrence of fishes found in the Greater Caribbean study area (6-330 N, 57-1000 W). Each occurrence record consists of a species name and associated latitude and longitude. Databases included in the comparisons made here are from five major online aggregators. Since their content overlaps to some extent, and OBIS, iDigBio and FishNet collaborate with GBIF, their data might be expected to produce similar biogeographic patterns. STRI includes a curated compendium of data from those five aggregators, enriched with data from many additional sources.
Only reef-associated fish species were included in the present analysis. These mostly represent demersal species known to occur on hard bottoms (coral, rock and oyster substrata), but also include species living on rubble, sand and vegetated bottoms within and around the immediate fringes of reefs, and pelagic species regularly found on reefs. All exotic and non-resident species and species other than reef-associated fishes were excluded from all databases prior to comparisons. Non-residents were defined as otherwise widespread species only rarely seen in the study area. Shore-fishes, including what are generally regarded as reef fishes, include those found in the waters of continental and insular shelves, i.e. between 0-200m. Reef-fish assemblages dominated by shallow-water taxa extend down to that depth in the study area (Baldwin et al. 2018). We used the shelf edge as a breakpoint and excluded records in areas deeper than 200m, identifying those areas using the General Bathymetric Chart of the Oceans (Kapoor, 1981; GEBCO Compilation Group, 2019).
Before the analyses, for all databases, duplicate records were deleted. Subsequently, records in the Pacific or on land were deleted. We used the Global Self-consistent, Hierarchical, High-resolution Geography Database (Wessel & Smith, 1996) to identify these areas. The spatial distribution of species-records in each database is shown in Figure 1 of the publication.
Global Biodiversity Information Facility (GBIF, https://www.gbif.org/): GBIF is an international network and research infrastructure aimed at providing open access to data about all types of life on earth. GBIF works through participant nodes using common standards and open-source tools that enable them to share information. Data from among the 49,000+ datasets hosted by GBIF that were used here range from those on museum specimens collected since the 18th century, to published scientific checklists, to curated local checklists produced by trained science sources such as the Atlantic and Gulf Rapid Assessment Program (https://www.agrra.org/),to geotagged smartphone photos (that act as vouchers allowing verification) shared by amateur and scientific naturalists through iNaturalist (https://www.inaturalist.org/), to unvouchered, unverified and unverifiable observation records from untrained divers such as those contributing to DiveBoard (http://www.diveboard.com). GBIF data are standardized in Darwin Core format. GBIF data were obtained from a polygon of the region of study and subject to taxonomic review and selection after downloading. GBIF data were obtained from a polygon of the study area and subject to taxonomic review after downloading (accessed through the GBIF portal, https://www.gbif.org/, on or about 2019-05-19).
Ocean Biogeographic Information System (OBIS, https://obis.org/): OBIS is a global open-access data and information clearing-house on marine biodiversity (OBIS, 2019) that was adopted as a project of the Intergovernmental Oceanographic Data and Information Exchange of the Intergovernmental Commission of UNESCO . Its range of sources is similar to that of GBIF. OBIS hosts data from organizations or programs that join it as one of 13 “nodes”, and harvest the data from the IPT (Integrated Publishing Toolkit), where providers publish their data. The IPT is developed and maintained by the GBIF, and OBIS is a major contributor of marine data to GBIF. Data are standardized in Darwin Core format. OBIS data were obtained for the region of study by downloading data on each family, then retaining only data inside the study area, which were then subject to taxonomic review and selection (accessed through the OBIS portal, https://obis.org/, on or about 2019-05-19).
Integrated Digitized Biocollections (iDigBio, https://portal.idigbio.org/portal/search): iDigBio is sponsored by the a US National Science Foundation and run by the University of Florida that provides digital data from public, non-federal, US collections. Data are standardized in a Darwin Core format, and provided “as is”. IDigBio joined the GBIF network in 2017. IDigBio records were downloaded from a polygon of the region of study and subject to taxonomic review and selection (accessed through the iDigBio portal, https://portal.idigbio.org/portal/search, on or about 2019-05-19).
FishNet2 (http://www.fishnet2.net/): FishNet2 is a collaborative effort that aggregates data on fish collections around the world to share and distribute data on specimen holdings from ~75 museums, universities and other institutions. FishNet2 distributes data in Darwin Core, and data are provided “as is”. FishNet2 is part of the network VerNet, which has contributed to GBIF since 2013 and became part of IDigBio in 2016. While FishNet2 has made substantial efforts to georeference location-record data it hosts, many hosted records still lack georeferencing. FishNet2 data were obtained from a polygon of the study area and subject to taxonomic review after downloading (accessed through the Fishnet2 Portal, www.fishnet2.org, 2019-05-19).
FishBase (http://www.fishbase.org): FishBase is a global biodiversity information system supervise by a consortium of nine non-USA international institutions, and hosts data on fin fishes and elasmobranchs (Froese & Pauly, 2009). Information presented in FishBase is extracted from the scientific literature, reports and museum or aggregator (GBIF) databases, and standardized by a team of specialists. Data from Fishbase were downloaded for the following ecosystems: Caribbean Sea, Gulf of Mexico, Southeast U.S. Continental Shelf, Atlantic Ocean, Sargasso Sea and Bermuda, and subject to taxonomic review and selection after downloading (2019-05-19).
Smithsonian Tropical Research Institute (STRI; https://biogeodb.stri.si.edu/caribbean/en/pages): The STRI database was compiled by DRR and Ernesto Peña at STRI’s Naos Marine Laboratory, and represents about 15 years accumulation of curated data (see below) from the following sources: data downloaded at roughly two year intervals from the five aggregators; data from online databases of various museums that supply aggregators (data directly downloaded from a museum sometimes differs from that available in an aggregator from the same museum), including the Swedish Museum of Natural History, the American Museum of Natural History, the Natural History Museum of Denmark, the Gulf Coast Research Laboratory, the Colombian Museum of Natural Marine History, the United States National Museum, and the United States Geological Survey; data from national aggregators of Colombia (Sistema de Información Sobre Biodiversidad de Colombia (https://sibcolombia.net/), and Sistema de Información Ambiental Marina de Colombia, https://siam.invemar.org.co/), Mexico (La Comisión Nacional para el Conocimiento y Uso de la Biodiversidad, CONABIO; http://www.conabio.gob.mx/informacion/gis/), and Costa Rica (Museo de Zoologia de la Universidad de Costa Rica, http://museo.biologia.ucr.ac.cr/); verified (by DRR) underwater photographs of fishes taken at known locations; peer reviewed publications containing location information (species descriptions; taxonomic revisions of species, genera and families; regional and local checklists); fisheries reports; digital tagging data for species such as elasmobranchs; diving surveys and collections of local faunas by DRR (e.g. Robertson et al. 2019). In addition selected data from two sources that collect species lists at sites scattered throughout the Greater Caribbean are incorporated: from the Atlantic and Gulf Rapid Reef Assessment program (AGRRA, https://www.agrra.org/: Kramer & Lang, 2003) and from trained citizen scientists who contribute data on fishes to the Reef Environmental Education Foundation’s database (REEF: Pattengill-Semmens & Semmens, 2003). The bibliographic module (https://biogeodb.stri.si.edu/caribbean/en/library) of Robertson & VanTassel (2019) contains ~1700 publications linked to species names, among them the publications from which location data were extracted.
Data from the aggregators is presented “as is” and the aggregators themselves do not do data curation. Duplicates (and occasionally triplicates and quaduplicates) of the same museum record often are included from multiple sources (e.g. the original museum source, derivative checklists, an aggregator), sometimes with slightly different georeferenced coordinates. Data available in one year may subsequently disappear from an aggregator, and different data may be available for the same species under different names (e.g. the old and new names when a species is reassigned to another genus). Errors, sometimes large errors (Robertson, 2008), are common in aggregator data, from museums as well as other sources, and longstanding errors can seem to take on a perpetual existence. For example the damselfish Abudefduf saxatilis is a common and widespread inhabitant of tropical reefs on both sides of the Atlantic, but does not naturally occur outside that ocean. Despite the fact that its taxonomic status and range were resolved ~30 y ago (e.g. see Allen, 1991) museum data presented by the all five aggregators that contributed to the multi-source database used in this study currently (December 10, 2019) show large numbers of records of this species throughout the entire tropical Indo-Pacific, as well as across its native range in the Atlantic. Since many of the databases accumulating on aggregators are derivative (lists derived from records and from other derivative lists) it will become increasingly difficult to eliminate such errors as corrections to data in primary sources do not automatically propagate through the chain of usage by different databases. Due to increasing limitations on resources for taxonomic work, museums themselves have difficulty dealing with errors in specimen identity and location, and old specimens become unidentifiable, specimens never get returned when loaned out, or simply vanish, and entire collections can get destroyed by hurricanes or fires, or get dumped when museums close or experience a major change in mission. Georeferenced location data on fish distributions in the neotropics (and presumably most other areas) hosted by aggregators, particularly GBIF and OBIS, which take data from a broad range of source types, might best be described as messy, and the significant potential for errors in location records and an inability to verify records always needs to be taken into account when incorporating data from aggregators, primary museum sources, and analog sources.
Data considered for inclusion in the STRI database were screened as follows to exclude questionable records. Data from two databases hosted by OBIS and GBIF were excluded entirely due to lack of reliability: BioGoMx (https://www.gulfbase.org/project/biodiversity-gulf-mexico-biogomx-database) and Diveboard (http://www.diveboard.com). The only REEF data used were from “expert” REEF recorders on readily identifiable species that are unlikely to be confused with similar species (e.g. data for some genera of sparids, gerreids, labrisomids and gobies that include various sympatric species with very similar appearances, were not used). After data from aggregators and museum sources were combined into a single database duplicate records were filtered out by rounding all records to three decimal places and eliminating duplicates, a process that inevitably deleted some valid records as well as duplicates. The sizes of the databases and abundance of such duplicates precluded individual manual exclusion. Finally, all location data for each species were revised by DRR by examining the distribution of its georeferenced coordinates overlayed on a digital map of the current known distribution range of that species (for such range information see Carpenter & De Angelis, 2002; Ebert et al., 2013; Last et al., 2016; Robertson & Van Tassell, 2019; IUCN Redlist species accounts for most species considered here: https://www.iucnredlist.org/search). Such revision took into account any recent modifications to taxonomy and distributions due to new data and new publications, or as a result of discussions between DRR and experts in the taxonomy of particular species or genera. Source information of many individual questionable records provided by aggregators with the hosted data was inspected to try and assess their validity. Records thought likely to be erroneous were deleted. Those included inexplicable records lacking adequate documentation located well outside the known distribution range, and records in unlikely habitats (e.g. on land for marine species; in deep water for shallow-water species). This revision process reduced the number of records by about 30%.
Data from the five individual aggregator databases that are used in the comparisons described here were all downloaded from their online portals during May, 2019. However, data from those five aggregators that were incorporated in the STRI database were downloaded in March 2017, with data from other sources described above added to the STRI database intermittently between then and May 2019, when the entire dataset was curated as described above. Hence the five individual aggregator databases analyzed in this study undoubtedly contain data not included in the version of the STRI database used in the present analyses.
Data acquisition and construction of the STRI database was supported by funds from STRI, the Smithsonian Marine Science Network, the Smithsonian Publications Fund, the Smithsonian’s Deep Reef Observation Project, the National Geographic Society, the IUCN Red List program, the Harte Research Institute, and CONABIO. We thank REEF and AGRRA for supplying species-location records, various people for taxonomic and location-record information used to construct that database (principal among them C Baldwin, S Brandl, K Conway B Frable, T Menut, T Munroe, R Robins, L Tornabene, J Van Tassell and B Victor), and hundreds of citizen-scientist submarine photographers whose images (see https://biogeodb.stri.si.edu/caribbean/en/contributors/citizen_scientists) acted as vouchers for location records.
Allen, G.R. (1991) Damselfishes of the World. Mergus, Melle, 271 p.
Baldwin, C.C., Tornabene, L. & Robertson, D.R. (2018) Below the mesophotic. Scientific Reports, 8, 4920.
Carpenter, K.E. (Ed) (2002) The living marine resources of the Western Central Atlantic. Vols 1-3, FAO, Rome, 2127 p.
Ebert, D.A., Fowler, S., Compagno, L. (2013) Sharks of the World: a fully illustrated guide. Wild Nature Press, Plymouth. 528 p.
GEBCO Compilation Group (2019) GEBCO 2019 Grid (doi:10.5285/836f016a-33be-6ddc-e053-6c86abc0788e).
Kapoor, D.C. (1981) General bathymetric chart of the oceans (GEBCO). Marine Geodesy, 5, 73–80.
Kramer, P.R. & Lang, J.C. (2003) Appendix one: The Atlantic and Gulf Rapid Reef Assessment (AGRRA) Protocols: Former Version 2. 2. Atoll Research Bulletin, 496, 611–624.
Last, P. R., White, W.A., de Carvalho, M.R., Séret, B., Stehmann, F.W., & Naylor, J.P. (2016). Rays of the World. CSIRO, Clayton. 790 p.
Pattengill-Semmens, C.V. & Semmens, B.X. (2003) Conservation and management applications of the reef volunteer fish monitoring program. Coastal Monitoring through Partnerships: Proceedings of the Fifth Symposium on the Environmental Monitoring and Assessment Program (EMAP) Pensacola Beach, FL, U.S.A., April 24–27, 2001 (ed. by B.D. Melzian), V. Engle), M. McAlister), S. Sandhu), and L.K. Eads), pp. 43–50. Springer Netherlands, Dordrecht.
Robertson, D. R. (2008) Global biogeographic databases on marine fishes: caveat emptor. Diversity and Distributions, 14, 891-892
Robertson, D.R,, Dominguez-Dominguez, O., Lopez Arollo, Y.M., Moreno Mendoza. R., Simoes, N. (2019) Reef-associated fishes from the offshore reefs of western Campeche Bank, Mexico, with a discussion of mangroves and seagrass beds as nursery habitats. Zookeys 843: 71-115. https://doi.org/10.3897/zookeys.843.33873
Robertson, D.R & Van Tassell, J. (2019) Shorefishes of the Greater Caribbean: online information system. Version 2.0. Smithsonian Tropical Research Institute, Balboa, Panamá. https://biogeodb.stri.si.edu/caribbean/en/pages.
Wessel, P. & Smith, W.H.F. (1996) A global, self-consistent, hierarchical, high-resolution shoreline database. Journal of Geophysical Research: Solid Earth, 101, 8741–8743.