Data and code for: "Global Sampling Decline Erodes Science Potential of Natural History Collections"

Forbes, Owen

doi:10.5281/zenodo.14010666

Published October 30, 2024 | Version 1.0

Dataset Open

Data and code for: "Global Sampling Decline Erodes Science Potential of Natural History Collections"

Forbes, Owen (Researcher)¹

1. Commonwealth Scientific and Industrial Research Organisation

# GBIF Specimen Data Analysis and Forecasting

This repository contains the code and data for analysing and forecasting trends in Global Biodiversity Information Facility (GBIF) specimen records across three major taxonomic groups: Chordata, Arthropoda, and Plantae.
The analysis pipeline includes data cleaning, anomaly detection, primary analyses, and forecasting based on historical database snapshots.

These scripts and data correspond to analyses in the following manuscript:

Global Sampling Decline Erodes Science Potential of Natural History Collections

Authors:
Owen Forbes
Andrew G. Young
Peter H. Thrall

## Repository Structure

The repository consists of three main Quarto (.qmd) scripts and associated data files:

1. `1_DataCleaning_Forbes-et-al_2024.qmd`: Data cleaning and anomaly detection
2. `2_PrimaryAnalyses_Forbes-et-al_2024.qmd`: Primary analyses and visualisation
3. `3_SnapshotsForecasting_Forbes-et-al_2024.qmd`: Historical snapshot analysis and forecasting

## Requirements

- R (version 4.3.2 or later)
- Required R packages:
- tidyverse (v2.0.0) - for data manipulation and visualization
- readr (v2.1.5) - for reading CSV/TSV files
- ggplot2 (v3.4.0 or v3.5.0) - for creating visualizations
- rnaturalearth (v1.0.1) - for accessing natural earth map data
- dplyr (v1.1.0 or v1.1.4) - for data manipulation
- countrycode (v1.6.0) - for converting country names and codes
- spdep (v1.3-3) - for spatial dependence modeling
- sp (v1.6-0 or v2.1-3) - for spatial data manipulation
- sf (v1.0-15 or v1.0-16) - for simple features access
- data.table (v1.14.8) - for fast aggregation of large data
- lubridate (v1.9.2) - for date-time manipulation
- viridis (v0.6.3) - for color palettes
- gridExtra (v2.3) - for arranging multiple plots
- ggpubr (v0.6.0) - for creating publication-ready plots
- zoo (v1.8-12) - for time series, including moving averages
- scales (v1.3.0) - for graphical scales
- forecast (v8.22.0) - for ARIMA forecast models
- purrr (v1.0.2) - for mapping custom forecast function onto each dataset
- arrow - for working with parquet files

Install these packages before running the scripts.

## How to Use

1. Download this repository to your local machine.
2. Set your working directory to the location of the scripts.
3. Download raw datasets from GBIF (as required)
4. Ensure all required R packages are installed.
5. Run the scripts in RStudio or your preferred R environment.

### Data Cleaning (`1_DataCleaning_Forbes-et-al_2024.qmd`)

This script cleans the raw GBIF data and identifies anomalies. It produces files containing indexes of dataset records to be removed, which are used in subsequent analyses.

**Note**: The raw GBIF exported datasets for contemporary records are not included in this repository due to file size constraints. Download them from the GBIF links provided in the script and place them in the `data/` directory.

### Primary Analyses (`2_PrimaryAnalyses_Forbes-et-al_2024.qmd`)

This script performs the main analyses and generates visualisations. It uses the outputs from the data cleaning script to filter anomalous records.

To reproduce all analysis stages from the original raw .csv files:
- Start at the chunks labelled "DATA LOAD AND FILTERING".
- Run the pipeline for non-spatial analyses before spatial analyses.
- Due to memory constraints, it's recommended to run analyses for one taxonomic group and one analysis stream at a time.

To skip to plot generation:
- Navigate to sections tagged as "@! SKIP TO PLOTTING !@".
- Ensure all required analysis output files are in the `data/` directory.

### Forecasting (`3_SnapshotsForecasting_Forbes-et-al_2024.qmd`)

This script analyses historical GBIF database snapshots and forecasts future growth. It uses the cleaned snapshot data produced by the data cleaning script.

## Data Files

### GBIF Exports - Raw Data (not included on Zenodo due to file size, please download directly from GBIF)
- `0016915-240425142415019.csv` for Chordata - https://www.gbif.org/occurrence/download/0016915-240425142415019

- `0016914-240425142415019.csv` for Plantae - https://www.gbif.org/occurrence/download/0016914-240425142415019

- `0016913-240425142415019.csv` for Arthropoda - https://www.gbif.org/occurrence/download/0016913-240425142415019

### Included Data Files

#### Raw Data
- `GBIF_snapshots.parquet` # Historical snapshots RAW dataset (arrow/parquet format)
- `GBIF_integer_to_datasetKey.tsv` # Mapping old dataset IDs onto new datasetKey field

#### Contemporary Datasets - data cleaning outputs
- `chordata_counts_to_highlight_030724` # List of anomalous Chordata dataset + year indexes to filter
- `arthropoda_counts_to_highlight_OG_030724` # List of anomalous Arthropoda dataset + year indexes to filter
- `plantae_counts_to_highlight_030724` # List of anomalous Plantae dataset + year indexes to filter

#### Cleaned Snapshots
- `plantae_snapshots_filter_threshold_IN_040924` # Cleaned Plantae snapshots
- `arthropoda_snapshots_filter_threshold_IN_040924` # Cleaned Arthropoda snapshots
- `chordata_snapshots_filter_threshold_IN_040924` # Cleaned Chordata snapshots
- `gbif_dates_df_anomaly_filtered_090724` # Anomaly-filtered snapshots (combined dataset)
- `gbif_dates_df_anomalies_highlighted_090724` # Anomalies highlighted snapshots (combined dataset)

#### Analysis Outputs - for skipping straight to plot/figure generation
- `arthropoda_specimens_per_year_080724` # Arthropoda specimen counts per year
- `arthropoda_unique_species_per_year_080724` # Arthropoda unique species counts per year
- `arthropoda_grid_counts_080724` # Arthropoda grid counts
- `chordata_specimens_per_year_080724` # Chordata specimen counts per year
- `chordata_unique_species_per_year_080724` # Chordata unique species counts per year
- `chordata_grid_counts_080724` # Chordata grid counts
- `plantae_specimens_per_year_080724` # Plantae specimen counts per year
- `plantae_unique_species_per_year_080724` # Plantae unique species counts per year
- `plantae_grid_counts_080724` # Plantae grid counts
- `chordata_continent_count_080724` # Chordata continent-specific counts
- `arthropoda_continent_count_080724` # Arthropoda continent-specific counts
- `plantae_continent_count_080724` # Plantae continent-specific counts

Files

data.zip

Files (409.6 MB)

Name	Size	Download all
1_DataCleaning_Forbes-et-al_2024.qmd md5:0464e542a8efc2f36d7977cdd6456eba	39.3 kB	Download
2_PrimaryAnalyses_Forbes-et-al_2024.qmd md5:451121ac34b48feb2ad006a4849923f2	84.9 kB	Download
3_SnapshotsForecasting_Forbes-et-al_2024.qmd md5:908d59766a5a4e7b01f17fd758374f33	42.8 kB	Download
data.zip md5:0bd436294f559ce0951b208143b74de6	409.4 MB	Preview Download
README.md md5:30120bb3f97cab301952141a0fb79637	6.3 kB	Preview Download

Additional details

Available: 2024-11-01

Upload to Zenodo

Programming language: R

	All versions	This version
Views	367	311
Downloads	207	78
Data volume	14.3 GB	4.5 GB

data.zip

Files (409.6 MB)

Dates

Software

Data and code for: "Global Sampling Decline Erodes Science Potential of Natural History Collections"

Authors/Creators

Description

Files

data.zip

Files (409.6 MB)

Additional details

Dates

Software