# GBIF Specimen Data Analysis and Forecasting

This repository contains the code and data for analysing and forecasting trends in Global Biodiversity Information Facility (GBIF) specimen records across three major taxonomic groups: Chordata, Arthropoda, and Plantae. 
The analysis pipeline includes data cleaning, anomaly detection, primary analyses, and forecasting based on historical database snapshots.

These scripts and data correspond to analyses in the following manuscript:

Global Sampling Decline Erodes Science Potential of Natural History Collections

Authors:
Owen Forbes
Andrew G. Young
Peter H. Thrall


## Repository Structure

The repository consists of three main Quarto (.qmd) scripts and associated data files:

1. `1_DataCleaning_Forbes-et-al_2025.qmd`: Data cleaning and anomaly detection
2. `2_PrimaryAnalyses_Forbes-et-al_2025.qmd`: Primary analyses and visualisation
3. `3_SnapshotsForecasting_Forbes-et-al_2025.qmd`: Historical snapshot analysis and forecasting

## Requirements

- R (version 4.3.2 or later)
- Required R packages:
  - tidyverse (v2.0.0) - for data manipulation and visualization
  - readr (v2.1.5) - for reading CSV/TSV files
  - ggplot2 (v3.4.0 or v3.5.0) - for creating visualizations
  - rnaturalearth (v1.0.1) - for accessing natural earth map data
  - dplyr (v1.1.0 or v1.1.4) - for data manipulation
  - countrycode (v1.6.0) - for converting country names and codes
  - spdep (v1.3-3) - for spatial dependence modeling
  - sp (v1.6-0 or v2.1-3) - for spatial data manipulation
  - sf (v1.0-15 or v1.0-16) - for simple features access
  - data.table (v1.14.8) - for fast aggregation of large data
  - lubridate (v1.9.2) - for date-time manipulation
  - viridis (v0.6.3) - for color palettes
  - gridExtra (v2.3) - for arranging multiple plots
  - ggpubr (v0.6.0) - for creating publication-ready plots
  - zoo (v1.8-12) - for time series, including moving averages
  - scales (v1.3.0) - for graphical scales
  - forecast (v8.22.0) - for ARIMA forecast models
  - purrr (v1.0.2) - for mapping custom forecast function onto each dataset
  - arrow - for working with parquet files

Install these packages before running the scripts.

## How to Use

1. Download this repository to your local machine.
2. Set your working directory to the location of the scripts.
3. Download raw datasets from GBIF (as required)
4. Ensure all required R packages are installed.
5. Run the scripts in RStudio or your preferred R environment.

### Data Cleaning (`1_DataCleaning_Forbes-et-al_2025.qmd`)

This script cleans the raw GBIF data and identifies anomalies. It produces files containing indexes of dataset records to be removed, which are used in subsequent analyses.

**Note**: The raw GBIF exported datasets for contemporary records are not included in this repository due to file size constraints. Download them from the GBIF links provided in the script and place them in the `data/` directory.

### Primary Analyses (`2_PrimaryAnalyses_Forbes-et-al_2025.qmd`)

This script performs the main analyses and generates visualisations. It uses the outputs from the data cleaning script to filter anomalous records.

To reproduce all analysis stages from the original raw .csv files:
- Start at the chunks labelled "DATA LOAD AND FILTERING".
- Run the pipeline for non-spatial analyses before spatial analyses.
- Due to memory constraints, it's recommended to run analyses for one taxonomic group and one analysis stream at a time.

To skip to plot generation:
- Navigate to sections tagged as "@! SKIP TO PLOTTING !@".
- Ensure all required analysis output files are in the `data/` directory.

### Forecasting (`3_SnapshotsForecasting_Forbes-et-al_2025.qmd`)

This script analyses historical GBIF database snapshots and forecasts future growth. It uses the cleaned snapshot data produced by the data cleaning script.

## Data Files

### GBIF Exports - Raw Data (not included on Zenodo due to file size, please download directly from GBIF)
- `0016915-240425142415019.csv` for Chordata -  https://www.gbif.org/occurrence/download/0016915-240425142415019

- `0016914-240425142415019.csv` for Plantae - https://www.gbif.org/occurrence/download/0016914-240425142415019 

- `0016913-240425142415019.csv` for Arthropoda - https://www.gbif.org/occurrence/download/0016913-240425142415019

### Included Data Files

#### Raw Data
- `GBIF_snapshots.parquet` # Historical snapshots RAW dataset (arrow/parquet format)
- `GBIF_integer_to_datasetKey.tsv` # Mapping old dataset IDs onto new datasetKey field

#### Contemporary Datasets - data cleaning outputs
- `chordata_counts_to_highlight_030724` # List of anomalous Chordata dataset + year indexes to filter
- `arthropoda_counts_to_highlight_OG_030724` # List of anomalous Arthropoda dataset + year indexes to filter
- `plantae_counts_to_highlight_030724` # List of anomalous Plantae dataset + year indexes to filter

#### Cleaned Snapshots
- `plantae_snapshots_filter_threshold_IN_040924` # Cleaned Plantae snapshots
- `arthropoda_snapshots_filter_threshold_IN_040924` # Cleaned Arthropoda snapshots
- `chordata_snapshots_filter_threshold_IN_040924` # Cleaned Chordata snapshots
- `gbif_dates_df_anomaly_filtered_090724` # Anomaly-filtered snapshots (combined dataset)
- `gbif_dates_df_anomalies_highlighted_090724` # Anomalies highlighted snapshots (combined dataset)

#### Analysis Outputs - for skipping straight to plot/figure generation
- `arthropoda_specimens_per_year_080724` # Arthropoda specimen counts per year
- `arthropoda_unique_species_per_year_080724` # Arthropoda unique species counts per year
- `arthropoda_grid_counts_080724` # Arthropoda grid counts
- `chordata_specimens_per_year_080724` # Chordata specimen counts per year
- `chordata_unique_species_per_year_080724` # Chordata unique species counts per year
- `chordata_grid_counts_080724` # Chordata grid counts
- `plantae_specimens_per_year_080724` # Plantae specimen counts per year
- `plantae_unique_species_per_year_080724` # Plantae unique species counts per year
- `plantae_grid_counts_080724` # Plantae grid counts
- `chordata_continent_count_080724` # Chordata continent-specific counts
- `arthropoda_continent_count_080724` # Arthropoda continent-specific counts
- `plantae_continent_count_080724` # Plantae continent-specific counts
