Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published March 26, 2024 | Version v1
Software Open

Simulated Herbarium data for testing the accuracy with which specimen data can predict the timing and duration of population-level flowering displays

  • 1. University of California, Santa Barbara
  • 2. University of Maine
  • 3. Harvard University

Description

This dataset provides code and example data for simulating specimen collections of flowering plants across North America, and for developing phenological predictions of population-level flowering onset and termination for these data.  It further presents code for assessing the accuracy of these predictions relaticve to known (simulated) population-level flowering dates at the location of each collection.

Notes

Funding provided by: National Science Foundation
Crossref Funder Registry ID: https://ror.org/021nxhr62
Award Number: NSF DEB-1556768

Funding provided by: National Science Foundation
Crossref Funder Registry ID: https://ror.org/021nxhr62
Award Number: DEB-2105932

Funding provided by: National Science Foundation
Crossref Funder Registry ID: https://ror.org/021nxhr62
Award Number: DEB-2242804

Methods

Creating a reference dataset: generating sample locations representing known population-level phenological distributions and individual phenological parameters

We simulated phenological data for 1200 hypothetical "species" in the coterminous USA that varied in the attributes of their individual- and population-level flowering phenology. For each of these simulated species, we selected 1000 locations within the continental United States, each representing a local population observed during a single year from which a simulated specimen was later obtained. The coordinates for each location, year, and associated mean annual temperature in the year of collection were randomly selected without replacement from 4-km2 PRISM pixels (PRISM Climate Group 2011) between the years 1901 to 2020, and were restricted to locations with 1991–2020 temperature normals of 1–20 °C and mean annual precipitation normals for the same period of 60–3800 mm.

Each species generated this way was assigned a series of attributes defining its individual- and population-level flowering phenology. The peak flowering date of an individual was assumed to coincide with its mean flowering date. We then defined a linear equation describing the relationship between the mean date of peak flowering among individuals within a population and local temperature conditions. Each species was assigned a median population flowering DOY of 50 at 0˚C (i.e., the intercept) as well as a phenological responsiveness (i.e., slope) of median flowering DOY to mean annual temperature: advancing by 1, 4, or 8 days per increase in °C. Next, we assigned each species a low or high magnitude of intrapopulation variation in phenological timing (i.e., in peak flowering DOYs) among individuals (based on normal distributions with standard deviations (σ) of either 10 or 30 days), representing the magnitude of variation in the flowering times of early- to late-flowering individuals within each local population. Then, each species was assigned a short, moderate, or long duration of the flowering period by each individual within each population (15, 30, or 60 days, representing the duration of time each individual plant was in flower.  Fifty species were simulated for each of these 18 combinations of phenological responsiveness, flowering duration, and intrapopulation variation in phenological timing.

To accommodate the possibility that the magnitude of variation in phenological timing within a population could depend on local climate conditions, we also simulated 50 species with temperature-sensitive intrapopulation phenological variation (σ) ranging from 10 to 30 days. For these species, σ of the DOY among individuals in a given population increased by 1 day for every 1 °C increase in the mean annual temperature of its location. For these simulated species, individual flowering duration was fixed at 30 days. Additionally, to accommodate the possibility that individual flowering durations could exhibit linear relationships with local climate conditions, we also simulated 50 species that exhibited individual-level variation in flowering duration resulting from changes in temperature (increasing by 1 day per °C increase in mean annual temperature, and ranging from 10 days to 30 days). For these species, the degree of intrapopulation variation in peak flowering dates was held constant at σ = 30 days (i.e., high intrapopulation variation).

 

Calculation of population-level onset, median, and termination dates of flowering

For each population of each species described above, we calculated a distribution of individual-level peak flowering dates—assumed to be normally distributed (Clark and Thompson 2011)—based on the flowering attributes of the species and the temperature conditions corresponding to its site and year of observation. First, we calculated the median flowering DOY at the location and year from which each specimen was collected based on its pre-defined intercept and phenological responsiveness to mean annual temperature (i.e., 1, 4, and 8 days per °C). Then, we obtained the standard deviation of each local population (i.e., its degree of intrapopulation variation in flowering dates) based on the flowering attributes of the simulated species as outlined above. Next, we arbitrarily defined population-level flowering onset DOYs for each population and year as the 10th percentile of a normally distributed population whose mean and standard deviation we obtained in the previous steps (i.e., the DOYs by which the first 10% of individuals in a local population at a given location and year would have reached their median flowering dates). Similarly, the population-level flowering termination dates were calculated as the 90th percentile of a normally distributed population with the same characteristics as described above (i.e., the DOYs by which all but 10% of individuals in a local population at a given location and year would have reached their peak (or mean) flowering dates).

Through this process, we obtained a sample of 1000 annual population-level distributions of flowering dates for each of 1200 hypothetical species. For each of these populations, the quantiles of their flowering distribution—representing the nth individual reaching peak flowering within a population—were known a priori, representing a benchmark against which to compare estimates derived from simulated specimen data.

 

Simulating randomly selected (unbiased) phenological snapshots from pre-defined populations

For each species, we then generated simulated specimens by: (1) randomly selecting an individual within each population and (2) selecting a random DOY within its individual-level flowering period that emulated the phenological snapshot provided by real herbarium specimens.  Specifically, using the distribution of peak flowering dates of each population, we selected an individual at random.  From its peak flowering date, we then obtained onset and termination dates by subtracting (for flowering onset) or adding (for flowering termination) half the individual's flowering duration for that species to the sampled date of peak flowering.  To simulate a phenological snapshot for that individual, we then randomly selected a DOY between the onset and termination of that individual's flowering period.  As a result, the simulated datum represented a simulated herbarium specimen generated accounting for uncertainty in both the timing of the individual relative to its source population, and in the timing of the collection relative to the onset and termination of that individual's flowering period. This procedure was repeated across all locations for each simulated species, generating 1000 data points (i.e., simulated specimens or phenological snapshots) per species.

 

Simulating biases in collection effort across population-level flowering periods

To simulate biases towards collection of specimens during the early or late portion of their local population-level flowering displays, we selected an individual at random within each population and year using both left- and right-skewed normal probability distributions.  These distributions were constructed by modulating the parameter α in the python package scipy.stats.skewnorm v1.10.1 (Azzalini and Capitanio 1998), such that if the underlying plant population was treated as exhibiting a normal distribution (α  = 0), samples were collected from that population with a left-skewed (α = -1.0) or right-skewed (α = -1.0) probability distribution.  Once an individual was selected from these skewed distributions, the timing of sample collection from within the individual flowering durations of these 'specimens' were generated using similar methods as unbiased specimens.  We then determined the accuracy of the model predictions generated from datasets exhibiting biased and unbiased sampling of local populations by comparing predicted population-level flowering onset and termination dates with the actual (i.e., known, simulated) flowering dates that were produced using a normal distribution.  To minimize computation time, population-level biases were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C (representing moderate responsiveness to climate stimuli), intrapopulation variation was high (σ = 30), and individual flowering duration was moderate (30 days).

 

Simulating biases in the timing of collection within flowering periods of individuals

In addition to biases towards collection of early or late individuals within a local population, botanists may also preferentially collect individuals from the early or late portion of their individual flowering period (i.e., individual collection bias).  In some cases, collectors may preferentially collect individuals that are proximate to their peak flowering date because this is when the most flowers are displayed.  In other cases, collectors may preferentially collect specimens that have only recently begun to flower, when floral structures may exhibit less damage from inclement weather or herbivores, or proximate to flowering termination in cases where the collector prefers specimens that include both flowers and fruits. Accordingly, for each population of each species, we simulated DOYs within each individual's flowering period both at random (i.e., without bias) and with three different types of bias. Unbiased collections were simulated by selecting a random date chosen uniformly within the flowering period of each sampled individual. To represent a bias toward collection of individuals close to their peak (median) flowering DOY, we sampled collection dates from a truncated normal distribution centered on an individual's mean flowering date and with σ = 25% of the flowering duration for that species and location (henceforth referred to as mean-biased collection data). To represent a bias toward collection dates shortly after flowering onset (henceforth, onset-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% earlier than the mean flowering onset date of that individual (σ = 25%). Finally, to represent a bias toward collection on dates shortly before flowering termination (henceforth termination-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% later than the mean flowering onset date of that individual (σ = 25%).  As with examinations of population-level bias, collection biases within the flowering periods of individuals were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C, intrapopulation variation was high (σ = 30), individual flowering duration was moderate (30 days), and no population-level bias was present.

 

Estimating population-level flowering onsets and terminations from simulated herbarium data

We generated phenoclimate models for each species from each set of simulated specimen collection dates using quantile regression (Koenker et al. 2018) in RStudio (R Team 2020). In all cases, each model regressed observed DOYs of the phenological snapshots of all sampled individuals of a given species against mean annual temperature. From these 1450 models (representing each of the species-specific models for all 1200 species plus the additional 150 models exhibiting population-level collection biases and the 100 models exhibiting individual-level collection biases), we predicted the 10th, 50th, and 90th percentiles of flowering DOYs for each species from mean annual temperatures corresponding to the years and locations of their source populations. We then calculated the mean absolute error (MAE) of the linear regression of the known timing of the onset (or termination) of the peak flowering period for each reference population on the predicted DOYs produced by each phenoclimate model based on the simulated herbarium data.  For each metric of population-level phenology (i.e., flowering onset, peak (i.e., median DOY), and termination), we then used Tukey HSD tests to compare the mean accuracies (estimated as MAE) of these predicted DOYs versus the actual population-level metrics among models constructed from species that differed in their phenological sensitivities to climate, flowering durations, degrees of intrapopulation variation in phenological timing, and collection biases.

Similarly, we tested whether the mean MAE of estimated peak flowering onset and termination dates among groups of species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation differed significantly from the mean MAE of estimated median flowering dates for each group of simulated species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation. We used Tukey HSD tests to compare the accuracy of estimated onset, median, and termination dates of the peak flowering period among all species produced from each of the simulated datasets.

 Finally, we re-fit all 1200 models (including all 24 combinations of species parameters but excluding models constructed to test the effects of collection biases) with randomly selected subsets of data (100–1000 specimens per species) to determine how sample size affected model performance and predictive accuracy. To evaluate whether more data would be needed when variation in phenology among populations is not perfectly explained by the climate variables included in the model, we ran additional simulations in which population-level mean DOYs (and associated onset and termination DOYs of the flowering period) of each species at each sampled location and year included random variation not associated with local climate: adding either ±5 days (i.e., a low-noise scenario) or ±15 days (i.e., a high-noise scenario) to the DOYs of the onset, median, and termination of flowering DOYs. For each location and year, the random offsets of the DOYs of flowering onset, median flowering DOY, and flowering termination were identical, such that random variation was incorporated only into the timing of flowering, and not its duration.

 

Files

01_Simulated_distribution-popskew.ipynb

Files (2.8 MB)

Name Size Download all
md5:beaa66a666bb81fc723ea293d5a8bb6a
116.3 kB Preview Download
md5:57b8c93295f937c0f938cf48d9746c1d
297.1 kB Download
md5:bc006bcdf44da1c8e7cdc9de01f03b02
13.6 kB Download
md5:7ad8bf3dd9182c9f528ab54da41633ca
329.4 kB Download
md5:3c5bf7f41401e1b158bd5e527ab37051
299.4 kB Download
md5:a8d15494573a9e55fcd5575cd5582ece
180.2 kB Preview Download
md5:24bce6099fd763cd14b8669ae469dc94
1.1 MB Preview Download
md5:3072c222a75fdee8ba5d1d6d24d2bc6d
308.0 kB Preview Download
md5:5242c4ce8a2b838eae0015a05e7b5e1c
154.7 kB Preview Download
md5:c9dd475e4e3d37aa1fc170e1aba46e96
8.9 kB Download

Additional details