Published January 25, 2019 | Version 1.0.0
Dataset Open

Large-scale and fine-grained phenological stage annotation of herbarium specimens datasets

  • 1. LIRMM, University of Montpellier, Inria
  • 2. R. K. Godfrey Herbarium, Florida State University
  • 3. La Brea Tar Pits and Museum
  • 4. AMAP, University of Montpellier, CIRAD, CNRS, INRA, IRD
  • 5. Division of Botany, Peabody Museum of Natural History, Yale University
  • 6. Department of Biological Sciences, California Polytechnic State University
  • 7. Agriculture and Agri-Food Canada
  • 8. School of Computing, Costa Rica Institute of Technology
  • 9. iDigBio, Florida State University
  • 10. Florida Museum of Natural History, University of Florida
  • 11. INRIA Sophia-Antipolis - ZENITH team, LIRMM

Description

This upload is constituted of four datasets of specimens from American herbaria covering different levels of information precision and different floras - from temperate to equatorial.

Three of these datasets consist of selected specimens from herbaria located in different geographic and environmental regions. Each specimen of these three datasets was annotated with the following fields: family, genus, species name, fertile / non-fertile, presence / absence of flower(s), presence / absence of fruit(s). The resulting dataset was composed of 163,233 herbarium specimens belonging to 7,782 species, 1,906 genera, and 236 families. Specimens were annotated as “fertile” if any reproductive structures were present, such as sporangia (ferns), cones (gymnosperms), flowers, or fruits (angiosperms). Non-fertile specimens were those that lacked any reproductive structures.

The fourth dataset consists of 20,371 herbarium specimens from 11 genera in the sunflower family (Asteraceae). The main difference in this dataset is that it is annotated with fine-grained phenophase scores rather than presence/absence attributes (see description below).

Each of these datasets is described below:

  • NEVP: this dataset of New England vascular plant (NEVP) specimens was produced by members of the Consortium of Northeastern Herbaria. The dataset comprises 42,658 digitized specimens that belong to 1,375 species and come from several North American institutions. Most of the specimens in this dataset are from the north-temperate region of the northeastern United States.

  • FSU: this dataset was produced by the Florida State University's Robert K. Godfrey Herbarium (FSU), a collection that focuses on northern Florida and the U.S. Southeast Coastal Plain, one of North America's biodiversity hotspots. This dataset contains 54,263 digitized herbarium specimen records that belong to 3,870 species, making it the taxonomically richest dataset in this study. Most species in this dataset grow under subtropical or warm temperate conditions in the southeastern region of the United States.

  • CAY: this dataset comes from the IRD’s Herbarium of French Guiana (CAY). CAY is dedicated to the Guayana Shield flora, with a strong focus on tropical tree species. This dataset is composed of 66,312 herbarium specimens that belong to 3,024 species. All digitized specimens of this herbarium are accessible online. Most specimens were collected in the tropical rainforests of French Guiana, with the remaining specimens coming mostly from Suriname and Guyana.

  • PHENO: this dataset includes 20,371 herbarium specimens of 139 species in the Asteraceae produced in a study of phenological trends in the U.S. Southeast Coastal Plain. The dataset is composed of specimen records from 57 herbaria. Each recorded specimen was annotated for quartile percentages (0, 25, 50, 75, or 100%) of (i) closed buds, (ii) buds transformed into flowers, and (iii) fruits. According to the distribution of these three categories for each specimen, a phenophase code was computed.

 

Datasets format

These datasets are grouped in 3 tasks:

  1. fertility detection
  2. flowers and/or fruit detection
  3. phenophase classification

The first 2 tasks are carried on the first 3 previous datasets and thus are based on the same set of images, unlike the third task which has its own disjoint set of images. This is why the dataset is presented into two separated files, one for each set of images.

Fertility detection & flower/fruit detection

These tasks are contained into the herbarium_fertility_annotations.zip archive. It consists of 3 files:

  • metadata.csv: general information about all the herbarium specimens for these tasks
    • id: specimen identifier
    • collection: which of NEVP, FSU or CAY does the specimen come from
    • herbarium: institution of origin of the specimen, especially for NEVP collection
    • clade, family, genus, species: classification of the specimen
    • URL: URL of the scan
  • fertility_task.csv: specific information regarding the fertility detection task
    • id: specimen identifier
    • is_fertile: True if the specimen has an expression of fertility, False otherwise
    • train_test_set: which subset does the specimen belong to; possible values are: train, random_test, species_test and herbarium_test
  • flower_fruit_task.csv: specific information regarding the flower/fruit detection task
    • id: specimen identifier, note that in this case not all the specimen described in metadata.csv are included in this task
    • has_flower: True if the specimen has at least one flower, False otherwise
    • has_fruit: True if the specimen has at least one fruit, False otherwise
    • train_test_set: which subset does the specimen belong to; possible values are: train, random_test, species_test and herbarium_test

Phenophase classification

These tasks are contained into the herbarium_asteraceae_phenophase_annotations.zip archive. It consists of a single file:

  • annotations.csv:
    • id: specimen identifier
    • URL: URL of the scan
    • genus: genus of the specimen
    • phenophase: integer from 1 to 9 describing the phenophase of the specimen
    • train_test_set: which subset does the specimen belong to; possible values are: train and test

 

Additional ressources

More information can be found in the related paper:
Lorieul, T., K. D. Pearson, E. R. Ellwood, H. Goëau, J.-F. Molino, P. W.  Sweeney, J. M. Yost, J. Sachs, E. Mata-Montero, G. Nelson, P. S. Soltis, P. Bonnet, and A. Joly. 2019. Toward a large-scale and deep phenological stage annotation of  herbarium specimens: Case studies from temperate, tropical, and equatorial floras. Applications in Plant Sciences 7(3): e1233.

For an example of usage of these datasets as well as a baseline, see: http://doi.org/10.5281/zenodo.2549996

 

Files

herbarium_asteraceae_phenophase_annotations.zip

Files (3.1 MB)

Name Size Download all
md5:0f2e27deb7c47c6648335574f882d54c
266.7 kB Preview Download
md5:85452b677be117451bdad72bd2c3a0f0
2.8 MB Preview Download

Additional details

Related works