Published November 22, 2024 | Version 1.1

Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process - Datasets

Description

This dataset contains supporting data for a research project aimed at analysing herbarium samples from the New England area at a large scale with deep learning techniques. Details on the methodology are shared in the acompanying paper (to be published).

Content:

  • dataset600k_withAI.csv : A dataset of over 600.000 herbarium samples with its record metadata and a corresponding AI phenological annotations with matching confidence scores. The entirety of the record headers are provided, extracted directly from the NEVP portal. In addition, the AI labels are defined by the following headers. These 8 columns represent 4 binary classifiers with the Presence/Absence of each 4 traits and corresponding confidence (as a percentage - presence/absence percentages sum to 1).
    • Flowering Not Flowering Budding Not Budding Fruiting Not Fruiting Reproductive Not Reproductive
  • data_species_with_statuses.csv: A processed dataset summarizing flowering period shift at a species level. Two types of headers are provided.
    • First metadata concerning the flowering shift and the data used to compute that value: 
      • genus genus_species slope nb_specimens p_value_significance trend_category
        Genus of the species Binomial name of the species Regression slope defining the flowering shift as a slope Number of herbarium specimens used to compute the shift P-value significance of the slope being non-zero. ('Non Significant'/'Significant') Summary of the shift as a binary characteristic ('Earlier'/'Later')
    • Second, metadata summarizing various traits associated to each species:
      • lifeform_status native_introduced_status wetland_status seasonality_average seasonality_spread
        Growth form from the USDA PLANTS Database. 'Forb_Herb', 'Shrub_Tree' or 'Vine' 'Native'/'Introduced' status from the USDA PLANTS Database.

        National Wetland Plant List (NWPL) Wetland Indicator Status within the Northcentral and Northeast Region

        'OBL'/'FACW'/'FAC'/'FACU'/'UPL'

        A characteristic of the flowering season of the species based on the mean Day of Year of the analysed specimens: if <=180: 'Early', else 'Late' A characteristic of the flowering season of the species based on the spread of the flowering season. Less than 28 days: 'Narrow', larger: 'Large'.

         

  • phylogenetic_tree.tre: The raw data used to generate the visualization of the flowering seasonality character and the detected flowering shift foreach species on a phylogenetic tree.
  • phylogenetic_processed_dataset.csv: The processed dataset resuting from the phylogenetic signal analysis. For each trait, an associated significance binary value is provided.

Files

data_species_with_statuses.csv

Files (332.5 MB)

Name Size
md5:9727e8f9129d1649e4834099b4ae37da
91.5 kB Preview Download
md5:057dc40613b8bdcbe53578b187fa01d8
332.3 MB Preview Download
md5:65a58c81d2e162d2d9008f33fb84ae33
83.5 kB Preview Download
md5:8465acd88080abb20bb103fee71f2ee2
22.8 kB Download