Published November 22, 2024
| Version 1.1
Dataset
Open
Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process - Datasets
Authors/Creators
Description
This dataset contains supporting data for a research project aimed at analysing herbarium samples from the New England area at a large scale with deep learning techniques. Details on the methodology are shared in the acompanying paper (to be published).
Content:
- dataset600k_withAI.csv : A dataset of over 600.000 herbarium samples with its record metadata and a corresponding AI phenological annotations with matching confidence scores. The entirety of the record headers are provided, extracted directly from the NEVP portal. In addition, the AI labels are defined by the following headers. These 8 columns represent 4 binary classifiers with the Presence/Absence of each 4 traits and corresponding confidence (as a percentage - presence/absence percentages sum to 1).
-
Flowering Not Flowering Budding Not Budding Fruiting Not Fruiting Reproductive Not Reproductive
-
- data_species_with_statuses.csv: A processed dataset summarizing flowering period shift at a species level. Two types of headers are provided.
- First metadata concerning the flowering shift and the data used to compute that value:
-
genus genus_species slope nb_specimens p_value_significance trend_category Genus of the species Binomial name of the species Regression slope defining the flowering shift as a slope Number of herbarium specimens used to compute the shift P-value significance of the slope being non-zero. ('Non Significant'/'Significant') Summary of the shift as a binary characteristic ('Earlier'/'Later')
-
- Second, metadata summarizing various traits associated to each species:
-
lifeform_status native_introduced_status wetland_status seasonality_average seasonality_spread Growth form from the USDA PLANTS Database. 'Forb_Herb', 'Shrub_Tree' or 'Vine' 'Native'/'Introduced' status from the USDA PLANTS Database. National Wetland Plant List (NWPL) Wetland Indicator Status within the Northcentral and Northeast Region
'OBL'/'FACW'/'FAC'/'FACU'/'UPL'
A characteristic of the flowering season of the species based on the mean Day of Year of the analysed specimens: if <=180: 'Early', else 'Late' A characteristic of the flowering season of the species based on the spread of the flowering season. Less than 28 days: 'Narrow', larger: 'Large'.
-
- First metadata concerning the flowering shift and the data used to compute that value:
- phylogenetic_tree.tre: The raw data used to generate the visualization of the flowering seasonality character and the detected flowering shift foreach species on a phylogenetic tree.
- phylogenetic_processed_dataset.csv: The processed dataset resuting from the phylogenetic signal analysis. For each trait, an associated significance binary value is provided.