Published March 5, 2024 | Version v1
Dataset Open

Pl@ntNet-CrowdSWE: Pl@ntNet collaborative learning with South-Western-Europe dataset

  • 1. ROR icon Institut Montpelliérain Alexander Grothendieck
  • 2. EDMO icon National Institute for Research in Computer and Control Sciences
  • 3. ROR icon Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier
  • 4. ROR icon Université de Montpellier
  • 5. ROR icon Centre National de la Recherche Scientifique
  • 6. ROR icon Centre de Coopération Internationale en Recherche Agronomique pour le Développement
  • 7. EDMO icon French Agricultural Research Centre for International Development
  • 8. ROR icon UMR Botanique et Modélisation de l'Architecture des Plantes et des végétations
  • 9. ROR icon Institut Universitaire de France

Description

Pl@ntNet-CrowdSWE: Pl@ntNet collaborative learning with South-Western-Europe dataset

This repository contains the files for the Pl@ntNet South Western Europe (SWE) crowdsourced dataset.
It contains all species identification and user votes for observations made between 2017 and 2023 in the SWE flora.

In total, more than 6 699 593 plant observations are labeld by 823 251 users between january 2017 and october 2023. In addition, 98 experts were selected to obtain ground truth values for 26 811 observations.

The structure of the dataset is described below, and a `readme.md` file is available in the record.

In short directory structure

Pl@ntNet SWE dataset
├── answers
│   ├── answers.json
│   └── ground_truth.txt
├── converters
│   ├── tasks.json
│   └── classes.json
└── aggregation
    ├── authors.txt
    ├── ai_classes.json
    ├── ai_answers.json
    ├── ai_scores.json
    └── k-southwestern-europe.json

Crowdsourced data

In the answers folder are located the crowdsourced answers and the associated ground truths.
The crowdsourced answers are stored in the answers.json file. It gathers more than 6 million tasks with answers from 823 251 users. It is formatted as a json entry with levels representing the observation ID, the users, and their associated vote for the species label.

{
obsID: {userID: vote, userID2: vote,...},
...
}

A list of 98 experts was created to gather a partial ground truth in the ground_truth.txt file.
Each row represents an observation and the associated class label is the current considered ground truth.
This file lets us compute several performance metrics such as the accuracy of the label aggregation.

Converters

In the converters folder, you can find the converters to obtain the Pl@ntNet official observation numbers (the last part of the URL https://identify.plantnet.org/fr/k-world-flora/observations/<id>) from the obsID used in answers.json. This is stored in the tasks.json file.
A similar dictionary converts the species proposed by users to a single label in {0, 1, 2, ...}.
This mapping is stored in classes.json.

As plant species can also have synonyms, we release the two files used to clean the user answers. The species.json file contains a list with all the accepted species determinations from the  World Checklist of Vascular Plants.
Then, we focused on the SWE flora and replaced synonyms with the underlying species using the k-southwestern-europe.json checklist by Plants Of the World Online (POWO) by Kew’s Royal Botanical Garden. This checklist is written as follows:

[
   {
    "species": species name,
    "synonyms": [
        synonym1,
        synonym2,
        ...
        ]
   },
   ...
]

Files to run the Pl@ntNet label aggregation strategy


To run the Pl@ntNet label aggregation strategy available in the peerannot library, several other pieces of information are needed and located in the aggregation folder.

- First, we need to know for each task which user was the author (if they proposed an initial species determination).
This information is stored in the authors.txt dataset, where each row is the obsID and the value is the userID of the author. If the author did not propose any species, this identification is set to -1.

- Then, to run the label aggregation strategies taking into account the AI vote, we extend the `classes.json` file with the AI-predicted classes into the ai_classes.json file. Each species is associated with a number, including newly introduced species by the AI.
- Then, we need the AI predictions. The AI answers are stored in the ai_answers.json file where each key is the obsID and each value represents the class predicted by the AI. Synonyms were also removed using the k-southwestern-europe.json file.
- Finally, for strategies taking into account the prediction score, we release the ai_scores.json file, where each key is the obsID and each value is the probability given for the predicted class.

Files

plantnet_swe.zip

Files (183.3 MB)

Name Size Download all
md5:037a6abccd51aa7cd9018e280d3bbdd5
183.3 MB Preview Download
md5:330bb61c4ba2b45eb0aa30b6ea43335d
3.8 kB Preview Download

Additional details

Funding

Pl@ntAgroEco 22-PEAE0009
Agence Nationale de la Recherche
IA CaMeLOt ANR-20-CHIA-0001-01
Agence Nationale de la Recherche
Grand Équipement National de Calcul Intensif (France)
GUARDEN 101060693
Centre de Coopération Internationale en Recherche Agronomique pour le Développement
MAMBO (Horizon EU) 101060639
European Union