WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models
- 1. École Polytechnique Fédérale de Lausanne
Description
#############
WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models
#############
Authors: Valentin Gabeff, Marc Russwurm, Devis Tuia & Alexander Mathis
Affiliation: EPFL
Date: January, 2024
Link to the article: https://link.springer.com/article/10.1007/s11263-024-02026-6
--------------------------------
WildCLIP is a fine-tuned CLIP model that allows to retrieve camera-trap events with natural language from the Snapshot Serengeti dataset. This project intends to demonstrate how vision-language models may assist the annotation process of camera-trap datasets.
Here we provide the processed Snapshot Serengeti data used to train and evaluate WildCLIP, along with two versions of WildCLIP (model weights).
Details on how to run these models can be found in the project github repository.
Provided data (images and attribute annotations):
The data consists of 380 x 380 image crops corresponding to the MegaDetector output of Snapshot Serengeti with a confidence threshold above 0.7. We considered only camera trap images containing single individuals.
A description of the original data can be found on LILA here, released under the Community Data License Agreement (permissive variant).
We warmly thank the authors of LILA for making the MegaDetector outputs publicly available, as well as for structuring the dataset and facilitating its access.
Adapted CLIP model (model weights):
WildCLIP models provided:
- [New] WildCLIP_vitb16_t1.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following template 1. Trained on both base and novel vocabulary (see paper for details).
- [New] WildCLIP_vitb16_t1_lwf.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following template 1, and with the additional VR-LwF loss. Trained on both base and novel vocabulary (see paper for details).
- WildCLIP_vitb16_t1_base.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following template 1. Model used for evaluation and trained on base vocabulary only. (previously named WildCLIP_vitb16_t1.pth)
- WildCLIP_vitb16_t1t7_lwf_base.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following templates 1 to 7, and with the additional VR-LwF loss. Model used for evaluation and trained on base vocabulary only. (previously named WildCLIP_vitb16_t1t7_lwf.pth)
We also provide the CSV files containing the train / val / test splits. The train / test splits follow camera split from LILA (https://lila.science/datasets/snapshot-serengeti). The validation split is custom, and also at the camera level.
- train_dataset_crops_single_animal_template_captions_T1T7_ID.csv: Train set with captions from templates 1 through 7 (column "all captions") or template 1 only (column "template 1")
- val_dataset_crops_single_animal_template_captions_T1T7_ID.csv: Validation set with captions from templates 1 through 7 (column "all captions") or template 1 only (column "template 1")
- test_dataset_crops_single_animal_template_captions_T1T8T10.csv: Test set with captions from templates 1, 8, 9 and 10 (columns "all captions")
Details on how the models were trained can be found in the associated publication.
References:
If you find our code, or weights, please cite:
@article{gabeff2024wildclip, title={WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models}, author={Gabeff, Valentin and Ru{\ss}wurm, Marc and Tuia, Devis and Mathis, Alexander}, journal={International Journal of Computer Vision}, pages={1--17}, year={2024}, publisher={Springer} }
If you use the adapted Snapshot Serengeti data please also cite their article:
@article{swanson2015snapshot, title={Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna}, author={Swanson, Alexandra and Kosmala, Margaret and Lintott, Chris and Simpson, Robert and Smith, Arfon and Packer, Craig}, journal={Scientific data}, volume={2}, number={1}, pages={1--14}, year={2015}, publisher={Nature Publishing Group} }
Files
README.md
Files
(27.2 GB)
Name | Size | Download all |
---|---|---|
md5:2f4e89dee02aa4035b34c8320634a1cf
|
4.7 kB | Preview Download |
md5:ffaaaa29a3679332713d861ca11a7994
|
20.5 GB | Preview Download |
md5:9c172064f99b3801333b74a3fcc3a198
|
576.2 MB | Preview Download |
md5:f74e038058d7b5093a7ee8b068fa187d
|
894.5 MB | Preview Download |
md5:037c6f4ac5b120d36a0c8346ef16cec4
|
95.9 MB | Preview Download |
md5:94ee4fb8b8d8d8dd8dddd171251aa9b6
|
1.3 GB | Download |
md5:76592825733e7fcfcef840d6e953de09
|
1.3 GB | Download |
md5:6b370915b888bd9af4929fddb656fbe3
|
1.3 GB | Download |
md5:ef359082d3f840546d496aa171f559d2
|
1.3 GB | Download |
Additional details
Related works
- Is supplement to
- Publication: 10.1007/s11263-024-02026-6 (DOI)