VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset

Giuseppe Amato; Paolo Bolettieri; Fabio Carrara; Fabrizio Falchi; Claudio Gennaro; Nicola Messina; Lucia Vadicamo; Claudio Vairo

doi:10.5281/zenodo.8355037

Published September 18, 2023 | Version v1

Dataset Open

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset

1. CNR-ISTI

Contributors

Data curators:

1. CNR-ISTI

This repository contains a diverse set of features extracted from the marine video (underwater) dataset (MVK) . These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

We used a snapshot of the MVK dataset from 2023, that can be downloaded using the instructions provided at https://download-dbis.dmi.unibas.ch/mvk/. It comprises 1,372 video files. We divided each video into 1 second segments.

This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

@inproceedings{amato2023visione, 
title={VISIONE at Video Browser Showdown 2023}, 
author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, 
booktitle={International Conference on Multimedia Modeling}, 
pages={615--621}, 
year={2023}, 
organization={Springer} 
}

This repository comprises the following files:

msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").
extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original MVK videos available.
features-aladin.tar.gz^† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.
features-clip-laion.tar.gz^† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.
features-clip-openai.tar.gz^† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.
features-clip2video.tar.gz^† contains CLIP2Video [Fang H. et al. 2021] extracted for all the 1s video segments.
objects-frcnn-oiv4.tar.gz^* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).
objects-mrcnn-lvis.tar.gz^* contains the objects detected using Mask R-CNN [He et al. 2017] (trained on LVIS).
objects-vfnet64-coco.tar.gz^* contains the objects detected using VfNet [Zhang et al. 2021] (trained on COCO dataset).

*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

"object_class_names": vector with the class name of each detected object.
"object_scores": scores corresponding to each detected object.
"object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

^†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the MVK dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used (see links above). Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

References:

[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CV

Files

Files (1.3 GB)

Name	Size	Download all
extract-keyframes-from-msb.tar.gz md5:d20bc40dd6ebaadeb39326d1c1ab662f	1.7 kB	Download
features-aladin.tar.gz md5:d9cb47522018cdf5868d070eb25523a2	128.2 MB	Download
features-clip-laion.tar.gz md5:6f11dee8de120e66f2e838e2dd4d2352	170.4 MB	Download
features-clip-openai.tar.gz md5:83adffb11629efd5c89d4a39c73943d5	128.4 MB	Download
features-clip2video.tar.gz md5:a413c562416d43eaba87169513796d1a	86.3 MB	Download
msb.tar.gz md5:9bcbc0c32447a7ee42198774d0b7259b	107.8 kB	Download
objects-frcnn-oiv4.tar.gz md5:3bd8cbf48326498d84c3b9101f64c4e0	218.1 MB	Download
objects-mrcnn-lvis.tar.gz md5:4904ac96a22d2e610c2551c193176261	447.8 MB	Download
objects-vfnet64-coco.tar.gz md5:f953a972ea30c8f8d5cd7161c2b00b8f	93.6 MB	Download

Additional details

Is part of: Conference paper: 10.1145/3591106.3592226 (DOI); Conference paper: 10.1007/978-3-031-27077-2_48 (DOI)
Is source of: Conference paper: 10.1007/978-3-030-98355-0_52 (DOI)

European Commission
AI4Media - A European Excellence Centre for Media, Society and Democracy 951911

	All versions	This version
Views	504	178
Downloads	640	362
Data volume	74.2 GB	51.8 GB

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset

Contributors

Data curators:

Files

Files (1.3 GB)

Additional details

Related works

Funding

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset

Creators

Contributors

Data curators:

Description

Files

Files (1.3 GB)

Additional details

Related works

Funding