There is a newer version of the record available.

Published September 16, 2022 | Version train
Dataset Open

Toloka Visual Question Answering Dataset

Description

Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, https://toloka.ai/.

Our dataset has 45,199 instances split among three subsets: train (38,990 instances), public test (1,705 instances), and private test (4,504 instances). The entire train dataset will be available for everyone since the start of the challenge. The public test dataset will be available since the evaluation phase of the competition, but without any ground truth labels. The private test dataset will not be available until the challenge ends.

The datasets will be provided as files in the comma-separated values (CSV) format containing the following columns.

Column Type Description
image string URL of an image on a public content delivery network
width integer image width
height integer image height
left integer bounding box coordinate: left
top integer bounding box coordinate: top
right integer bounding box coordinate: right
bottom integer bounding box coordinate: bottom
question string question in English

This upload also contains a ZIP file with the images from MS COCO.

Files

train.csv

Files (6.5 GB)

Name Size Download all
md5:65bbfff1ad9fe258c8eb31d831375c46
4.8 MB Preview Download
md5:31fe4d950e1e0357db7a35d30fc6769a
6.3 GB Preview Download
md5:32f10a2ff738822fcdb952f629d55e5d
123.6 kB Preview Download
md5:443bfd2782bbe2384ea372804087a69a
160.7 MB Preview Download

Additional details

Related works

Is compiled by
Other: https://toloka.ai/research/ (URL)
Is supplement to
Project deliverable: https://toloka.ai/challenges/wsdm2023/ (URL)
Software: https://github.com/Toloka/WSDMCup2023 (URL)
Dataset: https://cocodataset.org/ (URL)
Project deliverable: https://codalab.lisn.upsaclay.fr/competitions/7434 (URL)