#PraCegoVer dataset

Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila

doi:10.5281/zenodo.5710562

Published November 18, 2021 | Version v1

Dataset Restricted

#PraCegoVer dataset

1. Institute of Computing, University of Campinas

Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

#PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

#PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX
containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;
filename: image file name;
raw_caption: raw caption;
caption: clean caption;
date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz
tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=<your access token>

Notes

Funding acknowledgements: G.O.S. is funded by the São Paulo Research Foundation (FAPESP) (2019/24041-4). E.L.C. and S.A. are partially funded by H.IAAC (Artificial Intelligence and Cognitive Architectures Hub). S.A. is also partially funded by FAPESP (2013/08293-7), a CNPq PQ-2 grant (315231/2020-3), and Google LARA 2020. The opinions expressed in this work do not necessarily reflect those of the funding agencies.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

The #PraCegoVer dataset consists of data collected from public Instagram profiles, for which we do not hold the copyright. Moreover, the images and raw captions might contain sensitive data that reveal racial or ethnic origins, sexual orientations, and religious beliefs. Researchers interested in accessing this dataset for non-commercial research or educational purposes may request access under the following conditions:

Terms of Access
By requesting access to the #PraCegoVer dataset (the "Database"), the "Researcher" agrees to the following terms and conditions:

1. The Researcher shall use the Database only for non-commercial research and educational purposes.
2. The Researcher assumes full responsibility for the use of the Database and agrees to defend, indemnify, and hold harmless the #PraCegoVer team and the University of Campinas, including its employees, trustees, officers, and agents, from any claims arising from the Researcher’s use of the Database. This includes, but is not limited to, any claims related to the use of copyrighted images contained in the Database.
3. Under no circumstances shall the Researcher attempt to identify any individuals in the Database.
4. The Researcher may share access to the Database with research associates or colleagues, provided that these individuals first agree to be bound by these same terms and conditions.
5. The #PraCegoVer team and the University of Campinas reserve the right to revoke the Researcher’s access to the Database at any time without prior notice.

Access Request
To request access to the dataset, please fill out this form with the following information:

1. Name of the institution with which you are affiliated.
2. A brief description of the intended use of the dataset.

Please note that requests must be submitted using an official institutional email account, otherwise your request will be automatically rejected.

You are currently not logged in. Do you have an account? Log in here

Additional details

Is documented by: Journal article: https://www.mdpi.com/2306-5729/7/2/13 (URL)
Is supplemented by: Software: https://github.com/larocs/PraCegoVer (URL)

	All versions	This version
Views	1,915	1,156
Downloads	440	286
Data volume	7.9 TB	6.8 TB

#PraCegoVer dataset

Creators

Description

Notes

Files

Restricted

Request access

Additional details

Related works