Published November 18, 2021 | Version v2
Dataset Restricted

#PraCegoVer dataset

  • 1. Institute of Computing, University of Campinas

Description

Automatically describing images using natural sentences is an essential task for visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer, and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

#PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

 

New Release

We release pracegover_400k.json which contains 403,337 examples from the original dataset.json  after preprocessing and duplication removal. It is split into train, validation, and test with 242036, 80628, and 80673 examples, respectively.

 

Dataset Structure

#PraCegoVer dataset comprehends a main file dataset.json and a collection of compressed files named images.tar.gz.partX
containing the images. The file dataset.json comprehends a list of JSON objects with the attributes:

  • user: anonymized user that made the post;
  • filename: image file name;
  • raw_caption: raw caption;
  • caption: clean caption;
  • date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

 

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k, 173k, and 400k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz
tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in the PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=<your access token>

 

Notes

Funding acknowledgements: G.O.S. is funded by the São Paulo Research Foundation (FAPESP) (2019/24041-4). E.L.C. and S.A. are partially funded by H.IAAC (Artificial Intelligence and Cognitive Architectures Hub). S.A. is also partially funded by FAPESP (2013/08293-7), a CNPq PQ-2 grant (315231/2020-3), and Google LARA 2020. The opinions expressed in this work do not necessarily reflect those of the funding agencies.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

#PraCegoVer does not own the copyright of the image files. Also, this dataset consists of data collected from public profiles on Instagram. Thus, the images and raw captions might contain sensitive data that reveal racial or ethnic origins, sexual orientations, and religious beliefs. For researchers who wish to use the image files for non-commercial research and/or educational purposes, we can provide access through the Hub under certain conditions and terms. 


Terms of Access:
    The "Researcher" has requested permission to use the \#PraCegoVer dataset (the "Database") at University of Campinas. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:


    1. Researcher shall use the Database only for non-commercial research and educational purposes.   
    2. Researcher accepts full responsibility for his/her use of the Database and shall defend and indemnify the \#PraCegoVer team and University of Campinas, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted image files that he or she may create from the Database.
    3. Researcher will not under any circumstances attempt to determine the identity of individuals in the Database.
    4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
    5. The #PraCegoVer team and University of Campinas reserve the right to terminate Researcher's access to the Database at any time.

Acess Request:

Please, fill this form with the following data:

    1. Name of the Institution you are working on.
    2. Briefly describe what the dataset will be used for.

Bear in mind that the request must be done from an institutional email account

You are currently not logged in. Do you have an account? Log in here

Additional details

Related works

Is documented by
Journal article: https://www.mdpi.com/2306-5729/7/2/13 (URL)
Is supplemented by
Software: https://github.com/larocs/PraCegoVer (URL)