Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published November 15, 2022 | Version 1.0.0
Dataset Open

Scaled and Translated Image Recognition (STIR)

  • 1. Friedrich-Alexander-Universität Erlangen-Nürnberg
  • 2. Fraunhofer-Institut für Integrierte Schaltungen

Description

Paper: [2211.10288] Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional Neural Networks (arxiv.org)
Code: taltstidl/scale-equivariant-cnn: Official code for "Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional Neural Networks" (github.com)

While convolutions are known to be invariant to (discrete) translations, scaling continues to be a challenge and most image recognition networks are not invariant to them. To explore these effects, we have created the Scaled and Translated Image Recognition (STIR) dataset. This dataset contains objects of size \(s \in [17,64]\), each randomly placed in a \(64 \times 64\) pixel image.

Using the dataset

Depending on which data you are planning to use, download one or more of the following files. Data is stored in compressed .npz format and can be loaded as documented here.

File Description
emoji.npz Emoji vector icons rendered as white icon on black background
mnist.npz Classic MNIST handwritten digits rescaled to varying sizes
trafficsign.npz Traffic signs from street imagery downscaled to varying sizes
aerial.npz Objects in aerial imagery downscaled to varying sizes

Each file contains multiple arrays that can be accessed in a dictionary-like fashion. The keys are documented below, where n is the number of classes for a given file and m is the number of instances for each class. Both emoji.npz (36 classes, 1 instance) and mnist.npz (10 classes, 50 instances) are in black & white while trafficsign.npz (16 classes, 25 instances) and aerial.npz (9 classes, 25 instances) are in color.

Key Shape Description
imgs (3, 48, n, m, 64, 64) black & white, (3, 48, n, 64, 64, 3) color Images grouped into 3 sets (training, validation, testing) and 48 different scales. Values will be in range 0 to 255.
lbls (3, 48, n, m) Indices referencing ground truth labels. See lbldata for descriptive names. Values will be in range 0 to n - 1.
scls (3, 48, n, m) Known scales as given by bounding box size. Values will be in range 17 to 64.
psts (3, 48, n, m, 2) Known position of bounding box. First value is distance to left edge, second value distance to top edge.
metadata (6, 2) Metadata on title, description, author, license, version and date.
lbldata (n,) Descriptive names for each ground truth labels.

For use in Python a dataset class is provided that implements the basic functionality for loading a certain split and scale selection, as illustrated in the code below. It ensures shuffling is done in a consistent manner such that ground truth scales and positions can be retrieved. Metadata and label descriptions can be retrieved via metadata and labeldata, respectively.

from data.dataset import STIRDataset

dataset = STIRDataset('data/emoji.npz')
# Obtain images and labels for training
images, labels = dataset.to_torch(split='train', scales=[32, 64], shuffle=True)
# Obtain known scales and positions for above
scales, positions = dataset.get_latents(split='train', scales=[32, 64], shuffle=True)
# Get metadata and label descriptions
metadata = dataset.metadata
label_descriptions = dataset.labeldata

License and Attribution

When using this dataset for your own research, please respect the individual licenses of the original data. These are distributed within the data files' metadata. For attribution in papers, we recommend the following citations.

  1. D. Gandy, J. Otero, E. Emanuel, F. Botsford, J. Lundien, K. Jackson, M. Wilkerson, R. Madole, J. Raphael, T. Chase, G. Taglialatela, B. Talbot, and T. Chase. Font Awesome. https://fontawesome.com/v5/download, Nov. 2022.
  2. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
  3.  C. Ertler, J. Mislej, T. Ollmann, L. Porzi, G. Neuhold, and Y. Kuang. The Mapillary Traffic Sign Dataset for Detection and Classification on a Global Scale. In 2020 16th Eur. Conf. Comput. Vision (ECCV), Glasgow, UK, Aug. 2020.
  4. G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In 2018 IEEE/CVF Conf. Comput. Vision and Pattern Recognition (CVPR), pages 3974–3983, Salt Lake City, UT, USA, June 2018.

Files

Files (934.7 MB)

Name Size Download all
md5:b80315d8c3a9dfe44d140fbaaf9fb901
314.4 MB Download
md5:ba9c26a5d506a83c8d339fe1e5bb99c7
1.4 MB Download
md5:733eb17e09acce7ac9cdfca4d2df36df
44.7 MB Download
md5:9747d941385e41f2edf19d134e2f8136
574.2 MB Download