Scaled and Translated Image Recognition (STIR)

Thomas Altstidl; An Nguyen; Leo Schwinn; Franz Köferl; Christopher Mutschler; Björn Eskofier; Dario Zanca

doi:10.5281/zenodo.6578038

Published November 15, 2022 | Version 1.0.0

Dataset Open

Scaled and Translated Image Recognition (STIR)

1. Friedrich-Alexander-Universität Erlangen-Nürnberg
2. Fraunhofer-Institut für Integrierte Schaltungen

Paper: [2211.10288] Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional Neural Networks (arxiv.org)
Code: taltstidl/scale-equivariant-cnn: Official code for "Just a Matter of Scale? Reevaluating Scale Equivariance in Convolutional Neural Networks" (github.com)

While convolutions are known to be invariant to (discrete) translations, scaling continues to be a challenge and most image recognition networks are not invariant to them. To explore these effects, we have created the Scaled and Translated Image Recognition (STIR) dataset. This dataset contains objects of size \(s \in [17,64]\), each randomly placed in a \(64 \times 64\) pixel image.

Using the dataset

Depending on which data you are planning to use, download one or more of the following files. Data is stored in compressed .npz format and can be loaded as documented here.

File	Description
`emoji.npz`	Emoji vector icons rendered as white icon on black background
`mnist.npz`	Classic MNIST handwritten digits rescaled to varying sizes
`trafficsign.npz`	Traffic signs from street imagery downscaled to varying sizes
`aerial.npz`	Objects in aerial imagery downscaled to varying sizes

Each file contains multiple arrays that can be accessed in a dictionary-like fashion. The keys are documented below, where n is the number of classes for a given file and m is the number of instances for each class. Both emoji.npz (36 classes, 1 instance) and mnist.npz (10 classes, 50 instances) are in black & white while trafficsign.npz (16 classes, 25 instances) and aerial.npz (9 classes, 25 instances) are in color.

Key	Shape	Description
`imgs`	`(3, 48, n, m, 64, 64)` black & white, `(3, 48, n, 64, 64, 3)` color	Images grouped into 3 sets (training, validation, testing) and 48 different scales. Values will be in range `0` to `255`.
`lbls`	`(3, 48, n, m)`	Indices referencing ground truth labels. See `lbldata` for descriptive names. Values will be in range `0` to `n - 1`.
`scls`	`(3, 48, n, m)`	Known scales as given by bounding box size. Values will be in range `17` to `64`.
`psts`	`(3, 48, n, m, 2)`	Known position of bounding box. First value is distance to left edge, second value distance to top edge.
`metadata`	`(6, 2)`	Metadata on title, description, author, license, version and date.
`lbldata`	`(n,)`	Descriptive names for each ground truth labels.

For use in Python a dataset class is provided that implements the basic functionality for loading a certain split and scale selection, as illustrated in the code below. It ensures shuffling is done in a consistent manner such that ground truth scales and positions can be retrieved. Metadata and label descriptions can be retrieved via metadata and labeldata, respectively.

from data.dataset import STIRDataset

dataset = STIRDataset('data/emoji.npz')
# Obtain images and labels for training
images, labels = dataset.to_torch(split='train', scales=[32, 64], shuffle=True)
# Obtain known scales and positions for above
scales, positions = dataset.get_latents(split='train', scales=[32, 64], shuffle=True)
# Get metadata and label descriptions
metadata = dataset.metadata
label_descriptions = dataset.labeldata

License and Attribution

When using this dataset for your own research, please respect the individual licenses of the original data. These are distributed within the data files' metadata. For attribution in papers, we recommend the following citations.

D. Gandy, J. Otero, E. Emanuel, F. Botsford, J. Lundien, K. Jackson, M. Wilkerson, R. Madole, J. Raphael, T. Chase, G. Taglialatela, B. Talbot, and T. Chase. Font Awesome. https://fontawesome.com/v5/download, Nov. 2022.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, Nov. 1998.
C. Ertler, J. Mislej, T. Ollmann, L. Porzi, G. Neuhold, and Y. Kuang. The Mapillary Traffic Sign Dataset for Detection and Classification on a Global Scale. In 2020 16th Eur. Conf. Comput. Vision (ECCV), Glasgow, UK, Aug. 2020.
G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In 2018 IEEE/CVF Conf. Comput. Vision and Pattern Recognition (CVPR), pages 3974–3983, Salt Lake City, UT, USA, June 2018.

Files

Files (934.7 MB)

Name	Size	Download all
aerial.npz md5:b80315d8c3a9dfe44d140fbaaf9fb901	314.4 MB	Download
emoji.npz md5:ba9c26a5d506a83c8d339fe1e5bb99c7	1.4 MB	Download
mnist.npz md5:733eb17e09acce7ac9cdfca4d2df36df	44.7 MB	Download
trafficsign.npz md5:9747d941385e41f2edf19d134e2f8136	574.2 MB	Download

	All versions	This version
Views	685	674
Downloads	154	150
Data volume	47.7 GB	45.9 GB

Scaled and Translated Image Recognition (STIR)

Creators

Description

Files

Files (934.7 MB)