Novel benchmark which features aspects of natural scenes, e.g. a complex 3D object and different lighting conditions, while still providing access to the continuous ground-truth factors.
We use the Blender rendering engine to create visually complex 3D images. Each image in the dataset shows a colored 3D object which is located and rotated above a colored ground in a 3D space. Additionally, each scene contains a colored spotlight which is focused on the object and located on a half-circle around the scene. The observations are encoded with an RGB color space, and the spatial resolution is 224x224 pixels.
The images are rendered based on a 10-dimensional latent, where: (1) three dimensions describe the XYZ position, (2) three dimensions describe the rotation of the object in Euler angles, (3) two dimensions describe the color of the object and the ground of the scene, respectively, and (4) two dimensions describe the position and color of the spotlight. We use the HSV color space to describe the color of the object and the ground with only one latent each by having the latent factor control the hue value.
The training set and test set contain 250,000 and 25,000 observation-latent pairs, respectively, whereby the latents are uniformly sampled from the unit hyperrectangle.