How Robot Dogs See the Unseeable: Improving Visual Interpretability via Peering for Exploratory Robots
Contributors
Description
Supplementary Materials
By mimicking how insects peer side-to-side, robots can now see through clutter in real-time.
Abstract: In vegetated environments, such as forests, exploratory robots play a vital role in navigating complex, cluttered environments where human access is limited and traditional equipment struggles. Visual occlusion from obstacles, such as foliage, can severely obstruct a robot’s sensors, impairing scene understanding. We show that “peering”, a characteristic side-to-side movement used by insects to overcome their visual limitations, can also allow robots to markedly improve visual reasoning under partial occlusion. This is accomplished by applying core signal processing principles, specifically optical synthetic aperture sensing, together with the vision reasoning capabilities of modern large multimodal models. Peering enables real-time, high-resolution, and wavelength-independent perception, which is crucial for vision-based scene understanding across a wide range of applications. The approach is low-cost and immediately deployable on any camera-equipped robot. We investigated different peering motions and occlusion masking strategies, demonstrating that, unlike peering, state-of-the-art multi-view 3D vision techniques fail in these conditions due to their high susceptibility to occlusion. Our experiments were carried out on an industrial-grade quadrupedal robot. However, the ability to peer is not limited to such platforms, but potentially also applicable to bipedal, hexapod, wheeled, or crawling platforms. Robots that can effectively see through partial occlusion will gain superior perception abilities – including enhanced scene understanding, situational awareness, camouflage breaking, and advanced navigation in complex environments.
Supplementary Movies:
Movie S1. Animal and robotic peering enhance scene understanding under limited vision. This movie supplements Fig. 1, illustrating dynamic peering motions from locusts, the Florida bush cricket, and ANYbotics' ANYmal robot. It also shows the corresponding images from the robot's perspective during peering, along with the manual focusing process within the synthetic aperture (SA) integral images (here, using a planar synthetic focal surface). Finally, it presents visual reasoning results from a large multimodal model (specifically, ChatGPT-5.0) for both indoor RGB and outdoor near-infrared (NIR) recordings.
Movie S2. Various robotic peering motions. This video demonstrates the peering motions implemented on a quadruped robot, including horizontal rotation, horizontal shift, and diagonal shift. These movements are constrained by the robot's equilibrium limits (avoid falling into the ground). For the ANYbotics ANYmal, this yields a maximum vertical SA of approximately 20cm and a horizontal SA of 30cm in our experiments.
Supplementary Data:
Data S1. Data for Figure 1. This dataset contains the images, poses, and parameters for the RGB and NIR recordings used for Fig. 1. Use Software S1 to load this data (copy images and poses folders as well as parameters.txt to the main folder) and compute the SA integral images.
Data S2. Data for Figure 5. This dataset contains the images, poses, and parameters for the rotation, horizontal shift, and diagonal shift peering motions used for Fig. 5. Use Software S1 to load this data (copy images and poses folders as well as parameters.txt to the main folder) and compute the SA integral images.
Data S3. Data for Figure 6. This dataset contains the images, poses, and parameters for the reasoning examples used for Fig. 6. Use Software S1 to load this data (copy images and poses folders as well as parameters.txt to the main folder) and compute the SA integral images.
Data S4. Data for Figures 7 and S1. This dataset provides the non-binarized occlusion masks, images, poses, and parameters for the RGB recordings from Fig. 7 and the NIR recordings from Fig. S1. Use Software S1 to load this data: 1.) Copy parameters.txt and poses folder to the main folder. 2.) Select an occlusion mask model and copy the corresponding masks folder to the main folder. 3.) Copy either images_1008 or images_504 folder (check model resolution support in Figs. 7 and S1: either 1008^2px or 504^2px) to the main folder and rename it to images. 4.) Compute the SA integral images and change the occlusion mask threshold under the displays menu. Note, that if no masks folder is located in the main folder of the software, VDVI values can be computed online. If a masks folder is available, the mask values will be loaded from the contained .tiff files. Make sure that the mask resolution (.tiff files in the masks folder) matches the image resolution (.png files in the images folder) by selecting the images from the correct folder (either images_1008 or images_504). In any case, thresholding is done with the occlusion mask parameters of the software (see Software S1).
Data S5. Data for Figure S2. This dataset contains the 10 conventional images and 10 focal stack layers used for Fig. S2. Use Software S1 to generate focal stacks with Focal Stack menu. Layers are stored in the integrals folder.
Data S6. Data for Figure S4. This dataset contains the images, poses, and parameters for the camouflage breaking examples used for Fig. S4. Use Software S1 to load this data (copy images and poses folders as well as parameters.txt to the main folder) and compute the SA integral images.
Supplementary Software:
Software S1. Synthetic Aperture Integrator. This CUDA implementation was used to compute the SA integral images from a given set of conventional input images and the corresponding recording poses. It allows adjusting the focal surface interactively and supports occlusion masking based on Visible the Difference Vegetation Index (78). It requires Microsoft Windows and a reasonably fast GPU. Please note that the following instructions describe only the functionality for viewing the supplementary data. For more detailed information, please contact the corresponding author. To begin, choose a supplementary dataset and copy the images and poses folders, along with the parameters.txt file, into the main SAI directory. By default, the dense scene shown in Fig. S4 is preinstalled. Then, run SAI.exe. Please be aware that loading may take some time, depending on the number of images. Navigate the Virtual Camera using the mouse wheel and left button. Within the menu, you can adjust the Focal Surface parameters. The surface can be shifted into the scene using the z-parameter, where a positive value moves it forward. Additionally, it can be translated (TX, TY), rotated (RX, RY, RZ), and scaled (SX, SY, SZ). Note that the focal surface is based on a unit half-sphere. By selecting large values for SX and SY, the surface will approximate a flat plane, while a positive SZ value scales it in the +z direction. Besides planes, cylindrical and spherical focal surfaces can also be created by setting appropriate values for SX, SY, and SZ. The grid-flag toggles the focal surface grid visualization on or off; the center of this grid indicates the current focus point. To view the individual captured images, select the pinhole aperture option in the Virtual Camera tab and use the Jump to +/- buttons to browse the sequence. To return to the synthetic aperture integration view, simply click the open aperture option. The occlusion mask parameters (T,UB,LB) threshold the occlusion mask values (ranging from -1 to 1). Within this range, high values identify occluder pixels, while low values identify non-occluder pixels. The threshold T serves as the central value for this separation. Pixels with values recognized as occluders are assigned low alpha values, potentially becoming fully transparent (0), whereas non-occluders are assigned high alpha values, potentially becoming fully opaque (1). The LB and UB parameters define lower and upper bounds around T to create a smooth transition. A pixel value below LB is assigned an alpha of 1, and a value above UB is assigned an alpha of 0. For values falling between LB and UB, the alpha value is linearly interpolated between 0 and 1. This entire process is applied to each image individually, resulting in a unique alpha mask for each one. By default, these occlusion mask values are computed as the Visible Difference Vegetation Index (VDVI), where a high value indicates vegetation and a low value indicates non-vegetation. VDVI thresholds are relatively low (e.g. T=0.025-0.115).
Files
Data_S1.zip
Files
(8.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:4d58a69d418246bb982f23cf3fc77027
|
710.6 MB | Preview Download |
|
md5:f43d67ae89c2fbddf9545500da021c63
|
136.0 MB | Preview Download |
|
md5:f869211d9e66cb47b8e9815f813f5ca9
|
594.7 MB | Preview Download |
|
md5:76459af0a972b783526959a8de946c2a
|
6.0 GB | Preview Download |
|
md5:173fccdb05b1a1d7e5f952b8e4170e7f
|
38.6 MB | Preview Download |
|
md5:fb4f189166aba0747745c9a99af8bfbb
|
37.0 MB | Preview Download |
|
md5:4b5e9400076810178bff367bc90dd2d2
|
361.6 MB | Preview Download |
|
md5:7d4d4cae225908c040acc091acfba380
|
218.1 MB | Preview Download |
Additional details
Funding
- FWF Austrian Science Fund
- Wide Synthetic Aperture Sampling for Motion Classification I 6046-N P
- FWF Austrian Science Fund
- Wide Synthetic Aperture Sampling 32185-NBL
- European Union
- Forest Robotic Monitoring and Automation (FORMA) EFRE1079