Tauray: A Scalable Real-Time Open-Source Path Tracer for Stereo and Light Field Displays

Light field displays represent yet another step in continually increasing pixel counts. Rendering realistic real-time 3D content for them with ray tracing-based methods is a major challenge even accounting for recent hardware acceleration features, as renderers have to scale to tens to hundreds of distinct viewpoints. To this end, we contribute an open-source, cross-platform real-time 3D renderer called Tauray. The primary focus of Tauray is in using photorealistic path tracing techniques to generate real-time content for multi-view displays, such as VR headsets and light field displays; this aspect is generally overlooked in existing renderers. Ray tracing hardware acceleration as well as multi-GPU rendering is supported. We compare Tauray to other open source real-time path tracers, like Lighthouse 2, and show that it can meet or significantly exceed their performance.


INTRODUCTION
We introduce an open-source, cross-platform real-time 3D rendering tool called Tauray. Tauray focuses on using ray tracing techniques for generating real-time content for multi-view displays, such as VR headsets and light field displays. It includes multiple different rendering modes, such as forward path tracing and DDISH-GI [Ikkala et al. 2021], and supports using multiple GPUs.
Even though some existing renderers do support rendering on multiple GPUs, their focus tends to be on increasing the total throughput, leveraging alternate frame rendering or focusing on high-spp dynamically scheduled tiled rendering. Tauray's multi-GPU support instead aims to minimize latency and only splits the workload in ways that also benefit low-spp rendering and do not introduce additional latency beyond necessary memory transfers. Figure 1 shows Tauray rendering a scene in real-time on a light field display (Looking Glass portrait). The left photograph pair shows 1 spp path tracing with 5 light bounces denoised with SVGF, running at around~50 ms per frame. The right-side pair shows DDISH-GI with an 8 × 8 × 8 probe volume, with 256 8-bounce rays per probe, running at~15 ms per frame. Both sides are rendering 64 different views at 512 × 683.
While there are already several open-source rendering tools available (a comparison table is included in supplementary material), the novelty of Tauray is in its unique combination of the following features: • Open-source code. (https://github.com/vga-group/tauray) • Hardware-accelerated path tracing.
• Multi-GPU support for real-time rendering.
• Real-time light field & virtual reality (VR) display output.

IMPLEMENTATION
Tauray uses the Vulkan API for rendering. Many Vulkan extensions are used in Tauray, but none of them are vendor-specific. These choices keep Tauray from being locked to just one GPU vendor or operating system. Tauray runs on both Linux and Windows operating systems, though multi-GPU support is limited to Linux. Tauray records command buffers and assigns descriptor sets only at the start of the program or when models are dynamically added to or removed from the scene; this reduces CPU overhead and provides the GPU drivers more opportunity for optimization. This approach is not optimal for rasterization-based rendering, because it prevents per-frame culling of drawcalls. Since ray tracing-based methods do not issue draw calls for individual scene objects but rather one command to start tracing rays, there is no need to modify the contents of command buffers for each frame.
Scenes are specified using Khronos' glTF 2.0 format. Keeping shading complexity low is crucial for ray tracing performance [Dunn 2019], so we keep our material model simple by limiting it to the core metallic-roughness workflow in glTF, plus transmission for transparent objects. Specifically, Tauray currently uses the isotropic GGX/Trowbridge-Reitz [Trowbridge and Reitz 1975;Walter et al. 2007] BSDF and Lambertian diffuse BRDF. Additionally, spherical lights with a radius and directional lights with non-zero angular diameter are supported through a custom plugin that extends Blender's glTF exporter.

Rendering modes
While Tauray supports many rendering modes for debugging, dataset generation and comparison purposes, its primary focus is on methods aiming for photorealism: a forward path tracer and DDISH-GI [Ikkala et al. 2021] are available. Both methods use the crossvendor Vulkan extensions VK_KHR_ray_tracing_pipeline and VK_KHR_acceleration_structure for ray tracing.
The forward path tracer supports Next Event Estimation, Hashbased Owen scrambling [Burley 2020] and Russian Roulette sampling [Arvo and Kirk 1990;Kahn 1955]. SVGF [Schied et al. 2017] and BMFR [Koskela et al. 2019] are available for real-time denoising. Box and Blackman-Harris filters are available for primary ray sampling, in order to achieve anti-aliasing. Temporal Anti-Aliasing [Karis 2014] is also supported.
The DDISH-GI renderer supports locally rendered or streamed spherical harmonics probes. Both the client-side renderer and the probe server are available in Tauray. The streaming mode uses  Figure 2: Diagram of the cross-GPU memory transfer as implemented in Tauray. Resources tied to the secondary GPU are marked with the green color, while the resources of the primary GPU are in purple. Host process resources are in red. When more than two GPUs are used, all non-primary GPUs form a similar pair with the primary GPU.
ZeroMQ [Hintjens 2013] and is resilient to poor bandwidth, high network latency and unstable connections. This method is well suited for light field rendering, because the probes can be reused for all views. The ray tracing workload is independent of the number of views and pixels, and multi-view rendering scales practically as well as plain rasterization does.

Multi-GPU rendering
Multi-GPU rendering is implemented so that devices from different Vulkan device groups can co-operate. Since there is currently no DMA extension that enjoys cross-vendor support, we pass the GPU-to-GPU memory transfers through host memory. This memory transfer is done in a way that avoids synchronizing the host process with the GPUs, as shown in Figure 2. We exploit two Vulkan extensions that are typically used for cross-API interoperation: VK_KHR_external_memory_host and VK_KHR_external_semaphore.
We use the external memory extension to create buffers with the same host-provided memory access for both GPUs taking part in the memory transfer. Then, we issue commands to write the transferred data from the sender GPU to this host buffer. Once that transfer is finished, we can issue a read command on the corresponding buffer on the other GPU. Because regular Vulkan semaphores do not work across devices, external semaphores are used to synchronize the read after writing on another GPU. During this process, while the OS and GPU driver on the host CPU most likely are involved, the host process (Tauray) itself does not need to synchronize with the GPUs for this memory transfer.
The ray tracing workload of each view is split between every GPU taking part in rendering. Tauray provides a way to fairly easily program new workload splitting methods; a scanline-based approach is included as an example, which is adequate for when the GPUs have matching performance. Due to our low-latency realtime aim, splitting the workload by alternate frame rendering (AFR) is not considered, as it does not reduce latency beyond a single GPU [Monfort and Grossman 2009]. Certain short tasks, such as scene data refreshes (typically in the order of 0.1 ms in total) and acceleration structure updates are duplicated on each GPU. This is done when transferring their results would incur greater latency or there is no guarantee of data compatibility between the GPUs.  Figure 3: Typical multi-view, multi-GPU rendering pipeline diagram of Tauray; the details can change depending on parameters. The "G-Buffer" needed by many post-processing steps (like denoising) is rasterized on the primary output GPU in Tauray whenever possible. In our case, rendering the G-Buffer separately on the primary GPU is generally marginally faster than distributing it to multiple GPUs.

Stereo & light field rendering
Both stereo and light field rendering use the same rendering architecture for multi-view rendering. Views are stored in an image array instead of separate images. This lets all viewport-related rendering stages of Tauray operate on multiple views in a single pass, which minimizes the overhead involved with launching and synchronizing shaders. Rasterization-based render passes are accelerated for multi-view rendering using the VK_KHR_multiview extension. Compute and ray tracing render passes operate on all views in one pass. In an example 128-view case, doing this allows the path tracing renderer to roughly halve the total frametime and go from about 90% GPU utilization to 99-100% utilization. Figure 3 shows an overview of the multi-view rendering pipeline in Tauray.
As an alternative for brute-force rendering of all viewports, Tauray also has a simple, real-time capable spatial reprojection implementation for quickly generating more viewports from only a few rendered viewports, though it does not fill in the disocclusions intelligently yet.
VR is supported with the OpenXR API. As an example platform for real-time light field rendering, Tauray also supports rendering to the Looking Glass light field displays [Looking Glass Factory, Inc. 2022]. Content for arbitrary multi-view displays can be generated with offline rendering by setting up a grid of cameras. Spatial and temporal reprojection modes are also available, enabling the reuse of samples across different views and frames.

COMPARISON TO RELATED WORK
We compare Tauray to three other renderers: Falcor, Lighthouse 2, and Blender (Cycles). The first two renderers were chosen, because they are also open-source renderers with similar real-time path . All scenes are lit by one punctual directional light. This choice was limited by each renderer having somewhat different feature sets, and this was the lowest common denominator for an identical lighting setup.
All benchmarks in Tables 1 and 2 are measured when path tracing at 1920 × 1080 with 2 ray bounces (effectively 3, as all compared renderers implement next event estimation). Because the renderers do not provide identical denoising schemes, denoising is disabled. RTX 2080 Ti GPUs are used for the measurements. For all renderers, the timing measurements were full frame times as measured on the CPU. For Table 1, performance is averaged over 5 separate runs of 50 frames each. For Table 2, performance is averaged over 10 runs.
Lighthouse 2 was modified to do two light bounces instead of just one. Furthermore, the offline rendering benchmarks are done by using accumulation of 8 spp frames due to higher per-frame spp counts causing Lighthouse 2 to run out of memory on our setup.
In both online and offline cases, Tauray is consistently as fast or faster than the compared renderers. The dual-GPU setup in Lighthouse 2 seemed to be poorly supported: GPU utilization was low, 30-40% on both GPUs and its self-reported "frametime overhead" is in the order of ten milliseconds. Unfortunately, CUDA runs out of memory with Lighthouse 2 while loading the Emerald Square scene. Figure 4 shows how Tauray scales lineary to path tracing multiple views simultaneously for real-time light field rendering. These measurements are done on a single RTX 2080 Ti. The Looking Glass Portrait is used as the output light field display, so timings include compositing the views into the format the display expects.  The blue line represents performance with multiple views (512 × 512 each), while the red line represents single-view performance with an equivalent total number of pixels.
Other than resolution and view count, settings are the same as earlier.
The multi-view rendering overhead depends greatly on the scene: at 128 views, Emerald Square and Sponza were about 26-29% slower to render than the single-view equivalent with the same total number of pixels, while this same metric for the Breakfast room scene is only around 4%.

CONCLUSIONS
We introduced a scalable cross-platform real-time 3D rendering tool called Tauray. To our knowledge, it is the first open-source hardware-accelerated path tracer optimized for real-time rendering on light field and stereo displays. We demonstrated the optimized and scalable performance of Tauray: In both online and offline cases, Tauray's speed consistently matches or exceeds all compared renderers (Blender, Lighthouse 2, Falcor) and GPU setups. Tauray also scales efficiently for rendering multi-view content for VR and light field displays, roughly linearly with the number of views.