Published January 2, 2026 | Version v1
Video/Audio Open

Ep. 132: Can AI Map Your House Just by Looking Around?

  • 1. My Weird Prompts
  • 2. Google DeepMind
  • 3. Resemble AI

Description

Episode summary: In this episode of My Weird Prompts, hosts Herman and Corn dive into the cutting-edge landscape of 2026's video-based multimodal AI. They explore how the industry moved beyond simple frame-sampling to adopt spatial-temporal tokenization, allowing models to treat time as a physical dimension. The discussion covers the technical hurdles of real-time video-to-video interaction, including Simultaneous Localization and Mapping (SLAM) for floor plan generation and the use of speculative decoding to minimize latency. By examining the integration of Neural Radiance Fields (NeRFs) and native multimodality, Herman and Corn reveal how AI is finally crossing the uncanny valley to create digital avatars that are indistinguishable from reality.

Show Notes

In the latest episode of *My Weird Prompts*, hosts Herman and Corn take a deep dive into the state of artificial intelligence in 2026, focusing specifically on the rapid evolution of video-based multimodal models. The conversation was sparked by a prompt from their housemate, Daniel, who attempted to use a modern AI model to generate a floor plan simply by walking through their apartment. This exercise in "stress-testing the state of the art" served as the jumping-off point for a high-level technical discussion on how AI perceives space, time, and human interaction.

### From Frames to Volumes: The Tokenization Revolution Herman begins by explaining that the fundamental way AI "sees" video has undergone a massive shift. In the early days of video AI, models relied on sampling—taking individual frames at set intervals and trying to stitch the context together. However, as Herman points out, this method is inefficient and often loses the nuance of motion.

The breakthrough came with the advent of spatial-temporal tokenization. Instead of treating a video as a stack of 2D photos, modern models like Gemini 3 use "3D patches." Herman describes these as data cubes. A single token no longer represents just a square of pixels; it represents a volume of space extended through time (for example, across eight or sixteen frames). This "temporal compression" allows the model to capture the essence of motion—sliding, rotating, or shifting—within a single token. This innovation is what allows current models to process massive amounts of video data without their context windows "exploding" under the weight of the information.

### World Modeling and SLAM The discussion then shifts to the practical application of this technology: mapping physical spaces. When Daniel walks through his apartment, the AI isn't just recognizing objects; it is performing a version of Simultaneous Localization and Mapping (SLAM).

Herman notes that the model maintains a "latent representation" of the environment. As the camera moves, the AI uses its spatial-temporal understanding to predict how the environment should change. It understands that if the camera pans left, objects on the right must disappear in a geometrically consistent way. This "world modeling" is the difference between an AI that simply describes a video and an AI that understands the constraints of physical reality. For a floor plan to be accurate, the model must remember where the front door is even when the user has reached the back balcony, a feat made possible by context windows that now reach into the millions of tokens.

### The Quest for Real-Time Interaction One of the most ambitious frontiers discussed is real-time video-to-video interaction. Corn raises the point that for an AI avatar to be truly "indistinguishable," the latency must be nearly imperceptible. The industry gold standard is a response time of under 100 milliseconds—the limit of human perception for real-time flow.

To achieve this, Herman explains that AI has moved away from traditional linear processing. Instead, they use "speculative decoding" and "streaming inference." The model essentially gambles on the future; as a user begins a sentence or a movement, the AI starts generating multiple possible responses in parallel. If the user's action matches one of the predictions, the AI displays it instantly. If the user does something unexpected, the model pivots. This requires immense computational power but is essential for bridging the "uncanny valley" of conversational rhythm.

### NeRFs and the Physics of Light Beyond just timing, the visual fidelity of AI avatars has seen a massive upgrade through the integration of Neural Radiance Fields (NeRFs). Herman explains that instead of drawing a flat image, the model renders a three-dimensional volume in real-time. This allows for dynamic lighting consistency. If a user moves their phone while talking to an AI avatar, the shadows on the avatar's face shift realistically because the model understands the virtual light source's position relative to the user's camera. This creates a sense of shared physical space, making the digital entity feel like it truly exists within the room.

### The Power of Native Multimodality Finally, the hosts discuss the importance of native multimodality. In the past, AI systems were a "Frankenstein's monster" of different models—one for vision, one for text, and one for speech—all taped together, creating significant lag. In 2026, models are built to be multimodal from the ground up.

In these native models, audio waves and video pixels are converted into the same token space. This allows for "cross-modal attention," where the AI can use visual cues (like lip movement) to help it understand muffled audio. This unified processing is why modern AI avatars have perfect lip-syncing and emotional resonance; the voice and the facial expression are generated as a single, cohesive output rather than two separate files trying to stay in sync.

### Conclusion: A New Era of Telepresence As Herman and Corn wrap up, they reflect on the implications of these technologies for the future of communication. The convergence of spatial mapping, low-latency streaming, and 3D rendering points toward a world of advanced telepresence. We are moving toward a reality where a colleague from across the globe could appear as a high-fidelity, 3D avatar sitting in the chair across from you, reacting to your world with the same physical and temporal accuracy as a person standing in the room.

The episode serves as a reminder that as we continue to tokenize the world around us, the line between the digital and the physical continues to blur, driven by the complex, hidden mathematics of spatial-temporal AI.

Listen online: https://myweirdprompts.com/episode/video-multimodal-ai-evolution

Notes

My Weird Prompts is an AI-generated podcast. Episodes are produced using an automated pipeline: voice prompt → transcription → script generation → text-to-speech → audio assembly. Archived here for long-term preservation. AI CONTENT DISCLAIMER: This episode is entirely AI-generated. The script, dialogue, voices, and audio are produced by AI systems. While the pipeline includes fact-checking, content may contain errors or inaccuracies. Verify any claims independently.

Files

video-multimodal-ai-evolution-cover.png

Files (22.7 MB)

Name Size Download all
md5:43151fb8a83a9ff7cf3db514e3c299ac
5.9 MB Preview Download
md5:1a1b6d580dad1ffeaf2abfdbdee14733
1.9 kB Preview Download
md5:138e25874fdc0a6be4800e7231e604ec
16.8 MB Download
md5:b1ed3bd59054100a0bf03a7e1b1f6796
18.5 kB Preview Download

Additional details