Published March 8, 2026 | Version v2
Preprint Open

NeuroCodec: Efficient Video Prediction via Residual Latent Dynamics

  • 1. Univeristät Hildesheim

Description

We introduce NeuroCodec, a video prediction system that predicts future frames as residual updates in a pre-trained video VAE latent space. Our key insight is that adjacent video frames share the vast majority of their latent content, making residual updates (L_{t+1} = L_t + Δ) fundamentally more efficient than full-frame decoding—an observation motivated by hierarchical predictive processing in biological vision.

We combine slot-based spatial compression (1024→64 tokens), a lightweight dynamics Transformer (438K params), learned event boundary detection, and cross-attention residual decoding (1.43M params) into a unified architecture operating on frozen CogVideoX latents.On UCF-101, our system achieves a 36.8% MSE reduction over the copy baseline (0.216 vs. 0.342) with 130–406× lower per-frame latency than a 50-step UNet at equal batch sizes, while maintaining stable 8-frame rollouts (2.07× error ratio over 50 validation videos).

The architecture generalizes to Something-Something v2 (220K videos, 2.65M frame pairs) without modification. Slot compression captures 88.6% of latent variance and dynamics prediction improves over the copy baseline by 32.2%.Full pixel-level evaluation on SSv2 with a SimVP baseline (2.08M params) confirms that NeuroCodec achieves a strong trade-off between stability and perceptual quality: 2.67× rollout stability (vs. SimVP's 1.93×), FID 31.5 (vs. SimVP's 84.8, 2000 samples), and LPIPS 0.194 in single-step prediction (vs. SimVP's 0.355), at comparable model size (1.92M trainable parameters).

A spectral frequency-matching loss reduces variance-ratio deviation by 55%, achieving LPIPS 0.105 on UCF-101—within 8% of the copy baseline's 0.097 despite copy using ground-truth VAE encodings.

Systematic evaluation of six improvement strategies provides empirical evidence that the off-manifold bottleneck—previously identified in image generation—extends to video prediction. Predicted latents lie off the VAE decoder's data manifold, and no latent-space loss can fully close this gap without modifying the decoder. A lightweight manifold projector (55K params) with decoupled feedback—correcting the output path while preserving the raw feedback loop—reduces rollout MSE by 3.3% at 0.25 ms additional latency.

Files

NeuroCodec_preprint_v2.pdf

Files (5.6 MB)

Name Size Download all
md5:17ff05ad03424607a094dac119e37fd4
5.6 MB Preview Download

Additional details

Related works

Is supplemented by
Software: https://github.com/dukejosch4/NeuroCodec (URL)