NeuroCodec: Efficient Video Prediction via Residual Latent Dynamics

Haertel, Joscha Alexander

doi:10.5281/zenodo.18912457

Published March 8, 2026 | Version v2

Preprint Open

NeuroCodec: Efficient Video Prediction via Residual Latent Dynamics

Haertel, Joscha Alexander (Researcher)¹

1. Univeristät Hildesheim

We introduce NeuroCodec, a video prediction system that predicts future frames as residual updates in a pre-trained video VAE latent space. Our key insight is that adjacent video frames share the vast majority of their latent content, making residual updates (L_{t+1} = L_t + Δ) fundamentally more efficient than full-frame decoding—an observation motivated by hierarchical predictive processing in biological vision.

We combine slot-based spatial compression (1024→64 tokens), a lightweight dynamics Transformer (438K params), learned event boundary detection, and cross-attention residual decoding (1.43M params) into a unified architecture operating on frozen CogVideoX latents.On UCF-101, our system achieves a 36.8% MSE reduction over the copy baseline (0.216 vs. 0.342) with 130–406× lower per-frame latency than a 50-step UNet at equal batch sizes, while maintaining stable 8-frame rollouts (2.07× error ratio over 50 validation videos).

The architecture generalizes to Something-Something v2 (220K videos, 2.65M frame pairs) without modification. Slot compression captures 88.6% of latent variance and dynamics prediction improves over the copy baseline by 32.2%.Full pixel-level evaluation on SSv2 with a SimVP baseline (2.08M params) confirms that NeuroCodec achieves a strong trade-off between stability and perceptual quality: 2.67× rollout stability (vs. SimVP's 1.93×), FID 31.5 (vs. SimVP's 84.8, 2000 samples), and LPIPS 0.194 in single-step prediction (vs. SimVP's 0.355), at comparable model size (1.92M trainable parameters).

A spectral frequency-matching loss reduces variance-ratio deviation by 55%, achieving LPIPS 0.105 on UCF-101—within 8% of the copy baseline's 0.097 despite copy using ground-truth VAE encodings.

Systematic evaluation of six improvement strategies provides empirical evidence that the off-manifold bottleneck—previously identified in image generation—extends to video prediction. Predicted latents lie off the VAE decoder's data manifold, and no latent-space loss can fully close this gap without modifying the decoder. A lightweight manifold projector (55K params) with decoupled feedback—correcting the output path while preserving the raw feedback loop—reduces rollout MSE by 3.3% at 0.25 ms additional latency.

Files

NeuroCodec_preprint_v2.pdf

Files (5.6 MB)

Name	Size	Download all
NeuroCodec_preprint_v2.pdf md5:17ff05ad03424607a094dac119e37fd4	5.6 MB	Preview Download

Additional details

Is supplemented by: Software: https://github.com/dukejosch4/NeuroCodec (URL)

	All versions	This version
Views	70	38
Downloads	39	26
Data volume	323.3 MB	153.6 MB

NeuroCodec: Efficient Video Prediction via Residual Latent Dynamics

Authors/Creators

Description

Files

NeuroCodec_preprint_v2.pdf

Files (5.6 MB)

Additional details

Related works