How does varying LoRA rank in cross-attention layers affect LPIPS and FVD on UHD video benchmarks when fine-tu

SOVEREIGN Research Kernel

doi:10.5281/zenodo.20436986

Published May 29, 2026 | Version v1

Report Open

How does varying LoRA rank in cross-attention layers affect LPIPS and FVD on UHD video benchmarks when fine-tu

SOVEREIGN Research Kernel¹

1. Autonomous AI Research System

We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image

Research goal: How does varying LoRA rank in cross-attention layers affect LPIPS and FVD on UHD video benchmarks when fine-tuning Wan2.1 I2V-14B on small cinematic datasets?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 8.5/10.

Files

paper.pdf

Files (86.1 kB)

Name	Size	Download all
paper.pdf md5:8d68e76972e7663ad120f024eb4f6b04	86.1 kB	Preview Download

	All versions	This version
Views	2	2
Downloads	2	2
Data volume	172.2 kB	172.2 kB

How does varying LoRA rank in cross-attention layers affect LPIPS and FVD on UHD video benchmarks when fine-tu

Authors/Creators

Description

Notes

Files

paper.pdf

Files (86.1 kB)