Image Reasoning and Generative Models - A First-Principles Survey of Architectures, Training, and Open Problems

Published January 26, 2026 | Version v1

Technical note Open

This survey provides a comprehensive examination of modern image generation and reasoning, tracing architectural

developments from first principles through frontier systems. We begin with foundational theory—variational autoencoders,

vector quantization, score functions, and the U-Net and Transformer architectures that underpin modern generators. We

then analyze the three dominant paradigms: diffusion models (DDPM, latent diffusion, classifier-free guidance), rectified

flow (Stable Diffusion 3, FLUX), and autoregressive approaches (LlamaGen, VAR, MaskGIT). The survey examines critical

components including visual tokenizers, text encoders (CLIP, T5, native multimodal), and conditioning mechanisms (cross-

attention, ControlNet, IP-Adapter), alongside landmark commercial systems (Imagen, DALL-E, GPT-Image, Gemini/Nano

Banana, Qwen-Image). We explore the unification of image understanding and generation through architectures like

Chameleon, Emu3, Show-o, and Transfusion, as well as structured generation via scene graphs, layout conditioning, neuro-

symbolic methods, and programmatic synthesis. Practical considerations of training—datasets, scaling laws, distributed

infrastructure, and distillation—are covered alongside evaluation frameworks spanning FID, compositional benchmarks

(T2I-CompBench, GENEVAL), learned preference metrics, and human evaluation protocols. We document persistent

weaknesses including compositional reasoning failures, visual quality limitations, and control gaps, then project future trends:

unified multimodal models, efficiency advances, video and 3D generation, and evolving ecosystem dynamics. Throughout, we

emphasize the fundamental tradeoffs—expressiveness versus tractability, quality versus speed, flexibility versus control—that

shape this rapidly advancing field.

Files

Name	Size	Download all
ImageReasoningGenerativeModels2026.pdf md5:ae3af5bf5bceee26c1c06d0651d237a0	532.2 kB	Preview Download