Image Reasoning and Generative Models - A First-Principles Survey of Architectures, Training, and Open Problems
Authors/Creators
Description
This survey provides a comprehensive examination of modern image generation and reasoning, tracing architectural
developments from first principles through frontier systems. We begin with foundational theory—variational autoencoders,
vector quantization, score functions, and the U-Net and Transformer architectures that underpin modern generators. We
then analyze the three dominant paradigms: diffusion models (DDPM, latent diffusion, classifier-free guidance), rectified
flow (Stable Diffusion 3, FLUX), and autoregressive approaches (LlamaGen, VAR, MaskGIT). The survey examines critical
components including visual tokenizers, text encoders (CLIP, T5, native multimodal), and conditioning mechanisms (cross-
attention, ControlNet, IP-Adapter), alongside landmark commercial systems (Imagen, DALL-E, GPT-Image, Gemini/Nano
Banana, Qwen-Image). We explore the unification of image understanding and generation through architectures like
Chameleon, Emu3, Show-o, and Transfusion, as well as structured generation via scene graphs, layout conditioning, neuro-
symbolic methods, and programmatic synthesis. Practical considerations of training—datasets, scaling laws, distributed
infrastructure, and distillation—are covered alongside evaluation frameworks spanning FID, compositional benchmarks
(T2I-CompBench, GENEVAL), learned preference metrics, and human evaluation protocols. We document persistent
weaknesses including compositional reasoning failures, visual quality limitations, and control gaps, then project future trends:
unified multimodal models, efficiency advances, video and 3D generation, and evolving ecosystem dynamics. Throughout, we
emphasize the fundamental tradeoffs—expressiveness versus tractability, quality versus speed, flexibility versus control—that
shape this rapidly advancing field.
Files
ImageReasoningGenerativeModels2026.pdf
Files
(532.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:ae3af5bf5bceee26c1c06d0651d237a0
|
532.2 kB | Preview Download |