Published September 30, 2023 | Version v1
Journal article Open

Dense Caption Imagining

Description

A lot of recent research has focused on both
computer vision and natural language processing. Our
research focuses on the intersection of these, specifically
generating pictures from captions. We focus on the lower
data regime, using the COCO and CUB data sets which
include 200k and 11k picture and caption pairs
(respectively). We will use a hierarchical GAN
architecture as our baseline[7][24][26]. To improve our
baseline we attempt various methods targeting the

upsampling blocks, and adding residual or attention-
based layers. We will compare the inception score of the

methods to analyze our results. We will also consider
qualitative results to assure there is minimal mode
collapse and memorization. We find that of all our
improvements, improving the up-sampling technique to
use a Laplacian pyramid method with transposed
convolutional layers obtains the best results with a
minimal increase in computation time and memory needs.

Files

IJISRT22MAY1170.pdf

Files (795.7 kB)

Name Size Download all
md5:2bc7a15f3ed77a5f602eb968ea8d3a5c
795.7 kB Preview Download