Published November 7, 2022 | Version 1
Other Open

OCR-VQGAN trained on Paper2Fig100k

  • 1. Computer Vision Center, Barcelona
  • 2. ServiceNow Research
  • 3. ÉTS Montreal

Description

OCR-VQGAN: Taming Text-within-Image Generation

Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure.

Here we provide the pre-trained model using Paper2Fig100k dataset, which performs a downsampling of factor f=16, using a discrete codebook of 16384 and vectors of dimension 256. Refer to github.com/joanrod/ocr-vqgan/ to see implementation and details.

Our paper @WACV2023 presents how we design an OCR perceptual loss to be used in the VQGAN framework (OCR-VQGAN). In the paper, we also define the proposed Paper2Fig100k dataset.

Files

ocr-vqgan-f16-c16384-d256.zip

Files (961.7 MB)

Name Size Download all
md5:606c21f29a9ff186b29ec2c8d7dddfb6
961.7 MB Preview Download