OCR-VQGAN trained on Paper2Fig100k
- 1. Computer Vision Center, Barcelona
- 2. ServiceNow Research
- 3. ÉTS Montreal
Description
OCR-VQGAN: Taming Text-within-Image Generation
Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure.
Here we provide the pre-trained model using Paper2Fig100k dataset, which performs a downsampling of factor f=16, using a discrete codebook of 16384 and vectors of dimension 256. Refer to github.com/joanrod/ocr-vqgan/ to see implementation and details.
Our paper @WACV2023 presents how we design an OCR perceptual loss to be used in the VQGAN framework (OCR-VQGAN). In the paper, we also define the proposed Paper2Fig100k dataset.
Files
ocr-vqgan-f16-c16384-d256.zip
Files
(961.7 MB)
Name | Size | Download all |
---|---|---|
md5:606c21f29a9ff186b29ec2c8d7dddfb6
|
961.7 MB | Preview Download |