GCDP: Official Project Page

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

ICCV 2023

Minho Park*

m.park@kaist.ac.kr

Jooyeol Yun*

blizzard072@kaist.ac.kr

Seunghwan Choi

shadow2496@kaist.ac.kr

Jaegul Choo

jchoo@kaist.ac.kr

Korea Advanced Institute of Science and Technology (KAIST)

^{* indicates equal contrubutions.}

Responsive image — Jointly generate image-layout pairs from textual descriptions utilizing the Gaussian-categorical diffusion process.

arXiv

Code

BibTeX

Abstract

Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5 billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce.

Paper

[arXiv] [Github] [Video] [Slide] [Poster]

ICCV, 2023.
Minho Park, Jooyeol Yun, Seunghwan Choi, and Jaegul Choo.
"Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis"

Introduction

Method

Gaussian-categorical Distribution

$ \begin{align} \mathcal{NC}(\mathbb{x, y; \mu, \Sigma, \Theta}) &= \mathcal{C} (\mathbb{y; \Theta}) \cdot \mathcal{N}(\mathbb{x;\mu_y,\Sigma_y}) \\ &= \left( \prod_{i=1}^M \Theta_{i, \mathbb{y}_i} \right) \left( 2\pi \right)^{-\frac{N}{2}} \left| \mathbb{\Sigma_y} \right|^{-\frac{1}{2}} \exp \left( -\frac{1}{2} (\mathbb{x} - \mathbb{\mu_y})^\top \mathbb{\Sigma_y}^{-1} (\mathbb{x} - \mathbb{\mu_y}) \right) \end{align} $

where $ \mathbb{x} \in \mathbb{R}^{N}, \mathbb{y} \in \{ 1, 2, ..., K \} ^M \subset \mathbb{R}^{M} \\ \mathbb{\mu} \in \mathbb{R}^{S\times N}, \mathbb{\Sigma} \in \mathbb{R}^{S\times N\times N}, \mathbb{\Theta} \in \mathbb{R}^{M\times K}, (S=K^M) \\ \mathbb{\mu_y} \in \mathbb{R}^{N}, \mathbb{\Sigma_y} \in \mathbb{R}^{N\times N} $

Gaussian-categorical Diffusion Process

Qualitative results

Quantitative results

Analyzing the internal representation

Cross-modal outpainting

Citation

@inproceedings{park2023learning,
  title={Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis},
  author={Park, Minho and Yun, Jooyeol and Choi, Seunghwan and Choo, Jaegul},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7591--7600},
  year={2023}
}