Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

ICCV 2023

Minho Park*
m.park@kaist.ac.kr
Jooyeol Yun*
blizzard072@kaist.ac.kr
Seunghwan Choi
shadow2496@kaist.ac.kr
Jaegul Choo
jchoo@kaist.ac.kr
Korea Advanced Institute of Science and Technology (KAIST)
* indicates equal contrubutions.
Responsive image
Jointly generate image-layout pairs from textual descriptions utilizing the Gaussian-categorical diffusion process.

Abstract

Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5 billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce.


Paper

[arXiv] [Github] [Video] [Slide] [Poster]

ICCV, 2023.
Minho Park, Jooyeol Yun, Seunghwan Choi, and Jaegul Choo.
"Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis"


Introduction

Responsive image
Recall of facial attributes specified in the text descriptions. Text-to-image generation approaches trained on a subset of the Multi-Modal CelebA-HQ often fail to reflect text conditions. Facial attributes are classified with a pretrained attribute classifier.

Method

Gaussian-categorical Distribution

$ \begin{align} \mathcal{NC}(\mathbb{x, y; \mu, \Sigma, \Theta}) &= \mathcal{C} (\mathbb{y; \Theta}) \cdot \mathcal{N}(\mathbb{x;\mu_y,\Sigma_y}) \\ &= \left( \prod_{i=1}^M \Theta_{i, \mathbb{y}_i} \right) \left( 2\pi \right)^{-\frac{N}{2}} \left| \mathbb{\Sigma_y} \right|^{-\frac{1}{2}} \exp \left( -\frac{1}{2} (\mathbb{x} - \mathbb{\mu_y})^\top \mathbb{\Sigma_y}^{-1} (\mathbb{x} - \mathbb{\mu_y}) \right) \end{align} $
where $ \mathbb{x} \in \mathbb{R}^{N}, \mathbb{y} \in \{ 1, 2, ..., K \} ^M \subset \mathbb{R}^{M} \\ \mathbb{\mu} \in \mathbb{R}^{S\times N}, \mathbb{\Sigma} \in \mathbb{R}^{S\times N\times N}, \mathbb{\Theta} \in \mathbb{R}^{M\times K}, (S=K^M) \\ \mathbb{\mu_y} \in \mathbb{R}^{N}, \mathbb{\Sigma_y} \in \mathbb{R}^{N\times N} $
Responsive image
Visualization of a Gaussian-categorical distribution with a single variable ($N = 1, M = 1, K = 4,$ and $S = 4$).

Gaussian-categorical Diffusion Process

Responsive image
Illustration of the Gaussian-categorical diffusion process on the image-layout distribution of MM CelebA-HQ. We define a Gaussian-categorical diffusion process for modeling joint image-layout distributions, which is the first approach to unify two diffusion processes for image-layout generation. Derivation of the objective function is available in the paper.

Qualitative results

Responsive image
Examples of text-guided generation of image-layout pairs from the Gaussian-categorical diffusion trained on MM CelebA-HQ100 and Cityscapes. The text descriptions on the bottom are given as conditions to generate the image-label pairs.

Quantitative results

Responsive image
(a) FID-Semantic Recall trade-off in the Cityscapes dataset. (b) Semantic Recall for minor classes. Semantic Recall is measured using the HRNet-w48 model. (c) Proportion of each semantic class in the entire Cityscapes dataset. Class proportion is compared in log-scale for visibility.

Analyzing the internal representation

Responsive image
Visualization of clustering results between the internal features of the Gaussian-categorical diffusion and the Gaussian diffusion.

Cross-modal outpainting

Responsive image
Cross-modal outpainting for (a) text-guided image-to-layout generation and (b) text-guided layout-to-image generation. Segmentation layouts are generated with $n = 1$ resampling steps and images are generated with $n = 5$ resampling steps for each timestep.

Citation

@inproceedings{park2023learning,
  title={Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis},
  author={Park, Minho and Yun, Jooyeol and Choi, Seunghwan and Choo, Jaegul},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7591--7600},
  year={2023}
}