SphereDiff: Tuning-free 360° Static and Dynamic Panorama Generation via Spherical Latent Representation

* equal contribution
KAIST AI Logo
Storm: Generation Caption is "Storm, Ocean, Storm clouds, White foam swirls, etc."
Volcano: Generation Caption is "Volcano, magma, ash, lava, etc."
Aurora: Generation Caption is "Aurora, Northern lights, Northern sky, etc."

Abstract

The increasing demand for AR/VR applications has highlighted the need for high-quality 360-degree panoramic content. However, generating high-quality 360-degree panoramic images and videos remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or attempt tuning-free methods that still rely on ERP latent representations, leading to discontinuities near the poles. In this paper, we introduce SphereDiff, a novel approach for seamless 360-degree panoramic image and video generation using state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures uniform distribution across all perspectives, mitigating the distortions inherent in ERP. We extend MultiDiffusion to spherical latent space and propose a spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality in the projection process. Our method outperforms existing approaches in generating 360-degree panoramic content while maintaining high fidelity, making it a robust solution for immersive AR/VR applications.

Conceptual illustration of ACG
SphereDiff enables tuning-free 360° panorama generation via spherical latent. It is compatible with various diffusion backbones, including FLUX, SANA, and HunyuanVideo.

Motivation

Responsive image
Motivation. Previous finetuning approaches (360 LoRA, PanFusion) often fail to generate continuous scenes near the pole due to the limited ERP dataset. The tuning-free approach (DynamicScaler) also fails to generate a seamless frame due to the ERP latent representation. However, SphereDiff, generates a seamless image.

Overall Pipeline

Method overview
We initialize uniform spherical latents and extract perspective latents for multiple views at each denoising step using dynamic latent sampling. These latents are then denoised and fused using the MultiDiffusion with distortion-aware weighted averaging. This process enables seamless and distortion-free 360-degree panoramic image and video generation in a tuning-free manner.

Qualitative Comparison

Qualitative comparison. Compared to our method, baseline approaches often struggle to maintain spatial continuity near the poles, revealing their limitations in achieving seamless 360◦ generation in dynamic settings.

Additional comparison

Quantitative Results

Additional results

Quantitative Results

Quantitative Results

User study results. The 360◦ static and live wallpapers generated by SphereDiff have achieved state-of-the-art performance in user preference across most metrics, particularly in panoramic criteria such as distortion and end continuity.

Quantitative Results

Automated Quantitative Evaluation. SphereDiff consistently outperforms existing methods except image quality, where it ranks second.

Quantitative Results