SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model

Anonymous

Given a textual description of the house, SceneLCM enables automated generation of multi-room, multi-scale indoor scene without human intervention.

Abstract

Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality.

To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism.

Extensive experiments validate SceneLCM's superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.

Environment Editing

bedroom style1

prompt: A Boho-Hippe style bedroom, beautiful floor, a window on wall, photorealistic, HD, 8K

bedroom style2

prompt: A Bohemian style bedroom, beautiful floor, a window on wall, photorealistic, HD, 8K

bedroom style3

prompt: A cubism art style bedroom, beautiful floor, a window on wall, photorealistic, HD, 8K

bedroom style4

prompt: A Modern Children bedroom, beautiful floor, a window on wall, photorealistic, HD, 8K

dining room style1

prompt: A Gypsy-classic style dining room, beautiful floor, a window on wall, photorealistic, HD, 8K

dining room style2

prompt: A cubism art style dining room, beautiful floor, a window on wall, photorealistic, HD, 8K

dining room style3

prompt: A Neo-hipple style dining room, beautiful floor, a window on wall, photorealistic, HD, 8K

dining room style4

prompt: A Gypsy dining room, beautiful floor, a window on wall, photorealistic, HD, 8K

Additional Results

Render Environment Across Different Rooms

We can navigate between multiple rooms while rendering them simultaneously.

Physical Editing

We tilt the room by 30 degrees, causing the furniture to move due to gravity.

Object Generation

Object1

prompt: A cozy office chair with a big pink back, HD, 4K.

Object2

prompt: A swivel office chair with some beautiful texture, HD, 4K.

Object3

prompt: A rectangular glass-top coffee table with a metal frame. HD, 4K.

Object4

prompt: A beautiful office chair, photorealistic, HD, 8K

object5

prompt: A modern comfortable sofa, HD, 4K.

object6

prompt: A wooden desk with metal legs, photorealistic, HD, 8K

object7

prompt: A wooden desk, rich texture, photorealistic, HD, 8K

object8

prompt: A green comfortable sofa, photorealistic, HD, 4K

object9

prompt: A portrait of the Ghost Rider, head, HDR, photorealistic, 8K.

object10

prompt: A portrait of Groot, head, HDR, photorealistic, 8K

object11

prompt: A Gundam model, with detailed panel lines and decals, photorealistic, 8K, HDR

object12

prompt: A Gundam Barbatos Lupus Rex model, Gundam, Barbatos, with detailed panel lines and decals, photorealistic, 8K, HDR.

Animation

Render Environment Across Different Rooms

We generate multiple rooms simultaneously. In the previous examples, although each video was rendered within a single room, one can still see other rooms and their furniture at the doorway of the room

Interpolate start reference image.

Start Frame

Loading...
Interpolation end reference image.

End Frame


Japanese Style VS Chinese Traditional Style

Japanese Style

Prompt: A Japanese style bedroom, beautiful floor, a window on wall, photirealistic, HD, 8k

Although Japanese and Chinese styles share similar color schemes, Japanese style predominantly features rectangular patterns with cherry blossoms as decorative motifs.

Chinese Style

Prompt: A chinese traditional style entrance, beautiful floor, a window on wall, photirealistic, HD, 8k

Chinese style predominantly incorporates stripes and paper-cut window decorations as ornamental elements.

Texture map optimization VS UV Parameters

Texture map optimization

Prompt: A baroque style entrance, beautiful floor, a window on wall, photirealistic, HD, 8k

Directly optimize the texture map via CTS loss.

There is significant multi-view inconsistency in the room, where one side appears red while the other appears green. Additionally, numerous noise points are present

UV Parameters

Prompt: A baroque style entrance, beautiful floor, a window on wall, photirealistic, HD, 8k

Our method.

Consistency and beautiful texture.

Comparison of Additional Style(Industrial, baroque)

SceneCraft with Industrial Style

prompt: This is one view of a bedroom painted by Industrial style.

DreamScene with Industrial Style

prompt1: Industrial style, 4k, 8k, best quality, ultra-detailed, finely detail, highres, high resolution

prompt2: A DSLR photo of an Industrial style bedroom

Our with Industrial Style

prompt: A Industrial style entrance, beautiful floor, a window on wall, photirealistic, HD, 8k

SceneCraft depth map with Industrial Style

DreamScene depth map with Industrial Style

Our depth map with Industrial Style

SceneCraft with Baroque Style

prompt: This is one view of a bedroom painted by baroque style.

DreamScene with Baroque Style

prompt1: Baroque style, 4k, 8k, best quality, ultra-detailed, finely detail, highres, high resolution

prompt2: A DSLR photo of an Baroque style bedroom

Our with Baroque Style

prompt: A Baroque style entrance, beautiful floor, a window on wall, photirealistic, HD, 8k

SceneCraft depth map with Baroque Style

DreamScene depth map with Baroque Style

Our depth map with Baroque Style

Ablation Study of CTS

The SDS and ISM models require significantly more iterations and longer training durations to achieve comparable performance to ours.

Latent Consistency Model + SDS

5000 epoch, batch size=4, ~1.1h

Latent Consistency Model + ISM

4500 epoch, batch size=4, ~40min(unstable)

Our

3000 epoch, batch size=4, ~28min

Latent Consistency Model + SDS

5000 epoch, batch size=4, ~1.1h

Latent Consistency Model + ISM

4500 epoch, batch size=4, ~40min(unstable)

Our

3000 epoch, batch size=4, ~28min

Derivation of Flow Matching loss

Derivation of Floss Matching loss

Flow Matching loss is not work in our experiment.

The main reason is that the denoising function is converage faster than noise, the converage speed is inconsistent. We proof this conclusion in the following.

Concretely,

  1. Intuitively, noise matching is a local matching for the next step of the current image, whereas denoising matching is a global matching between the current image and the expected image. The two exhibit significant divergence at the beginning, leading to gradient conflicts.
  2. Theoretically, denoising function is faster converages than noise, the convergence speeds of the two are inconsistent.
  3. Although the CTS loss consists of two terms, it is important to note that each term is independently controlled by a distinct coefficient that regulates its convergence speed

Denosing function converges faster then noise in terms of the sample likelihood

Derivation of Floss Matching loss

Quantitative comparisons of Our against baselines.

OOB: percentage of layout where objects extend beyond the room’s boundaries or intersect with other objects; ORI: percentage of correctness of object orientations; FFR: furniture footprint ratio.

Score Set-the-Scene SceneCraft DreamScene Our
CLIP Score (Style, room, prompt) 19.50, 22.34, 23.35 21.12, 23.37, 23.53 23.85, 24.45, 23.33 24.10, 23.59, 24.20
Avg. CLIP Score 21.73 22.67 23.87 23.96
Inception Score 3.36 3.59 4.46 4.40
VQAscore 0.33 0.65 0.71 0.82
Consistency \\ 1.55 1.36 1.12

Quantitative comparisons of Our layout against baselines.

Score Anyhome (multi-room) holodeck (multi-room) InstructScene (single-room) Architect (single-room) Our (multi-room)
OOB 0 0 28.57% 0 0
ORI 76.66% 74.35% 89.22% 73.32% 83.57%
FFR 7.14% 9.01% - 7.85% 9.52%