SceneMaker: When 3D Scene Generation Stops Guessing

Opening — Why this matters now

Single-image 3D scene generation has quietly become one of the most overloaded promises in computer vision. We ask a model to hallucinate geometry, infer occluded objects, reason about spatial relationships, and place everything in a coherent 3D world — all from a single RGB frame. When it fails, we call it a data problem. When it half-works, we call it progress.

SceneMaker makes a less fashionable but far more effective claim: the problem is architectural. If de-occlusion, geometry generation, and pose estimation are trained together, they poison each other’s learning signals. Open-set generalization collapses not because models lack capacity, but because they are forced to learn incompatible priors from the same data.

Background — Where prior methods hit a wall

Existing 3D scene generation systems tend to fall into two camps:

Scene-native methods, trained end-to-end on indoor datasets. These work — until the objects are unfamiliar, small, or partially hidden.
Object-native methods, which import large-scale 3D object datasets for better geometry, but still struggle with occlusion and pose under real-world clutter.

Both camps share the same structural flaw: they assume scene datasets can simultaneously provide three priors — de-occlusion, object geometry, and pose estimation. In practice, no dataset does this well at scale.

The result is familiar: warped geometry for small objects, drifting poses, and brittle performance the moment we step outside curated indoor scenes.

Analysis — What SceneMaker does differently

SceneMaker’s core contribution is not a new loss function or a bigger transformer. It is a decoupled pipeline, where each sub-task is trained on the dataset that actually contains the right priors.

1. De-occlusion is treated as an image problem

Instead of forcing 3D models to infer what they cannot see, SceneMaker introduces a standalone de-occlusion diffusion model trained on large-scale image data.

Key design choices:

Initialized from a powerful image editing model to retain language understanding
Fine-tuned on a custom 10K-image dataset with three realistic occlusion patterns: object masking, image-boundary cropping, and user-like brush occlusions
Text-controllable, enabling semantic completion rather than geometric guessing

This alone eliminates a major failure mode of prior 3D-native approaches: learning occlusion priors from data that barely contains them.

2. Geometry generation focuses purely on shape quality

Once objects are de-occluded, SceneMaker delegates geometry generation to existing high-fidelity 3D object models. Crucially, these models now receive clean, amodal object images instead of partially visible inputs.

The impact is measurable:

Method	Chamfer ↓	F-Score ↑	Volume IoU ↑
MIDI	0.0508	0.5533	0.4214
Amodal3R	0.0443	0.7124	0.5279
SceneMaker	0.0409	0.7454	0.5985

Decoupling doesn’t add complexity — it removes confusion.

3. Pose estimation becomes a first-class model

Pose estimation is where SceneMaker is most opinionated.

Instead of predicting poses as an afterthought, it introduces a unified diffusion-based pose model that explicitly predicts:

Rotation (6D)
Translation
Object size

The architectural insight lies in attention routing:

Rotation tokens attend only to object-level features
Translation and size tokens attend to scene-level context
Both local (within-object) and global (cross-object) self-attention are used

This prevents the classic failure where global scene context corrupts object-centric pose variables.

Findings — Why it works (and keeps working)

SceneMaker consistently outperforms state-of-the-art baselines on both indoor and open-set benchmarks, especially under severe occlusion.

On open-set scenes:

Model	CD ↓	F-Score ↑	IoU-B ↑
PartCrafter	0.2171	0.2613	–
MIDI3D	0.1425	0.3211	0.5079
SceneMaker	0.0285	0.6125	0.7549

Even more telling: SceneMaker trained without open-set data still beats competitors on indoor scenes. That is structural robustness, not dataset memorization.

Implications — Why this matters beyond benchmarks

SceneMaker quietly shifts the narrative around generative 3D systems:

For AIGC pipelines: decoupling enables controllability without sacrificing realism
For embodied AI: accurate pose and size estimation is non-negotiable for interaction
For simulation and digital twins: open-set generalization stops being a research footnote and becomes practical

Perhaps most importantly, it suggests a broader lesson: when a model fails across domains, the fix is not always more data — sometimes it is admitting that different priors deserve different teachers.

Conclusion — A rare case of architectural humility

SceneMaker does not try to do everything at once. It does the opposite: it splits the problem until each part can be learned properly.

That restraint is precisely why it works.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Where prior methods hit a wall#

Analysis — What SceneMaker does differently#

1. De-occlusion is treated as an image problem#

2. Geometry generation focuses purely on shape quality#

3. Pose estimation becomes a first-class model#

Findings — Why it works (and keeps working)#

Implications — Why this matters beyond benchmarks#

Conclusion — A rare case of architectural humility#