Opening — Why this matters now
Single-image 3D scene generation has quietly become one of the most overloaded promises in computer vision. We ask a model to hallucinate geometry, infer occluded objects, reason about spatial relationships, and place everything in a coherent 3D world — all from a single RGB frame. When it fails, we call it a data problem. When it half-works, we call it progress.
SceneMaker makes a less fashionable but far more effective claim: the problem is architectural. If de-occlusion, geometry generation, and pose estimation are trained together, they poison each other’s learning signals. Open-set generalization collapses not because models lack capacity, but because they are forced to learn incompatible priors from the same data.
Background — Where prior methods hit a wall
Existing 3D scene generation systems tend to fall into two camps:
- Scene-native methods, trained end-to-end on indoor datasets. These work — until the objects are unfamiliar, small, or partially hidden.
- Object-native methods, which import large-scale 3D object datasets for better geometry, but still struggle with occlusion and pose under real-world clutter.
Both camps share the same structural flaw: they assume scene datasets can simultaneously provide three priors — de-occlusion, object geometry, and pose estimation. In practice, no dataset does this well at scale.
The result is familiar: warped geometry for small objects, drifting poses, and brittle performance the moment we step outside curated indoor scenes.
Analysis — What SceneMaker does differently
SceneMaker’s core contribution is not a new loss function or a bigger transformer. It is a decoupled pipeline, where each sub-task is trained on the dataset that actually contains the right priors.
1. De-occlusion is treated as an image problem
Instead of forcing 3D models to infer what they cannot see, SceneMaker introduces a standalone de-occlusion diffusion model trained on large-scale image data.
Key design choices:
- Initialized from a powerful image editing model to retain language understanding
- Fine-tuned on a custom 10K-image dataset with three realistic occlusion patterns: object masking, image-boundary cropping, and user-like brush occlusions
- Text-controllable, enabling semantic completion rather than geometric guessing
This alone eliminates a major failure mode of prior 3D-native approaches: learning occlusion priors from data that barely contains them.
2. Geometry generation focuses purely on shape quality
Once objects are de-occluded, SceneMaker delegates geometry generation to existing high-fidelity 3D object models. Crucially, these models now receive clean, amodal object images instead of partially visible inputs.
The impact is measurable:
| Method | Chamfer ↓ | F-Score ↑ | Volume IoU ↑ |
|---|---|---|---|
| MIDI | 0.0508 | 0.5533 | 0.4214 |
| Amodal3R | 0.0443 | 0.7124 | 0.5279 |
| SceneMaker | 0.0409 | 0.7454 | 0.5985 |
Decoupling doesn’t add complexity — it removes confusion.
3. Pose estimation becomes a first-class model
Pose estimation is where SceneMaker is most opinionated.
Instead of predicting poses as an afterthought, it introduces a unified diffusion-based pose model that explicitly predicts:
- Rotation (6D)
- Translation
- Object size
The architectural insight lies in attention routing:
- Rotation tokens attend only to object-level features
- Translation and size tokens attend to scene-level context
- Both local (within-object) and global (cross-object) self-attention are used
This prevents the classic failure where global scene context corrupts object-centric pose variables.
Findings — Why it works (and keeps working)
SceneMaker consistently outperforms state-of-the-art baselines on both indoor and open-set benchmarks, especially under severe occlusion.
On open-set scenes:
| Model | CD ↓ | F-Score ↑ | IoU-B ↑ |
|---|---|---|---|
| PartCrafter | 0.2171 | 0.2613 | – |
| MIDI3D | 0.1425 | 0.3211 | 0.5079 |
| SceneMaker | 0.0285 | 0.6125 | 0.7549 |
Even more telling: SceneMaker trained without open-set data still beats competitors on indoor scenes. That is structural robustness, not dataset memorization.
Implications — Why this matters beyond benchmarks
SceneMaker quietly shifts the narrative around generative 3D systems:
- For AIGC pipelines: decoupling enables controllability without sacrificing realism
- For embodied AI: accurate pose and size estimation is non-negotiable for interaction
- For simulation and digital twins: open-set generalization stops being a research footnote and becomes practical
Perhaps most importantly, it suggests a broader lesson: when a model fails across domains, the fix is not always more data — sometimes it is admitting that different priors deserve different teachers.
Conclusion — A rare case of architectural humility
SceneMaker does not try to do everything at once. It does the opposite: it splits the problem until each part can be learned properly.
That restraint is precisely why it works.
Cognaptus: Automate the Present, Incubate the Future.