Opening — Why this matters now

Single-image 3D scene generation has quietly become one of the most overloaded promises in computer vision. We ask a model to hallucinate geometry, infer occluded objects, reason about spatial relationships, and place everything in a coherent 3D world — all from a single RGB frame. When it fails, we call it a data problem. When it half-works, we call it progress.

SceneMaker makes a less fashionable but far more effective claim: the problem is architectural. If de-occlusion, geometry generation, and pose estimation are trained together, they poison each other’s learning signals. Open-set generalization collapses not because models lack capacity, but because they are forced to learn incompatible priors from the same data.

Background — Where prior methods hit a wall

Existing 3D scene generation systems tend to fall into two camps:

  • Scene-native methods, trained end-to-end on indoor datasets. These work — until the objects are unfamiliar, small, or partially hidden.
  • Object-native methods, which import large-scale 3D object datasets for better geometry, but still struggle with occlusion and pose under real-world clutter.

Both camps share the same structural flaw: they assume scene datasets can simultaneously provide three priors — de-occlusion, object geometry, and pose estimation. In practice, no dataset does this well at scale.

The result is familiar: warped geometry for small objects, drifting poses, and brittle performance the moment we step outside curated indoor scenes.

Analysis — What SceneMaker does differently

SceneMaker’s core contribution is not a new loss function or a bigger transformer. It is a decoupled pipeline, where each sub-task is trained on the dataset that actually contains the right priors.

1. De-occlusion is treated as an image problem

Instead of forcing 3D models to infer what they cannot see, SceneMaker introduces a standalone de-occlusion diffusion model trained on large-scale image data.

Key design choices:

  • Initialized from a powerful image editing model to retain language understanding
  • Fine-tuned on a custom 10K-image dataset with three realistic occlusion patterns: object masking, image-boundary cropping, and user-like brush occlusions
  • Text-controllable, enabling semantic completion rather than geometric guessing

This alone eliminates a major failure mode of prior 3D-native approaches: learning occlusion priors from data that barely contains them.

2. Geometry generation focuses purely on shape quality

Once objects are de-occluded, SceneMaker delegates geometry generation to existing high-fidelity 3D object models. Crucially, these models now receive clean, amodal object images instead of partially visible inputs.

The impact is measurable:

Method Chamfer ↓ F-Score ↑ Volume IoU ↑
MIDI 0.0508 0.5533 0.4214
Amodal3R 0.0443 0.7124 0.5279
SceneMaker 0.0409 0.7454 0.5985

Decoupling doesn’t add complexity — it removes confusion.

3. Pose estimation becomes a first-class model

Pose estimation is where SceneMaker is most opinionated.

Instead of predicting poses as an afterthought, it introduces a unified diffusion-based pose model that explicitly predicts:

  • Rotation (6D)
  • Translation
  • Object size

The architectural insight lies in attention routing:

  • Rotation tokens attend only to object-level features
  • Translation and size tokens attend to scene-level context
  • Both local (within-object) and global (cross-object) self-attention are used

This prevents the classic failure where global scene context corrupts object-centric pose variables.

Findings — Why it works (and keeps working)

SceneMaker consistently outperforms state-of-the-art baselines on both indoor and open-set benchmarks, especially under severe occlusion.

On open-set scenes:

Model CD ↓ F-Score ↑ IoU-B ↑
PartCrafter 0.2171 0.2613
MIDI3D 0.1425 0.3211 0.5079
SceneMaker 0.0285 0.6125 0.7549

Even more telling: SceneMaker trained without open-set data still beats competitors on indoor scenes. That is structural robustness, not dataset memorization.

Implications — Why this matters beyond benchmarks

SceneMaker quietly shifts the narrative around generative 3D systems:

  • For AIGC pipelines: decoupling enables controllability without sacrificing realism
  • For embodied AI: accurate pose and size estimation is non-negotiable for interaction
  • For simulation and digital twins: open-set generalization stops being a research footnote and becomes practical

Perhaps most importantly, it suggests a broader lesson: when a model fails across domains, the fix is not always more data — sometimes it is admitting that different priors deserve different teachers.

Conclusion — A rare case of architectural humility

SceneMaker does not try to do everything at once. It does the opposite: it splits the problem until each part can be learned properly.

That restraint is precisely why it works.

Cognaptus: Automate the Present, Incubate the Future.