SceneMaker: When 3D Scene Generation Stops Guessing

A chair behind a table is not half a chair

A single image can be a very rude input.

It shows the front of a room, hides the back of objects, compresses depth into pixels, and then asks a model to produce a coherent 3D scene. The model must decide what the hidden side of a chair looks like, how large the chair is, whether it sits behind the table or intersects with it, and where everything belongs in 3D space. Naturally, when the result looks wrong, we often blame “weak 3D generation.”

SceneMaker makes a more useful argument: the bottleneck is not only object generation. It is the fact that many systems ask one pipeline to learn three different priors at once — what hidden object parts should look like, how object geometry should be generated, and how objects should be placed in a scene.¹ That is not a heroic end-to-end learning problem. It is a traffic accident with gradients.

The paper’s main contribution is a decoupled framework for open-set 3D scene generation from a single image. SceneMaker separates the task into three modules: de-occlusion, 3D object generation, and pose estimation. Each module learns from the kind of data that actually contains the right signal. Image data teaches occlusion patterns. 3D object data teaches geometry. Scene data teaches relative pose and scale.

That sounds almost too simple. It is not. The useful idea is not “split the pipeline” in the lazy engineering sense. The useful idea is that open-set 3D scene generation fails when the wrong prior is learned from the wrong dataset.

The real bottleneck is not the 3D object generator alone

The easy misconception is that 3D scene generation mainly improves when object generators become better. Better meshes, better textures, better shape priors — all useful. But a scene is not a folder of pretty assets.

A scene also needs amodal reasoning: the ability to infer the full object from a partially visible view. It needs pose reasoning: the ability to place the object with the right rotation, translation, and size. It needs inter-object consistency: the sofa and the table cannot both occupy the same physical space unless the model is secretly designing modern art.

SceneMaker frames the problem as a missing-prior problem:

Required prior	What it means	Why common data sources struggle
De-occlusion prior	Inferring hidden object parts from visible evidence and context	3D datasets are smaller and less diverse in real-world occlusion patterns
Object geometry prior	Generating complete 3D shapes	Large-scale 3D object datasets help, but often focus on isolated objects
Pose and size prior	Placing generated objects coherently in scene space	Scene datasets are limited, often indoor-heavy, and less open-set

This matters because prior methods tend to lean on one insufficient source. Scene-native approaches learn from scene datasets, but those datasets are limited in object diversity. Object-native approaches benefit from large 3D object datasets, but still inherit weak de-occlusion and pose priors. SceneMaker’s redirection is structural: do not make one data source pretend to be three.

The framework starts with scene perception. It uses segmentation to identify object masks, estimates depth, projects visible pixels into 3D point clouds, de-occludes object images, generates 3D objects from those completed images, estimates pose and size, and finally composites the scene. The paper uses existing components where appropriate, including Grounded-SAM for segmentation and MoGe for depth estimation. The novelty is not that every wheel is newly invented. Blessedly.

The novelty is that the wheels are attached to the right vehicle.

De-occlusion is treated as an image task, not a 3D miracle

The first major module is a standalone de-occlusion model. Instead of asking a 3D generator to produce complete geometry from a partly hidden crop, SceneMaker first tries to reconstruct the complete object image.

This is a practical decision. Image datasets are broader than 3D datasets, especially for occlusion patterns. Real images contain objects blocked by other objects, clipped by image borders, degraded by small scale, and interrupted by arbitrary shapes. A 3D object dataset may know what a clean chair looks like. It may not know what a chair looks like when half of it is hidden behind a table in a messy room.

SceneMaker initializes the de-occlusion module from Flux Kontext and fine-tunes it on a curated 10K triplet dataset: masked image, text prompt, and target image. The paper designs three occlusion patterns: object cutouts without background to simulate object occlusion, right-angle cropping to simulate image borders, and random brush strokes to simulate user-prompt-style missing regions. It also randomizes object and image sizes to simulate small objects and low-resolution cases.

The de-occlusion evidence has two layers. First, the authors compare the de-occlusion model against BrushNet and Flux Kontext on a 1K-image validation set spanning more than 500 classes. SceneMaker’s de-occlusion model reports higher PSNR, SSIM, and CLIP score:

Method	PSNR ↑	SSIM ↑	CLIP ↑	Likely purpose of test
BrushNet	11.07	0.6760	0.2659	Comparison with image inpainting
Flux Kontext	13.91	0.7309	0.2674	Comparison with the initialization model
SceneMaker de-occlusion	15.03	0.7566	0.2698	Main evidence for the fine-tuned de-occlusion prior

The second layer is more important for 3D: does better de-occlusion actually help object generation under occlusion? On occluded 3D object generation, SceneMaker improves over MIDI and Amodal3R:

Method	Chamfer Distance ↓	F-Score ↑	Volume IoU ↑	What it supports
MIDI	0.0508	0.5533	0.4214	Scene-generation baseline under occlusion
Amodal3R	0.0443	0.7124	0.5279	3D-native occluded object baseline
SceneMaker	0.0409	0.7454	0.5985	De-occlusion improves downstream object completeness

This is not just a “higher score” story. The operational meaning is that the 3D generator receives a cleaner object-level input. It no longer has to infer missing visual evidence and generate geometry in the same breath. The pipeline removes ambiguity before geometry generation instead of worshipping ambiguity as an end-to-end feature.

A side benefit is controllability. Because de-occlusion is image-and-text conditioned, the paper shows examples where prompts can control hidden areas, such as changing the color of a pot or the object held by a penguin. This is an exploratory extension rather than the paper’s core quantitative claim. Still, it points to a business-relevant feature: controllable 3D asset completion is more useful than passive reconstruction.

Geometry generation becomes easier when it stops doing everyone else’s job

SceneMaker does not present itself as a new universal 3D object generator. It delegates object generation to existing image-to-3D methods after de-occlusion. This is a sensible design choice.

The hidden lesson is that object geometry models are already relatively strong when the object is visible and isolated. Their failure under scene conditions is partly an input problem. If a small lamp is half occluded and badly cropped, the 3D generator is not merely generating a lamp; it is guessing what the segmentation model missed, what the occluder covered, and how the object should be completed.

By moving de-occlusion upstream, SceneMaker lets the geometry module specialize. It can do what it is good at: turning a cleaner object image into a 3D asset.

For production use, this decomposition matters. A pipeline with separable modules is easier to debug than a monolith. If the chair shape is wrong, inspect de-occlusion and object generation. If the chair is beautiful but floating above the floor, inspect pose estimation. If every object is perfect but the scene is physically absurd, congratulations, you have found the next research problem.

That is not a joke. It is the difference between a demo and an operational system.

Pose estimation is where SceneMaker becomes more than asset assembly

The most important part of SceneMaker is arguably the pose estimation module. In a scene, pose is not just a final transform applied to an object. It determines whether the generated 3D world is coherent.

SceneMaker uses a diffusion-based pose estimation model that directly predicts rotation, translation, and size. Size is important because modern 3D object generators often output normalized objects in canonical space. A generated mug and a generated sofa can both be “complete” in their own coordinate systems. The scene still needs to know which one is supposed to be large.

The model’s attention design is the key mechanism. Each object is represented through separate tokens for rotation, translation, size, and geometry. Local self-attention lets variables inside the same object interact. Global self-attention lets objects in the same scene interact, supporting coherent relative placement. Cross-attention is deliberately decoupled: rotation attends to canonical object-level conditions, while translation and size attend to scene-level conditions.

That routing is not decorative architecture. It encodes a judgment about what information each pose variable should use.

Rotation can often be estimated from object-centric evidence. Translation and size need scene context. If every token attends to everything, the model may gain flexibility but lose discipline. SceneMaker’s design says: let the chair know about the room when deciding where it sits, but do not let room-level clutter corrupt every object-centric rotation cue.

The authors also build a synthetic open-set scene dataset. They curate about 90K usable Objaverse models and compose 200K scenes, each with 2 to 5 randomly selected objects, rendered from 20 viewpoints. That produces 8 million rendered images. The dataset is designed not as a photorealistic world simulator, but as a source of diverse pose mappings across open-set object geometries.

This is the paper’s second major practical insight: open-set pose estimation needs scene-level synthetic diversity, not just better object meshes.

The benchmark results support the decomposition, not just the leaderboard

The main quantitative results compare SceneMaker with prior scene generation methods on the MIDI test set, indoor severe-occlusion scenes, and open-set scenes.

On the MIDI test set, SceneMaker reports the best overall performance among the listed baselines:

Method	CD-S ↓	F-Score-S ↑	CD-O ↓	F-Score-O ↑	IoU-B ↑
MIDI	0.080	0.5019	0.103	0.5358	0.518
SceneMaker	0.051	0.5642	0.0963	0.6544	0.671

This is main evidence: the full framework improves scene-level and object-level metrics on an established benchmark.

The open-set and severe-occlusion results are more revealing. On open-set scenes, SceneMaker’s full model substantially improves scene-level Chamfer Distance, scene F-Score, and bounding-box IoU over MIDI3D and PartCrafter:

Method	Open-set CD-S ↓	Open-set F-Score-S ↑	Open-set IoU-B ↑	Interpretation
PartCrafter	0.2171	0.2613	—	Struggles in open-set scenes
MIDI3D	0.1425	0.3211	0.5079	Better, but still limited
SceneMaker without open-set data	0.1538	0.4644	0.6248	Architecture helps, but open-set data matters
SceneMaker full model	0.0285	0.6125	0.7549	Synthetic open-set scene data sharply improves generalization

This table is where the paper’s mechanism-first story becomes stronger. SceneMaker without the open-set dataset is already competitive in some ways because the framework is better organized. But the full model is much stronger on open-set scenes because pose estimation has finally seen enough diverse object-scene combinations.

The lesson is not “synthetic data solves open-set 3D.” That would be too convenient. The lesson is narrower and more useful: synthetic data helps when it is aimed at the right missing prior. Here, it is not used to teach everything. It is used to teach pose and size generalization across diverse object geometries.

The ablations are not a second thesis; they tell us which module pays rent

The paper’s ablations should be read by purpose, not as a pile of bonus tables.

The component contribution table in the supplementary material is especially useful because it starts from a baseline and adds modules incrementally.²

Variant	CD-S ↓	F-Score-S ↑	CD-O ↓	F-Score-O ↑	IoU-B ↑	Likely purpose
Baseline	0.1501	0.3429	0.3623	0.2171	0.6448	Starting point
+ Open-set data	0.0387	0.5247	0.3419	0.2704	0.6948	Tests synthetic data contribution
+ Decoupled de-occlusion	0.0363	0.5662	0.0707	0.5752	0.7020	Tests de-occlusion contribution
+ Attention mechanisms	0.0285	0.6125	0.0671	0.5948	0.7549	Tests full pose model contribution

The biggest object-level jump arrives when decoupled de-occlusion is added. CD-O falls from 0.3419 to 0.0707, and object F-Score rises from 0.2704 to 0.5752. That is not a small architectural garnish. It is the evidence that occlusion handling was indeed a bottleneck for object geometry.

The final attention mechanisms further improve the full system, especially scene-level placement and bounding-box IoU. That supports the claim that pose estimation needs more careful attention routing.

The main paper also ablates global self-attention, local self-attention, and local cross-attention in the pose model using ground-truth meshes to reduce geometry confounding. This is an ablation, not the headline result. Its purpose is to isolate whether pose architecture matters when object geometry is not the limiting factor. Most metrics degrade when attention components are removed, although individual metrics are not perfectly monotonic. That is normal. Multi-metric 3D evaluation rarely behaves like a morality play.

The robustness tests in the supplement serve a different purpose. The authors perturb segmentation masks and depth maps to simulate perception noise. SceneMaker degrades but remains reasonably strong: CD-S rises from 0.0285 to 0.0302 under segmentation noise and to 0.0297 under depth perturbation; IoU-B drops from 0.7549 to 0.7197 and 0.6993, respectively. This is a robustness/sensitivity test. It supports tolerance to moderate perception noise. It does not prove safety under arbitrary segmentation failure, bad lighting, reflective surfaces, or industrial-grade chaos.

The “complete point cloud” result is another distinct category. With complete point clouds, performance improves dramatically, with CD-S reported at 0.0064 and F-Score-S at 0.9197. This is an upper-bound or exploratory extension. It suggests that video or multi-view inputs could improve pose estimation by providing richer scene structure. It does not mean the current single-image system already has that information.

Business value: cheaper scene reconstruction, not magical world understanding

The business interpretation should stay sober. SceneMaker directly shows improved open-set 3D scene generation under the paper’s benchmark settings. Cognaptus’ inference is that this kind of decomposition can reduce the cost and friction of generating usable 3D scenes from images.

The likely practical pathways are clear:

Use case	What SceneMaker-like systems could improve	What remains uncertain
3D asset creation	Faster conversion from scene images to editable object-level assets	Quality control, mesh usability, artist workflow integration
Real-to-sim pipelines	More reliable reconstruction of cluttered scenes for simulation	Physical accuracy, dynamics, contact forces
Synthetic training data	More diverse generated scenes for perception models	Domain gap and evaluation validity
Embodied AI simulation	Better object geometry, scale, and pose for interaction environments	Whether generated scenes support decision-making and control
Digital twins	Cheaper initialization from visual evidence	Accuracy requirements for engineering or facility use

The most attractive business point is not that SceneMaker makes prettier 3D scenes. Pretty scenes are useful, but pretty lies are still lies.

The stronger point is diagnostic separability. A decoupled system can expose where failure happens. For companies building AI content tools, robotics simulators, product visualization systems, or synthetic data pipelines, that matters more than benchmark aesthetics. If a system fails, you need to know whether the failure came from segmentation, depth, de-occlusion, object generation, or pose estimation. Otherwise, debugging becomes corporate astrology.

There is also a cost boundary. The supplement reports inference on a single HGX A100 80GB GPU. Segmentation takes about 1 second, depth about 0.4 seconds, de-occlusion about 20 seconds per object, 3D generation about 10 seconds per object, and pose estimation about 12 seconds. The authors report total scene generation around 40 seconds, with de-occlusion as the VRAM bottleneck requiring at least 35GB. De-occlusion and 3D generation can be parallelized, but this is still not a lightweight edge pipeline.

So the near-term business fit is not “instant 3D reconstruction on every phone.” It is more likely offline or semi-automated content production, simulation asset generation, research tooling, and high-value workflows where a few dozen seconds per scene is acceptable.

The limits are physical, interactive, and downstream

SceneMaker is careful about its limitations, and those limits are not decorative.

First, physical plausibility remains hard. The dataset places objects on a shared plane and avoids bounding-box intersections, but real scenes involve support relations, stacking, contact, deformation, and force interactions. A generated object can look plausible and still be physically wrong. For embodied AI, that difference matters. A robot cannot grasp a visually convincing impossibility.

Second, control remains limited. The paper shows text-controllable de-occlusion for object parts, but broader scene control through rich natural language is not solved. Users may want to specify relationships: “put the mug on the left side of the laptop, half behind the notebook, handle facing outward.” SceneMaker points in that direction, but it does not fully deliver that interface.

Third, downstream decision-making is unresolved. A generated high-quality 3D scene is not automatically useful for embodied planning. The model can produce assets and poses; it does not prove that an agent can reason safely or effectively inside the generated scene.

These limitations do not weaken the paper’s core contribution. They locate it. SceneMaker is a strong step toward open-set single-image 3D scene generation. It is not a complete real-world simulator, not a physics engine, and not an embodied intelligence stack. There, said once, precisely. No need to sprinkle caution confetti over every paragraph.

The broader lesson: assign each prior to the data that can teach it

SceneMaker is valuable because it resists the temptation to solve a structured problem with an unstructured wish.

A single image does not contain enough visible information. A scene dataset does not contain enough open-set object diversity. A 3D object dataset does not contain enough scene-level pose structure. An image dataset does not contain full 3D relationships. The right response is not to pretend one dataset will magically cover all missing priors. The right response is to build a system where each prior is learned from the source that can actually teach it.

That is why the mechanism-first interpretation matters. The paper is not just saying “our model is better.” It is saying that open-set 3D scene generation becomes more tractable when de-occlusion, geometry, and pose stop interfering with one another.

For business readers, the lesson travels beyond 3D. When an AI workflow keeps failing across domains, the answer may not be a larger model, more generic data, or another end-to-end slogan. Sometimes the answer is to identify which priors are being confused, separate them, and train each part with the evidence it deserves.

SceneMaker stops asking a 3D model to guess what an image model could infer, what an object generator could build, and what a pose model should place.

That is not less ambitious. It is just less confused.

Cognaptus: Automate the Present, Incubate the Future.

Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, and Lei Zhang, “SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model,” arXiv:2512.10957, 2025. https://arxiv.org/html/2512.10957 ↩︎
Supplementary material for “SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model,” CVPR 2026 supplemental PDF. https://openaccess.thecvf.com/content/CVPR2026/supplemental/Shi_SceneMaker_Open-set_3D_CVPR_2026_supplemental.pdf ↩︎

A chair behind a table is not half a chair#

The real bottleneck is not the 3D object generator alone#

De-occlusion is treated as an image task, not a 3D miracle#

Geometry generation becomes easier when it stops doing everyone else’s job#

Pose estimation is where SceneMaker becomes more than asset assembly#

The benchmark results support the decomposition, not just the leaderboard#

The ablations are not a second thesis; they tell us which module pays rent#

Business value: cheaper scene reconstruction, not magical world understanding#

The limits are physical, interactive, and downstream#

The broader lesson: assign each prior to the data that can teach it#