Lighting is a cruel product demo.
A relighting model can look impressive when the input is clean, the geometry is polite, the materials are obedient, and the benchmark has been assembled in the reassuringly sterile world of synthetic data. Then someone points it at a real outdoor scene: leaves moving in the wind, glass behaving like glass, the sun half-occluded by a branch, indirect light bouncing from surfaces nobody bothered to model, and the whole thing starts to look rather less like computational photography and rather more like a confident intern guessing where shadows should go.
That is the useful starting point for WildRelight, a new arXiv paper introducing a real-world benchmark for single-image relighting and a physics-guided adaptation framework built around it.1 The paper is not interesting because it adds yet another relighting architecture to the pile. It is interesting because it asks a more operationally expensive question: do the impressive synthetic-benchmark results actually survive contact with measured outdoor light?
The answer is: not reliably. Which is inconvenient, because “not reliably” is usually where product roadmaps go to acquire a budget problem.
The benchmark result is the headline, not the model add-on
The paper’s most important evidence is the synthetic-to-real failure it exposes. WildRelight evaluates representative single-image relighting systems on outdoor scenes captured under real natural illumination. The tested models include RGBX, DiffusionRenderer, and Materialist. The first two are diffusion-based neural relighting systems trained on synthetic data; Materialist is an optimisation-based method evaluated with known ground-truth illumination in its protocol.
The headline numbers are not subtle:
| Method | Setup | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|
| RGBX | Zero-shot neural relighting | 15.87 | 0.4507 | 0.4917 |
| DiffusionRenderer | Zero-shot neural relighting | 22.81 | 0.6218 | 0.3927 |
| Materialist | Optimisation with ground-truth illumination | 24.19 | 0.5819 | 0.3639 |
The clean interpretation is not “Materialist beats everything.” The protocol matters. Materialist uses the known HDR environment map during optimisation, so it is not facing the same zero-shot relighting problem as the diffusion-based models. It is better read as evidence that known physical illumination helps, not as a fair consumer-product leaderboard.
The more important reading is this: models trained on synthetic worlds lose authority when asked to relight real outdoor scenes. RGBX struggles badly. DiffusionRenderer does better, but still leaves a wide gap. The qualitative examples described in the paper are exactly the sort of failure product teams should expect but rarely enjoy admitting: poor brightness reproduction, weak handling of high-frequency shadows, and difficulty with vegetation, glass, reflections, and outdoor indirect illumination.
That is the paper’s useful discomfort. It turns relighting from a generative-image problem into a measurement problem.
WildRelight measures the thing synthetic benchmarks politely avoid
WildRelight contains 30 outdoor scenes. Each scene is captured from a fixed viewpoint under 5 to 7 natural illumination conditions. For every high-resolution scene image, the authors also capture a full 360-degree HDR environment map. The scene camera is a Sony A7; the environment map camera is an Insta360 Pro 2. The point is not gadget trivia. The whole benchmark depends on making the image and the light field physically correspond.
That is why the rig matters. The authors co-locate the panoramic camera’s optical centre with the entrance pupil, or no-parallax point, of the Sony A7 lens. This is not aesthetic fussiness. If the environment camera is even slightly displaced, it may see the sun while the scene camera sees the sun occluded by foliage, or vice versa. Congratulations: the benchmark now contains false shadows. One camera has measured a light source that the other camera’s view does not physically share.
The paper’s acquisition protocol therefore prioritises alignment over scale. That is the correct trade-off. A million loosely paired outdoor images would not solve this benchmark’s core problem. It would merely produce a larger pile of photometric ambiguity, which is how many computer vision datasets quietly become landfill with citations.
WildRelight’s design choices are unusually operational:
| Design choice | Technical reason | Business translation |
|---|---|---|
| Fixed single viewpoint | Enables pixel-aligned comparison across illumination states | Lets teams evaluate whether relighting changed the right pixels, not just whether the image looks plausible |
| Co-located HDR environment maps | Captures incident illumination at the same vantage point as the scene image | Makes lighting a measured input, not a hallucinated decoration |
| RAW linear capture and HDR synthesis | Preserves radiance relationships and highlight/shadow detail | Reduces benchmark noise from camera processing pipelines |
| Manual dynamic-element masks | Excludes wind-blown foliage, clouds, and moving regions from metrics when needed | Prevents model evaluation from being polluted by scene motion |
| Time-varying natural light | Captures real outdoor illumination changes across afternoon and sunset | Tests the part of relighting that matters outside the studio |
The comparison with prior datasets clarifies the gap. Controlled light-stage datasets provide excellent material and object measurements, but they are not outdoor scenes under full natural illumination. Multi-view outdoor datasets are useful for NeRF-style reconstruction, but they are not built for single-image relighting with strict fixed-view alignment. Indoor single-view datasets often lack HDR environment maps, natural light, or precise illumination correspondence.
WildRelight is therefore best understood as measurement infrastructure. It is less glamorous than a new diffusion model. It is also more useful.
The “small dataset” objection misunderstands the job
Thirty scenes is not large by modern deep-learning standards. But the paper is explicit that WildRelight is not intended as a massive pretraining corpus. It is closer in spirit to a precision benchmark: small enough to curate carefully, strict enough to expose whether a method respects physical illumination.
This matters because real outdoor relighting is not just a data-volume problem. The signal is difficult to capture. Natural light changes over hours. Afternoon illumination evolves slowly, so the authors sample every 45 to 60 minutes. Near sunset, light intensity and chromaticity change quickly, so they sample every 10 to 15 minutes. The capture process also has to avoid pedestrians and transient objects while keeping the scene and environment map temporally close.
The supplementary analysis is important here. The authors report a median delay of 38 seconds, a mean delay of 40.14 seconds, and a maximum delay of 114 seconds between scene image and environment map capture. They then analyse solar angular displacement and argue that the resulting lighting misalignment is physically negligible for relighting; at a 256-pixel-wide environment-map resolution, even the worst-case delay corresponds to a sub-pixel shift of about 0.3 pixels.
This appendix is not a second thesis. It is a robustness check for the acquisition protocol. Its purpose is to defend the benchmark’s central claim: the paired images and illumination maps are aligned well enough to support meaningful quantitative evaluation.
That is exactly the kind of detail that separates a benchmark from a pretty dataset.
Finetuning shows the benchmark contains learnable real-world signal
After showing the zero-shot gap, the paper asks whether WildRelight can actually help models adapt. The answer appears to be yes.
The authors finetune DiffusionRenderer using LoRA rather than full-parameter retraining. In the main table, performance improves as follows:
| DiffusionRenderer variant | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| Zero-shot | 23.28 | 0.6165 | 0.3790 |
| Finetuned on WildRelight | 25.95 | 0.6687 | 0.3368 |
This is main evidence for dataset utility. It suggests WildRelight contains the right kind of supervision: real outdoor materials, real illumination variation, and paired HDR environment maps that help a synthetic-trained model move toward real-world statistics.
It does not prove that a production relighting system can be fixed with a small weekend finetune and a hopeful Slack message. The training described in the supplementary material still uses serious compute, including a 48-hour run on a single NVIDIA H100 for the reported finetuning setup. It also depends on a carefully constructed dataset, not casual image scraping.
There is also a reporting wrinkle worth handling cleanly. The main paper reports the 23.28 to 25.95 PSNR improvement in Table 3. A supplementary paragraph later gives a different pair of numbers for the finetuning result. The defensible reading is directional rather than theological: supervised adaptation on WildRelight improves relighting quality, but the exact headline number should be treated through the main table, not over-generalised into a product promise.
In other words, the benchmark works. The invoice is still real.
DPS and TTA are a proof that the dataset enables adaptation, not a finished product
The method section introduces a reference framework combining two ideas: physics-guided inverse rendering through Diffusion Posterior Sampling, and sampling-aware temporal Test-Time Adaptation. This is the paper’s third contribution, but it should not eclipse the benchmark.
The mechanics are straightforward enough if we strip away the inevitable acronym fog.
First, DPS acts as a physical constraint during inference. The model predicts scene components, such as base colour, normal, roughness, and metallicity. These are rendered under measured illumination using a differentiable Cook–Torrance-style renderer. The rendered image is compared with the observed image, and the diffusion sampling trajectory is nudged toward physically consistent decompositions. This helps prevent the model from inventing shadows or materials that look plausible but violate the measured scene-light relationship.
Second, temporal TTA uses WildRelight’s repeated captures of the same scene under different lighting conditions. The authors use a leave-one-lighting-out protocol: adapt using the other observed lighting states for that scene, then test on the held-out lighting. The diffusion backbone is frozen, and lightweight LoRA modules are adapted. This is parameter-efficient, but not magic. It uses extra observations from the same scene. It is not the same as taking one random phone photo and instantly relighting the universe.
The ablation table shows why the combination matters:
| Configuration | Mechanism | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|
| Baseline | Zero-shot | 21.63 | 0.6311 | 0.3901 |
| + DPS | Inference prior | 22.58 | 0.6578 | 0.3825 |
| + TTA | Optimisation | 24.10 | 0.6451 | 0.3923 |
| + DPS & TTA | Constrained adaptation | 25.04 | 0.6829 | 0.3453 |
The pattern is more informative than the absolute score. DPS alone gives a modest but consistent improvement, suggesting that physical guidance helps anchor the decomposition. TTA alone gives a larger PSNR boost but slightly worsens LPIPS, implying a classic optimisation problem: the model gets better at matching pixels while drifting away from perceptual naturalness. The combined method performs best across the reported metrics because DPS constrains TTA. The model adapts, but not quite so freely that it starts optimising itself into ugliness. A rare victory for restraint.
The result also nearly reaches the supervised finetuning reference in PSNR: 25.04 versus 25.95. That is useful evidence. But the boundary is important. This is a proof-of-concept for instance-specific adaptation using temporally aligned real observations. It is not yet a latency-optimised relighting engine ready for mobile deployment, AR headsets, real-estate platforms, or virtual production pipelines.
The experimental pieces have different jobs
A useful way to read the paper is to separate the experiments by purpose. Otherwise, the benchmark, the finetuning result, and the proposed method blur into one general “it works” claim, which is emotionally satisfying and analytically lazy.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Zero-shot SOTA benchmark | Main evidence | Synthetic-trained relighting systems struggle on real outdoor illumination | That any one architecture is permanently inferior |
| Supervised LoRA finetuning | Dataset utility test | WildRelight contains learnable real-world signal | That adaptation is cheap enough for all deployment settings |
| DPS-only ablation | Ablation | Physical rendering consistency improves inference | That physics guidance alone solves relighting |
| TTA-only ablation | Ablation | Scene-specific temporal adaptation improves photometric fit | That pixel-level adaptation preserves perceptual quality |
| DPS + TTA | Main method evidence | Physical constraints and temporal adaptation are complementary | That the method is production-ready |
| Temporal alignment appendix | Robustness check | Capture delays are unlikely to invalidate illumination pairing | That all outdoor dynamics are solved |
| RAW/HDR and rig justification | Implementation detail with benchmark consequences | High-fidelity capture is necessary for physical evaluation | That every commercial product needs this exact rig |
That distinction is not academic housekeeping. It changes the business interpretation. The benchmark result tells teams that their synthetic validation may be misleading. The finetuning result says real measured data can close part of the gap. The DPS/TTA result says temporally repeated observations can be exploited without full retraining. These are related claims, not interchangeable ones.
The business lesson is measurement before magic
For computational photography, AR, virtual production, real-estate imagery, e-commerce visuals, and creative tools, the paper’s message is blunt: believable relighting requires control over the light signal. Generative models can fill in texture, hallucinate plausible shadows, and produce images that look expensive in a slide deck. But if the task is physically meaningful relighting, plausibility is not enough. The model must know how illumination actually changed.
This creates a practical hierarchy.
At the lowest level, a company can use synthetic benchmarks to prototype architectures. That is fine. Synthetic data is convenient, scalable, and wonderfully cooperative. It just should not be mistaken for field validation.
At the next level, teams need real-world benchmarks with measured illumination. WildRelight shows what such a benchmark needs: HDR environment maps, radiometric consistency, fixed viewpoints, careful camera alignment, and masks for dynamic regions. This is not optional ceremony. It is how one prevents the benchmark from testing dataset noise instead of model behaviour.
At the highest level, deployment systems may use test-time or scene-specific adaptation. For example, a property-visualisation workflow could capture the same space at multiple times of day, then adapt a relighting model to that property’s geometry and materials. A virtual production team could build calibrated lighting capture into location scouting. An e-commerce or product-imaging pipeline could use controlled-but-real illumination sweeps to adapt models to specific material classes. These are not claims the paper directly proves. They are reasonable operational inferences from the mechanism it demonstrates.
The uncomfortable conclusion is that high-quality relighting may be less about a single heroic model and more about capture discipline. Naturally, this is less fashionable than saying “foundation model,” but buildings, glass, trees, and sunlight remain stubbornly unimpressed by fashion.
What WildRelight does not settle
The limitations are not footnotes. They determine where the paper can and cannot be used.
First, WildRelight is outdoor-focused and relatively small. Its strength is precision, not coverage. It tells us a great deal about aligned outdoor relighting, but not necessarily about every indoor, product, human, automotive, or mixed-reality scenario.
Second, the setup assumes fixed viewpoints across illumination changes. That is ideal for evaluation and adaptation, but many commercial workflows involve moving cameras, changing layouts, people entering scenes, or handheld capture. The benchmark deliberately controls viewpoint so that illumination can be evaluated cleanly. Real users, tragically, continue to move around.
Third, dynamic elements are masked rather than modelled. This is sensible for metric integrity, but it means the benchmark partly sidesteps the harder problem of relighting moving vegetation, clouds, water, reflections, and refractions. The supplementary material explicitly notes that water surfaces and dynamic reflections/refractions are not masked in the same way because they are complex and central to the relighting challenge. That is not a flaw; it is a boundary.
Fourth, the adaptation method uses temporal observations from the same scene. The phrase “single-image relighting” describes the underlying task, but the proposed TTA framework benefits from multiple lighting states for adaptation. For business users, that means the method is most relevant when the capture workflow can gather repeated observations or calibrated lighting variation. It is less directly applicable to one-off user-generated photos.
Fifth, the evaluation uses global least-squares intensity alignment before computing metrics. This is a reasonable response to scale ambiguity in single-image relighting, but it also means the reported metrics should be read as assessing relative illumination structure and appearance after global intensity correction, not as proof that the model perfectly predicts absolute exposure.
Finally, computational efficiency remains open. The authors present DPS and sampling-aware TTA as a demonstration of what the dataset enables. They do not present it as a finished low-latency product architecture. Anyone translating this into an app, platform, or pipeline would need to solve runtime, memory, capture usability, and failure-detection issues. The research has opened the door. It has not installed the elevator.
The real contribution is a better way to be wrong
WildRelight matters because it gives relighting research a sharper failure surface. Synthetic benchmarks often let models be wrong in ways that remain visually tolerable. Real aligned illumination data is less forgiving. It asks whether the shadow moved because the light moved, whether glass reflects the right environment, whether foliage occlusion is physically consistent, and whether the model has learned anything about outdoor light beyond the average aesthetic of a training set.
That is valuable for research. It is even more valuable for business. Companies do not just need models that produce attractive outputs; they need to know when those outputs are physically grounded, when they are merely plausible, and when the benchmark has been too polite to reveal the difference.
The paper’s best insight is therefore not that DPS plus TTA improves PSNR. It is that real-world relighting becomes tractable only when the data collection process respects the physics of the task. Measured light, aligned viewpoints, HDR capture, and temporal variation are not supporting details. They are the product boundary.
The industry has spent several years learning that generative models can make images look real. WildRelight asks a stricter question: can they make images respond to reality?
That is a much better question. Less glamorous, perhaps. But glamour has always been lighting-dependent.
Cognaptus: Automate the Present, Incubate the Future.
-
Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli, and Jeppe Revall Frisvad, “WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting,” arXiv:2605.11696, 12 May 2026, https://arxiv.org/abs/2605.11696. ↩︎