MI-ZO: Teaching Vision-Language Models Where to Look

Camera placement is an unglamorous way to lose an AI project.

A vision-language model may recognize doors, ladders, rocks, chairs, and surface textures perfectly well in ordinary images. Point the camera at the wrong side of an object, however, and the relevant feature disappears. Show the model eight similarly unhelpful views and it has received more data without receiving more evidence.

The usual response is to improve the model: collect 3D training data, finetune a larger architecture, or replace the system with something designed specifically for spatial reasoning. The paper behind MI-ZO takes a less expensive route. It leaves the vision-language model untouched and learns to control what the model sees.¹

That distinction matters. A 2D-trained vision-language model does not perceive a 3D scene as a continuous world. It receives a sequence of rendered viewpoints. The sequence becomes the model’s practical representation of the scene, so choosing those viewpoints is part of the reasoning system rather than a neutral preprocessing step.

MI-ZO treats this selection problem as online camera control. It combines several visual and linguistic signals, learns which combinations are associated with correct or incorrect VLM responses, suppresses redundant signals, and uses the resulting information measure to choose better camera actions.

The model’s brain remains the same. The evidence policy improves.

A capable model can still be shown useless evidence

Consider a remote inspector checking a digital reconstruction of industrial equipment. A component has a small crack on its rear surface. The VLM receives several front and side views, produces a fluent explanation, and confidently reports no visible defect.

The failure may come from three different sources:

The model cannot recognize the crack under any viewing condition.
The system never shows the model the crack.
The system shows the crack, but surrounds the useful view with enough irrelevant or confusing inputs that the final decision deteriorates.

Only the first failure necessarily calls for a better perception model. The other two concern evidence acquisition and presentation.

This is the reader misconception MI-ZO usefully corrects. Moving from 2D images to 3D scenes does not always require a 3D-native foundation model or another training cycle. Where a controllable camera or renderer exists, the system can adapt by learning which views are worth presenting.

The authors formalize the task as selecting a sequence of camera actions that produces an accurate VLM response using as few views as possible. Each camera action creates another visual input and another VLM interaction. Poorly chosen actions therefore cost both time and inference capacity.

The control policy has to answer a deceptively difficult question:

Which viewpoint is likely to add useful evidence rather than repeat, obscure, or contradict what the system has already observed?

The 3D problem becomes a camera-policy problem

The paper evaluates a two-round process.

During the measurement round, the camera follows a default sequence of views. The VLM assesses the scene, and the system records its responses alongside information extracted from the visual and linguistic inputs.

During the correction round, the controller uses earlier demonstrations and measurement-round feedback to predict a more useful sequence of camera actions.

The overall loop can be summarized as:

3D scene and description
        ↓
Default viewpoints
        ↓
VLM responses and correctness feedback
        ↓
MI-ZO estimates which information sources are useful
        ↓
Controller predicts revised camera actions
        ↓
VLM receives a more informative sequence of views

The camera does not merely search for visually attractive images. It searches for views whose information patterns have been associated with reliable VLM decisions.

That makes the VLM’s own errors part of the controller’s training signal. A viewpoint can be visually clear yet unhelpful for a particular model and question. Conversely, an unusual angle may expose precisely the feature that changes the model’s answer.

MI-ZO rewards useful evidence and discounts repetition

The technical difficulty begins when the system tries to measure how informative a viewpoint is.

The paper extracts multiple low-cost information sources from each visual-and-language pair. These include global colour information, object-level colour information, local edge density, and linguistic signals derived from noun phrases and descriptive terms.

Using several sources should produce a richer assessment than relying on one feature alone. Yet adding variables does not guarantee a better information measure. Two sources may largely repeat each other. Another may introduce noise. In multivariate mutual-information estimation, overlapping contributions can even reduce the resulting measurement.

MI-ZO addresses this through a weighted mixture of information sources. The component weights sum to one and are adjusted using online correctness feedback. A zeroth-order optimization process proposes changes to the weights without backpropagating through the VLM or requiring access to its parameters.

The process retains weight changes when they improve the running multi-information estimate. Sources that add useful distinction gain influence. Sources whose contribution is redundant or reductive lose it.

The paper describes this redundancy as regret. In operational terms, regret is the cost of treating repeated or counterproductive signals as though they were new evidence.

A simple analogy is a meeting with several witnesses. Five independent observations may strengthen a decision. Five people repeating the same mistaken account do not provide five times the evidence. MI-ZO tries to learn the difference.

Mechanism stage	What the system observes	What it learns	Operational purpose
Source extraction	Colour, edges, object-level features, and linguistic descriptors	Candidate signals associated with a viewpoint	Represent the scene without accessing VLM internals
Weighted mixture	Multiple potentially overlapping sources	How much each source should contribute	Reduce redundancy in the information estimate
Active regret minimization	Correct and incorrect VLM responses over rounds	Which weight updates improve separation between outcomes	Adapt online from limited feedback
Zeroth-order optimization	Function values rather than gradients	Better source weights without backpropagation	Support black-box or inaccessible VLMs
Camera control	Information scores, predicted errors, and axis-level rankings	Which camera actions are likely to be useful	Prioritize views before spending more VLM calls

The theoretical analysis supports the existence of a function that bounds the negative contribution of reductive information units. The practical contribution is easier to state: the controller receives a more discriminative signal for deciding where the camera should move.

The controller converts information scores into camera actions

MI-ZO is only the measurement layer. The paper also builds a lightweight controller that converts those measurements into camera actions.

The controller uses two component models and a central unit:

One component estimates the probability of VLM errors for different viewpoints.
Another ranks camera-axis levels using scene attributes and MI-ZO scores.
The central unit combines these predictions, incorporates model-fit indicators, and updates a low-dimensional interaction matrix representing the camera-action space.

The implementation relies on polynomial regression, iterative least squares, and derivative-free estimation. It is deliberately designed for settings with limited demonstrations rather than large supervised datasets.

This design choice is important when interpreting the comparisons. MI-ZO is not presented as a universal replacement for neural control policies. Its intended environment is one where the organization has a working VLM, a controllable 3D scene or camera, a small amount of feedback, and little appetite for another training program.

The authors report that the controller adds no GPU-processing or VRAM requirement beyond the hardware allocation used for VLM inference. That claim should be read narrowly. The full system still has to run the VLM for every selected view. Camera control is inexpensive; repeated model calls remain the dominant cost.

The diagnostic tests the mechanism before the benchmarks test the system

Before evaluating end-to-end camera control, the paper introduces a diagnostic dataset called UC-3DS-MI.

It contains 24 uniform and 24 complex scenes composed of abstract polygon objects. Using simple shapes reduces the influence of class familiarity from VLM training data. Complexity is defined through controlled differences in colour and geometry, while VLM predictions are collected across six viewpoints.

This diagnostic serves a specific purpose: it tests whether active regret minimization makes the information metric more capable of separating inputs that lead to correct VLM responses from those that lead to incorrect responses.

The result supports the proposed mechanism. Multivariate measures using active regret minimization distribute correct and incorrect responses into more identifiable regions. Single-variable measures tend to conflate the two groups. Multivariate measures without active regret also fail to benefit reliably from additional correctness feedback.

This is mechanism evidence, rather than the paper’s main application result. It indicates that MI-ZO is producing a more useful control signal. The three downstream benchmarks test whether that signal actually improves VLM decisions.

The appendix adds a useful sensitivity check through posterior concentration. Lower dispersion indicates more stable updates as the number of samples changes. For the GO-LED-OL metric, active regret reduces the reported dispersion from 298 to 23. For GH-LED, it falls from 162 to 55.

Those figures support the low-data motivation. They do not prove that MI-ZO will remain stable in every deployment environment, but they show that its advantage is visible before the camera controller reaches the final task.

Across three benchmarks, active regret is the part that moves the numbers

The paper evaluates Video-LLaMA-13B and Chat-UniVi-13B on three newly introduced benchmarks. Each benchmark isolates a different reason why viewpoint selection matters.

Benchmark	What the controller must solve	Main metric
GeoProperties-3DS	Select views that reveal distinguishing geological surface properties	Balanced error rate; lower is better
FeatureID-3DS	Find viewpoints exposing a specific feature under a limited action budget	Accuracy after five or eight camera actions
PartialView-3DS	Move around partitions that obscure objects from different viewpoints	Accuracy after eight camera actions

Geological reasoning: the best views reduce false conclusions

GeoProperties-3DS uses scenes extracted from 3D reconstructions of rocks, regolith, formations, and outcrops observed during Mars missions. Each collection contains five scenes, while a description correctly matches only one of them.

The task is difficult because a VLM may incorrectly claim that a property is present after seeing an unhelpful or ambiguous view. Balanced error rate is therefore used to capture both false positives and false negatives.

Method	Video-LLaMA-13B BER	Chat-UniVi-13B BER
VLM without control	59.8	61.3
Best standard controller: Extended Kalman	54.5	55.5
Poly+ZO+MI without the added MI-ZO metrics	53.8	54.7
Best active-regret variant	40.4	42.2

The best MI-ZO configuration lowers balanced error by 19.4 points for Video-LLaMA and 19.1 points for Chat-UniVi relative to the uncontrolled VLM.

More revealingly, the base Poly+ZO+MI controller reaches BER values of 53.8 and 54.7. The strongest active-regret variants reach 40.4 and 42.2. Camera control alone helps, but the large improvement arrives when the controller receives an information metric that actively discounts redundancy.

Feature identification: extra actions help only when the policy spends them well

FeatureID-3DS contains 60 scenes built from tower-like ShapeNet models. A relevant feature—such as a ladder or doorway—is visible only from selected viewpoints. The controller must prioritize those views under budgets of five and eight camera actions.

System	No control, Acc@5	Best MI-ZO, Acc@5	No control, Acc@8	Best MI-ZO, Acc@8
Video-LLaMA-13B	19.8	31.4	25.1	53.3
Chat-UniVi-13B	14.8	27.4	19.6	44.4

The result at eight actions deserves attention. Video-LLaMA’s accuracy rises from 25.1 to 53.3, while Chat-UniVi rises from 19.6 to 44.4.

The action budget alone does not explain this gain. An uncontrolled model already receives more views at eight actions, yet remains far below the MI-ZO-guided system. The value comes from allocating the available interactions toward views that expose discriminating features.

The best metric also changes with the budget. GO-LED-OL_ar leads at five actions, while GH-LED_ar leads at eight. This suggests that the optimal information mixture may depend on how quickly the controller must commit to a viewpoint sequence.

Occlusion handling: the controller learns to move around the partition

PartialView-3DS contains 60 scenes in which two objects are separated by a partition. Depending on the camera position, one object is fully or partially hidden. The VLM must combine the collected views and select the correct scene summary.

Method	Video-LLaMA-13B accuracy	Chat-UniVi-13B accuracy
VLM without control	20.4	18.1
Poly+ZO+MI without added MI-ZO metrics	24.5	20.4
Best active-regret variant	39.3	35.3

Again, the base controller produces a modest improvement. Adding the active-regret information metric produces the larger step.

This pattern across all three benchmarks is the paper’s most persuasive evidence. MI-ZO is not merely placing a conventional controller beside a VLM. It is improving the signal used to decide which camera movements are worthwhile.

The appendix tests robustness, sensitivity, and the ceiling

Several appendix experiments are easy to misread as additional headline results. Their real value is in defining what the mechanism depends on.

Test	Likely purpose	What it supports	What it does not prove
Additional colour spaces and sources	Ablation	HSV and CIELAB-derived sources are useful; adding sources beyond roughly five reaches a ceiling	More visual variables will always improve control
MI-ZO metrics added to neural controllers	Comparison with alternative implementations	The lightweight controller is well suited to the paper’s low-data regime	MI-ZO automatically improves every controller architecture
Variance over ten runs	Robustness check	Improvements are not produced by one lucky run	Performance will be equally stable in uncontrolled real-world environments
Wall-clock time by action count	Efficiency analysis	Each extra camera action adds approximately linear VLM-processing time	Overall deployment cost is negligible
Random n-grams and two-sentence descriptions	Linguistic sensitivity test	Useful language-scene alignment is necessary	The method is insensitive to prompt format
Larger object counts	Exploratory boundary test	View control helps while the underlying VLM can still reason over the scene	Camera control solves multi-object reasoning generally
Qualitative all-view failures	Hard-limit analysis	Some errors cannot be corrected through viewpoint changes	Every VLM error has a useful alternative view

The colour-space ablation is particularly instructive. Adding more sources does not continue improving performance after a point. The paper reports no further error-rate impact when the source count rises beyond five. This result is consistent with the central argument: information quantity and information usefulness are different things.

The language sensitivity test exposes another dependency. Replacing descriptions with unrelated random character n-grams causes FeatureID-3DS accuracy to collapse to roughly 15%. Extending descriptions from one sentence to two also lowers the best variants by about nine points. MI-ZO depends on the relationship between language and visual evidence; it does not treat text as decorative metadata.

Object count creates a sharper boundary. With Video-LLaMA, the best active-regret configuration scores 0.83 with two objects and 0.79 with three, but only 0.26 with four. The uncontrolled VLM also deteriorates, reaching 0.21 with four objects.

This is where the inference-time strategy reaches its ceiling. A camera controller can reveal hidden information. It cannot force the underlying VLM to reason correctly about evidence it fundamentally fails to process.

The paper’s qualitative analysis makes the same point more directly: in some scenes, the tested VLMs return incorrect answers from every available viewpoint. Once all views fail, view selection has nothing left to optimize.

The business value is better evidence allocation, not free intelligence

The paper directly demonstrates that a lightweight inference-time controller can improve two off-the-shelf 13B VLMs on constrained 3D multi-object benchmarks without finetuning their parameters.

The broader business interpretation requires one additional step.

Many applied vision systems already operate through a controllable evidence pipeline:

a camera can rotate around an inspected component;
a robot can reposition itself;
a digital twin can render arbitrary views;
a 3D asset pipeline can generate new previews;
a remote-science interface can prioritize observations for human review.

In such systems, model inference is only one decision. Another decision occurs earlier: which evidence deserves an inference call?

MI-ZO suggests that this upstream policy can be optimized using limited feedback, inexpensive scene features, and black-box access to the VLM.

Technical contribution	Operational consequence	Potential ROI pathway
No VLM finetuning	Existing models can remain in production	Avoid another training and validation cycle
Derivative-free optimization	Internal model gradients are unnecessary	Support third-party or restricted models
Active regret minimization	Repeated and misleading evidence receives less weight	Reduce wasted views and unhelpful inference calls
Action-budget performance	Useful viewpoints can be prioritized earlier	Lower latency in inspection and generation workflows
Online feedback	The policy adapts from observed errors	Improve evidence selection as operational data accumulates

The most plausible near-term use cases are those where the camera or renderer is already software-controlled and mistakes are strongly related to viewpoint selection. Digital-twin inspection, 3D asset quality assurance, remote scientific analysis, and constrained robotic observation fit this description.

The economic case should be evaluated against three quantities:

The cost of obtaining another view. Rendering a digital twin may be cheap. Repositioning a physical robot or remote instrument may not be.
The cost of another VLM interaction. The paper shows that action count is the primary driver of wall-clock time because every action triggers another conversational turn with the VLM.
The cost of an incorrect conclusion. Better viewpoint control is more valuable when false positives or missed features create expensive downstream decisions.

A deployment with negligible inference cost and no consequence from mistakes may not need a learned controller. A deployment where each observation is expensive and a missed feature matters has a much stronger case.

Where MI-ZO fits—and where it will not

MI-ZO is most suitable when four conditions hold.

First, the evidence source must be controllable. The system needs a camera, renderer, robot, or observation process capable of producing alternative views.

Second, viewpoint choice must materially affect the answer. If the task concerns a property invisible from every available camera position, control cannot help.

Third, the organization needs usable correctness feedback or reliable proxy labels. MI-ZO learns online from the relationship between information measurements and VLM errors. Weak feedback will produce a weak policy.

Fourth, the chosen visual and linguistic sources must reflect the task. Colour and edge-density signals are sensible for the paper’s benchmarks. An industrial inspection system may require texture, thermal variation, depth discontinuities, or domain-specific geometry instead.

The evidence base also remains deliberately narrow. The experiments use two open-source 13B VLMs, relatively small benchmark collections, fixed camera-action spaces, and scenes containing multiple objects from the same class. The method performs poorly when object counts become too demanding, when language descriptions shift substantially, or when the VLM is wrong from every viewpoint.

These are not footnotes to be politely acknowledged and forgotten. They define the deployment boundary.

MI-ZO is an evidence-routing strategy for capable but viewpoint-sensitive models. It is not a repair mechanism for a model that lacks the required visual reasoning ability.

Better views are an engineering decision

AI teams often treat the model as the center of the system and everything before it as plumbing. MI-ZO shows why that division becomes expensive in 3D environments.

A VLM’s response depends on the evidence sequence it receives. Camera movements, observation budgets, feature measurements, and feedback loops therefore belong inside the reasoning architecture.

The paper’s strongest contribution is not simply that an information-theoretic controller improves three benchmarks. It demonstrates a practical causal chain:

redundant measurements weaken the information signal;
active regret minimization improves that signal;
the improved signal guides better camera actions;
better camera actions produce better VLM decisions;
fewer wasted actions reduce the cost of obtaining those decisions.

For organizations facing poor 3D performance, this creates a useful diagnostic order.

Before collecting another training dataset, ask whether the model was shown the relevant evidence. Before replacing the model, test whether a better observation policy changes its answer. Before adding more views, determine whether those views contribute anything new.

A larger model may still be necessary. MI-ZO simply makes sure the current one has been allowed to look in the right direction first.

Cognaptus: Automate the Present, Incubate the Future.

Jason Armitage and Rico Sennrich, “Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control,” arXiv:2512.24826, 2025. https://arxiv.org/abs/2512.24826 ↩︎

A capable model can still be shown useless evidence#

The 3D problem becomes a camera-policy problem#

MI-ZO rewards useful evidence and discounts repetition#

The controller converts information scores into camera actions#

The diagnostic tests the mechanism before the benchmarks test the system#

Across three benchmarks, active regret is the part that moves the numbers#

Geological reasoning: the best views reduce false conclusions#

Feature identification: extra actions help only when the policy spends them well#

Occlusion handling: the controller learns to move around the partition#

The appendix tests robustness, sensitivity, and the ceiling#

The business value is better evidence allocation, not free intelligence#

Where MI-ZO fits—and where it will not#

Better views are an engineering decision#