From Seeing to Doing: Why Agentic AI Still Trips Over Reality

Tools do not make an agent; they make the failure more interesting

Camera. Browser. Crop tool. Search engine. Python sandbox.

That sounds like the beginning of an intelligent workflow. Give a multimodal model these tools, and it should move from merely seeing the world to actually doing something with it: zoom into the blurry sign, search the extracted clue, cross-check the result, and produce the answer.

Except, in practice, the agent often crops the wrong region, searches the wrong phrase, forgets which image it just created, loops through redundant checks, and then answers with the quiet confidence of someone who has spent twenty minutes rearranging papers on a desk without reading the document. Progress, yes. Intelligence, technically pending.

The arXiv paper Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? introduces a benchmark designed precisely for this gap between visual recognition and reliable multimodal action.¹ Its central point is not merely that multimodal agents remain imperfect. Everyone knows that. The sharper point is that final-answer accuracy hides the place where the breakdown actually happens.

A model may have visual ability. It may have web search. It may have code execution. It may even call tools enthusiastically. None of that proves it can coordinate visual evidence and external knowledge across a real workflow.

Agentic-MME therefore asks a more operational question: when an AI agent fails, did it fail because it could not see, because it did not act, because it acted on the wrong thing, because it searched badly, or because it kept acting after the useful evidence was already available?

That question is much closer to what businesses need to know before deploying multimodal agents into inspection, document review, claims processing, field operations, procurement verification, retail auditing, compliance workflows, or any task where “look at this image and find out what matters” is only the beginning.

The benchmark tests coordination, not just perception

Agentic-MME contains 418 real-world tasks across six domains and three difficulty levels. Each task gives the agent one or more images and a question. The agent can manipulate images through visual tools and, when needed, retrieve external information through open-web tools.

The paper frames two capabilities:

Visual Expansion: the agent actively transforms images to reveal hidden or hard-to-read evidence. This includes operations such as cropping, rotating, enhancing, resizing, denoising, sharpening, thresholding, edge detection, and related manipulations.
Knowledge Expansion: the agent uses web retrieval to obtain facts not contained directly in the image, such as identifying a logo, checking a place, verifying an entity, or retrieving contextual information.

The benchmark matters because these two capabilities are not tested as separate party tricks. Level-3 tasks require the agent to interleave them. The agent may need to crop a weak logo, search possible candidates, use the search result to decide what to inspect next, crop another region, and only then answer.

That is the useful part. Many existing evaluations still reward the equivalent of “recognize object in image” or “search web for fact.” Agentic-MME is closer to: “notice that the answer is hidden in the corner, extract the clue, turn it into a useful query, verify the retrieved candidate against the visual evidence, and avoid wasting the rest of the afternoon.”

Difficulty level	What it tests	Typical workflow	Why it matters operationally
Level 1	Core visual expansion	One decisive visual action, such as crop or enhancement	Tests whether the model knows when passive viewing is insufficient
Level 2	Short multi-step workflow	Several visual actions, light bookkeeping, sometimes simple search	Tests whether the model can preserve intermediate observations
Level 3	Synergistic visual-search reasoning	Iterative visual manipulation, retrieval, hypothesis checking, and cross-validation	Tests whether the agent can coordinate tools rather than merely call them

This structure is not just taxonomy. It is a mechanism map. Level 1 asks whether the model can surface evidence. Level 2 asks whether it can chain evidence. Level 3 asks whether it can use one type of evidence to guide another.

That is where agentic AI becomes difficult. Not at the moment of possessing tools, but at the moment of deciding what the tools are for.

The paper’s real contribution is process visibility

Final-answer accuracy is easy to report and dangerously comforting. If an agent gives the wrong answer, the usual benchmark tells us that it failed. Wonderful. A thermometer that only says “sick.”

Agentic-MME tries to make the failure diagnosable.

Each task is paired with a human reference trajectory: a step-by-step path showing how a competent solver would manipulate the image, search the web, inspect intermediate artifacts, and reach the final answer. Across the benchmark, the authors annotate more than 2,000 stepwise checkpoints, with an average of over 10 person-hours of manual annotation per task.

These checkpoints are scored along two axes:

S-axis: strategy and search execution. Did the agent take the right high-level action? Did it use relevant search queries? Did the retrieved results contain the needed intermediate information?
V-axis: visual evidence verification. Did the processed image actually contain the decisive cue? Cropping is not enough. The crop must reveal the right thing.

This distinction is valuable because tool logs alone can be misleading. A model can call crop() and still crop the wrong corner. It can search the web and still search for the wrong entity. It can produce an intermediate image and still fail to read what is inside it. The paper’s evaluation design separates these cases instead of flattening them into one unhelpful “incorrect.”

The benchmark also includes an overthinking metric that penalizes redundant tool use relative to human reference trajectories. This matters because long agent traces are often mistaken for deep reasoning. Sometimes they are just expensive hesitation in serialized form.

Why tool access is not the same as agency

The most useful misconception to kill here is simple: giving an AI system tools does not automatically create agentic intelligence.

It creates the possibility of agentic behavior. The rest depends on coordination.

A multimodal agent must solve several subproblems that are easy to describe but hard to execute consistently:

Coordination layer	What can go wrong	Business translation
Action recognition	The model does not realize it needs a tool	It guesses from an unreadable document instead of zooming in
Action targeting	The model uses the tool on the wrong region	It crops the wrong product label, meter, receipt field, or damage area
Artifact faithfulness	The output artifact does not contain the required evidence	The workflow records “tool used” but the useful evidence was never captured
Search formulation	The model turns a visual clue into a poor query	It searches the vague category instead of the discriminating identifier
Cross-modal verification	The model fails to compare web results against image evidence	It accepts a plausible external fact that does not match the visual record
Execution efficiency	The model keeps exploring after sufficient evidence exists	It burns time, tokens, and tool calls without improving reliability

This is why Agentic-MME is more interesting than a leaderboard. Its design says: the unit of evaluation should not be only the final answer. The unit should be the workflow.

For business teams, this is the difference between buying a demo and buying an operational system. A demo can show that an agent has search and vision. A deployment must show that the agent knows when to use them, where to apply them, and when to stop.

The main results show a large autonomy gap

The headline result is blunt. Human solvers achieve 93.8% overall accuracy on Agentic-MME. The best reported model mode, Gemini 3 Pro in Atomic tool mode, reaches 56.3% overall accuracy. On the hardest Level-3 tasks, humans reach 82.3%, while Gemini 3 Pro reaches 33.3% in its best reported mode. In the paper’s code-generation mode, the same model’s Level-3 accuracy is reported at 23.0%.

That is not a small residual gap. That is the difference between “useful assistant with supervision” and “please do not let this thing approve field claims alone.”

The pattern is also more informative than the absolute score:

Result	Likely purpose in the paper	What it supports	What it does not prove
Human vs model accuracy	Main evidence	Frontier models remain far below human workflow reliability	It does not prove models are commercially useless
Level-wise degradation	Main evidence	Difficulty rises sharply when visual and search operations must be coordinated	It does not isolate which component fails without process metrics
Gen vs Atomic comparison	Implementation/interface comparison	Structured tool APIs often reduce interface-level errors and improve efficiency	It does not prove atomic tools are always superior for every workflow
Tool availability study	Ablation-style capability comparison	Image tools and search tools are complementary, especially on Level 3	It does not show that more tools always help
Oracle guidance study	Diagnostic intervention	Better intermediate visual cues and stepwise guidance improve performance	It does not eliminate the autonomous execution gap
Judge robustness study	Robustness/sensitivity check	S/V process scores are relatively stable across judges and human expert review	It does not remove all risks of judge-model bias

This table is important because not every experiment in the paper should be read the same way. The main result is the human-model gap under the benchmark. The tool availability study explains why both visual and search tools matter. The oracle guidance study probes whether the annotated trajectories contain useful planning information. The judge robustness study supports evaluation stability; it is not a second thesis about multimodal intelligence.

The tool ablation says synergy is real, not decorative

One of the paper’s more useful tests compares model performance under different tool availability settings: perception-only, image-only, search-only, and full image-plus-search access.

For Gemini 3 Flash, overall accuracy rises from 39.82% in perception-only mode to 52.24% with both image and search tools. For Qwen3-VL-235B, overall accuracy rises from 31.94% to 42.85% under the full setting.

The Level-3 pattern is more revealing. For Qwen3-VL-235B, image-only access actually performs worse than perception-only on Level 3: 6.25% versus 7.41%. Search-only improves to 11.11%. But full image-plus-search access reaches 19.23%.

That is the paper’s coordination argument in numeric form. Visual tools alone can be counterproductive when the agent cannot use them faithfully or verify their outputs. Search alone helps only modestly when the search query lacks the right visual clue. The two together matter when the task requires an evidence loop: visual cue → candidate search → external context → visual re-check → final answer.

This is also where enterprise evaluations often under-test systems. A procurement team may test whether an agent can read invoice screenshots. A compliance team may test whether it can search external registries. The harder workflow is the bridge between them: extract the ambiguous identifier, search the right source, reconcile the returned entity with the visual artifact, and explain the basis of the answer.

The benchmark is saying: test the bridge.

The oracle study shows that planning help is useful but not sufficient

The authors also run an oracle-style guidance study. They provide models with additional ground-truth intermediate support:

Visual cues: the correct intermediate visual artifacts, such as the properly cropped region;
Stepwise guidance: checkpoint descriptions from the human reference trajectory, with answer-related keywords masked to prevent leakage.

For Gemini 3 Flash, accuracy improves from 52.24% under the full baseline to 58.37% with visual cues and 76.21% with stepwise guidance. For Qwen3-VL-235B, accuracy improves from 42.85% to 49.38% with visual cues and 72.80% with stepwise guidance.

This has two meanings.

First, the annotations are not decorative. They contain actionable planning structure. When the model receives a better route through the problem, it performs much better.

Second, the remaining gap is the more uncomfortable part. Even with guidance, performance does not saturate, especially on Level 3. The agent still has to execute calls, maintain state, interpret artifacts, manage context, and avoid compounding small mistakes.

In other words, a plan is not the same as an executed workflow. Many enterprise AI pilots quietly die inside that sentence.

Overthinking is not reasoning; it is often waste with better posture

The paper’s efficiency analysis is especially relevant for real deployments because agentic systems do not only fail by under-acting. They also fail by over-acting.

Human reference trajectories average 2.15 calls per task. In the reported efficiency table, several models make far more calls. GPT-5-mini in code-generation mode averages 12.13 calls per task with a high overthinking score, yet reaches only 33.5% overall accuracy. Gemini 3 Pro makes fewer calls in Atomic mode than in code-generation mode and reaches the strongest accuracy.

The lesson is not “use fewer tools.” DeepeyesV2 under-utilizes tools and performs poorly. The lesson is that tool use has a quality frontier. Too little action means passive guessing. Too much action means noisy search, duplicated crops, budget exhaustion, and trace confusion. Reliable agents need targeted action.

This is a useful antidote to a common product demo pattern: long traces are shown as proof that the agent is thinking deeply. Sometimes a long trace is valuable. Sometimes it is a model repeatedly checking the same window because it has lost confidence in its own intermediate state. The difference should be measured, not admired.

The error taxonomy names the real bottlenecks

The fine-grained error analysis classifies failures into seven modes:

missing search tools;
bad search query;
unfaithful visual tool use;
missing visual tool use;
overthinking collapse;
tool misexecution;
post-visual perception deficit.

This is one of the most business-useful parts of the paper because it maps model failure to engineering interventions.

If the model misses visual tool use, better prompting, routing, or tool-trigger policies may help. If it uses the visual tool unfaithfully, the system needs better region proposal, visual grounding, or artifact verification. If it searches badly, the bridge between visual cue extraction and query construction needs work. If it misexecutes tools, the interface should be simplified, validated, or replaced with structured APIs. If it overthinks, the agent needs stopping rules, confidence checks, or interaction budgets.

A final-answer score cannot tell you which investment to make. A process taxonomy can.

That is why the paper’s mechanism-first value is stronger than its leaderboard value. The ranking may change as models improve. The failure categories will remain useful for longer.

Atomic tools reduce some failures, but they do not solve agency

Agentic-MME supports two tool-use modes:

Gen mode: the model writes sandboxed Python code for visual transformations and calls search tools through function interfaces;
Atm mode: the model uses structured atomic function calls for image manipulation and retrieval.

Atomic mode is operationally attractive because it removes many low-level execution errors. The paper reports that structured APIs can eliminate much of the tool-misexecution problem seen in free-form code mode. That is not surprising. If the model does not need to write brittle image-processing code, fewer things break at the syntax, file-path, and artifact-saving layer.

But the benchmark also shows that interface simplification is not enough. Atomic tools may help the model execute the requested action, but they do not guarantee that the requested action is the right one. A clean crop() call can still crop nonsense. A well-formed search query can still search the wrong entity. A perfectly logged trace can still be perfectly wrong.

This matters for product design. Enterprises should standardize tool interfaces where possible, but they should not confuse interface reliability with task reliability. The first is a prerequisite. The second requires process-level verification.

What Cognaptus would infer for enterprise evaluation

The paper directly shows that current multimodal agents struggle on a benchmark that requires visual manipulation, web retrieval, process verification, and efficient execution. It also shows that process-aware evaluation can distinguish several bottlenecks hidden by final-answer accuracy.

From this, Cognaptus would infer a practical evaluation rule:

Do not evaluate multimodal agents only by whether they answer correctly. Evaluate whether they expose the right evidence, retrieve the right external support, preserve intermediate state, and stop after sufficient proof.

A business evaluation framework should therefore log at least five things:

Evaluation object	What to log	Why it matters
Visual actions	Tool name, target region, transformed artifact, reason for action	Prevents fake success where the tool was called but evidence was not surfaced
Search actions	Query, source, retrieved payload, expected entity or fact	Separates search failure from visual failure
Intermediate artifacts	Crops, enhanced images, downloaded images, parsed pages	Makes later audit possible
Evidence links	Which artifact or source supports which subclaim	Prevents unsupported final answers
Efficiency	Number of calls, repeated calls, stop condition	Detects overthinking and cost inflation

This is not just academic neatness. It changes procurement.

A vendor saying “our agent achieved 85% on your test set” is useful but incomplete. Better questions are: how often did it crop the correct region? How often did it retrieve the expected source? How often did it answer correctly after retrieving the wrong evidence? How often did it use ten actions where a human needed two? How often did it stop because it had evidence rather than because it ran out of budget?

Those questions expose deployment risk earlier.

Where this benchmark should not be overread

Agentic-MME is a benchmark, not a deployment study. It does not prove that a specific enterprise workflow will fail at the same rate. A claims-inspection agent, warehouse-audit assistant, or real-estate document reviewer may face narrower distributions, better domain tools, private databases, cleaner images, and stricter workflow templates.

The benchmark also relies on automated judges for intermediate scoring, although the paper reports robustness checks across multiple judges and human expert review for selected evaluations. That supports consistency, but it does not make judge-based scoring metaphysically pure. Nothing involving evaluation of AI systems should be described as metaphysically pure unless one is trying to lose credibility efficiently.

There is another boundary. The tasks are designed to stress active multimodal capability. That is appropriate for the paper’s purpose, but not every business task needs Level-3 visual-search synergy. Some workflows only need structured OCR plus validation. Others need simple human-in-the-loop review. For those cases, a heavyweight agent may be unnecessary.

The right use of the paper is therefore not: “agentic AI is not ready.” That is too broad and not especially helpful.

The better use is: “when a workflow requires coordinated visual manipulation and external verification, final-answer benchmarks are too weak; evaluate the process.”

The business value is cheaper diagnosis, not prettier autonomy

Agentic AI is often sold through the language of autonomy. The system will see, think, search, decide, and act. A tidy little office worker made of tool calls. Charming. Also incomplete.

The more immediate business value is not full autonomy. It is cheaper diagnosis of partial autonomy.

If an agent can reliably crop the right region but fails at search, you know where to intervene. If it searches well but fails to compare returned facts against image evidence, you need cross-modal verification. If it works in Atomic mode but collapses in code mode, the interface is the bottleneck. If it keeps making redundant calls, the problem is not insufficient intelligence; it may be missing stopping criteria.

This is the practical message of Agentic-MME. The frontier is no longer just perception. It is controlled coordination.

Seeing is now table stakes. Doing requires a workflow that knows what evidence it needs, how to surface it, where to verify it, and when enough is enough.

Until then, agentic AI will continue to trip over reality—not because it lacks tools, but because reality has the irritating habit of requiring the right tool, at the right time, for the right reason.

Cognaptus: Automate the Present, Incubate the Future.

Qianshan Wei et al., “Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?”, arXiv:2604.03016, 2026. https://arxiv.org/abs/2604.03016 ↩︎

Tools do not make an agent; they make the failure more interesting#

The benchmark tests coordination, not just perception#

The paper’s real contribution is process visibility#

Why tool access is not the same as agency#

The main results show a large autonomy gap#

The tool ablation says synergy is real, not decorative#

The oracle study shows that planning help is useful but not sufficient#

Overthinking is not reasoning; it is often waste with better posture#

The error taxonomy names the real bottlenecks#

Atomic tools reduce some failures, but they do not solve agency#

What Cognaptus would infer for enterprise evaluation#

Where this benchmark should not be overread#

The business value is cheaper diagnosis, not prettier autonomy#