Video is where AI demos go to become expensive.
A model can describe a short clip. It can answer a question about a few sampled frames. It can even sound confident while doing so, which is apparently a product feature now. But business video work is rarely “what is happening in this five-second clip?” It is usually messier: find the exact moment in a two-hour training recording, count repeated actions without double-counting adjacent clips, verify whether an event appears in audio, subtitles, and frames, or decide whether a safety incident is real rather than just visually similar to one.
The obvious response is to ask for more context. More frames. Longer video windows. Larger multimodal models. More heroic GPU invoices.
The paper ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding argues that this is only part of the problem, and not the most interesting part.1 Its core claim is operational: video agents fail not just because they cannot “see” enough, but because they do not have a good way to route, transform, verify, and aggregate evidence once they have seen something.
That distinction matters. A video agent is not merely a vision-language model with a play button attached. It is closer to an evidence-processing workflow: retrieve candidate moments, inspect clips, compare modalities, merge overlapping detections, count events, discard distractors, and stop when the answer is justified. In other words, less magic eyeball, more competent operations clerk. Less glamorous, more useful. How tragic for the keynote slide.
The paper is really about evidence orchestration, not just video perception
ReTool-Video has three main contributions, but they are best read as one mechanism.
First, the authors build MVTL, a MetaAug-Video Tool Library with 134 registered tools. These include 26 base tools for multimodal evidence acquisition and 108 meta tools for operations such as filtering, aggregation, reranking, formatting, temporal merging, and lightweight computation.
Second, they introduce ReTool-Video, a recursive tool-using framework. Instead of forcing every planner action to match an existing executable tool, the system allows the planner to express a higher-level video intent. If that intent does not match a registered tool, a resolver translates it through parameter repair, tool substitution, or decomposition into lower-level tool calls.
Third, they train the planner with reinforcement learning, using planner-token-masked GRPO. The resolver, tool outputs, and environment text are masked out of the policy loss, so training focuses on the planner’s global decisions: what evidence to request, when to delegate, how to judge sufficiency, and when to stop.
The business translation is simple enough, though not necessarily cheap: do not treat video AI as one model call. Treat it as a managed evidence pipeline.
That is the article’s organizing idea. If we start with the benchmark table, we get a familiar story: model A beats model B. Useful, but thin. If we start with the mechanism, the paper becomes more interesting. It gives a design pattern for enterprise video intelligence: base evidence tools, meta-processing tools, a planner, a resolver, a shared observation format, failure typing, caching, and runtime budgets.
This is not a chatbot watching YouTube. It is a small bureaucracy for visual evidence. Finally, bureaucracy gets its revenge.
MVTL separates seeing evidence from processing evidence
The key move in MVTL is the separation between base tools and meta tools.
Base tools acquire evidence. They touch the video, image, audio, transcript, scene graph, or other structured representation. Meta tools work on intermediate outputs. They filter, merge, count, rank, reformat, group, threshold, or compute over what previous tools have returned.
That sounds mundane until we remember what video questions often require. Suppose a user asks: “How many times does the worker clean the toilet?” A raw clip-level tool may identify four positive clips. But if those clips are adjacent fragments of two continuous events, the answer is not four. The agent needs temporal sorting and segment merging. That is not perception in the narrow sense. It is bookkeeping over perception.
MVTL makes that bookkeeping explicit.
| Layer | What it does | Examples from the paper’s design | Business analogue |
|---|---|---|---|
| Base tools | Acquire multimodal evidence | Video/visual inspection, retrieval/search, audio/speech access, frame or clip analysis, structured video access | Search cameras, inspect frames, read transcripts, retrieve candidate windows |
| Meta tools | Process intermediate results | Filtering, aggregation, reranking, temporal merging, formatting, counting, lightweight computation | Deduplicate incidents, merge adjacent detections, rank evidence, prepare audit records |
| Runtime-control tools | Support execution rather than direct reasoning | Tool retrieval, memory folding, context compression, exception recovery, scheduling assistance | Workflow plumbing, cache management, failure recovery, system governance |
The registry design is important. Each tool has metadata: name, natural-language description, tags, input schema, output schema, availability conditions, and runtime constraints. This allows the runtime to validate calls, retrieve relevant tools, normalize outputs, and apply bounded repair or fallback when something goes wrong.
This is where the paper quietly becomes more enterprise-relevant than a normal video benchmark paper. A tool library without schemas is a script drawer. A tool library with routing, validation, normalized observations, execution traces, and failure states starts to resemble a governable system.
MVTL also supports dual-level access. The structured level gives the agent preprocessed evidence such as subtitles, captions, event descriptions, and knowledge graphs. The raw-evidence level gives timestamped frames and clips. The intended pattern is efficient narrowing first, faithful verification second.
That pattern is familiar from document AI and search systems. You retrieve candidate evidence cheaply, then inspect the original source when precision matters. Video just makes the cost of being sloppy more obvious.
Recursive grounding closes the gap between user intent and executable tools
The second mechanism is recursive grounding.
In a flat tool-use setup, the planner must choose a registered tool directly. If the planner wants to “compare whether these adjacent clips are one continuous event,” but no exact tool exists for that intent, the system either chooses a poor substitute, produces invalid parameters, or stops too early. Flat interfaces punish natural reasoning intents because those intents are often not executable units.
ReTool-Video changes the interface. The planner can output three broad action types: primitive executable actions, abstract video-reasoning intents, or a final answer. The runtime first checks whether an action directly matches a registered tool and satisfies its schema. If yes, it executes the tool. If no, it treats the action as an abstract intent and sends it to the resolver.
The resolver has a narrower job. It does not decide whether the whole task is complete. It only grounds the current action. It may repair missing parameters, search for a semantically similar tool, or decompose the intent into lower-level tool calls.
A simplified execution loop looks like this:
Question + video context
↓
Root planner decides what evidence is missing
↓
Action is checked against MVTL registry
↓
If executable: run tool directly
If abstract: send to resolver
↓
Resolver repairs, substitutes, or decomposes into executable calls
↓
Tool outputs become normalized observations
↓
Root planner judges whether evidence is sufficient
↓
Finish only when the root planner answers
The restriction that only the root planner can produce the final answer is not decorative. It prevents a local resolver from confusing “I completed this sub-action” with “the entire question is solved.” In production terms, it separates local workflow completion from global decision authority. That is the sort of separation that looks boring until a system starts confidently answering from half-processed evidence.
The paper also allows controlled parallel execution. The planner can issue multiple independent actions in one round, and the runtime schedules them together when dependencies permit. For video, this matters because many tasks require checking several windows, modalities, or entity clues. Parallelism is useful only when the system can manage dependencies and merge observations coherently. Otherwise, it just manufactures more context sludge. We already have enough of that.
Planner-only reinforcement learning trains the conductor, not every instrument
The reinforcement-learning part of ReTool-Video is narrower than it first appears.
The authors optimize the planner policy while treating the resolver, tool library, and execution modules as non-trainable infrastructure. The reward is dominated by final answer correctness. Smaller auxiliary terms reward structural validity and penalize ineffective trajectories such as empty progress, repeated probing, or forced termination at the maximum step limit.
The important design choice is planner-token masking. During policy optimization, only planner-generated tokens are included in the loss. Resolver outputs, tool observations, execution logs, image or video evidence descriptions, and environment-generated text are excluded.
This focuses credit assignment on the decisions that matter at the orchestration layer:
- Which action should be taken next?
- Should the system call a primitive tool or delegate an abstract intent?
- Is the current evidence sufficient?
- Is the agent repeating itself?
- Should it stop?
For business readers, this is the cleaner interpretation: the paper is not trying to retrain every perceptual component. It trains the planner to use an existing evidence infrastructure more effectively. The analogy is not “teach every employee every job.” It is “train the dispatcher so the right specialist gets called at the right time, with the right evidence packet, and nobody sends twelve people to stare at the same door.”
This matters for deployment thinking. In many enterprise settings, the perceptual tools will be heterogeneous: OCR, ASR, object detection, domain classifiers, clip QA, retrieval indexes, policy rules, and audit systems. ReTool-Video suggests that a high-leverage training target may be the orchestration policy rather than each tool’s internal model.
That is an inference from the paper, not something the paper proves in a live enterprise setting. The evidence is still benchmark-based. But the architecture points in a practical direction.
The main results show where evidence orchestration helps most
The paper evaluates ReTool-Video on three video understanding benchmarks: MVBench, MLVU, and Video-MME. The likely purpose of the main result table is comparison with prior systems under benchmark protocols, not a deployment cost study.
| Benchmark | What it mainly stresses | ReTool-Video result | How to read the result |
|---|---|---|---|
| MVBench | Short-video temporal understanding across 20 task types | 72.9 | Improvement is present but more moderate, consistent with the paper’s point that short/local tasks are less dependent on long evidence orchestration |
| MLVU | Long-video reasoning, global comprehension, local evidence retrieval | 81.5 | Stronger gain, suggesting value in retrieval, verification, and aggregation over extended temporal contexts |
| Video-MME without subtitles | Open-domain video QA across diverse durations and scenarios | 76.6 | Supports the system’s ability to work under subtitle-free evaluation, though the pipeline may still use generated ASR when valid |
Compared with InternVL3.5-30B-A3B, ReTool-Video reports higher results on MLVU and Video-MME despite using a 9B backbone for planner and resolver. The table gives 81.5 versus 73.0 on MLVU and 76.6 versus 68.7 on Video-MME, or gains of 8.5 and 7.9 points respectively.
The magnitude is meaningful because it appears where the mechanism should help: long-video and open-domain settings where answer-critical evidence may be sparse, distributed, or cross-modal. The gain on MVBench is smaller, which is also informative. If a benchmark can often be answered from short sampled context, then elaborate evidence routing has less room to shine. A screwdriver looks less impressive when the screw is already loose.
This is a useful sanity check. A mechanism-first paper should not improve equally everywhere. If it did, we would suspect either a very broad capability jump or a benchmark table doing too much emotional labor. Here, the pattern fits the claimed mechanism: orchestration helps most when evidence is hard to locate, combine, or verify.
The ablation table says tools are necessary, but uncontrolled tools are not enough
The ablation study is not a second thesis. Its likely purpose is to separate the contribution of active tool use, meta tools, recursion, parallel execution, and planner-level RL.
The headline is clear: direct answering performs much worse than tool-using variants. But the more interesting lesson is that “more tool use” is not the same as better evidence orchestration.
| Setting | MLVU | MVBench | Video-MME | Likely interpretation |
|---|---|---|---|---|
| Direct response without tools | 47.6 | 46.6 | 57.4 | No-tool answering struggles, especially where retrieval and local verification matter |
| Tool-use without parallelism and recursion | 70.5 | 63.5 | 71.8 | Basic tool use already gives a large jump over direct answering |
| Tool-use without meta tools | 71.2 | 64.1 | 71.7 | Meta-level operations add value beyond raw evidence tools |
| Tool-use without parallelism | 68.4 | 64.4 | 70.2 | Parallel execution helps, but not uniformly or by itself |
| Tool-use without recursion | 63.2 | 64.8 | 66.7 | Removing recursion is costly, especially for long-video and open-domain settings |
| Tool-use before planner RL | 74.6 | 65.3 | 73.3 | Full tool infrastructure helps even before RL |
| ReTool-Video with planner RL | 81.5 | 72.9 | 76.6 | Planner training adds a further gain in action selection, sufficiency judgment, and stopping |
The odd-looking part is worth pausing on. The variant without recursion can underperform the variant without both parallelism and recursion on MLVU and Video-MME. The authors interpret this as evidence that parallel calls are useful only when abstract actions can be reliably grounded. Without recursion, parallelism may inject noisy observations and context pressure.
That is exactly the kind of result business teams should notice. Tool access is not automatically a capability. It can become a liability if the system cannot validate, ground, merge, and stop. A poorly governed tool-using agent is not an analyst. It is an intern with unlimited browser tabs and no manager.
The RL gain also deserves disciplined interpretation. The full tool-use system rises from 74.6 to 81.5 on MLVU after planner-level RL, from 65.3 to 72.9 on MVBench, and from 73.3 to 76.6 on Video-MME. This supports the paper’s claim that action selection and evidence sufficiency remain bottlenecks even after the tool infrastructure exists.
It does not prove that the same training recipe will transfer cleanly to every enterprise domain. It does suggest that once a company has built a stable evidence tool layer, tuning the planner’s orchestration behavior may be a high-value next step.
The behavior analysis is a quality-control chart, not an invitation to spam tools
The tool-behavior analysis is best read as diagnostic evidence. Its likely purpose is to understand how tool-use patterns relate to final accuracy, not to introduce a new benchmark claim.
The authors examine final accuracy against tool-call count, reasoning iterations, and tool-call success rate. Their interpretation is nicely unsentimental: effective tool use depends more on quality than quantity. Accuracy remains stable with a small to moderate number of calls, especially one to four calls, but does not improve monotonically with longer tool chains. Similarly, moderate multi-turn reasoning helps, while overly long trajectories often signal unresolved uncertainty or repeated probing.
The strongest behavior signal is tool-call success rate. Usable observations matter more than raw tool frequency.
That is the production lesson. The enterprise metric should not be “how many tools did the agent use?” It should be closer to:
| Bad metric | Better metric |
|---|---|
| Number of tool calls | Share of calls producing usable evidence |
| Chain length | Evidence sufficiency per step |
| Number of modalities touched | Whether the decisive modality was inspected |
| Agent verbosity | Trace quality and auditability |
| Completion rate | Correct completion under budget |
The runtime tool distribution also supports the architecture. Base tools dominate, because most tasks still require first-hand evidence acquisition. Meta tools appear less frequently but remain important for long-tail reasoning: filtering, aggregation, temporal merging, memory management, and computation.
This is how many real workflows behave. The common path does most of the volume. The value of the long-tail tools appears when the common path would otherwise produce a subtly wrong answer.
The case study shows why video AI needs bookkeeping
The paper’s case study uses an MLVU example involving toilet-cleaning events. The likely purpose is illustrative: it shows the mechanism operating on a concrete failure mode rather than proving aggregate performance.
The initial grounding stage retrieves sparse candidate intervals, including true toilet-cleaning clips and visually similar distractors. The planner does not answer directly from those coarse candidates. It issues an abstract visual-analysis intent, which the resolver decomposes into multiple clip-level QA calls for local verification.
The system identifies four positive clips. A naive clip counter would answer four. But those clips correspond to only two continuous cleaning events. ReTool-Video then applies meta tools: it sorts time ranges and merges adjacent temporal segments under a tolerance. Fragmented detections become distinct event counts.
This is a small example, but it captures the paper’s central mechanism better than the benchmark table does. Many business video tasks fail at exactly this boundary. The system sees something real, but then counts it incorrectly, merges it incorrectly, or fails to distinguish one event from repeated fragments.
Quality inspection has the same issue. So does safety monitoring. So does sports analytics, classroom video review, retail shelf monitoring, warehouse exception detection, and training-video search. The business error is often not “the model saw nothing.” It is “the model saw pieces and assembled them badly.”
That assembly layer is where ReTool-Video is most interesting.
Business value comes from designing the evidence pipeline before scaling the model
The paper directly shows benchmark improvements for a video-agent architecture built around a richer tool library, recursive action grounding, meta tools, controlled parallelism, and planner-level RL. It does not directly show production ROI.
Cognaptus’ business inference is narrower: for video-heavy workflows, the first architecture question should not be “Which VLM do we buy?” It should be “What evidence pipeline must exist before any model answer is allowed into operations?”
A practical enterprise translation looks like this:
| Architecture element in the paper | Operational equivalent | Business relevance |
|---|---|---|
| Structured video access | Segments, transcripts, captions, event descriptions, searchable indexes | Reduces search cost before expensive inspection |
| Raw evidence access | Original clips, frames, audio snippets, timestamped references | Preserves fidelity for verification and audit |
| Base tools | Clip QA, frame inspection, ASR, retrieval, visual analysis | Acquires task-specific evidence |
| Meta tools | Filtering, sorting, merging, counting, reranking, formatting | Prevents fragmented evidence from becoming bad answers |
| Recursive resolver | Converts vague user intent into executable steps | Handles real user requests that do not match clean APIs |
| Observation normalization | Shared evidence buffer with tool call, returned evidence, and execution signal | Makes multi-step reasoning traceable |
| Failure states | Schema error, missing argument, unavailable tool, empty result, invalid output, budget violation | Separates “no evidence” from “broken workflow” |
| Planner-level RL | Tunes decisions around action choice and stopping | Improves orchestration without retraining every specialist tool |
For a company, this suggests a staged build plan.
First, segment and index video evidence. Do not begin with the fantasy that the model will reason over full raw video perfectly. ReTool-Video’s own preprocessing pipeline constructs ASR-guided semantic segments when possible, falls back to scene or fixed-length segmentation when needed, samples frames, generates captions, caches segment payloads, retrieves and reranks evidence, and then gives the planner a structured evidence block.
Second, expose both structured and raw evidence. Structured evidence supports speed. Raw evidence supports truth. If the system cannot return to original clips or frames, it is not a reliable decision system; it is a summarization rumor mill.
Third, build meta-operations early. Merging, filtering, counting, sorting, and reranking may sound less exciting than multimodal reasoning, but these are often where operational correctness lives. The toilet-cleaning example is not about grand intelligence. It is about not counting continuous events as separate events. Enterprise AI will survive or die on such unromantic details.
Fourth, log execution traces. The paper records planner action, resolved tool call, execution result, normalized observation, failure status, and budget consumption. That is not only useful for research analysis and RL. It is also the beginning of auditability.
Fifth, tune the planner only after the tool environment is stable. Training an orchestration policy before the evidence tools are reliable is like optimizing airport scheduling while the runways are still imaginary. Bold, but not recommended.
The result has clear boundaries, which is where the useful caution belongs
The first boundary is that the evidence is benchmark evidence. MVBench, MLVU, and Video-MME are relevant, but they are not production deployments. They do not measure integration cost, long-term maintenance, security review, legal auditability, data retention policy, human escalation, or operator trust.
The second boundary is runtime cost. The paper’s implementation uses preprocessing, caching, multi-round planner and resolver loops, recursive depth limits, and NVIDIA A800 GPUs. The runtime budget includes step and time limits, with the root planner allowed up to 15 rounds, resolver invocations up to 3 rounds, maximum recursion depth of 5, and a common per-sample wall-clock budget of 480 seconds during RL data collection. That is not a casual API call. Business adoption would need workload-specific cost modeling.
The third boundary is tool quality. ReTool-Video assumes the tool ecosystem can produce useful observations. If ASR is poor, captions are misleading, retrieval misses decisive segments, or clip QA is weak for a specialized domain, the planner can orchestrate a bad orchestra beautifully. The result will still be noise, but now with a trace.
The fourth boundary is comparison design. The paper follows a full-system comparison protocol, using baselines under their original papers, official implementations, or official API settings. That is reasonable for comparing systems as designed. It is less precise if the question is “which exact component caused every point of improvement?” The ablation table helps, but production teams should still validate components in their own environment.
The fifth boundary is governance. The paper has explicit failure states, observation normalization, caching, and execution records, which are good ingredients. But a production system still needs policy-specific controls: when to escalate to a human, when not to answer, how to store clips, who can inspect sensitive footage, and how to handle disputed evidence.
These boundaries do not weaken the paper. They prevent us from using it like a magic coupon for enterprise automation. Those coupons keep expiring at procurement.
The lesson is not “use more tools.” It is “make evidence executable.”
ReTool-Video is useful because it reframes video agents as evidence-routing systems.
The paper’s strongest idea is not simply that a 9B agent with tools can score well on video benchmarks. The stronger idea is architectural: complex video understanding requires a tool space rich enough to acquire and process evidence, and an action space flexible enough to translate abstract reasoning intents into executable chains.
That is the bridge from research result to business practice.
For video-heavy companies, the temptation will be to wait for a bigger multimodal model and hope the workflow problem disappears. Sometimes scale will help. But ReTool-Video points to a more durable pattern: build the evidence library, normalize observations, distinguish base perception from meta-processing, ground abstract requests recursively, trace failures explicitly, and train the planner to stop only when evidence is sufficient.
In other words, video AI becomes useful when it stops merely watching and starts routing.
A little less cinematic. A lot more operational.
Cognaptus: Automate the Present, Incubate the Future.
-
Xiao Liu, Nayu Liu, Junnan Zhu, Ruirui Chen, Guohui Xiang, Changjian Wang, Kaiwen Wei, Rongzhen Li, and Jiang Zhong, “ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding,” arXiv:2605.13228, 2026. https://arxiv.org/abs/2605.13228 ↩︎