When One Clip Isn’t Enough: Teaching LLMs to Watch Long Videos Like Adults

Video is a terrible place to hide evidence.

Not because the evidence is invisible. Because it is usually obvious only after someone has already found the right minute, the right scene, and the right visual detail. A person reviewing a long customer-support screen recording, a training video, a compliance recording, or a surveillance clip rarely watches everything with equal attention. They skim, localize, zoom in, check the detail, and then answer. Primitive, yes. Effective, also yes.

That is the practical idea behind LongVideoAgent, a paper that treats long-video question answering less like “feed a giant video into a giant model” and more like coordinated investigation.¹ The system uses a MasterAgent to decide what to do next, a GroundingAgent to locate the relevant clip, and a VisionAgent to extract visual facts from that localized segment. The master does not heroically consume every raw frame. It asks for the part that matters, then asks what can be seen there.

The obvious story would be: “multi-agent video QA improves benchmark accuracy.” True, but not very useful. The better story is a comparison between two operating models.

One model compresses the whole video and hopes the answer survives.

The other localizes first, inspects second, and answers only after enough evidence has been gathered.

That difference matters far beyond TV-show benchmarks, because many business video workflows fail in exactly the same place: not at final reasoning, but at evidence selection.

The tempting but brittle design: throw the whole video into the model

The natural first instinct with long-video AI is to scale the input. Sample more frames. Extend the context. Add subtitles. Maybe summarize everything first. Then ask the model the question.

This works until the answer depends on a small visual cue buried in a long sequence. A single object, a face, a gesture, a written sign, the side of a bed near a window — all the annoying little things that make video video, rather than a transcript with decorative pixels.

The paper frames this as a lossy-compression problem. Many long-video systems either downsample frames, compress video into textual or visual summaries, or rely on a static representation produced before the question is fully resolved. Once the important clue is compressed away, the language model can reason beautifully over the wrong evidence. Elegant nonsense, now with frame sampling.

LongVideoAgent takes a different route. It does not ask the master model to be omnivorous. It gives the master model a bounded set of actions:

Action	What it does	Why it matters
Grounding	Locate the segment likely to contain the answer	Reduces the search space before visual inspection
Visual query	Ask for targeted visual observations from the localized segment	Retrieves details not inferable from subtitles
Answer	Stop when evidence is sufficient	Prevents endless tool use and keeps the process bounded

The important design choice is that the master agent only consumes text: subtitles, symbolic clip tags, and textual observations returned by the vision agent. Raw images are handled by the VisionAgent, not directly by the master. This makes the architecture closer to a review workflow than a monolithic perception model.

In business language: the system separates case routing, evidence retrieval, and decision-making. That is usually how real operations work. The “one giant model sees everything” fantasy is neater on a diagram, but production systems are rarely paid for diagram elegance.

LongVideoAgent replaces passive viewing with evidence gathering

The paper’s architecture is simple enough to explain without pretending it is magic.

The MasterAgent receives the question and episode subtitles. It can ask the GroundingAgent to return a relevant clip tag, such as a specific segment in an episode timeline. Once a clip is localized, the master can ask the VisionAgent a focused visual question about that clip. The VisionAgent returns a textual observation — for example, objects, people, actions, scene cues, OCR, or spatial layout. The master then decides whether to ask another question, re-ground, or answer.

The loop is bounded. In the default experimental setup, the maximum execution step count is $K=5$. The default evidence window is one localized clip, although the paper later tests larger windows. That step limit is not a minor engineering detail. It tells us the authors are not simply giving the agent unlimited tool calls until something works. They are testing whether a small number of targeted actions can outperform broader but less selective processing.

A useful way to read the method is as a replacement for “watch everything”:

Single-pass long-video model	LongVideoAgent
Preprocess or downsample the video before answering	Decide what evidence is needed after seeing the question
Treat long context as the main resource	Treat localization as the main resource
Hope the relevant clue survives compression	Ask a targeted visual question once the segment is found
Output an answer with limited traceability	Produce an interpretable trajectory of grounding, visual query, and answer
Often pays attention everywhere weakly	Pays attention somewhere strongly

The system’s example cases make the mechanism concrete. In one case, the subtitles do not reveal where Sheldon is sitting. The GroundingAgent localizes the relevant clip; the VisionAgent describes a bench at night in an urban sidewalk setting, with nearby visual cues; the MasterAgent chooses “Bus Stop.” In another case, the first visual read is too generic, so the master asks a more precise follow-up about the window’s position relative to the bed. The second visual answer resolves the left/right question.

This is the adult part of “watching video”: not staring harder at the whole thing, but asking the next better question.

The benchmark is built by turning short clips into episode-level search

The authors construct LongTVQA and LongTVQA+ from TVQA and TVQA+. TVQA originally contains 152.5K multiple-choice questions over 21.8K short clips of 60–90 seconds, with subtitles and moment annotations. TVQA+ adds spatio-temporal grounding on a subset, including 29.4K QAs from 4,198 clips and 310.8K frame-level boxes for referenced entities.

The paper aggregates clips from the same TV episode into a single episode-level sequence. Clip timestamps are re-indexed into the episode timeline, subtitles and questions are merged, and TVQA+ bounding boxes are preserved.

That construction is important. The model is no longer answering from a neat short clip where the relevant moment has already been handed over. It must operate over a longer episode-level context and find where the answer lives.

This is also where the business analogy becomes legitimate, with limits. Many enterprise videos are not naturally clipped into perfect evidence windows. A safety review may involve hours of footage. A customer support recording may contain long stretches of irrelevant navigation. A product training video may bury the answer in one demonstration step. The LongTVQA setup captures the search problem more realistically than short-clip QA.

Still, it is not a complete proxy for enterprise video. These are TV-show datasets with provided subtitles, controlled question formats, and multiple-choice answers. The paper demonstrates a strong architectural pattern. It does not prove that the same system can be dropped into noisy factory footage or privacy-sensitive call-center recordings tomorrow morning. Very sad for anyone hoping to invoice by lunch.

The main result: agentic routing beats passive consumption

The overall performance table compares non-agent and agentic versions across closed-source and open-source backbones.

Model / setting	LongTVQA Acc.	LongTVQA+ Acc.	Interpretation
GPT-4o, non-agent, subtitle+frame	70.78	78.32	Strong multimodal baseline processing the long video directly
Gemini-2.5 Pro, non-agent, subtitle+frame	78.90	81.28	Stronger closed-source full-video baseline
GPT5-mini, subtitle only	62.40	66.70	Non-agent text baseline
Agentic-GPT5-mini, subtitle+frame	71.11	78.90	Large gain from agentic grounding and visual inspection
Grok, subtitle only	76.90	81.80	Strong text baseline
Agentic-Grok, subtitle+frame	82.65	85.60	Best reported result in the table
DeepSeek-R1 671B, subtitle only	68.99	75.04	Strong open-source text baseline
Agentic-DeepSeek-R1 671B, subtitle+frame	70.30	79.70	Moderate LongTVQA gain, larger LongTVQA+ gain
Agentic-Qwen2.5 3B	23.50	27.70	Small open-source master without RL struggles
AgenticRL-Qwen2.5 3B	47.40	50.10	RL gives very large gains
Agentic-Qwen2.5 7B	46.10	60.30	Larger open-source master, still weak without RL
AgenticRL-Qwen2.5 7B	60.20	70.80	RL narrows the gap materially

The cleanest comparison is not “which row is biggest?” The cleanest comparison is what happens when a similar backbone moves from passive mode to agentic mode.

GPT5-mini rises from 62.40 to 71.11 on LongTVQA and from 66.70 to 78.90 on LongTVQA+. Grok rises from 76.90 to 82.65 and from 81.80 to 85.60. DeepSeek-R1 sees a smaller gain on LongTVQA, from 68.99 to 70.30, but a more meaningful gain on LongTVQA+, from 75.04 to 79.70.

These are not identical-input comparisons in every case, because the agentic setting adds subtitle+frame access through tools while some non-agent baselines are subtitle-only. The authors are clear that the agentic system adds both procedure and visual evidence. That matters for interpretation: the gain is not “agency alone” in isolation. It is the combined effect of localization, targeted visual observation, and iterative coordination.

That is also why the ablation table is more important than the headline leaderboard.

The ablation tells the real story: grounding first, vision second

The strongest evidence for the paper’s mechanism appears in the ablation study:

Setting	Accuracy
Non-agent, text-only	64.3
Multi-Agent, grounding	69.0
Multi-Agent, grounding + vision	74.8

This is the paper’s most business-relevant result. It decomposes the workflow into stages.

Adding grounding increases accuracy by 4.7 points. That supports the idea that localization itself has value: before asking “what is in the scene?”, the system must first identify which scene matters. In a long-video workflow, this is often the largest practical cost. Human reviewers do not spend most of their time interpreting; they spend it finding the moment worth interpreting.

Adding the VisionAgent lifts accuracy by another 5.8 points, for a total gain of 10.5 over the non-agent text-only baseline. This supports the second part of the architecture: subtitles are not enough. Some answers depend on visual layout, objects, scene identity, spatial relations, or other cues not present in dialogue.

The sequence matters:

Grounding filters the haystack.
Vision reads the needle.
The master decides whether the needle is enough.

Reverse that order and the workflow becomes expensive. Skip grounding and visual inspection becomes too broad. Skip vision and the system remains trapped in transcript-land, where every room is whatever the dialogue fails to deny.

The sensitivity tests are about operating budgets, not a second thesis

The paper includes several additional tests. These should not be read as separate grand claims. They are mostly sensitivity and implementation tests that tell us how the system behaves under practical constraints.

Test	Likely purpose	What it supports	What it does not prove
Max steps $K$	Sensitivity to tool-use budget	More steps help up to a point; $K=5$ is enough in this setup	Unlimited reasoning would keep improving
Evidence window size	Sensitivity to adjacent context	Nearby clips can disambiguate references across shots	Larger windows are always worth the latency
Vision model choice	Component ablation	Stronger visual extraction improves final QA	The architecture is independent of perception quality
Qualitative examples	Mechanism illustration	The system can refine queries when evidence is insufficient	Typical production reliability

For max steps, raising $K$ from 2 to 5 increases grounding accuracy from 67.00 to 71.00 and answer accuracy from 68.30 to 73.67. Raising $K$ to 10 nudges grounding to 72.00 but leaves answer accuracy unchanged at 73.67. The lesson is not “more agent steps are always better.” The lesson is that a small action budget can capture most of the value, while extra loops may add latency without improving answers.

For evidence window size, increasing the window from 1 to 2 clips raises answer accuracy from 70.33 to 75.00, and expanding to 3 clips raises it to 77.26. Grounding accuracy also improves from 71.67 to 78.67 and then 81.94. This makes intuitive sense: TV scenes and real videos often refer across adjacent moments. But the authors also note the cost: larger windows require more visual queries and latency. For business systems, this is a familiar tradeoff. Wider search improves recall until the bill arrives.

For the vision model ablation, GPT-4o outperforms Qwen3-VL-235B in the VisionAgent role, with answer accuracy of 78.00 versus 73.67. That is not a small footnote. It means the architecture depends on the quality of the specialist modules. An agentic workflow can route intelligently, but it cannot extract visual facts that the vision model fails to see. Delegation is not alchemy. Annoying, but financially useful to remember.

Reinforcement learning teaches the master to coordinate, not to see

The second major contribution is reinforcement learning for the open-source MasterAgent. The authors use GRPO while keeping the grounding and vision agents frozen. The reward has two parts:

$$ R(\tau)=\alpha\sum_{t=0}^{T} r_t^{fmt}+r^{ans} $$

The first term rewards structural validity: the master must emit exactly one properly formed action tag at each step. The second rewards answer correctness at termination using exact match on the multiple-choice answer.

This is deliberately sparse. There is no elaborate reward for “good curiosity,” “beautiful reasoning,” or “agentic vibes,” thank goodness. The model is rewarded for following the action protocol and getting the final answer right.

The reported gains are large for smaller open-source masters. Qwen2.5-3B improves from 23.50 to 47.40 on LongTVQA and from 27.70 to 50.10 on LongTVQA+. Qwen2.5-7B improves from 46.10 to 60.20 and from 60.30 to 70.80.

The interpretation is precise: RL improves the master’s ability to use the workflow. It does not train a better vision model. It does not jointly optimize grounding and perception. It teaches the central controller to produce valid actions, decide when to call tools, and stop with an answer.

This is useful for companies because many agent failures are not perception failures. They are orchestration failures. The model calls the wrong tool, asks a vague question, stops too early, loops too long, or mixes evidence from the wrong segment. LongVideoAgent’s RL design addresses exactly that coordination layer.

The cost is also visible. The paper reports training Qwen2.5-7B for 12 hours on 4× NVIDIA H800 GPUs, while the 3B variant took 6 hours under the same setup. That is not outrageous for research, but it is not free. The business question is whether repeated workflow accuracy justifies controller training, or whether prompting and rule-based orchestration are enough for a given use case.

The business lesson is selective inspection, not video omniscience

For practical AI deployment, the paper’s value is not that businesses should copy LongVideoAgent line by line. The value is the workflow pattern.

A video AI system should not begin with “summarize everything.” It should begin with: what evidence would answer this question, and where should we look for it?

Business workflow	Naive design	LongVideoAgent-style design	Practical benefit	Boundary
Surveillance review	Sample frames across the whole recording	Localize candidate moments, then inspect visual details	Less wasted review and clearer evidence trail	Real surveillance has poor lighting, occlusion, and privacy constraints
Training-video QA	Summarize the entire training module	Find the relevant demonstration step, then answer	Better answers for procedural questions	Depends on accurate transcript and scene segmentation
Customer-support screen recordings	Ask the model over a long screen video	Locate the user action or error moment, then inspect UI state	Faster issue diagnosis	OCR, UI changes, and sensitive data handling matter
Media archive search	Embed or caption everything uniformly	Retrieve likely scene, then run targeted visual queries	More precise search over long archives	Rights, metadata, and multilingual subtitles complicate deployment
Compliance monitoring	Generate broad video summaries	Ground the suspected event and request specific evidence	Stronger audit traceability	Must avoid overclaiming automated compliance judgment

The paper directly shows benchmark improvements on LongTVQA and LongTVQA+.

Cognaptus’ inference is broader: many video-heavy operations should separate temporal localization, visual evidence extraction, and final reasoning. This is not just an AI architecture preference. It is an operating model for reducing wasted context and improving auditability.

What remains uncertain is domain transfer. TV-show QA has cleaner structure than many enterprise videos. Questions are multiple choice. Subtitles are available. The system uses strong external tools, including Grok-4-fast-reasoning for temporal localization and GPT-4o as the default VisionAgent. In production, tool availability, latency, privacy rules, and video quality may dominate benchmark accuracy.

Still, the direction is hard to ignore. If your video workflow asks the model to “watch everything,” you may not have a video intelligence problem. You may have a routing problem wearing sunglasses.

Where this architecture applies — and where it probably breaks

LongVideoAgent is best understood as an architecture signal for long, sparse-evidence video tasks. It is especially relevant when the answer depends on a small localized moment and when textual context alone is insufficient.

It is less convincing as evidence for open-ended video understanding. The benchmark is multiple-choice QA, which gives the model a closed answer set. Real business questions are often messier: “What happened?”, “Was the process followed?”, “Did the customer encounter friction?”, “Was this safety incident preventable?” Those questions require judgment, policy interpretation, and often causal reasoning across many events.

The paper also relies on provided subtitles and does not process raw audio. That matters. In many real settings, audio quality is poor, speakers overlap, domain vocabulary is specialized, and ASR errors change the evidence. The authors acknowledge this and list raw audio integration as future work.

The grounding and vision modules are also fixed during RL. This makes the experiment cleaner for studying the master controller, but it limits full-system adaptation. If the GroundingAgent consistently selects the wrong segment in a new domain, the master may become a very disciplined coordinator of bad evidence. Very organized failure is still failure.

Finally, the reward is intentionally simple: format plus answer correctness. That simplicity is a strength because it shows useful gains without ornate supervision. But it may not be enough for tasks requiring calibrated uncertainty, refusal, multi-answer synthesis, or evidence citation.

The real shift: from bigger context to better attention contracts

LongVideoAgent does not say long context is useless. It says long context is not the whole answer.

The paper’s strongest contribution is to turn long-video QA into an attention contract among agents. The GroundingAgent promises to find where to look. The VisionAgent promises to describe what is visible there. The MasterAgent promises to coordinate, stop, and answer. Each role reduces one kind of uncertainty.

That is a better mental model for business AI than the usual “model eats data, answer comes out” cartoon. In long videos, the answer often depends less on how much the model can ingest and more on whether the system can spend perception budget at the right moment.

The comparison is therefore not “agentic AI versus non-agentic AI” in the abstract. The comparison is operational:

passive compression versus active evidence gathering;
broad context versus localized inspection;
generic summarization versus question-conditioned observation;
one-shot answering versus bounded multi-step review.

For video-heavy organizations, that is the useful lesson. Don’t make the model watch like a bored intern forced through a three-hour recording. Make it watch like an investigator: find the scene, inspect the clue, answer only when the evidence is good enough.

One clip is not enough. But neither is every clip.

The trick is knowing which clip deserves a second look.

Cognaptus: Automate the Present, Incubate the Future.

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, and Qifeng Chen, “LongVideoAgent: Multi-Agent Reasoning with Long Videos,” arXiv:2512.20618, 2025, https://arxiv.org/abs/2512.20618. ↩︎

The tempting but brittle design: throw the whole video into the model#

LongVideoAgent replaces passive viewing with evidence gathering#

The benchmark is built by turning short clips into episode-level search#

The main result: agentic routing beats passive consumption#

The ablation tells the real story: grounding first, vision second#

The sensitivity tests are about operating budgets, not a second thesis#

Reinforcement learning teaches the master to coordinate, not to see#

The business lesson is selective inspection, not video omniscience#

Where this architecture applies — and where it probably breaks#

The real shift: from bigger context to better attention contracts#