Video is easy to collect and expensive to understand.

That is the awkward little truth behind many enterprise “AI video intelligence” projects. A warehouse camera records everything. A body camera records everything. A meeting room system records everything. A field-service headset records everything. Then someone asks a very human question: who handled the device after lunch, what did they say, and was the machine hot when they touched it?

At that point, “everything” becomes less impressive. It becomes a storage bill wearing a product roadmap.

The MARS technical report for the CASTLE Challenge at EgoVis 2026 is useful because it does not pretend that long-horizon multimodal understanding is solved by feeding more raw media into a bigger model.1 Its core idea is more operational and therefore more interesting: when the evidence is scattered across long videos, transcripts, gaze traces, photos, thermal images, heartrate logs, and auxiliary clips, the system should not consume all evidence equally. It should build a compact evidence memory, start with cheap reusable sources, and let an agent request the missing modality only when the current uncertainty justifies it.

That sounds less glamorous than “end-to-end multimodal intelligence”. Good. Glamour is usually where compute budgets go to die.

MARS achieved 0.57 accuracy on the 185-question CASTLE Challenge test set and ranked second on the final leaderboard, just behind the top score of 0.58. The number matters, but it is not the main lesson. The more useful lesson is the architecture: source taxonomy, offline evidence memory, agentic modality selection, consistency checking, and a forced fallback when the benchmark still demands an answer.

In other words, MARS is not mainly a story about making the model bigger. It is a story about making evidence legible before reasoning begins.

The real bottleneck is not vision; it is evidence control

CASTLE is not a short video benchmark. The paper describes a dataset setting with four days of activity, 10 participants, 15 synchronized perspectives, more than 600 hours of video, official transcripts, photos, gaze traces, thermal imagery, and biometric signals. The challenge asks systems to answer 185 closed-form, four-choice questions.

That design creates a different kind of difficulty. A question may depend on a spoken explanation, an object visible on a screen, a person’s identity, a room-level camera angle, a thermal cue, or an attention cue from gaze. The problem is not merely “can the model see?” It is “does the system know where to look, when to stop looking, and which modality can settle the dispute?”

This is where many naive multimodal systems behave like interns on their first week: enthusiastic, expensive, and very eager to open every file.

MARS takes a more disciplined route. The authors treat CASTLE as an evidence-selection problem. The official data layout already separates main video directories by day and includes auxiliary folders for gaze, heartrate, photo, thermal, and video data. MARS turns this filesystem structure into a reasoning structure.

The system groups sources into two primary sources and four auxiliary sources:

Source category Evidence type Operational role
Primary Video Converted into searchable captions, OCR notes, and summaries because raw long video cannot be repeatedly inserted into context
Primary Transcript Kept as text, indexed by day, stream, and time segment
Auxiliary Gaze Useful when attention or interaction target matters
Auxiliary Heartrate Weak but useful signal for activity intensity
Auxiliary Photos High-resolution identity, object, vehicle, food, and screen cues
Auxiliary Thermal imagery Heat-related evidence for cooking, appliances, drinks, or environmental state

This table is not decoration. It is the system’s thesis.

Different evidence types have different costs, different reliability profiles, and different best-use cases. A transcript may cheaply answer a question about what someone said. A photo may settle an identity or object detail. Thermal imagery may matter for a cooking or appliance state. Heartrate is weaker, but may still help when the question concerns physical effort. Gaze is not “truth”, but it can narrow what a participant likely attended to.

The boring word here is source. The important word is selection.

MARS first compresses the world into evidence memory

The first major contribution of MARS is not the decision loop. It is the evidence memory that makes the loop possible.

For video, MARS follows the HCQA-style pipeline: long videos are sliced into short clips; representative frames are captioned; OCR is applied when screens or text appear; clip-level captions are compressed into event summaries using DeepSeek. The reason is practical. CASTLE’s videos are too long to place directly into the model context for every question. Summarisation is not a stylistic choice; it is a survival mechanism.

For transcripts, the system does something different. Because transcripts are already textual, they are normalised into short utterance windows while preserving time information. That allows transcript snippets to align with nearby video summaries.

Auxiliary media are treated as source-specific evidence rather than flattened into one universal soup. Photo notes describe static visual details. Thermal notes describe visible heat patterns. Gaze notes describe likely attended targets. Heartrate notes describe local increases, decreases, or abnormal peaks.

The resulting evidence entries share a common schema: source type, day, stream or owner, time if available, and evidence content.

That schema is the quiet part of the system. Without it, the agent would not be selecting evidence; it would merely be asking for more text. The difference matters. Evidence memory gives the system coordinates. It lets the agent say, in effect: “I need the photo evidence for this participant and time window,” rather than “please give me more context and let us hope the answer appears.”

Hope, unfortunately, is not a retrieval strategy.

The agent is a source controller, not a magic reasoner

At inference time, MARS receives the question, four answer options, and an initial evidence package retrieved from video summaries and transcripts. The paper uses GPT-5.4 as the decision model. The model is not simply asked to answer. It must choose from a fixed action space:

Agent action What it does Why it matters
Continue thinking Reorganises current evidence, identifies missing facts, refines the retrieval query Prevents premature retrieval when the issue is reasoning clarity
Add data Requests one specific source, such as transcript, video summary, gaze, heartrate, photo, or thermal evidence, with target day/person/room/time Turns multimodality into controlled evidence acquisition
Answer Selects one option when evidence is sufficient Stops the loop before cost expands unnecessarily
Random fallback Chooses randomly when the reasoning budget is exhausted and evidence is insufficient Handles the benchmark requirement that every question must receive one option

This is a mechanism worth slowing down for.

The decision model is not valuable because it has a mystical inner sense of video truth. It is valuable because it is forced to operate under an evidence policy. It can continue reasoning. It can request one missing source. It can answer. Or it can admit, operationally, that the evidence remains unsupported and use the benchmark’s required fallback.

For business readers, the “random fallback” sounds inelegant. It is. But it is also honest. The CASTLE Challenge requires a closed-form answer for every question. In a real enterprise system, that fallback would not be random. It would probably be “escalate to human review”, “mark as unsupported”, “request additional capture”, or “do not automate this decision”.

The mechanism still transfers. The random fallback is the competition-specific endpoint. The useful business pattern is the explicit unsupported state.

The evidence supports the mechanism, but not every row is an ablation

The paper gives two main quantitative tables. They should not be read the same way.

Table 2 is the clean leaderboard comparison. MARS, submitted under the participant name “ilearn_zhy”, ranked second with 0.57 accuracy. The top participant scored 0.58. Other listed scores were 0.55, 0.50, 0.35, and 0.21. This is the main comparative evidence that MARS was competitive in the challenge setting.

Table 1 is more subtle. It reports the solution evolution during the challenge:

Step Accuracy Modification
1 0.35 Zero-shot QA using only question text and answer options
2 0.42 Inject official transcripts from likely streams
3 0.44 Add person, day, room, and event cue parsing
4 0.48 Convert retrieved video clips into HCQA-style captions, OCR notes, and DeepSeek summaries
5 0.51 Add source-specific auxiliary evidence
6 0.55 Introduce a GPT-5.4 decision agent
7 0.57 Add final consistency checking and unsupported-case random fallback

This table is informative, but it is not a clean controlled ablation in the strict experimental sense. It is an engineering progression. Each later step builds on earlier changes, so we cannot isolate each modification as an independent causal effect. The +0.07 from transcripts, for example, does not prove that transcripts would always add seven points under every retrieval setup. The +0.04 from the decision agent does not prove that the same agent would help without the evidence memory already in place.

Still, the progression is useful because it reveals where performance came from.

The jump from 0.35 to 0.42 suggests that question-only guessing was weak and that official transcripts carried substantial task-relevant evidence. The smaller move to 0.44 from parsing person, day, room, and event cues suggests that retrieval narrowing helped, but only modestly at that stage. The rise to 0.48 after video captions, OCR notes, and summaries indicates that searchable visual evidence mattered. The move to 0.51 after auxiliary evidence supports the paper’s claim that not all questions can be solved from video-transcript evidence alone.

The larger improvement to 0.55 after introducing the decision agent is the strongest signal for the paper’s mechanism-first interpretation. Once evidence exists in a structured form, deciding what to ask for next becomes valuable. The final move to 0.57 after consistency checking and unsupported-case fallback suggests that answer hygiene also matters, though the reported gain is smaller.

A fair reading is this: MARS did not win by one brilliant trick. It accumulated gains by turning a messy multimodal archive into an evidence system.

That is usually how practical AI improves. Less lightning bolt, more plumbing.

Why “just add everything to context” is the wrong lesson

The obvious misconception is that long-context multimodal models will make source selection unnecessary. Why build evidence memory? Why classify sources? Why ask for one modality at a time? Just feed the model the whole universe and let it reason.

This belief is attractive because it removes system design from the story. Unfortunately, it also removes cost control, auditability, and much of the ability to diagnose failure.

CASTLE’s structure makes the weakness visible. The relevant evidence may be distributed across days, rooms, participants, and modalities. If a question concerns spoken content, dumping thermal images into context is mostly noise. If a question concerns a cooking state, transcript-only reasoning may hallucinate confidence. If a question concerns a screen, OCR notes may be more decisive than general visual captioning. If a question concerns attention, gaze may matter even when the scene itself is visually obvious.

MARS replaces indiscriminate context expansion with evidence routing.

That gives three practical advantages:

Design choice Technical effect Business meaning
Offline evidence memory Converts long media into searchable, indexed notes Reusable preprocessing instead of repeated raw-media analysis
Source-aware retrieval Requests evidence by modality, day, stream, person, room, or time Lower cost and better failure diagnosis
Agentic source control Decides whether to reason, add data, answer, or fallback More disciplined automation under uncertainty

The business lesson is not “use MARS exactly as implemented”. The lesson is that multimodal AI systems need an operating layer between raw data and model reasoning. That layer should know what evidence exists, where it came from, how expensive it is, and what kind of question it can answer.

Without that layer, the model is not doing multimodal reasoning. It is rummaging.

The business use case is auditable decision support, not omniscient video intelligence

MARS is a competition system, not a production deployment study. But the operating pattern maps clearly to business workflows that involve long, messy, multimodal records.

Consider field operations. A technician wears a camera, speaks with a remote supervisor, photographs a component, and works around machinery that may generate heat signatures. Later, a manager asks whether the technician followed a procedure. A useful AI system should not simply summarise the whole day. It should retrieve the procedure-relevant transcript segment, align it with video notes, check photos for component state, and request thermal evidence only if heat condition is part of the claim.

Or take compliance review. A company may need to verify whether safety equipment was used during a specific operation. The system should identify the relevant day, stream, room, and time range; use video summaries and OCR where needed; and mark unsupported cases rather than inventing a confident answer because the form requires one.

Or training analytics. A trainee’s performance may involve what they saw, what they said, what action they performed, and whether stress or physical exertion signals changed. The right architecture is not a single “training video summariser”. It is a source-selective evidence workflow.

Here is the practical translation:

What the paper directly shows Cognaptus business inference What remains uncertain
MARS improves from a 0.35 question-only baseline to 0.57 through staged evidence additions and agentic source selection Long-record AI systems should invest in evidence indexing and modality routing before chasing larger end-to-end models The paper does not test production latency, cost, human-review integration, or real operational error costs
Source-specific auxiliary evidence improves the challenge system from 0.48 to 0.51 in the reported evolution Auxiliary modalities can create marginal value when primary video and transcripts are ambiguous The contribution of each auxiliary modality is not isolated independently
The decision agent improves the reported progression from 0.51 to 0.55 Agentic control is useful when it operates over structured evidence rather than raw chaos The report does not compare many alternative agent policies or decision models
Consistency checking and unsupported fallback improve the final reported score to 0.57 Automated workflows need explicit unsupported states and final answer checks Random fallback is benchmark-specific and should not be copied into high-stakes business decisions

This distinction matters. The paper supports a source-selective architecture for benchmark QA. It does not prove that every enterprise should deploy autonomous video reasoning tomorrow morning. That would be the kind of conclusion consultants put on slides when they have run out of analysis.

The better conclusion is narrower and stronger: when multimodal evidence is long, heterogeneous, and partially relevant, the architecture should separate evidence preparation from evidence selection from final decision.

The leaderboard result is close, so the mechanism matters more than the trophy

MARS ranked second with 0.57 accuracy, just below the top score of 0.58. A one-point leaderboard gap is not a philosophical event. It tells us the system was competitive, not that it revealed the final form of multimodal intelligence.

The more durable insight comes from the solution evolution. Each performance gain corresponds to a more disciplined handling of evidence:

First, use transcripts because they are cheap and semantically dense.

Second, parse entities and context cues because retrieval without coordinates is wasteful.

Third, convert long video into captions, OCR notes, and summaries because raw video does not fit repeated reasoning.

Fourth, add auxiliary sources because some questions need evidence that video summaries and transcripts do not capture cleanly.

Fifth, use an agent to decide what is missing rather than applying the same retrieval recipe to every question.

Finally, check consistency and define what happens when support remains insufficient.

This is mechanism-first progress. It is also painfully familiar to anyone who has built real automation: the model matters, but the surrounding evidence system decides whether the model gets the right facts at the right time.

Where the result should not be overextended

The paper’s boundaries are straightforward.

First, the test set contains 185 closed-form questions. That is enough for a challenge result, but not enough to claim broad production reliability across all long-horizon multimodal tasks.

Second, the reported solution evolution is cumulative. It is useful as an engineering history, but it should not be interpreted as a fully controlled ablation isolating every component’s independent effect.

Third, the fallback action is shaped by the benchmark. Since the challenge requires one of four options, MARS uses random fallback when evidence remains insufficient. In enterprise settings, the analogous action should usually be escalation, abstention, or additional data request, not random selection. Randomness is not governance. It is what happens when a multiple-choice test refuses to accept “not enough evidence”.

Fourth, the report does not deeply evaluate cost, latency, privacy, security, human-in-the-loop review, or maintenance complexity. These are not minor details for business deployment. A system that indexes four days of multimodal human activity is also a system that handles sensitive behavioural evidence. The architecture may be technically sensible, but implementation governance would decide whether it is acceptable.

These boundaries do not weaken the paper’s core contribution. They keep it in the right box.

The practical design rule: route evidence before asking for judgement

The most useful way to read MARS is as a design rule:

Do not ask an AI system to judge before you have taught it how to route evidence.

For long-horizon multimodal AI, that rule has several consequences.

A production system should not store video, audio, photos, sensor traces, and transcripts as disconnected assets and then expect an LLM to “understand the context”. It should build evidence memory with source, time, owner, location, and modality metadata. It should distinguish cheap textual evidence from expensive visual evidence. It should know when a question needs OCR, when it needs identity grounding, when it needs thermal cues, and when it simply lacks support.

The agent should not be treated as the whole product. The agent is useful because it sits on top of structured evidence and chooses controlled actions. Without evidence memory, the agent is just a confident narrator with a filing cabinet problem.

That is the understated value of MARS. It shows a pragmatic route from multimodal data accumulation to multimodal decision support. Not omniscience. Not magic. Not “upload everything and ask nicely”. A source-aware system that compresses, indexes, requests, checks, and sometimes admits that it does not know.

For business AI, that is not a small lesson. It is the difference between automation that can be audited and automation that merely sounds fluent while losing the evidence trail.

And if the future of multimodal AI is going to involve more cameras, more sensors, more transcripts, and more questions over longer periods, then source selection will not be a technical footnote. It will be the control surface.

The data lake is already full. The next advantage belongs to the systems that know which cup of water to drink.

Cognaptus: Automate the Present, Incubate the Future.


  1. Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie, “MARS: Technical Report for the CASTLE Challenge at EgoVis 2026,” arXiv:2605.18176v1, 18 May 2026. ↩︎