Stage Before You Shoot: Why Reliable AI Needs a Middle Game

TL;DR for operators

AI systems are increasingly being asked to work in messy, high-dimensional environments: long video archives, multilingual evidence, persona-specific retrieval, humanoid motion, physical contact, timing, perception, and real-world deployment. The temptation is familiar: throw a stronger model at the whole thing and hope intelligence leaks out of the parameter count. Charming. Also expensive.

Two recent papers from very different corners of AI point to a more useful lesson. One paper proposes C2F-RAG, a training-free coarse-to-fine pipeline for long-video retrieval-augmented generation, where semantic retrieval, logical reranking, persona filtering, and grounded JSON generation are separated into distinct stages.¹ The other introduces RoboNaldo, a three-stage motion-guided curriculum reinforcement-learning system for humanoid soccer shooting, where stable kicking, target adaptation, and moving-ball timing are learned progressively rather than demanded from a single reward soup.²

The shared insight is simple but operationally powerful:

Reliable AI does not come from asking one signal to supervise everything. It comes from staging the problem so each signal is used only where it is trustworthy.

For managers, builders, and AI buyers, this is not a robotics curiosity or a video-search trick. It is a design pattern for operational AI. Use broad cheap signals for narrowing. Use stronger reasoning only after the search space is smaller. Use task rewards only after basic behavior exists. Use hard schemas, triggers, stabilizers, and evaluation gates where the system must execute. In other words: build the middle game. The opening move and the final answer are not enough.

The shared problem: one task, too many incompatible signals

The two papers look unrelated at first glance. One is about retrieving and generating answers from a 110,000-plus video corpus. The other is about a Unitree G1 humanoid kicking soccer balls toward targets. One lives in vector search and LLM prompting. The other lives in reinforcement learning, control loops, and high-impulse contact.

That difference is exactly why the pairing is useful.

Both papers deal with tasks where a naive single-stage system breaks because the available signals are useful, but not universally useful. Semantic similarity is good for finding candidates, but bad at determining persona-grounded relevance. OCR and ASR are noisy for broad vector retrieval, but useful for fine-grained evidence reasoning. Motion imitation is good for balance and coordination, but bad at target adaptation. Sparse task rewards know what success looks like, but do not know how to teach a humanoid to stay upright long enough to kick anything.

This is the operational disease: a signal gets promoted beyond its competence.

The C2F-RAG paper names its version of the disease the “semantic-logical gap.” Dense retrieval can surface videos that look topically similar while still being logically irrelevant to the user’s persona or question. Those hard negatives then contaminate downstream generation. A language model, being ever polite and occasionally too imaginative, may synthesize from the wrong evidence with great confidence.

RoboNaldo faces the embodied version. A fixed human-kick reference gives the robot a stable motion scaffold, but cannot tell it where to strike a randomly placed ball or when to kick a moving one. Pure task reward, meanwhile, gives the final objective but not enough guidance through the enormous search space of humanoid balance, swing coordination, contact, aim, and recovery.

Different medium. Same failure: the wrong signal is asked to carry the whole system.

The shared insight: stage the trust, not just the computation

The better framing is not “more model” versus “less model.” It is “which stage deserves which signal?”

A useful staged system has three properties:

It knows what each signal is good for.
It prevents early-stage noise from poisoning later-stage decisions.
It introduces stronger, slower, or more specific supervision only after the search space has been made manageable.

A generic version looks like this:

$$ \text{Reliable behavior} \approx \sum_{i=1}^{n} \text{Stage}_i(\text{signal}_i, \text{constraint}_i, \text{failure mode}_i) $$

That is not a mathematical law. It is a management reminder with symbols attached, which is often what organizations need before buying another dashboard.

The papers instantiate the pattern differently:

System	Coarse stage	Middle stage	Execution stage	Failure being controlled
C2F-RAG	Dense retrieval over clean global visual/text summaries	LLM-based logical reranking over full multimodal context, including OCR and ASR	Persona-constrained generation with temporal grounding and strict JSON schema	Semantic distractors, hallucination, ungrounded citations, persona drift
RoboNaldo	Human-motion tracking to learn stable whole-body kick prior	Free-kick adaptation with ball and target rewards	Moving-ball shooting with locomotion command, kick trigger, and stabilization	Falling, mistimed contact, weak strikes, poor target accuracy, poor sim-to-real behavior

The systems differ technically, but they share the same architecture of restraint. They do not let the first useful signal become king. They give it a job description.

What the video RAG paper shows: semantic similarity is a useful intern, not a judge

The C2F-RAG paper is built around a practical problem in long-video retrieval-augmented generation: the system must retrieve relevant videos from a massive multilingual, event-centric corpus and generate persona-constrained answers with temporal grounding. This is not a toy search box. The system has to find evidence, respect the user’s informational lens, avoid hallucination, and cite the right video chunks.

The authors’ key move is to separate broad retrieval from logical selection.

In the coarse stage, they use BGE-M3 dense retrieval over high-level global summaries and dense frame descriptions. Crucially, they exclude OCR and ASR from this broad embedding stage. That may sound counterintuitive, because OCR and ASR can contain exactly the details needed for evidence. But the paper’s logic is sound: those modalities are fragmented, local, and noisy. They may help a reasoner inspect a candidate, but they can degrade a large vector space if used too early.

So the first stage asks only: “What is semantically plausible?”

It retrieves a wide candidate pool, reportedly top-1000 videos from a 110,000-plus corpus. Then the fine stage reintroduces the messy evidence. It serializes global summaries, localized visual frames, OCR, and ASR into a time-ordered multimodal context and gives that to an LLM-based adapted A.I.R. filtering agent. That agent scores logical alignment with the query and persona, explicitly pruning hard negatives.

The last stage then generates the answer from the distilled subset, with prompt instructions for persona adherence, zero hallucination, sentence-level support, temporal grounding, and strict JSON output.

The performance evidence supports the design direction. In the reported retrieval table, C2F-RAG reaches nDCG@10 of 0.848 and Recall@100 of 0.837, ahead of the listed baselines. Under the oracle generation setting, it trades off some precision for higher recall and stronger F1-style balance, with the paper reporting higher overall average performance than the official CAG baseline. The interesting part for operators is not the leaderboard bragging. Leaderboards come and go, usually wearing conference badges. The useful part is why the pipeline works: it refuses to treat semantic similarity as proof of relevance.

The system’s logic is basically this:

Question	Stage that answers it	Why not earlier?
Is this video broadly about the right topic?	Coarse dense retrieval	Cheap, scalable, high recall
Does this video actually support this persona-specific query?	LLM-based fine reranking	Requires reasoning over full multimodal context
Can the system answer without inventing unsupported claims?	Prompt-sculpted generation and schema enforcement	Requires grounded synthesis and auditable output
Can the output be evaluated and consumed automatically?	Deterministic JSON validation	Requires structural control after generation

This is a serious enterprise lesson. Many RAG failures are not “LLM failures” in the glamorous sense. They are admission-control failures. The wrong documents enter the final context, and then the generator is expected to behave as if poisoning the evidence pool was just a minor inconvenience.

It is not.

A generator cannot reliably compensate for a retrieval system that confuses “sounds related” with “is decision-relevant.” That is how an internal assistant cites a policy that mentions procurement but does not apply to the current contract. Or how a sales-support bot retrieves a case study from the wrong industry and produces something that is factually elegant, legally awkward, and commercially useless.

The paper’s staged design is a reminder that retrieval is not one problem. It is at least three: find candidates, judge evidence, and enforce accountable synthesis.

What the robotics paper shows: reward is not a substitute for readiness

RoboNaldo makes the same point with knees, ankles, ball contact, and the kind of timing window that makes software people suddenly appreciate physics.

The task sounds easy only to humans, who have spent years casually solving balance, visual prediction, impact control, and foot placement while pretending soccer is mostly about “instinct.” For a humanoid robot, shooting requires whole-body stability, high-impulse interaction, point-level accuracy, and generalization across stationary and moving balls.

The authors explicitly contrast three inadequate approaches:

Approach	What it gives	What it fails to give
Motion tracking	Stable coordination and human-like kicking structure	Adaptation to ball position, target, contact point, and strike timing
Pure task RL	The final objective	Efficient discovery of balance, swing, contact, and aim from sparse feedback
Motion prior / AMP-style behavior shaping	More natural movement	Still weak supervision for when and where to kick

RoboNaldo’s answer is a three-stage curriculum.

Stage 1 tracks a retargeted human side-foot kick without ball or task rewards. The purpose is not to score. The purpose is to acquire a stable whole-body kicking prior. In business language, this is capability bootstrapping before KPI optimization. Radical concept.

Stage 2 introduces the ball, target, and shooting rewards in stationary-ball free-kick settings. Now the policy must adapt the demonstrated kick. It must alter approach, contact point, strike direction, and impact speed depending on sampled ball and target positions.

Stage 3 introduces moving-ball shooting. This adds timing. The paper uses a locomotion-command and kick-trigger interface, with a high-level heuristic planner during training. The low-level policy can then be driven by other high-level controllers at inference. The system also adds stabilization after contact, because a robot that scores and immediately collapses has, depending on the procurement contract, either succeeded narrowly or failed expensively.

The empirical story supports the staging. The paper reports that in simulation, pure PPO and AMP struggle to make stable directed contact, while pure motion tracking replays a kick rather than adapting it. Stage 2 learns accurate and powerful free-kicks but does not zero-shot cleanly to moving balls. Stage 3 generalizes across free-kick and moving-ball settings. On hardware, the system runs onboard on a Unitree G1 with LiDAR-camera perception and reports 0.73 m average free-kick target error from 3 m, 0.86 m moving-ball error, 13.10 m/s peak ball speed, and 74% contact in human-passed moving-ball trials.

Again, the business lesson is not “humanoid soccer is ready to replace midfielders.” It is that task reward is not the same as teachability.

A final KPI can be too sparse, too delayed, or too brittle to train useful behavior from scratch. This is familiar outside robotics. “Increase customer retention” is a final objective, not an operational learning signal. “Reduce claim leakage” is a final objective, not enough guidance for a triage model. “Improve analyst productivity” is a dashboard outcome, not a process controller.

The RoboNaldo paper makes the architecture explicit: train the system first to be stable, then to be effective in a controlled task, then to handle timing and deployment variation.

The pattern: coarse, adapt, execute

Across both papers, the staged pattern can be reduced to three questions.

1. What can we safely do broadly?

C2F-RAG can safely use semantic retrieval for high-recall candidate generation, but not for final evidence judgment. RoboNaldo can safely use motion tracking to learn a stable kick prior, but not for target-specific shooting.

This is the “broad but shallow” stage. It should be cheap, scalable, and conservative about what it claims. Its job is not to be right in the final sense. Its job is to keep enough good options alive while eliminating obvious waste.

2. What must be adapted with richer supervision?

C2F-RAG uses LLM-based cognitive filtering over full multimodal context to distinguish logical evidence from semantic distractors. RoboNaldo uses shooting rewards and randomized stationary-ball settings to adapt a human-like kick into a target-directed interaction skill.

This is the middle stage. It is usually where many production systems are underbuilt. Organizations love retrieval and they love final generation. They love imitation learning and they love final reward. The middle, where candidates become evidence and motion becomes task behavior, is less glamorous. It is also where reliability is often decided.

3. What must be constrained at execution?

C2F-RAG constrains generation through persona instructions, temporal grounding, strict schemas, and deterministic post-processing. RoboNaldo constrains moving-ball execution through locomotion commands, kick triggers, contact timing, and stabilization.

This is the execution stage. It is where the system must stop exploring and start obeying. Outputs need to be citeable, machine-readable, auditable, or physically stable. “The model probably understood” is not a control policy. It is a mood.

The useful tension: one paper avoids training; the other depends on it

The contrast between the papers is worth keeping, because it prevents the wrong lesson.

C2F-RAG is training-free. It uses existing embedding models, a commercial LLM, prompt design, reranking logic, and deterministic schema enforcement. Its thesis is modular orchestration. The economic appeal is clear: if your environment changes quickly, if labeled data is scarce, or if the problem is more about evidence selection than task execution, a training-free staged pipeline can be attractive.

RoboNaldo is the opposite. It is explicitly a training-heavy curriculum RL method. The system succeeds because the policy is trained through sequential stages, reward redesign, simulation, domain randomization, and real-world deployment. Prompting a humanoid with “please kick accurately and remain upright” is, regrettably, not yet a robotics strategy.

The shared lesson is therefore not “avoid training” or “always train.” The lesson is:

Match the learning mechanism to the failure mode.

If the core failure is evidence contamination, logical irrelevance, persona drift, or formatting, modular retrieval and reasoning gates may be enough. If the core failure is embodied coordination, sparse feedback, and timing under physics, training is not optional. The staging principle survives, but the implementation changes.

For business AI, this distinction matters. Too many teams debate “fine-tuning versus RAG” as if the answer were theological. It is usually mechanical. Ask what failure mode you are controlling:

Failure mode	Likely intervention
Wrong evidence enters context	Better retrieval, reranking, metadata filters, evidence gates
Right evidence enters but answer is poorly structured	Prompt constraints, schema enforcement, post-processing
Model lacks domain behavior or style	Fine-tuning, preference optimization, examples, evaluation
Agent cannot coordinate multi-step action	Workflow decomposition, policy learning, tool contracts
Physical or operational timing is unstable	Control loops, triggers, simulation, staged deployment
Final objective is too sparse	Intermediate rewards, curricula, proxy tasks, demonstrations

This is the kind of table that should appear before budget approval, not after the pilot fails.

Business interpretation: build stage contracts, not just pipelines

Now let’s translate the research into operating principles. The papers show technical systems. The business interpretation is broader: every serious AI workflow needs explicit stage contracts.

A stage contract defines five things:

Contract element	Practical question
Input	What does this stage receive, and what has already been filtered out?
Signal	What evidence, reward, model, or heuristic is trusted here?
Decision	What is this stage allowed to decide?
Output	What must it pass forward, and in what format?
Rejection rule	What must it refuse, suppress, delay, or flag?

C2F-RAG has a clear version of this. The coarse stage receives global summaries and frame descriptions, not all noisy modalities. Its decision is candidate retrieval. The fine stage receives richer multimodal context and decides logical/persona relevance. The generation stage receives distilled evidence and must output grounded JSON.

RoboNaldo also has stage contracts. Stage 1 receives a reference motion and learns coordination. It is not asked to optimize ball placement. Stage 2 receives ball and target variation and learns stationary shooting adaptation. Stage 3 receives moving-ball timing through command and trigger interfaces and learns deployment-ready interaction behavior.

This is how enterprises should think about AI systems that must survive contact with messy operations.

A customer-support assistant should not use the same mechanism to retrieve all policies, determine jurisdiction, infer customer intent, draft the answer, and decide whether escalation is required. A credit workflow should not use one model score as if it were simultaneously probability, explanation, compliance rationale, and decision authority. A procurement copilot should not treat document similarity as contractual applicability. A field-service robot should not treat a success reward as a substitute for safe motion primitives.

One signal, one job. Preferably with supervision.

Where the pattern breaks

Staging is not magic. It introduces its own failure modes.

First, a bad early stage can still starve later stages. If C2F-RAG’s coarse retrieval misses the correct video entirely, the fine-stage reasoner cannot resurrect it from the void. If RoboNaldo’s Stage 1 learns a brittle motion prior, later stages inherit the constraint.

Second, stage boundaries can hide accountability. If a final answer is wrong, was retrieval too broad, reranking too strict, generation under-grounded, or schema validation too forgiving? Modular systems need stage-level telemetry. Otherwise, the architecture becomes a blame relay.

Third, latency and cost move around rather than disappear. C2F-RAG reports parallelized fine filtering and generation overhead; that may be acceptable for challenge evaluation or analyst workflows, but not necessarily for sub-second consumer interaction. RoboNaldo’s curriculum improves deployment behavior, but it requires simulation infrastructure, reward engineering, perception design, and hardware validation.

Fourth, the middle stage can become a hand-built shrine. Prompt sculpting, heuristic triggers, reward weights, and handcrafted planners are powerful, but they can also encode hidden assumptions. RoboNaldo’s authors acknowledge limitations: reliance on a single reference kick, a handcrafted high-level trigger policy, and a perception module tailored to a retro-reflective ball. C2F-RAG’s training-free design is economically attractive, but its performance depends on prompt engineering, chosen models, corpus preprocessing, and evaluation setting.

Staging reduces chaos. It does not abolish engineering.

The operator’s framework: the Signal Assignment Test

Before deploying or buying an AI system, ask one unfashionably useful question:

Which signal is being trusted to make which decision?

Then audit the system with this five-step test.

1. Identify the final behavior

Do not start with model type. Start with the operational behavior.

Should the system retrieve evidence?
Rank options?
Generate a cited answer?
Trigger an action?
Move a robot?
Escalate a decision?
Refuse unsafe requests?

Different final behaviors require different reliability mechanisms.

2. Split the behavior into stage decisions

A useful AI system usually contains hidden decisions. Make them visible.

For RAG, the decisions might be: parse query, retrieve candidates, rerank evidence, generate answer, validate citations, format output. For robotics, they might be: perceive scene, choose skill, approach target, trigger action, execute contact, stabilize.

If the system diagram has one box labeled “AI,” that is not architecture. That is graphic design.

3. Assign one trusted signal per stage

Semantic similarity can retrieve. It should not certify truth. User clicks can suggest usefulness. They should not define compliance. Motion imitation can bootstrap coordination. It should not decide optimal contact timing. Final reward can evaluate success. It may not teach the path there.

This is the core discipline.

4. Add rejection and fallback rules

Every stage needs a way to say no.

Retrieval should drop low-confidence or out-of-scope candidates.
Reranking should suppress hard negatives.
Generation should refuse unsupported claims.
Control policies should avoid unstable actions.
Human escalation should catch cases where the system cannot justify itself.

No rejection rule means every stage is a yes-machine. Businesses already have enough of those. They are called meetings.

5. Measure by stage, not only by final output

Final metrics are necessary but insufficient. The C2F-RAG paper reports retrieval metrics, generation/grounding metrics, and latency. RoboNaldo reports simulation metrics, ablations, hardware results, contact, alive rate, shot error, and ball speed. That decomposition is useful because it exposes where performance comes from.

For enterprise systems, measure:

Layer	Example metric
Candidate generation	Recall, coverage, latency, cost
Evidence selection	Precision, false positive rate, hard-negative rejection
Generation	groundedness, citation validity, schema compliance
Action execution	success rate, rollback rate, safety incidents
Human interface	escalation quality, review time, override rate

A single “accuracy” number is usually where nuance goes to die.

The bigger lesson: intelligence is not the same as admissibility

Both papers are ultimately about admissibility.

C2F-RAG asks: which pieces of video evidence are admissible for this persona-specific answer? RoboNaldo asks: which learned behaviors are admissible before the robot is allowed to strike a moving ball? In both cases, the system becomes stronger by narrowing what each stage is permitted to do.

This is countercultural in AI, where capability is often marketed as expansion: more modalities, more context, more parameters, more autonomy, more tools, more everything. The papers point in the opposite direction. Expansion must be paired with disciplined staging, or the system merely becomes more creative about failing.

The video system does not let noisy OCR and ASR contaminate first-pass retrieval; it waits until reasoning can use them properly. The robot does not try to learn target shooting before it can execute a stable kick; it waits until the body has a usable prior. Both systems delay certain information until the moment it becomes useful.

That is the mature design instinct: not using every signal immediately just because it exists.

What managers should take away

For business readers, the main takeaway is not that every AI system needs three stages. Three is a convenient number, not a sacred architecture. The takeaway is that reliability comes from assigning responsibility.

A staged AI system should make clear:

what each stage is allowed to decide;
what evidence or reward that stage can trust;
what gets filtered before moving forward;
what constraints govern the final output or action;
where failure is measured.

That applies to RAG, agents, robotics, finance workflows, healthcare triage, legal retrieval, operations planning, and customer support. Anywhere a system must operate under noisy evidence, incomplete context, time pressure, or real-world consequences, the same principle appears: do not ask one model, embedding, reward, prompt, or controller to solve the whole problem.

The C2F-RAG paper shows this in evidence selection. RoboNaldo shows it in embodied control. Together, they make a broader point that is easy to miss if you read them as separate technical artifacts.

The future of applied AI will not be won by the most theatrical end-to-end demo. It will be won by systems that know when to retrieve, when to reason, when to adapt, when to trigger, and when to shut up.

A little staging, it turns out, is not bureaucracy. It is how the system survives the performance.

Cognaptus: Automate the Present, Incubate the Future.

Jiaxin Dai, Zehang Wei, Jiamin Yan, and Xiang Xiang, “Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation,” arXiv:2606.07924, 2026. https://arxiv.org/html/2606.07924 ↩︎
Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, and Hongyang Li, “RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning,” arXiv:2606.11092, 2026. https://arxiv.org/html/2606.11092 ↩︎

TL;DR for operators#

The shared problem: one task, too many incompatible signals#

The shared insight: stage the trust, not just the computation#

What the video RAG paper shows: semantic similarity is a useful intern, not a judge#

What the robotics paper shows: reward is not a substitute for readiness#

The pattern: coarse, adapt, execute#

1. What can we safely do broadly?#

2. What must be adapted with richer supervision?#

3. What must be constrained at execution?#

The useful tension: one paper avoids training; the other depends on it#

Business interpretation: build stage contracts, not just pipelines#

Where the pattern breaks#

The operator’s framework: the Signal Assignment Test#

1. Identify the final behavior#

2. Split the behavior into stage decisions#

3. Assign one trusted signal per stage#

4. Add rejection and fallback rules#

5. Measure by stage, not only by final output#

The bigger lesson: intelligence is not the same as admissibility#

What managers should take away#