Think First, Grasp Later: Why Robots Need Reasoning Benchmarks

A robot receives a simple instruction: pick up the blue cup.

It approaches the blue cup, positions its gripper badly, and knocks the cup over. Another robot moves smoothly, closes its gripper precisely—and picks up the red cup.

On the operations dashboard, both attempts may appear under the same pleasantly uninformative label: task failed.

Yet the first robot understood the instruction and executed it poorly. The second executed competently after misunderstanding the instruction. Fixing one requires better control. Fixing the other requires better reasoning. Treating both as the same failure is an efficient way to spend more on training while learning very little.

The paper Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training¹ addresses this diagnostic problem through two connected contributions. First, it introduces ERIQ, a benchmark that evaluates embodied reasoning without requiring a robot to execute an action. Second, it proposes GenieReasoner and its Flow-matching Action Tokenizer, or FACT, to connect semantic reasoning with precise continuous control.

The paper is easiest to understand as a comparison among three imperfect robots:

a robot that knows what to do but cannot execute precisely;
a robot that moves precisely but may choose the wrong action;
a robot architecture designed to preserve both semantic judgment and motor precision.

That comparison matters more than the usual benchmark-versus-benchmark horse race. It reveals that reasoning and execution are separate capabilities, require separate measurements, and can quietly damage each other when trained together carelessly.

Two Robots Can Fail for Opposite Reasons

Vision-Language-Action models attempt to combine visual perception, language understanding, reasoning, and robot control in one system. The ambition is sensible. The implementation is awkward.

Language models naturally operate with discrete tokens. A model predicts one token after another: a word, a subword, a coordinate label, or perhaps an encoded action. Robot controllers, however, must produce continuous movements. A gripper does not move to token number 4,217. It moves through physical space with continuously varying positions, rotations, and joint states.

Existing VLA systems therefore tend to lean toward one of two architectural families.

Architecture family	What it preserves well	Characteristic failure in the paper’s comparison
Discrete action-token models	Compatibility with the VLM’s semantic and autoregressive representations	The model identifies and approaches the correct target but loses precision during grasping
Continuous action-head models	Smooth, high-fidelity physical control	The model performs competent movements toward an incorrectly identified target
GenieReasoner with FACT	Discrete semantic alignment followed by continuous trajectory reconstruction	Attempts to retain instruction understanding without surrendering fine motor precision

The discrete approach has an attractive simplicity. Reasoning outputs and robot actions can be predicted through the same next-token machinery. Unfortunately, physical accuracy becomes difficult to represent compactly.

Uniformly dividing continuous action space into bins can provide precision, but only by creating a large action vocabulary or long token sequences. Learned quantizers compress more efficiently, but their reconstructed movements may be too crude for contact-sensitive tasks. FAST-style encodings improve compression through variable-length sequences, but the paper argues that variable lengths introduce additional autoregressive decoding instability.

Continuous action heads solve a different problem. Diffusion- or flow-based controllers can generate smooth trajectories without forcing every physical detail into a token. But attaching a continuous control objective to a discrete language-and-vision backbone creates its own tension. Gradients optimized for motor reconstruction may interfere with the semantic representations that made the VLM useful in the first place.

The result is not merely a technical trade-off. It is a diagnostic trap. A robot can fail because it selected the wrong target, because it generated the wrong sequence of subtasks, because it did not recognize that an earlier action had failed, or because its otherwise sensible plan became inaccurate during execution. An end-to-end success rate collapses all four into one number.

Convenient, certainly. Informative, less so.

ERIQ Tests the Decision Before Motor Error Can Hide It

The paper’s first contribution, the Embodied Reasoning Intelligence Quotient benchmark, separates reasoning from physical execution.

ERIQ contains 6,052 question–answer pairs constructed from real-world robotic trials. Instead of asking a robot to complete a task, it asks a vision-language model questions about what the robot sees, what should happen next, whether something went wrong, and what a human intends.

Its tasks are organized around four reasoning pillars.

ERIQ pillar	Questions it is designed to answer
Spatial Perception and Grounding	Which object is being referenced? Where is it relative to other objects? Do different camera views show the same target?
Planning and Monitoring	What sequence of actions is needed? Has the task progressed or completed?
Error Detection and Recovery	Did a mistake occur? What kind of mistake was it? What corrective action is appropriate?
Human Intent Understanding	What is the nearby person trying to accomplish, and how should the robot cooperate?

These pillars are divided into 15 finer-grained subtasks. The benchmark spans more than 100 task scenarios across household, restaurant, supermarket, industrial, and office settings. Its inputs include single images, sequential images, and interleaved image–text sequences.

Most importantly, ERIQ uses multiple-choice or binary answers. That makes scoring deterministic and reproducible. It avoids outsourcing evaluation to another language model and then pretending the resulting subjectivity has disappeared because it arrived through an API.

This format also creates a boundary. Multiple-choice reasoning is easier to score than open-ended reasoning, but it does not fully represent the ambiguity of physical environments. A model may recognize the correct answer among supplied alternatives without being able to generate a useful recovery plan independently. ERIQ is therefore a diagnostic instrument, not a complete definition of embodied intelligence.

That distinction strengthens the benchmark rather than weakening it. Diagnostic tests are valuable precisely because they isolate a capability. Nobody expects a blood test to perform surgery.

The 82.72% Score Measures Reasoning, Not Robot Success

After embodied pre-training, the paper’s 3-billion-parameter model raises its average ERIQ score from the base model’s 58.64% to 82.72%. That is an absolute improvement of 24.08 percentage points.

The model scores particularly highly on action understanding and human-intention comprehension, reaching 96.67% and 96.44%, respectively. It also makes large gains in dual-view matching and relative-position grounding, suggesting that embodied co-training improves the model’s ability to connect instructions with spatially grounded observations.

The aggregate score, however, should not be read as evidence that reasoning has been solved. The same model scores 55.36% on fine-grained planning and 51.60% on task-progress assessment. ERIQ therefore does something useful even for the paper’s own system: it exposes where a strong average conceals weaker operational capabilities.

This is one reason a multidimensional benchmark is more valuable than a single success rate. A model can be excellent at recognizing a person’s intention while remaining unreliable at determining whether a long task is actually finished. In a collaborative demonstration, both may look impressive. In an unattended workflow, the second weakness is the one that keeps the incident-response team employed.

The model also performs better than its Qwen2.5-VL-3B base on the evaluated open-source spatial benchmarks, suggesting that embodied training did not simply improve ERIQ by destroying general visual-language capability. It does not dominate every larger external model on every task, nor would such a comparison establish that it should. The relevant evidence is narrower: targeted embodied co-training substantially improves the model’s reasoning on the intended diagnostic dimensions while preserving broader capabilities.

The paper further argues that higher ERIQ performance is positively associated with stronger end-to-end generalization. Its training-recipe experiments support that interpretation, but the relationship should be stated carefully. The experiments compare a small set of related model variants produced within one training pipeline. They show that reasoning strength and later execution performance move together under those conditions. They do not establish that increasing an arbitrary robot’s ERIQ score will causally raise its task-success rate by a predictable amount.

ERIQ is best understood as a promising leading indicator, not yet a universal conversion formula from benchmark points to completed warehouse picks.

FACT Lets the VLM Predict Tokens Without Making the Robot Move Like One

Measuring reasoning separately solves the diagnostic problem. It does not solve the robot-control problem.

To reconnect reasoning with physical action, the paper introduces FACT, a Flow-matching Action Tokenizer. FACT converts continuous robot-action trajectories into compact discrete codes that the VLM can predict autoregressively. It then uses a flow-matching decoder to reconstruct precise continuous movements from those codes.

The pipeline is conceptually simple:

Observation and instruction → VLM reasoning → discrete action codes → FACT flow decoder → continuous robot trajectory

FACT’s encoder compresses an action sequence across time and action dimensions. A lookup-free, bitwise quantizer then maps the compressed representation into discrete codes. These codes are stable, fixed-format targets for the autoregressive VLM.

Precision is recovered later. The decoder begins with noise and learns a velocity field that transports the noisy sample toward the target action trajectory. During inference, it integrates that learned flow to generate continuous control signals conditioned on the predicted discrete codes.

The important design choice is where the burden of precision sits.

Earlier discrete systems effectively ask the token representation itself to preserve every detail of the motion. FACT asks the discrete codes to preserve enough structured information for a generative decoder to reconstruct the details. The VLM can remain in a discrete semantic space, while the decoder handles continuous physical fidelity.

This does not mean the entire system uses one identical loss. The FACT decoder still requires its own flow-matching training objective. The unification occurs at the VLM backbone: reasoning responses and action codes can both be learned as discrete autoregressive targets, avoiding the need to inject a competing continuous-action head directly into the reasoning model.

The paper’s tokenizer ablation compares FACT with FAST+ across code lengths and vocabulary settings. At equivalent compressed lengths, FACT produces substantially lower reconstruction error, often by roughly an order of magnitude. A code length of 20 is selected as the preferred balance between reconstruction fidelity and prediction difficulty.

This experiment supports a specific claim: FACT is a more accurate action-reconstruction mechanism under the tested compression settings.

It does not, by itself, prove that the resulting robot will select the correct object, recover from errors, or safely operate in an unfamiliar workplace. Reconstruction quality is necessary for precise execution. It is not a substitute for reasoning or system-level evaluation.

Language Following and Task Completion Reveal Different Failures

The paper’s most informative experiment is not the highest benchmark score. It is the training-recipe ablation that measures language following separately from task success.

Language following evaluates whether the robot reaches the vicinity of the intended target. It is primarily a test of semantic grounding: did the system understand what object or location the instruction referred to?

Task success requires full completion, including a successful grasp. It therefore combines semantic correctness with physical execution.

The distinction creates a more revealing evaluation matrix.

Outcome	Likely interpretation
Poor language following and poor task success	The system does not reliably understand the instruction
Strong language following but poor task success	The system understands the target but cannot execute precisely
Strong task mechanics after approaching the wrong object	The controller is capable, but semantic grounding is weak
Strong language following and strong task success	Reasoning and execution are sufficiently aligned for the tested task

The ablation compares different combinations of general VQA data, embodied VQA data, action-alignment data, and post-training mixtures.

Embodied VQA pre-training alone raises ERIQ from 58.64% to 82.72% and improves language following, but end-to-end task success remains negligible without action alignment. The model understands more, yet still cannot reliably act.

Action training produces a substantial improvement in physical execution. A variant trained with action alignment but without the same embodied-reasoning foundation can complete more tasks than the baseline, but its semantic capabilities remain weaker.

The strongest configuration combines embodied VQA and action data during pre-training and retains both during post-training. Compared with action-only post-training, continuing to include embodied VQA improves several end-to-end results. For direct target instructions, success rises from 0.18 to 0.25. For spatial instructions, it rises from 0.05 to 0.35. For color instructions, it rises from 0.14 to 0.22.

The language-following results are not uniformly higher in every category; spatial language following falls from 0.68 to 0.54 between the two final variants, even as spatial task success rises sharply. This is precisely why the paper’s separated metrics matter. A single aggregate story about “better alignment” would hide the fact that semantic proximity and successful completion can move differently.

The broader lesson is not merely that more mixed data is useful. It is that reasoning capabilities can be weakened during action-focused post-training unless the training distribution continues to exercise them. Reasoning is not a decorative pre-training phase that can be safely abandoned once the robot starts moving.

Apparently, even machines forget the theory once management tells them to focus exclusively on execution.

The Real-World Comparison Shows the Trade-Off in Physical Form

The paper evaluates GenieReasoner on real robots across five increasingly difficult conditions:

objects seen during training;
unseen objects;
color variations;
spatial or pose variations;
semantic instructions that require interpreting a function rather than naming an object directly.

For example, an instruction such as “pick up something to clean the table” requires the model to infer which available object satisfies the intended purpose.

The real-world results compare GenieReasoner with both continuous-action and discrete-action baselines. The reported pattern closely matches the paper’s central diagnosis.

The discrete baseline demonstrates relatively strong instruction adherence. It often identifies and approaches the correct target. Its task-completion rate then drops because quantization artifacts reduce the precision needed for grasping.

The continuous baselines display stronger manipulation once they have selected the correct target. Their more frequent failure is semantic: in harder unseen-object and color-variation settings, they may approach the wrong object.

GenieReasoner is designed to occupy the missing quadrant. Its action tokens remain connected to the VLM’s semantic representations, while FACT reconstructs continuous trajectories for physical execution. Across the paper’s real-world comparisons, it achieves the highest aggregate result and combines stronger instruction following with competitive task completion.

The paper also presents qualitative demonstrations on AgiBot G1 and ARX AC-One platforms, including out-of-distribution object retrieval, shelf restoration, and deformable-object manipulation such as garment folding. These demonstrations extend the scope of the system and illustrate cross-platform use.

They should still be interpreted as qualitative extensions. Demonstrations show that the system can perform selected complex tasks. They do not establish failure rates, recovery reliability, deployment costs, or unattended operational safety across those broader task categories.

Each Experiment Answers a Different Question

The paper contains several forms of evidence. Treating all of them as interchangeable would recreate the same diagnostic problem that ERIQ is meant to solve.

Evidence	Likely purpose	What it supports	What it does not establish
ERIQ comparison across models	Main reasoning evidence and comparison with prior models	Embodied co-training improves the tested reasoning dimensions; ERIQ differentiates model capabilities	That a high ERIQ score guarantees physical task completion
FACT reconstruction comparison	Component ablation and representation test	FACT reconstructs continuous actions more accurately than FAST+ at comparable code lengths	That lower reconstruction error alone produces better semantic decisions
Training-recipe ablation	Mechanism and data-mixture ablation	Embodied VQA, action alignment, and continued mixed post-training play distinct roles	A universal optimal data mixture for every robot or task
Real-world language-following evaluation	Main semantic-grounding evidence	The unified action representation reduces target-selection errors in the tested settings	General semantic reliability outside the evaluated scenarios
Real-world task-success evaluation	Main end-to-end evidence	GenieReasoner better combines correct targeting with precise execution	Production-level reliability, safety, latency, or economics
Cross-platform and complex-task demonstrations	Exploratory qualitative extension	The architecture can transfer to multiple embodiments and more varied tasks	Robust performance distributions across those task families

This separation matters because robotics papers often contain impressive demonstrations surrounding much narrower quantitative evidence. A benchmark result, a tokenizer ablation, and a garment-folding video can all be useful. They answer different questions.

The paper is strongest when those questions remain separate and then form a coherent chain:

ERIQ measures whether the model understands embodied situations.
The training ablation shows that stronger embodied reasoning improves semantic grounding but cannot replace action learning.
FACT improves the fidelity of discrete action representations.
Real-world evaluation tests whether the combined system can translate better reasoning into precise behavior.

That chain is more persuasive than any single headline number.

The Business Value Is Cheaper Diagnosis Before Expensive Motion

The most immediate business implication is not that every robotics company should adopt FACT. It is that robotics teams should stop using end-to-end task success as their only meaningful diagnostic measure.

Physical robot evaluation is expensive. It consumes hardware time, engineering supervision, reset labor, compute, and occasionally whatever object the robot was supposed to handle gently. When a policy fails, teams may retrain the entire system without knowing whether the bottleneck was perception, reasoning, semantic grounding, action representation, or low-level control.

An ERIQ-style reasoning gate could move part of that diagnosis earlier in the development pipeline.

Development stage	Diagnostic question	Potential operational benefit
Before action-policy training	Does the model understand objects, relations, task progress, errors, and human intent?	Avoid investing in expensive control training for a semantically unsuitable backbone
During joint training	Are reasoning capabilities being preserved as action performance improves?	Detect capability forgetting before it appears as confusing physical failures
Before physical deployment	Does the model generalize to novel instructions and configurations in reasoning-only tests?	Prioritize the most promising models for limited robot-test capacity
During failure analysis	Did the system choose the wrong target or fail to execute the correct choice?	Route the problem to the appropriate data, model, or controls team

This is a Cognaptus inference from the paper’s results, not a directly measured return-on-investment claim. The paper does not report development costs saved, testing cycles reduced, or deployment incidents prevented.

Still, the pathway is credible. If reasoning-only evaluation can eliminate weak model candidates before action training and physical trials, the savings may come less from cheaper training per run than from avoiding the wrong runs entirely.

FACT suggests a second operational pathway. Firms building VLA systems currently face a choice between architectures that preserve semantic alignment and architectures optimized for control fidelity. A tokenizer that retains discrete compatibility while delegating precision to a continuous decoder may reduce the need for complex safeguards between reasoning and action objectives.

The practical value would be architectural simplification and more interpretable failure isolation. Whether FACT delivers that value in production depends on factors the paper does not measure: decoder latency, integration-step requirements, compute footprint, controller compatibility, retraining burden, and behavior under safety-critical disturbances.

What the Paper Does Not Yet Establish

The paper provides a coherent diagnosis and a technically plausible treatment. Several boundaries remain important.

First, the relationship between ERIQ performance and downstream success is suggestive rather than causal. The benchmark, model, training data, and evaluation pipeline were developed within the same research program. An independently developed model might score well on ERIQ without achieving comparable execution gains, particularly if it learns benchmark-specific patterns.

Second, multiple-choice evaluation improves reproducibility but limits expressiveness. Real failures rarely arrive with four conveniently labeled recovery options. A robot that selects the right answer may still fail to generate, verify, and execute an appropriate recovery sequence autonomously.

Third, the reported real-world evaluation demonstrates relative performance in selected manipulation settings. It does not establish the reliability levels required for deployment in factories, restaurants, healthcare facilities, or homes. Safety behavior, inference latency, failure recovery under repeated disturbances, and human-override requirements remain outside the paper’s main evidence.

Fourth, FACT improves reconstruction accuracy, but a flow-matching decoder introduces its own operational costs. The paper does not provide a production-level analysis of latency, energy use, hardware requirements, or the trade-off between the number of integration steps and control quality.

Finally, the benchmark’s coverage is broad within robotic manipulation, but manipulation is not the whole embodied world. Navigation, mobile coordination, sustained human collaboration, and safety-critical intervention may require different reasoning tests and different forms of action representation.

These limitations do not undermine the paper’s central contribution. They define the next validation layer.

Think First, Then Verify the Grasp

The robot that selects the correct cup and drops it is not making the same mistake as the robot that smoothly grasps the wrong cup.

This paper’s most useful contribution is making that distinction measurable.

ERIQ isolates the reasoning capabilities that end-to-end task success tends to obscure. Its results show that embodied VQA training can substantially improve spatial grounding, planning, error analysis, and intent understanding before the model becomes capable of reliable physical execution.

FACT addresses the complementary problem. It allows the VLM to predict stable discrete action codes while using a flow-matching decoder to recover continuous precision. The real-world comparisons then demonstrate why both pieces are necessary: semantic competence without precise execution is ineffective, while precise execution without semantic competence is merely an efficient way to perform the wrong task.

For robotics developers, the strategic lesson is straightforward. Evaluate whether the system understands the task before spending heavily on teaching it to move. During action training, keep checking whether that understanding survives. And when a physical trial fails, resist the temptation to call the entire model “bad” until someone has identified whether it thought incorrectly or grasped badly.

Robots need better hands. They also need examinations that reveal what, exactly, is happening between their cameras and those hands.

Cognaptus: Automate the Present, Incubate the Future.

Yi Liu et al., “Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training,” arXiv:2512.24125. https://arxiv.org/abs/2512.24125 ↩︎

Two Robots Can Fail for Opposite Reasons#

ERIQ Tests the Decision Before Motor Error Can Hide It#

The 82.72% Score Measures Reasoning, Not Robot Success#

FACT Lets the VLM Predict Tokens Without Making the Robot Move Like One#

Language Following and Task Completion Reveal Different Failures#

The Real-World Comparison Shows the Trade-Off in Physical Form#

Each Experiment Answers a Different Question#

The Business Value Is Cheaper Diagnosis Before Expensive Motion#

What the Paper Does Not Yet Establish#

Think First, Then Verify the Grasp#