When Models Guess the Verb by Looking at the Drawer

Drawer.

That is the easy part. A model sees a drawer, and it knows that drawers are often opened. Then it watches a video where someone is closing the drawer and predicts opening anyway.

This is not the kind of error that makes a demo look silly for five seconds and then disappear into the benchmark appendix. It is the kind of error that reveals what the system is really using as evidence. The model is not necessarily watching the motion. It may be recognizing the object, remembering the most common verb attached to that object during training, and calling that “video understanding.” Very efficient. Also wrong.

The paper behind this article, Why Can’t I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition, studies exactly this failure mode in zero-shot compositional action recognition, or ZS-CAR.¹ The task sounds technical, but the underlying problem is familiar to anyone deploying automation systems: can the model recognize a new combination of known parts, or does it collapse back to the combinations it saw most often?

The authors’ answer is uncomfortable but useful. Strong video-language backbones do not automatically solve the problem. Even when the verb and object are both known, a model can still fail on a new verb-object pairing because it has learned the object as a shortcut for the verb. “Drawer” becomes evidence for “opening.” “Paper” becomes evidence for “folding” or “tearing.” “Knife” becomes evidence for “take.” The object becomes a lazy manager assigning verbs by historical habit.

The paper’s contribution is therefore not just another accuracy gain. It offers a diagnostic frame for asking whether a video model is learning actions from temporal evidence or merely laundering co-occurrence bias through a more impressive backbone.

The drawer mistake is a compact version of the whole problem

The appendix gives the cleanest entry point. In the Sth-com failure-case table, the baseline model misclassifies unseen (Closing, Drawer) samples as (Opening, Drawer) in a large share of those errors. The object-side training statistic is revealing: Opening is the verb most frequently paired with Drawer in the training data. So when the model encounters the unfamiliar composition Closing + Drawer, it drifts toward the familiar pair.

That is the paper in miniature.

The model has seen both primitives before. It knows the verb class Closing. It knows the object class Drawer. What it struggles with is the composition: the pairing of a known verb with a known object in a combination that was not observed during training.

This is the central difference between ordinary recognition and compositional generalization. In ordinary classification, the model asks, “Which label have I seen before?” In ZS-CAR, it must ask, “Can I recombine known primitives into a pair I did not see during training?” That second question is much closer to real operational use. Businesses rarely deploy models into a world where all possible process variants have been politely pre-enumerated for the training set.

The drawer error also clarifies why this is not just a data-volume problem. More data helps only if it breaks the shortcut. If the extra data reinforces the same co-occurrence patterns, the model becomes more confident in the wrong evidence. It learns the office routine, not the actual action.

Why objects dominate verbs in video models

The authors argue that object-driven shortcuts arise from two pressures that reinforce each other.

First, verb-object supervision is sparse and skewed. A dataset cannot cover every plausible verb-object combination. Some combinations appear frequently; many never appear. If Opening + Drawer appears often and Closing + Drawer appears rarely or not at all, the training distribution quietly teaches the model that the drawer predicts the verb.

Second, objects are easier to learn than verbs. An object is often visible in a single frame. A verb may require temporal reasoning across several frames. Opening and closing can be visually similar if the model is weak at representing direction, sequence, or state change. The shortcut is obvious: use the easier object cue, then infer the likely verb from training co-occurrence.

The paper tests this asymmetry directly. In a controlled subset of Sth-com, the authors train a randomly initialized ViT on a balanced set designed to remove compositional sparsity. The model learns object prediction faster and reaches higher object accuracy than verb accuracy. That experiment is not the main benchmark result; its purpose is diagnostic. It isolates learning difficulty and shows that even without the usual sparsity problem, objects are easier targets.

Then the authors build a bias-controlled split where each verb is strongly associated with a particular object. On bias-conflict unseen compositions, the model can still achieve high object accuracy while verb accuracy drops below uniform chance. The important interpretation is not “CLIP is bad.” The paper also reports the phenomenon with random initialization. The shortcut is structural: if the training setup rewards object-to-verb inference, the model will take the cheap path. Models, like interns, learn what the evaluation system actually rewards.

FSP and FCP turn a vague suspicion into a measurable failure

A useful part of the paper is that it does not stop at saying “shortcut learning exists.” It proposes two diagnostic metrics for the specific failure.

False Seen Prediction (FSP) measures how often a misclassified unseen composition is predicted as a seen training composition. If unseen cases keep collapsing into seen pairs, the model is not navigating the full compositional space well.

False Co-occurrence Prediction (FCP) narrows this further by asking whether those wrong predictions fall into frequent training compositions. This matters because not every false-seen prediction is equally suspicious. A collapse into a frequent pair is stronger evidence that the model is leaning on training co-occurrence priors.

The authors also break FSP and FCP into component-level patterns:

Error pattern	What happens	Why it matters
Verb-collapse	Object is correct, verb is wrong	Strong evidence of object-driven verb shortcut
Object-collapse	Verb is correct, object is wrong	Suggests object recognition or object-conditioned confusion
Dual-collapse	Both verb and object are wrong	More general failure, less diagnostic of the specific shortcut

This breakdown is valuable because it prevents a common mistake in model evaluation: treating all wrong answers as the same kind of wrong. In this paper, the key problem is not merely that the model misclassifies unseen pairs. It is that many errors keep the object and replace the verb with a training-frequent counterpart. That is a very specific pathology.

The learning curves support the diagnosis. With a CLIP backbone, the C2C baseline shows a widening gap between seen and unseen accuracy as FSP/FCP rise during training. With InternVideo2, a stronger video-pretrained backbone, the behavior is milder but still present. The paper’s message is therefore not “use a better backbone.” It is “a better backbone still needs the right pressure to learn temporal evidence.”

That distinction matters in business settings. Upgrading the foundation model may improve averages while leaving the damaging failure mode intact. Average accuracy is pleasant. Failure-mode diagnosis is operationally useful.

The compositional gap asks whether the model is doing more than primitive recognition

The paper also introduces a Compositional Gap diagnostic. The idea is to compare joint composition accuracy against what would be expected from recognizing the verb and object independently. The exact metric is used as a diagnostic rather than a standalone leaderboard score.

That positioning is sensible. A compositional gap is not magic proof of “reasoning.” It is a way to ask whether the model’s joint prediction benefits from modeling the verb-object relation, instead of merely stitching together separately recognized components.

On Sth-com, the baseline C2C model has positive gap on seen compositions but negative gap on unseen compositions, even with InternVideo2. That is the interesting asymmetry. The model can benefit from composition modeling where the compositions are familiar, but the benefit degrades when the composition is new. In plain language: the model knows the old routines better than the underlying action logic.

RCORE later improves this picture. On unseen compositions, it pushes the compositional gap positive in the main results for Sth-com and EK100-com. That does not mean the model has achieved human-like action understanding. It means the joint composition prediction is no longer falling below the independent reference in the same way. A modest result, but a meaningful one. Not every paper needs to declare a revolution before breakfast.

RCORE attacks the shortcut from two sides

The proposed method is called RCORE, short for Robust COmpositional REpresentations. It has two main components: CPR and TORC.

The names are a little acronym-heavy, as research papers are legally required to be. The mechanisms are clearer than the naming.

CPR expands supervision without drowning the model in impossible negatives

Co-occurrence Prior Regularization (CPR) addresses the sparse and skewed composition problem.

The authors synthesize new verb-object training examples by mixing static object cues from another video into high-motion regions of the original video. The goal is to create plausible new compositions while preserving the verb signal. The model then receives soft supervision for these synthesized compositions.

But there is a training problem. Conventional ZS-CAR pipelines often optimize only over seen compositions. That makes it hard to directly supervise unseen combinations. The opposite extreme—optimizing over the full verb-object space—can be harmful because most unseen compositions become treated as negatives during training.

The authors’ solution is batch-adaptive label-space expansion. For each mini-batch, the label space expands only to include the synthesized compositions relevant to that batch. This injects new compositional supervision without forcing the model to classify against the entire possible universe each time.

CPR also uses frequent hard negatives. If a synthesized target is likely to be confused with a frequent seen pair, the loss penalizes that collapse. The penalty is not applied to every hard negative; the ablation shows why. Penalizing all hard negatives hurts object prediction and overall composition performance. Penalizing frequent hard negatives gives a better trade-off. In operational language: do not punish every nearby alternative; punish the historically over-dominant confounders.

TORC forces the verb representation to care about temporal order

Temporal Order Regularization for Composition (TORC) addresses the second root cause: weak temporal verb learning.

If a model treats a forward sequence and a reversed sequence as nearly the same, it is unlikely to distinguish actions like opening and closing reliably. The paper reports that the baseline maintains high cosine similarity between original and reversed verb features during training, around +0.92 in the appendix figure. That is not a comforting number if your use case depends on direction, sequence, or state transition.

TORC applies two pressures. First, it separates original and reversed temporal features, discouraging the model from treating opposite temporal semantics as equivalent. Second, it encourages high-entropy verb predictions when temporal order is shuffled, preventing the model from confidently inferring verbs from static cues alone.

The logic is simple: when temporal structure is destroyed, the model should become less certain about the verb. If it remains confident, it is probably not using the temporal structure in the first place.

This is one of the paper’s most transferable ideas. Even outside this exact architecture, a deployment team can ask a practical diagnostic question: does the model’s verb prediction degrade when temporal order is disrupted? If not, perhaps the “video model” is mostly an image model with better manners.

The main results show a trade-off, not a miracle

The paper evaluates RCORE on two datasets: Sth-com and the newly curated EK100-com. Sth-com comes from Something-Something style action data. EK100-com is repurposed from EPIC-KITCHENS-100 and is more sparse: the paper reports a label coverage ratio of 7.5%, lower than Sth-com’s 12.8%. That makes EK100-com a useful harder test for co-occurrence bias.

The authors use an open-world evaluation protocol as the default. This matters. In closed-world evaluation, inference can be restricted to compositions known to appear in validation or test sets. That is convenient, but unrealistic: real systems do not receive a polite mask telling them which combinations are allowed today. The open-world protocol scores across the full verb-object space and avoids test-set-tuned bias calibration as the main setting.

The key results are directionally consistent:

Dataset / backbone	Baseline unseen composition accuracy	RCORE unseen composition accuracy	Interpretation
Sth-com / CLIP	30.08	33.90	Better unseen generalization, with lower seen accuracy
Sth-com / InternVideo2	39.53	43.98	Gain persists with video-pretrained backbone
EK100-com / CLIP	21.56	28.41	Larger gain under severe sparsity
EK100-com / InternVideo2	30.33	38.07	Strong unseen improvement, again with seen-accuracy trade-off

The seen-accuracy trade-off should not be hidden. On both datasets, RCORE often reduces seen composition accuracy while improving unseen composition accuracy and harmonic mean. This is not a free lunch; it is a redistribution of model competence away from memorized frequent pairs toward novel compositions.

For many business applications, that trade-off may be desirable. If the system only needs to classify a fixed, stable set of known actions, memorizing frequent pairs is not a disaster. It may even be efficient. But if the system is deployed into a changing workflow, new product handling patterns, warehouse exceptions, kitchen tasks, robotic demonstrations, or customer behaviors, unseen compositions are not edge cases. They are Tuesday.

The paper’s evidence is strongest when the diagnostic and performance results line up. RCORE reduces FSP/FCP growth during training. It improves unseen composition accuracy. It produces better verb behavior under temporal perturbation. It improves the compositional gap on unseen pairs. These are different measurements pointing toward the same mechanism.

The ablations explain why the method is designed this way

The ablation studies are not decorative. They answer several “why not just…” questions that a practitioner would reasonably ask.

Test	Likely purpose	What it supports	What it does not prove
CPR only vs. TORC only vs. both	Ablation	CPR and TORC target complementary causes: co-occurrence bias and temporal weakness	That this exact combination is optimal for all video tasks
Full open-world label space vs. batch-adaptive expansion	Sensitivity / design validation	Naively training over the full space can damage unseen accuracy; selective expansion is safer	That batch-adaptive expansion is the only possible solution
Frequent hard negatives vs. all hard negatives	Ablation	Penalizing frequent confounders works better than penalizing everything	That frequent-negative selection needs no domain tuning
TORC component losses	Ablation	Both temporal separation and entropy under disruption contribute	That temporal reversal captures every important action relation
Bias calibration with validation tuning	Robustness / deployment realism	RCORE performs well without requiring post-processing bias calibration	That calibration is useless in all applications
InternVideo2-1B extension	Robustness to backbone size	Gains persist with a much larger video-pretrained backbone	That scale and RCORE have identical effects

The label-space ablation is especially important. Full open-world training sounds principled: if all verb-object combinations are possible, train against all of them. But the result is poor because most unseen pairs are treated as negatives throughout training. The model is punished for compositions it might need to recognize later. That is the machine-learning equivalent of telling employees to be innovative while penalizing every idea not already in the handbook.

Batch-adaptive expansion is more careful. It introduces specific synthesized compositions into the training denominator only when they are relevant. That design choice turns out to matter.

The penalty-set ablation carries a similar lesson. If the model collapses into frequent seen pairs, one tempting solution is to penalize all hard negatives. The paper shows that this is too blunt. The frequent hard-negative strategy is more targeted: it suppresses the confounders that reflect training co-occurrence priors while avoiding excessive damage to object learning.

EK100-com makes the business lesson harder to ignore

The paper’s new EK100-com benchmark deserves attention because it is closer to messy operational data than a neatly balanced academic toy.

The dataset is built from EPIC-KITCHENS-100, using egocentric videos with verb-object labels. The final benchmark includes 71,238 samples, 81 verbs, 216 objects in training statistics, and 1,320 total compositions. Its label coverage is only 7.5%, reflecting severe sparsity. In practical terms, many plausible verb-object combinations are absent from the observed training set.

That sparsity is not an academic inconvenience. It resembles enterprise reality.

A warehouse video system may have many objects and many actions, but only a small subset of combinations will be observed during pilot data collection. A retail monitoring system may see common customer-object interactions but miss rare ones. A robot-learning setup may collect demonstrations for standard tasks and later face recombinations that were never explicitly labeled. A healthcare or manufacturing workflow camera may observe frequent routines but fail on unusual pairings that matter precisely because they are unusual.

On EK100-com, RCORE improves unseen composition accuracy substantially over C2C: from 21.56 to 28.41 with CLIP, and from 30.33 to 38.07 with InternVideo2. The top-line accuracy is still not high enough to pretend the problem is solved. But the gain is meaningful because the dataset is sparse enough to expose the failure mode clearly.

The appendix failure cases are also revealing. In EK100-com, the model often confuses opposing or frequent verbs around the same object: put versus take, turn-on versus turn-off, and similar patterns. These are not exotic semantic distinctions. They are basic operational differences. In a business process, “put” and “take” are not interchangeable. Inventory systems tend to be fussy about that sort of thing.

What the paper directly shows, and what business teams should infer

The paper directly shows three things.

First, object-driven shortcuts are measurable in ZS-CAR. FSP and FCP expose how unseen compositions collapse into seen or frequent training pairs, and the breakdown shows that verb-collapse is a major pattern.

Second, stronger backbones reduce but do not eliminate the problem. InternVideo2 performs better than CLIP in many settings, and InternVideo2-1B adds further capacity, but RCORE still provides gains. Scale helps; it does not repeal distributional bias.

Third, training pressure matters. CPR and TORC improve unseen composition accuracy and reduce shortcut diagnostics by expanding compositional supervision and forcing temporal sensitivity.

The business inference is narrower but still important: when video models are used to classify actions in real workflows, teams should not evaluate only aggregate action accuracy. They should audit whether the model is overusing object priors when verbs require temporal evidence.

A deployment-oriented version of the paper’s lesson might look like this:

Deployment question	Paper-inspired diagnostic	Practical meaning
Does the model fail on new action-object pairings?	Compare seen vs. unseen composition performance	Measures recombination ability, not just memorization
Are wrong unseen predictions collapsing into common training pairs?	FSP-style false-seen analysis	Detects over-reliance on historical routines
Are failures mostly wrong verbs with correct objects?	Verb-collapse breakdown	Flags object-driven verb shortcuts
Does temporal disruption reduce verb confidence?	Shuffle/reverse temporal order	Tests whether the model uses motion sequence
Are frequent confounders dominating errors?	Frequent-pair hard-negative audit	Identifies where targeted regularization may help

This is not a claim that every business needs to implement RCORE literally. The architecture assumes a fixed verb-object vocabulary and encoder-based VLM scoring. Many deployed systems use different architectures, decoder-based models, or task-specific pipelines. But the diagnostic idea transfers well: test whether the model knows the action, or merely recognizes the object and guesses the historically common verb.

The boundary: useful diagnosis, not open-ended video intelligence

The paper stays within a controlled setting. That is a strength for diagnosis, but it also defines the boundary of interpretation.

ZS-CAR assumes fixed verb and object vocabularies. The test-time novelty is in the composition, not in entirely new verbs or objects. That makes the evaluation cleaner, but it is not the same as open-ended video understanding. A deployed assistant may need to describe actions outside the taxonomy, handle multiple objects, infer intent, or reason over longer temporal chains.

The paper also focuses on encoder-based VLMs because they expose explicit scores over fixed verb-object labels. Decoder-based multimodal models often produce language outputs rather than standardized logits over a controlled label space. That makes them harder to evaluate using the same FSP/FCP machinery. The failure mode may still exist, but the diagnostic interface changes.

The gains are meaningful but moderate. On Sth-com, RCORE improves unseen composition accuracy by a few points. On EK100-com, the improvements are larger, but absolute accuracy remains limited. A business system would still need domain-specific data, careful label design, temporal coverage, and failure-case review before relying on such models for high-stakes automation.

Finally, synthetic supervision is not automatically safe. CPR works here because it is designed to preserve verb evidence while injecting object variation. In another domain, naive mixing could create unrealistic examples or damage the signal the model needs. The paper’s own ablations show that careless expansion and overly broad penalties can hurt performance. The useful lesson is targeted regularization, not “generate synthetic data and hope the spreadsheet smiles.”

The real value is a better audit habit

The drawer example is memorable because it is embarrassingly simple. The model sees a drawer and guesses opening. But the deeper point is not about drawers. It is about a class of AI failure that looks like understanding from far away and looks like historical autocomplete up close.

In business automation, that distinction matters. A system that recognizes frequent routines can be useful. A system that silently confuses new actions with old co-occurrences can be dangerous in exactly the cases where automation is supposed to help: exceptions, process variation, unusual object handling, and edge workflows.

RCORE’s technical design is specific: CPR for co-occurrence prior regularization, TORC for temporal order sensitivity, open-world evaluation, FSP/FCP diagnostics, and benchmark tests on Sth-com and EK100-com. The broader operational lesson is simpler:

Do not ask only whether a video model gets the label right.

Ask what evidence it used.

If the model guesses the verb by looking at the drawer, it has not learned the action. It has learned the office gossip.

Cognaptus: Automate the Present, Incubate the Future.

Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, and Jinwoo Choi, “Why Can’t I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition,” arXiv:2601.16211v2, 2026, https://arxiv.org/abs/2601.16211. ↩︎

The drawer mistake is a compact version of the whole problem#

Why objects dominate verbs in video models#

FSP and FCP turn a vague suspicion into a measurable failure#

The compositional gap asks whether the model is doing more than primitive recognition#

RCORE attacks the shortcut from two sides#

CPR expands supervision without drowning the model in impossible negatives#

TORC forces the verb representation to care about temporal order#

The main results show a trade-off, not a miracle#

The ablations explain why the method is designed this way#

EK100-com makes the business lesson harder to ignore#

What the paper directly shows, and what business teams should infer#

The boundary: useful diagnosis, not open-ended video intelligence#

The real value is a better audit habit#