Logs are where agentic AI gets honest

A business agent rarely fails in the dramatic way demo videos imply.

It does not usually announce, with theatrical humility, that it has misunderstood the workflow, misread the screen, or built a wrong model of the task. More often, it produces a tidy chain of steps, a reasonable explanation, a few reassuring intermediate notes, and then quietly stores the wrong conclusion as if it were company policy.

That is the useful way to read the Sensi paper.1 It is not mainly a story about an LLM game agent that failed to win. It is a story about a learning system that became very efficient at learning, while still being vulnerable to learning the wrong thing.

The punchline is almost too neat. Sensi v1, a simpler two-player architecture, solves 2 ARC-AGI-3 levels. Sensi v2, the more sophisticated version with curriculum learning, database-controlled context, and LLM-based learning evaluation, solves 0 levels. Normally, that would be enough for a brutal one-sentence review: nice architecture, shame about the game.

But that reading misses the part that matters. Sensi v2 completes its whole learning curriculum in roughly 32 action attempts, while the paper reports that Agentica requires around 1,600 to 3,000 interactions to build understanding of a game. The agent does not win, but it compresses the learning process dramatically. The result is not success. It is a more interesting failure: fast, structured, auditable, and wrong.

That distinction matters for business AI. In enterprise workflows, the hard problem is often not whether an agent can eventually discover a procedure after enough attempts. The hard problem is whether it can form a reliable working model before it burns time, API calls, customer patience, or compliance budget. Sensi suggests that structured test-time learning can reduce the number of attempts needed to understand a new environment. It also shows why that efficiency is dangerous unless perception and verification are grounded outside the agent’s own tidy internal story.

Sensi is not training a model; it is organizing a learning process

The paper places Sensi in ARC-AGI-3, an interactive game environment where agents must infer rules, action meanings, and win conditions from experience. The agent is not handed a manual. It must look, act, observe what changes, and gradually infer the mechanics.

That setting is useful because it strips the problem down to a brutal loop:

  1. observe the world;
  2. choose an action;
  3. interpret the result;
  4. update beliefs;
  5. repeat until the task becomes understandable.

This is not the same as conventional benchmark answering. It is closer to the enterprise problem of giving an agent a new software interface, a new claims-processing workflow, a new data extraction task, or a new internal tool and expecting it to learn through interaction. No amount of confident language matters if the agent cannot tell what changed after it clicked a button.

Sensi’s core move is therefore architectural rather than model-level. It does not fine-tune weights at test time. It does not claim to solve the environment through brute-force inference. Instead, it builds an external learning system around frozen LLM calls. The model remains the reasoning engine, but learning is organized through roles, curriculum, state transitions, stored facts, and evaluator feedback.

That is why the paper’s contribution is more interesting than the leaderboard result. It asks whether an LLM agent can be made into a structured learner without retraining the LLM itself.

Sensi v1 separates seeing from acting, which is already a serious design choice

Sensi v1 starts with a simple split: one LLM role observes, another acts.

The Observer maintains hypotheses about the game world. It tracks guesses and figured-out facts: what actions seem to do, what objects may mean, what conditions may trigger progress or failure. The Actor receives that evolving understanding and chooses the next action.

This sounds obvious only after someone says it. Many agent designs still ask one model call to perceive the situation, remember the history, infer the rule, decide the next experiment, and take the action. That is not elegance. That is asking one intern to be analyst, operator, QA engineer, and database administrator while pretending the org chart is “agentic.”

Sensi v1’s split creates a basic epistemic discipline:

Role Main job Operational consequence
Observer Interpret changes and maintain hypotheses The system has a place to accumulate beliefs instead of burying them in transient reasoning text.
Actor Choose actions that test or use those hypotheses The system can separate “what we think is true” from “what we should try next.”
Shared hypothesis lists Store guesses and confirmed observations The agent gains continuity across turns without relying only on model memory.

The reported v1 result is modest but meaningful. Using ChatGPT 5.1 as the backbone, Sensi v1 solves 2 levels of game LS20 and discovers 15 correct facts about game mechanics. The appendix lists facts such as action-direction mappings, energy consumption, key-door interactions, and the win condition.

The more subtle point is reproducibility. The paper reports that pass@10 equals pass@1 for v1 under the same accumulated knowledge. In plain language, once the agent has the right figured-out list, it tends to act consistently. That is important because business systems usually prefer boring reliability over stochastic brilliance. “It worked beautifully on attempt seven” is not a deployment strategy. It is a gambling habit with a UI.

But v1 also exposes the next problem. It has no strong control over what the agent learns first, no external verifier for whether something has really been learned, and no robust mechanism to prevent degeneracy. At level 3, the agent reportedly enters a repetitive state and stops meaningfully updating its understanding. So v1 shows that role separation helps, but it does not yet make learning governable.

Sensi v2 turns learning into a queue, not a vibe

Sensi v2 adds the paper’s main architectural idea: curriculum-based test-time learning.

Instead of letting the agent explore everything at once, v2 gives it an ordered list of learning items. The default curriculum begins with action meanings, then energy effects, then the win condition. Each item moves through a three-state lifecycle:

State Meaning Why it matters
not_reached The agent has not started this item Learning is sequenced rather than scattered.
learning The item is active and being evaluated The agent’s actions are focused on one uncertainty.
completed The item has passed the learning threshold Figured-out items are promoted into facts for later stages.

This is the part of the paper that deserves more attention than the win/loss column. The curriculum does not merely tell the agent what to care about. It changes the structure of exploration.

A monolithic agent in an unknown environment can waste attempts mixing together too many questions: What does this button do? What is the goal? Why did the score change? Did energy decrease? Is the red object a key, an enemy, or a decoration? That is cognitively expensive for humans and mechanically expensive for agents.

Sensi v2 imposes a narrower question at each stage. First learn actions. Then learn energy. Then learn the win condition. The sequence is hand-designed, which is a limitation, but it is also the point: the system demonstrates that test-time learning can be organized as a controlled process rather than left to the model’s improvisational charm.

For enterprise agents, this maps directly to workflow onboarding. A claims-processing agent should not discover every policy rule, form field, exception case, and approval path simultaneously. It should learn document types, then field meanings, then validation rules, then escalation conditions. In other words: a curriculum. Revolutionary, in the same way that checklists were revolutionary to people who previously relied on memory and confidence.

SQLite is the control plane, not a storage closet

The most transferable idea in Sensi v2 is not the game setting. It is the database.

Sensi v2 stores the agent’s cognitive state in SQLite tables: learning items, inputs, game history, guesses, figured-out items, and losing action sequences. Every turn reads from and writes to this database. The prompt becomes a view over structured state rather than a hand-packed bundle of recent chat history.

That design changes what it means to manage an agent.

Technical choice What it does inside Sensi Business meaning
External database state Stores facts, guesses, history, curriculum status Agent cognition becomes inspectable and auditable.
Programmatic prompt construction Injects selected tables into context each turn Context can be controlled without rewriting prompts manually.
Curriculum state machine Advances learning items when thresholds are met Learning progress becomes operationally governable.
Stored action/failure history Tracks attempts and losing sequences Debugging can use logs, not vibes.

This is why “database-as-control-plane” is a serious phrase rather than a fancy way of saying “we saved some stuff.” In ordinary prompt engineering, context is usually a blob. In Sensi, context is constructed from structured state. That means a human or another program can inspect, edit, reorder, seed, or remove pieces of the agent’s working knowledge.

Want to skip learning action meanings because a human already knows them? Seed the facts table. Want to change the learning order? Modify the curriculum rows. Want to audit why the agent took a certain action? Inspect the stored guesses, figured-out items, and prior diffs.

This is how agent systems begin to look less like chatbot magic and more like distributed software. The LLM is not the whole system. It is one execution component inside a controlled architecture.

Sense scoring makes progress measurable, then exposes the trap

Sensi v2 also adds an LLM-as-judge component called sense scoring. The idea is straightforward. For each active learning item, one LLM generates a learning metric. Another LLM evaluates whether the agent’s current facts and figured-out items satisfy that metric. If the score reaches the threshold, the state machine marks the item completed and promotes the figured-out items into facts.

This is elegant. It is also the source of the failure.

The sense scorer is not checking the game’s ground truth directly. It is checking whether the agent’s current understanding looks good under a generated rubric. If the agent’s understanding is coherent but false, the judge can reward it. This is not a minor implementation nuisance. It is the central governance problem of self-evaluating agents.

The learning loop can be simplified as:

$$ \text{observe} \rightarrow \text{hypothesize} \rightarrow \text{score understanding} \rightarrow \text{promote facts} $$

That loop is only as reliable as the weakest epistemic link. If observation is wrong, the rest of the system can become beautifully organized around the wrong world model.

And that is exactly what happens.

Zero wins and thirty-two tries is not a contradiction

The experimental result has two layers.

The surface layer is task performance. Sensi v1 solves 2 levels. Sensi v2 solves 0 levels.

The deeper layer is learning-process performance. Sensi v2 completes its learning curriculum in about 32 action attempts. The paper compares this with Agentica’s reported 1,600 to 3,000 interactions per game.

$$ \frac{1600}{32}=50, \quad \frac{3000}{32}=93.75 $$

So the reported sample-efficiency improvement is roughly 50–94×.

That number should not be overread. It is not a controlled apples-to-apples benchmark. The paper itself notes several confounds: v1 and v2 use different backbone models, the Agentica comparison uses reported numbers rather than a reproduced baseline, and v2 does not demonstrate end-to-end task success. Still, the magnitude is large enough to change the interpretation.

Result What it directly shows What it does not show
Sensi v1 solves 2 levels The two-player Observer/Actor architecture can accumulate correct game knowledge when perception works well enough. It does not prove curriculum learning is necessary.
Sensi v2 solves 0 levels The full curriculum architecture did not achieve task success in this experiment. It does not prove the learning-control architecture is useless.
V2 completes the curriculum in ~32 attempts The state machine, learning queue, and sense-scoring loop can drive rapid structured learning. It does not prove that the learned facts are correct.
Agentica reported at 1,600–3,000 interactions Sensi’s learning loop may be far more sample-efficient under the paper’s comparison axis. It is not a reproduced baseline under identical conditions.

This is why the headline “v2 failed” is technically true and analytically lazy. The more precise reading is that Sensi v2 successfully executes a learning process, but the process validates the wrong knowledge because the perception layer contaminates the pipeline.

That is a different class of failure. It is less like an agent that cannot learn and more like an organization that has excellent process discipline built around bad source data. Anyone who has seen a corporate dashboard built on broken definitions will recognize the genre.

The hallucination cascade turns perception errors into policy

The paper’s failure analysis is the strongest part of the work because it identifies a concrete causal chain.

Sensi v2 uses a frame-differencing module to describe visual changes between consecutive game frames. This is supposed to tell the Observer what happened after an action. But the frames are small pixel-art images, and the multimodal model can misidentify objects, directions, positions, and UI elements.

A wrong visual diff then enters the learning pipeline. The Observer updates its hypotheses based on that diff. The sense scorer evaluates the resulting figured-out items. Because the items may be internally coherent, the scorer gives a high score. The curriculum marks the item completed. The wrong figured-out items are promoted into facts. Later learning items build on those facts.

That is the self-consistent hallucination cascade:

Stage Failure mechanism Why it is dangerous
1. Perception error The frame-diff module describes the visual change incorrectly. Ground truth never enters the reasoning pipeline.
2. Hypothesis contamination The Observer builds a plausible but false rule. The system now has structured misinformation, not random noise.
3. Spurious validation The sense scorer rewards internal consistency. Coherence is mistaken for correctness.
4. Premature completion The state machine advances the curriculum. Learning appears complete before it is grounded.
5. Error compounding Wrong facts are reused in later learning items. One bad observation becomes downstream “knowledge.”

This is the callback to the opening problem. Agentic AI does not always fail by being confused. Sometimes it fails by becoming organized.

That is worse in business settings than a visible error. A visible error can be caught. A coherent false belief stored in a persistent control plane can spread. It can shape future actions, suppress alternative hypotheses, and make dashboards look clean while the underlying model of the world is quietly rotten.

The paper’s analogy is essentially a buggy comparator inside an otherwise functioning sorting algorithm. The architecture can run correctly while producing wrong outcomes because one input function is wrong. In enterprise terms, the workflow engine may be fine; the document parser is lying.

The paper’s evidence is useful, but it is not all the same kind of evidence

The article should not treat every result as equal. The paper combines architecture proposal, implementation detail, main result, comparison, and failure diagnosis. Those pieces support different claims.

Paper element Likely purpose What it supports What it does not prove
Observer/Actor split in v1 Main architectural evidence Separating perception and action can support hypothesis accumulation and consistent action selection. It does not isolate the contribution from the backbone model or environment specifics.
Appendix list of 15 discovered facts Implementation evidence and qualitative validation V1 can recover concrete game mechanics through interaction. It is not a broad benchmark across many games.
V2 learning queue and state machine Architectural mechanism Learning can be sequenced and externally controlled. It does not guarantee correct learned content.
Database schema Implementation detail with general design value Agent state can be made persistent, inspectable, and steerable. It does not by itself improve perception or task success.
Sense-score progression Main process evidence, partly illustrative The curriculum loop can advance through learning items. Representative curves are not exact per-turn reproducible measurements.
32 attempts vs. 1,600–3,000 Comparison with prior reported system Sensi may greatly reduce interaction budgets for structured learning. It is not a controlled reproduced baseline.
Self-consistent hallucination cascade Failure analysis The v2 failure is localized to perceptual grounding and validation design. It does not show that fixing perception alone will guarantee wins.

This classification is important because the paper’s strongest business lesson is not “Sensi is ready for deployment.” It is not.

The stronger lesson is architectural: agent learning can be made modular enough that failure becomes diagnosable. That is valuable. A bad end-to-end result with a clear failure chain is more useful than a good demo with no idea why it worked.

What Cognaptus would infer for business agents

The direct result belongs to ARC-AGI-3. The business inference belongs to agent onboarding.

A company deploying agents into unfamiliar workflows faces a recurring problem: the agent must learn operational structure from limited interaction. It may need to understand a new CRM screen, a procurement approval path, a spreadsheet template, a shipping exception procedure, or a support escalation rule. The agent does not need to “win a game.” It needs to reduce uncertainty without turning guesses into policy.

Sensi suggests a practical design pattern:

Sensi mechanism Enterprise translation ROI relevance Required guardrail
Observer/Actor separation Split workflow understanding from action execution. Fewer reckless actions; clearer debugging. Prevent the Actor from acting on low-confidence beliefs.
Curriculum queue Teach the agent workflow components in a defined order. Faster onboarding to new procedures. Human or programmatic review of curriculum design.
Database-as-control-plane Store beliefs, facts, attempts, failures, and state transitions in tables. Auditability, reuse, and easier correction. Versioned state, rollback, and provenance metadata.
LLM-as-judge Score whether the agent has learned enough to proceed. Automated progress checking. Ground-truth checks, not only internal consistency.
Failure history Preserve losing action sequences and bad attempts. Avoid repeated costly mistakes. Distinguish true failures from context-specific exceptions.

This is not marketing copy for autonomous agents. The inference is more restrained: Sensi-like architecture may reduce the interaction budget required for agents to learn unfamiliar workflows, provided perception and validation are externally grounded.

That condition is not decorative. It is the whole ballgame.

For document workflows, grounding may mean deterministic parsers, schema validation, sampled human review, and database constraints. For UI agents, it may mean DOM-level inspection rather than screenshot-only interpretation. For trading or finance agents, it may mean market-data provenance, execution logs, and hard risk checks outside the LLM. For compliance workflows, it may mean policy references and rule engines that can veto the agent’s self-declared understanding.

The common principle is simple: never let the same system invent the rubric, grade itself, and promote its own guesses into permanent facts without external verification. That is not autonomy. That is a beautifully formatted conflict of interest.

The real boundary: efficient learning is not reliable learning

The paper is unusually useful because it does not hide the negative result. Still, the boundary conditions matter.

First, v2 won 0 levels. That means the full curriculum architecture has not yet demonstrated end-to-end game success in the reported experiment. The process worked; the task outcome did not.

Second, the curriculum is hand-designed. The agent is not automatically discovering the best order of concepts. In business settings, that may be acceptable—many workflows already have natural dependency structures—but it limits claims about general autonomous learning.

Third, the comparison between v1 and v2 is confounded by different backbone models. The paper reports ChatGPT 5.1 for v1 and Gemini 3.1 Pro for v2. That makes it hard to attribute differences purely to architecture.

Fourth, the sample-efficiency comparison to Agentica is based on reported interaction counts rather than a controlled reproduction. The 50–94× number is useful as a magnitude signal, not as a final benchmark verdict.

Fifth, and most importantly, the judge evaluates internal consistency more than correspondence to ground truth. This is the failure mode that business readers should remember. An agent can pass its own learning checkpoint while being wrong about the world.

These limitations do not erase the contribution. They define where it can be used. Sensi is not evidence that self-directed LLM agents are ready to learn any environment safely. It is evidence that structured test-time learning can be made fast, inspectable, and modular enough to reveal where reliability breaks.

That is a narrower claim. It is also the more useful one.

The useful lesson is not “trust the agent,” but “instrument the learning”

Sensi v2 is easy to mock if the only metric is winning. Zero levels solved is not the traditional badge of triumph.

But serious system design often advances through failures that become inspectable. Sensi’s failure is valuable because it names the bottleneck. The learning architecture did not merely flail. It executed the curriculum, advanced the state machine, stored knowledge, and exposed the perception-to-validation failure chain.

For business AI, that is the lesson: do not only ask whether the agent completed the task. Ask whether the learning process is instrumented enough to explain what the agent thinks it learned, how it learned it, who verified it, and whether that verification touched reality.

Sensi’s most memorable result is that an agent can learn less randomly, learn more quickly, and still learn incorrectly. That sounds like bad news only if one expected intelligence to emerge from longer prompts and a generous budget. For everyone else, it is a useful engineering diagnosis.

The next generation of business agents will not be judged only by how well they reason. They will be judged by how safely they update their working model of the world. Sensi shows one way to make that update process structured. It also shows exactly why structure without grounding can turn error into fact at industrial speed.

Efficiently wrong is still wrong. But at least now we know where to put the wrench.

Cognaptus: Automate the Present, Incubate the Future.


  1. Mohsen Arjmandi, “Sensi: Learn One Thing at a Time—Curriculum-Based Test-Time Learning for LLM Game Agents,” arXiv:2603.17683, 2026. https://arxiv.org/abs/2603.17683 ↩︎