Opening — Why this matters now

The industry has quietly shifted its obsession.

Not long ago, the benchmark question was simple: Can AI solve the task?

Today, a more uncomfortable question is emerging: How many tries does it take before the AI even understands the task?

In a world of agentic systems—autonomous traders, copilots, and decision engines—test-time learning efficiency is no longer a technical curiosity. It is an economic constraint.

The paper behind Sensi does something mildly heretical: it presents a system that learns dramatically faster… and still fails.

And yet, that failure might be the most important result in the entire work. fileciteturn0file0


Background — From brute-force agents to structured learners

Most LLM agents today learn like distracted interns.

They poke the environment repeatedly, accumulate noisy observations, and slowly converge toward something resembling understanding. Systems like Agentica reportedly require 1,600–3,000 interactions just to grasp a single game.

This is not intelligence. It is persistence with a GPU bill.

Sensi reframes the problem by asking a sharper question:

What if the issue isn’t reasoning ability—but how learning itself is structured at test time?

Instead of scaling compute, Sensi introduces structure:

Traditional Agents Sensi Approach
Monolithic reasoning Split cognition (Observer vs Actor)
Unstructured exploration Curriculum-driven learning
Static prompts Programmable context (database)
Implicit evaluation LLM-as-judge with explicit scoring

The result is less “try everything” and more “learn one thing properly before moving on.”

A surprisingly rare discipline in AI systems.


Analysis — What Sensi actually builds

1. Two-player cognition: separating thinking from acting

Sensi’s first move is almost embarrassingly simple.

Instead of one LLM doing everything, it splits the agent into:

  • Observer → builds hypotheses about the world
  • Actor → chooses actions to test those hypotheses

This separation creates something resembling epistemology inside the agent:

  • What do I think is true?
  • What action would confirm or reject that?

The result is not just better reasoning—but more reproducible reasoning. The system achieves deterministic behavior (pass@1 ≈ pass@10), which is rare in stochastic LLM pipelines.


2. Curriculum at test time: learning like a human (for once)

Sensi v2 introduces a constraint most AI systems conveniently avoid:

Learn things in order.

Instead of exploring everything simultaneously, the agent follows a queue:

  1. Learn actions
  2. Learn energy system
  3. Learn win conditions

Each item must be completed before the next begins.

This is enforced by a state machine:

State Meaning
not_reached Not yet attempted
learning Currently being explored
completed Verified and promoted to facts

The implication is subtle but powerful:

The agent is no longer optimizing for reward—it is optimizing for understanding.

The reward (winning the game) becomes an emergent outcome, not the objective function.


3. LLM-as-judge: self-evaluation with moving goalposts

Sensi does not rely on fixed metrics.

Instead, it generates its own evaluation criteria dynamically:

  • One LLM defines how learning should be measured
  • Another LLM scores progress against that metric

This creates a feedback loop:

$$ \text{Learning Progress} \rightarrow \text{Self-Evaluation} \rightarrow \text{Curriculum Advancement} $$

Elegant. Also slightly dangerous.

Because the system is now judging itself based on criteria it invented.


4. Database-as-control-plane: the real innovation

The most underappreciated idea in the paper is not the curriculum.

It is the database.

All agent state—facts, hypotheses, history—is stored externally in structured tables and injected into prompts each turn.

This means:

  • Behavior can be changed without modifying prompts
  • Learning can be inspected like logs
  • State becomes programmable, persistent, and auditable
Layer Role
Database Control plane (what the agent knows)
LLM Execution engine (how it reasons)

This is not prompt engineering.

This is neuro-symbolic orchestration disguised as SQLite.


Findings — Efficiency up, correctness down

Let’s address the uncomfortable part.

Performance summary

System Levels Solved Interactions Needed Sample Efficiency
Random Agent 0 Baseline chaos
Agentica Unknown 1,600–3,000 Low
Sensi v1 2 Variable Moderate
Sensi v2 0 ~32 Extremely high

Yes—Sensi v2 solves zero levels.

And yet:

$$ \text{Efficiency Gain} = \frac{1600-3000}{32} \approx 50\text{–}94\times $$

That number is the real story.


The failure: a beautifully consistent hallucination

Sensi doesn’t fail randomly.

It fails coherently.

The paper identifies a precise failure chain:

Step What Happens
1 Perception error (wrong frame interpretation)
2 Hypothesis built on wrong data
3 Judge validates internal consistency
4 Curriculum marks learning as complete
5 Wrong knowledge becomes permanent fact

This is the key dynamic:

$$ \text{Wrong Perception} \rightarrow \text{Consistent Belief} \rightarrow \text{High Confidence} \rightarrow \text{Locked-In Error} $$

In other words:

The system doesn’t fail because it is confused. It fails because it is confidently wrong in a structured way.

Which, incidentally, is also a known human failure mode.


Implications — What this means beyond games

1. The bottleneck has shifted

Before Sensi:

  • Problem: inefficient learning

After Sensi:

  • Problem: unreliable perception

This is progress.

Efficiency is a scaling problem. Perception is an engineering problem.

One is expensive. The other is fixable.


2. Test-time learning is now economically viable

Reducing interactions from ~3000 to ~30 changes the deployment equation:

  • Lower latency
  • Lower cost
  • Higher adaptability

This matters for:

  • AI trading systems adapting to new market regimes
  • Autonomous agents using unfamiliar APIs
  • Enterprise copilots handling new workflows

In short: learning at runtime becomes practical.


3. Governance risk: self-validated intelligence

Sensi introduces a subtle governance issue.

The system:

  • Defines its own evaluation criteria
  • Judges its own performance
  • Promotes its own knowledge to “facts”

Without external grounding, this creates a closed epistemic loop.

From a business perspective, this is not just a bug—it’s a risk category:

Risk Type Description
Self-confirming bias Agent validates incorrect beliefs
Silent failure High confidence masks errors
Audit difficulty Errors embedded in internal state

This is precisely where AI assurance frameworks will need to evolve.


4. Database-as-control-plane will generalize

This pattern will likely outlive the paper.

Expect to see:

  • Multi-agent systems sharing a common state DB
  • Human-in-the-loop editing agent beliefs directly
  • Audit trails for AI reasoning

In other words, AI systems will start to look suspiciously like distributed systems.

With all the same control-plane vs data-plane abstractions.


Conclusion — Efficiently wrong is still progress

Sensi v2 is, on paper, a failure.

Zero levels solved.

But that framing misses the point.

The architecture demonstrates that:

  • LLM agents can learn structured knowledge in ~30 interactions
  • Curriculum and state machines can reliably guide learning
  • Externalized memory enables controllable, inspectable intelligence

The remaining issue—perception—is not philosophical.

It is technical.

And that distinction matters.

Because once perception is fixed, the system doesn’t need to learn faster.

It already does.


Cognaptus: Automate the Present, Incubate the Future.