Opening — Why this matters now
The industry has quietly shifted its obsession.
Not long ago, the benchmark question was simple: Can AI solve the task?
Today, a more uncomfortable question is emerging: How many tries does it take before the AI even understands the task?
In a world of agentic systems—autonomous traders, copilots, and decision engines—test-time learning efficiency is no longer a technical curiosity. It is an economic constraint.
The paper behind Sensi does something mildly heretical: it presents a system that learns dramatically faster… and still fails.
And yet, that failure might be the most important result in the entire work. fileciteturn0file0
Background — From brute-force agents to structured learners
Most LLM agents today learn like distracted interns.
They poke the environment repeatedly, accumulate noisy observations, and slowly converge toward something resembling understanding. Systems like Agentica reportedly require 1,600–3,000 interactions just to grasp a single game.
This is not intelligence. It is persistence with a GPU bill.
Sensi reframes the problem by asking a sharper question:
What if the issue isn’t reasoning ability—but how learning itself is structured at test time?
Instead of scaling compute, Sensi introduces structure:
| Traditional Agents | Sensi Approach |
|---|---|
| Monolithic reasoning | Split cognition (Observer vs Actor) |
| Unstructured exploration | Curriculum-driven learning |
| Static prompts | Programmable context (database) |
| Implicit evaluation | LLM-as-judge with explicit scoring |
The result is less “try everything” and more “learn one thing properly before moving on.”
A surprisingly rare discipline in AI systems.
Analysis — What Sensi actually builds
1. Two-player cognition: separating thinking from acting
Sensi’s first move is almost embarrassingly simple.
Instead of one LLM doing everything, it splits the agent into:
- Observer → builds hypotheses about the world
- Actor → chooses actions to test those hypotheses
This separation creates something resembling epistemology inside the agent:
- What do I think is true?
- What action would confirm or reject that?
The result is not just better reasoning—but more reproducible reasoning. The system achieves deterministic behavior (pass@1 ≈ pass@10), which is rare in stochastic LLM pipelines.
2. Curriculum at test time: learning like a human (for once)
Sensi v2 introduces a constraint most AI systems conveniently avoid:
Learn things in order.
Instead of exploring everything simultaneously, the agent follows a queue:
- Learn actions
- Learn energy system
- Learn win conditions
Each item must be completed before the next begins.
This is enforced by a state machine:
| State | Meaning |
|---|---|
| not_reached | Not yet attempted |
| learning | Currently being explored |
| completed | Verified and promoted to facts |
The implication is subtle but powerful:
The agent is no longer optimizing for reward—it is optimizing for understanding.
The reward (winning the game) becomes an emergent outcome, not the objective function.
3. LLM-as-judge: self-evaluation with moving goalposts
Sensi does not rely on fixed metrics.
Instead, it generates its own evaluation criteria dynamically:
- One LLM defines how learning should be measured
- Another LLM scores progress against that metric
This creates a feedback loop:
$$ \text{Learning Progress} \rightarrow \text{Self-Evaluation} \rightarrow \text{Curriculum Advancement} $$
Elegant. Also slightly dangerous.
Because the system is now judging itself based on criteria it invented.
4. Database-as-control-plane: the real innovation
The most underappreciated idea in the paper is not the curriculum.
It is the database.
All agent state—facts, hypotheses, history—is stored externally in structured tables and injected into prompts each turn.
This means:
- Behavior can be changed without modifying prompts
- Learning can be inspected like logs
- State becomes programmable, persistent, and auditable
| Layer | Role |
|---|---|
| Database | Control plane (what the agent knows) |
| LLM | Execution engine (how it reasons) |
This is not prompt engineering.
This is neuro-symbolic orchestration disguised as SQLite.
Findings — Efficiency up, correctness down
Let’s address the uncomfortable part.
Performance summary
| System | Levels Solved | Interactions Needed | Sample Efficiency |
|---|---|---|---|
| Random Agent | 0 | — | Baseline chaos |
| Agentica | Unknown | 1,600–3,000 | Low |
| Sensi v1 | 2 | Variable | Moderate |
| Sensi v2 | 0 | ~32 | Extremely high |
Yes—Sensi v2 solves zero levels.
And yet:
$$ \text{Efficiency Gain} = \frac{1600-3000}{32} \approx 50\text{–}94\times $$
That number is the real story.
The failure: a beautifully consistent hallucination
Sensi doesn’t fail randomly.
It fails coherently.
The paper identifies a precise failure chain:
| Step | What Happens |
|---|---|
| 1 | Perception error (wrong frame interpretation) |
| 2 | Hypothesis built on wrong data |
| 3 | Judge validates internal consistency |
| 4 | Curriculum marks learning as complete |
| 5 | Wrong knowledge becomes permanent fact |
This is the key dynamic:
$$ \text{Wrong Perception} \rightarrow \text{Consistent Belief} \rightarrow \text{High Confidence} \rightarrow \text{Locked-In Error} $$
In other words:
The system doesn’t fail because it is confused. It fails because it is confidently wrong in a structured way.
Which, incidentally, is also a known human failure mode.
Implications — What this means beyond games
1. The bottleneck has shifted
Before Sensi:
- Problem: inefficient learning
After Sensi:
- Problem: unreliable perception
This is progress.
Efficiency is a scaling problem. Perception is an engineering problem.
One is expensive. The other is fixable.
2. Test-time learning is now economically viable
Reducing interactions from ~3000 to ~30 changes the deployment equation:
- Lower latency
- Lower cost
- Higher adaptability
This matters for:
- AI trading systems adapting to new market regimes
- Autonomous agents using unfamiliar APIs
- Enterprise copilots handling new workflows
In short: learning at runtime becomes practical.
3. Governance risk: self-validated intelligence
Sensi introduces a subtle governance issue.
The system:
- Defines its own evaluation criteria
- Judges its own performance
- Promotes its own knowledge to “facts”
Without external grounding, this creates a closed epistemic loop.
From a business perspective, this is not just a bug—it’s a risk category:
| Risk Type | Description |
|---|---|
| Self-confirming bias | Agent validates incorrect beliefs |
| Silent failure | High confidence masks errors |
| Audit difficulty | Errors embedded in internal state |
This is precisely where AI assurance frameworks will need to evolve.
4. Database-as-control-plane will generalize
This pattern will likely outlive the paper.
Expect to see:
- Multi-agent systems sharing a common state DB
- Human-in-the-loop editing agent beliefs directly
- Audit trails for AI reasoning
In other words, AI systems will start to look suspiciously like distributed systems.
With all the same control-plane vs data-plane abstractions.
Conclusion — Efficiently wrong is still progress
Sensi v2 is, on paper, a failure.
Zero levels solved.
But that framing misses the point.
The architecture demonstrates that:
- LLM agents can learn structured knowledge in ~30 interactions
- Curriculum and state machines can reliably guide learning
- Externalized memory enables controllable, inspectable intelligence
The remaining issue—perception—is not philosophical.
It is technical.
And that distinction matters.
Because once perception is fixed, the system doesn’t need to learn faster.
It already does.
Cognaptus: Automate the Present, Incubate the Future.