Opening — Why this matters now

LLM agents are getting disturbingly good at finishing tasks. They click the right buttons, traverse web pages, solve text-based games, and close tickets. Benchmarks applaud. Dashboards glow green.

Yet something feels off. Change the environment slightly, rotate the layout, tweak the constraints — and suddenly the same agent behaves like it woke up in a stranger’s apartment. The problem isn’t execution. It’s comprehension.

This paper asks the uncomfortable question the agent community has mostly avoided: do LLM agents actually understand the worlds they act in, or are they merely optimizing trajectories?

Background — From “doing” to “knowing”

Most agent benchmarks today are trajectory-centric. They measure whether the agent reaches a predefined goal, how efficiently it does so, or how closely it follows an optimal action sequence. These metrics are excellent at grading behavior — and nearly useless at diagnosing knowledge.

The authors identify a critical blind spot: an agent can complete a task while holding a fragmented or outright incorrect internal model of the environment. Conversely, an agent may acquire correct world knowledge yet fail due to planning errors or horizon limits. Current benchmarks collapse these two failure modes into one number and call it evaluation.

In short: task success conflates competence with coincidence.

Analysis — The Task-to-Quiz (T2Q) paradigm

The paper introduces Task-to-Quiz (T2Q), a deterministic, environment-grounded evaluation framework designed to decouple execution from understanding.

T2Q splits evaluation into two explicit stages:

Stage 1: Coverage-oriented tasks

Instead of a single goal, agents are given a set of tasks designed to maximize exposure to the environment:

  • Traversing rooms
  • Interacting with objects
  • Triggering latent states (locked, unlocked, open, closed)

Task generation is formalized as a weighted set-cover problem, ensuring broad coverage with minimal redundancy. The emphasis is exploration, not optimization.

Stage 2: Environment quizzes

After interaction, agents are quizzed on the world they just inhabited. Questions are generated automatically from ground-truth environment metadata and cover five dimensions:

Category What it tests
Location Where objects are
Connectivity Which rooms connect
Direction Spatial orientation
Matching Key–lock relationships
Properties Latent object states

Crucially, each question is paired with trajectory-based prerequisites. If the agent never visited a room or opened a container, the question is marked non-answerable, preventing unfair penalties and hallucinated grading.

The result is a fully automated, reproducible pipeline — no human judges, no LLM-as-referee theatrics.

Findings — When success and understanding diverge

The benchmark, T2QBench, spans 30 environments and nearly 2,000 grounded QA pairs across three difficulty levels. The results are quietly devastating.

1. Task success is a poor proxy for understanding

As difficulty increases, Task Success Rate (TSR) drops sharply — while Environment Understanding Score (EUS) remains relatively stable. Agents still know roughly the same amount about the world even as they fail more tasks.

Doing degrades faster than knowing.

2. Memory systems don’t help (yet)

Across models, a naive in-context baseline often matches or outperforms sophisticated memory systems (Mem0, LangMem, A-MEM). The likely culprit: memory abstraction pipelines discard fine-grained spatial and relational evidence without replacing it with structured world models.

In other words, memory systems organize less than they erase.

3. Exploration is the real bottleneck

Agents perform best on questions answerable via short-term recall (locations), worse on relational reasoning (directions, connectivity), and worst on latent properties that require deliberate interaction.

The dominant failure mode isn’t retrieval — it’s reluctance to explore.

Implications — What this means for agent design

Three uncomfortable implications follow:

  1. Benchmark inflation is real. High task scores can mask shallow world models.
  2. Memory ≠ understanding. Without representations tailored to environment structure, memory is cosmetic.
  3. Generalization demands curiosity. Agents optimized for efficiency actively avoid the very interactions that would make them robust.

For businesses deploying agents in dynamic environments — ops workflows, simulations, autonomous tools — this matters. An agent that “works” today but fails silently tomorrow is not automation. It’s technical debt with confidence.

Conclusion — Measuring what agents actually know

Task-to-Quiz reframes agent evaluation around a simple but overdue idea: execution and understanding are not the same skill. By grounding evaluation in environment facts rather than success trajectories, T2Q exposes why today’s agents generalize poorly — and where future progress must focus.

If we want agents that adapt, we must stop rewarding them for merely getting lucky.

Cognaptus: Automate the Present, Incubate the Future.