NPCs With Short-Term Memory Loss: Benchmarking Agents That Actually Live in the World

Opening — Why this matters now

Agentic AI has entered its Minecraft phase again. Not because blocks are trendy, but because open-world games remain one of the few places where planning, memory, execution, and failure collide in real time.

Yet most agent benchmarks still cheat.

They rely on synthetic prompts, privileged world access, or oracle-style evaluation that quietly assumes the agent already knows where everything is. The result: impressive demos, fragile agents, and metrics that flatter models more than they inform builders.

MineNPC-Task is a deliberate rejection of that pattern. It asks a brutally simple question: Can an AI NPC survive real player requests without hidden shortcuts—and can we measure that fairly?

Background — From scripted agents to mixed-initiative reality

Minecraft has long been a sandbox for embodied AI. Platforms like Malmo, MineDojo, and Voyager explored navigation, skill acquisition, and long-horizon play. But most evaluations share two structural weaknesses:

Synthetic intent — tasks written for agents, not by humans.
Privileged perception — access to maps, seeds, or global state that no human player has.

MineNPC-Task flips both.

Tasks are elicited from expert co-play, then normalized into templates. Execution is constrained to public Mineflayer APIs only, under a strict bounded-knowledge policy. If a human couldn’t do it from in-world perception, neither can the agent.

This isn’t nostalgia for “hard mode.” It’s about comparability.

Analysis — What the paper actually builds

MineNPC-Task is not just a task list. It is a full evaluation harness designed around how real agents fail.

1. Task design: small, dependent, and human-authored

44 high-level player-authored tasks
216 atomic subtasks
Average: 4.9 subtasks per task

Each subtask explicitly encodes:

Field	Purpose
Dependencies	Forces sequencing (no magical jumps)
Required parameters	Tools, locations, counts
Clarifying question	Exactly one if something is missing
Success criterion	Machine-checkable, in-world only

No latent credit. No vibes-based grading.

2. Plan–Clarify–Act–Judge loop

The framework enforces a disciplined interaction cycle:

Plan preview (3–5 steps, visible to the player)
Single-turn clarification if a slot is missing
Code generation against Mineflayer APIs
Lightweight code review (retry capped)
Execution with progress toasts
Judgment from bounded evidence only

Crucially, the agent must ask when it does not know—rather than hallucinate competence.

3. Memory, but on a leash

Memory exists, but it is intentionally modest:

Landmarks
Artifacts
Preferences
Commitments and breakdowns

Each memory entry carries provenance (seen, told, inferred) and can go stale. There is no silent omniscience—only retrieval scoped to the current task.

This design choice matters. It keeps memory legible and debuggable, rather than turning it into an uninspectable vector soup.

Findings — What breaks when GPT‑4o plays fair

The authors evaluate GPT‑4o as an initial snapshot, not a leaderboard entry. Across 216 subtasks:

71 failures
≈33% subtask failure rate

That number is not embarrassing. It is informative.

Failure modes (by frequency)

Category	What goes wrong
Code execution	Invalid parameters, NaNs, brittle logic
Inventory/tool handling	Wrong tool, missing item
Context misunderstanding	Ambiguous language, scope drift
Implicit referencing	“That block”, “over here”
Navigation	Ill-defined or shifting targets

Mining and construction tasks cluster around code and inventory errors. Retrieval and navigation expose referencing failures.

In other words: the agent fails where humans are usually precise, and succeeds where humans adapt.

Player perception

Despite failures:

Interaction quality rated positively by most participants
Memory recall seen as helpful—but insufficiently persistent
Transparency softened frustration when things broke

A key insight emerges: recoverability matters more than perfection.

Implications — Why this benchmark matters beyond games

MineNPC-Task is not really about Minecraft.

It is about agent evaluation under constraint.

For anyone building:

LLM-based autonomous agents
Tool-using copilots
Embodied or robotic systems
Long-horizon workflow automation

…the lessons generalize:

Ask when uncertain beats guessing confidently.
Bounded perception exposes real brittleness.
Memory must be visible, scoped, and fallible.
Evaluation must rely on observable evidence, not internal traces.

If your agent only looks good with privileged access, it isn’t good—it’s rehearsed.

Conclusion — Benchmarks that remember what agents forget

MineNPC-Task is intentionally unglamorous.

It does not promise superhuman NPCs. It does not optimize for leaderboard dominance. Instead, it offers something rarer: a fair mirror.

By grounding tasks in real player intent, enforcing public-API constraints, and judging success only from in-world evidence, the benchmark exposes the exact seams where today’s agentic systems still unravel.

That honesty is its contribution.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From scripted agents to mixed-initiative reality#

Analysis — What the paper actually builds#

1. Task design: small, dependent, and human-authored#

2. Plan–Clarify–Act–Judge loop#

3. Memory, but on a leash#

Findings — What breaks when GPT‑4o plays fair#

Failure modes (by frequency)#

Player perception#

Implications — Why this benchmark matters beyond games#

Conclusion — Benchmarks that remember what agents forget#