TowerMind: When Language Models Learn That Towers Have Consequences

Opening — Why this matters now

Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses.

The uncomfortable truth is that most LLM benchmarks still reward descriptive intelligence rather than operational competence. TowerMind arrives as a corrective: a deliberately constrained, real-time strategy environment designed to test whether LLMs can actually act — not just narrate intent — under pressure.

Background — From grand RTS to practical evaluation

RTS games have long been the gold standard for evaluating long-term planning and decision-making. StarCraft II–based benchmarks dominate the space, but they are heavy, expensive, GPU-hungry, and operationally awkward for rapid experimentation.

TowerMind takes a different route. Instead of simulating the entire geopolitical chaos of a full RTS, it focuses on the tower defense subgenre — a stripped-down but still strategically demanding slice of real-time decision-making. The result is an environment that preserves macro–micro tension while shedding unnecessary computational weight.

In practical terms: TowerMind runs on CPUs, requires a fraction of the storage of SC2-based setups, and still manages to stress-test planning, execution, and adaptation.

What TowerMind actually is

TowerMind is a Unity-based tower defense environment wrapped in an OpenAI Gym–compatible interface. It offers three observation modalities:

Pixel-based (512×512 RGB frames)
Textual (structured JSON game state)
Structured numerical state vectors

Actions combine continuous spatial coordinates with discrete commands — build, upgrade, deploy units, move heroes, trigger abilities. Only actions that are valid under current game constraints are executed; everything else is silently ignored.

That design choice is not cosmetic. It enables TowerMind’s most quietly brutal feature: hallucination measurement through invalid actions.

Hallucination, quantified

In TowerMind, hallucination is not a philosophical debate — it is a metric.

If an agent attempts to:

Build a tower where no tower point exists
Upgrade a tower it cannot afford
Command a dead hero

…the action simply fails. The valid action rate becomes a proxy for how often the model’s internal world model diverges from reality.

This reframing is subtle but powerful: correctness is separated from effectiveness. A model may understand the rules and still lose badly.

Benchmark structure and difficulty

TowerMind ships with five benchmark levels of increasing difficulty. Difficulty is not hand-waved; it is explicitly modeled as a composite of:

Road complexity
Tower placement density
Enemy diversity and volume
Resource constraints and sell-back penalties

Level	Roads	Tower Points	Enemy Types	Avg Enemies/Wave	Difficulty
Lv1	1	4	14	20.8	2.45
Lv2	1	5	13	9.2	2.77
Lv3	3	12	14	12.0	3.42
Lv4	3	12	14	17.0	3.55
Lv5	4	13	11	16.4	3.74

The design intentionally includes misleading tower points — placements that look plausible but are tactically useless. Human experts avoid them. LLMs do not.

Findings — Fluent, valid, and still wrong

1. LLMs lag far behind humans

Even the strongest commercial models achieve less than half of human expert performance on harder levels. On Level 5, the gap exceeds 80%.

2. Vision helps — except when it doesn’t

Most models improve when visual input is added. Notably, some large open-source models degrade under vision-language input, suggesting brittle multimodal integration rather than genuine situational awareness.

3. Hallucination scales with difficulty

As levels become more complex, invalid actions rise sharply — especially for smaller and open-source models. In several cases, performance drops below random baselines, an unflattering but revealing outcome.

4. Correctness ≠ effectiveness

Commercial models maintain relatively high valid action rates while still performing poorly on score. They understand the rules — they simply do not understand strategy.

Qualitative failure modes

TowerMind exposes three recurring behavioral pathologies:

Unvalidated planning — models build towers in locations that never engage enemies, despite having all required spatial information.
No multifinality — unlike humans, LLMs fail to combine objectives (e.g., collecting resources while fighting).
Action underutilization — upgrades ignored, abilities wasted, resources mismanaged.

These are not bugs. They are structural limitations of current agent reasoning.

RL baselines: not a free lunch either

Classic RL algorithms (Ape-X DQN and PPO) were also evaluated. After 100 million steps, both could partially solve easy levels — and still failed dramatically on harder ones.

TowerMind is not an easy benchmark wearing a friendly UI. It is deliberately unforgiving.

Implications — What TowerMind actually measures

TowerMind does not test whether a model can describe a plan. It tests whether a model can:

Maintain a consistent world model
Validate plans against outcomes
Avoid being misled by plausible but useless options
Translate intent into effective, timely action

In short: it measures whether an agent understands that actions have consequences.

Conclusion

TowerMind is not flashy. It is not massive. It does not pretend that LLMs are one prompt away from strategic mastery.

Instead, it does something far more valuable: it reveals — cleanly, cheaply, and repeatably — where today’s agents fall apart once language stops being enough.

That makes it an unusually honest benchmark.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From grand RTS to practical evaluation#

What TowerMind actually is#

Hallucination, quantified#

Benchmark structure and difficulty#

Findings — Fluent, valid, and still wrong#

1. LLMs lag far behind humans#

2. Vision helps — except when it doesn’t#

3. Hallucination scales with difficulty#

4. Correctness ≠ effectiveness#

Qualitative failure modes#

RL baselines: not a free lunch either#

Implications — What TowerMind actually measures#

Conclusion#