Opening — Why this matters now
Large Language Models have become fluent planners. Ask them to outline a strategy, decompose a task, or explain why something should work, and they rarely hesitate. Yet when placed inside an environment where actions cost resources, mistakes compound, and time does not politely pause, that fluency often collapses.
The uncomfortable truth is that most LLM benchmarks still reward descriptive intelligence rather than operational competence. TowerMind arrives as a corrective: a deliberately constrained, real-time strategy environment designed to test whether LLMs can actually act — not just narrate intent — under pressure.
Background — From grand RTS to practical evaluation
RTS games have long been the gold standard for evaluating long-term planning and decision-making. StarCraft II–based benchmarks dominate the space, but they are heavy, expensive, GPU-hungry, and operationally awkward for rapid experimentation.
TowerMind takes a different route. Instead of simulating the entire geopolitical chaos of a full RTS, it focuses on the tower defense subgenre — a stripped-down but still strategically demanding slice of real-time decision-making. The result is an environment that preserves macro–micro tension while shedding unnecessary computational weight.
In practical terms: TowerMind runs on CPUs, requires a fraction of the storage of SC2-based setups, and still manages to stress-test planning, execution, and adaptation.
What TowerMind actually is
TowerMind is a Unity-based tower defense environment wrapped in an OpenAI Gym–compatible interface. It offers three observation modalities:
- Pixel-based (512×512 RGB frames)
- Textual (structured JSON game state)
- Structured numerical state vectors
Actions combine continuous spatial coordinates with discrete commands — build, upgrade, deploy units, move heroes, trigger abilities. Only actions that are valid under current game constraints are executed; everything else is silently ignored.
That design choice is not cosmetic. It enables TowerMind’s most quietly brutal feature: hallucination measurement through invalid actions.
Hallucination, quantified
In TowerMind, hallucination is not a philosophical debate — it is a metric.
If an agent attempts to:
- Build a tower where no tower point exists
- Upgrade a tower it cannot afford
- Command a dead hero
…the action simply fails. The valid action rate becomes a proxy for how often the model’s internal world model diverges from reality.
This reframing is subtle but powerful: correctness is separated from effectiveness. A model may understand the rules and still lose badly.
Benchmark structure and difficulty
TowerMind ships with five benchmark levels of increasing difficulty. Difficulty is not hand-waved; it is explicitly modeled as a composite of:
- Road complexity
- Tower placement density
- Enemy diversity and volume
- Resource constraints and sell-back penalties
| Level | Roads | Tower Points | Enemy Types | Avg Enemies/Wave | Difficulty |
|---|---|---|---|---|---|
| Lv1 | 1 | 4 | 14 | 20.8 | 2.45 |
| Lv2 | 1 | 5 | 13 | 9.2 | 2.77 |
| Lv3 | 3 | 12 | 14 | 12.0 | 3.42 |
| Lv4 | 3 | 12 | 14 | 17.0 | 3.55 |
| Lv5 | 4 | 13 | 11 | 16.4 | 3.74 |
The design intentionally includes misleading tower points — placements that look plausible but are tactically useless. Human experts avoid them. LLMs do not.
Findings — Fluent, valid, and still wrong
1. LLMs lag far behind humans
Even the strongest commercial models achieve less than half of human expert performance on harder levels. On Level 5, the gap exceeds 80%.
2. Vision helps — except when it doesn’t
Most models improve when visual input is added. Notably, some large open-source models degrade under vision-language input, suggesting brittle multimodal integration rather than genuine situational awareness.
3. Hallucination scales with difficulty
As levels become more complex, invalid actions rise sharply — especially for smaller and open-source models. In several cases, performance drops below random baselines, an unflattering but revealing outcome.
4. Correctness ≠ effectiveness
Commercial models maintain relatively high valid action rates while still performing poorly on score. They understand the rules — they simply do not understand strategy.
Qualitative failure modes
TowerMind exposes three recurring behavioral pathologies:
- Unvalidated planning — models build towers in locations that never engage enemies, despite having all required spatial information.
- No multifinality — unlike humans, LLMs fail to combine objectives (e.g., collecting resources while fighting).
- Action underutilization — upgrades ignored, abilities wasted, resources mismanaged.
These are not bugs. They are structural limitations of current agent reasoning.
RL baselines: not a free lunch either
Classic RL algorithms (Ape-X DQN and PPO) were also evaluated. After 100 million steps, both could partially solve easy levels — and still failed dramatically on harder ones.
TowerMind is not an easy benchmark wearing a friendly UI. It is deliberately unforgiving.
Implications — What TowerMind actually measures
TowerMind does not test whether a model can describe a plan. It tests whether a model can:
- Maintain a consistent world model
- Validate plans against outcomes
- Avoid being misled by plausible but useless options
- Translate intent into effective, timely action
In short: it measures whether an agent understands that actions have consequences.
Conclusion
TowerMind is not flashy. It is not massive. It does not pretend that LLMs are one prompt away from strategic mastery.
Instead, it does something far more valuable: it reveals — cleanly, cheaply, and repeatably — where today’s agents fall apart once language stops being enough.
That makes it an unusually honest benchmark.
Cognaptus: Automate the Present, Incubate the Future.