SpatialBench: When AI Meets Messy Biology

Opening — Why this matters now

AI agents are having a good year. They write code, refactor repositories, debug production bugs, and occasionally embarrass junior developers. Naturally, biology is next. Spatial transcriptomics—arguably one of the messiest, most insight-rich data domains in modern life science—looks like a perfect proving ground. If agents can reason over spatial biology data, the promise is compelling: fewer bottlenecks, faster discovery, and less dependence on scarce bioinformatics talent.

SpatialBench arrives as a cold shower.

Across 146 real, verifiable spatial analysis tasks, frontier AI agents achieve accuracy rates that would politely be described as aspirational. The headline result is not that models fail—but how and why they fail, and what that implies for anyone trying to deploy AI agents beyond clean benchmarks and toy datasets.

Background — From benchmarks to biological reality

Most existing AI evaluations in biology test knowledge recall: multiple-choice questions, paper abstracts, or clinical facts. Spatial biology does not cooperate with that framing. It is empirical, noisy, platform-specific, and deeply procedural. Analysts spend more time deciding how to look than what to conclude.

SpatialBench is built from this reality. Each task snapshots a real analysis workflow midstream—right before a meaningful biological decision: quality control thresholds, normalization choices, dimensionality reduction interpretation, clustering, cell typing, differential expression, or spatial reasoning. The agent must interact with actual data objects, not just narrate an answer.

Crucially, every task is paired with a deterministic grader. No vibes. No partial credit. Either the biological insight is recovered, or it isn’t.

Analysis — What SpatialBench actually tests

SpatialBench spans:

146 evaluation tasks
5 spatial technologies (Xenium, Visium, MERFISH, Seeker, AtlasXomics)
7 task categories (QC, normalization, dimensionality reduction, clustering, cell typing, differential expression, spatial analysis)

The design philosophy is strict but fair:

Tasks are scientifically durable—answers should not hinge on fragile hyperparameters.
Domain knowledge alone is insufficient; agents must compute, inspect, and reason.
Shortcuts are actively eliminated through adversarial testing.

This places SpatialBench in a different category from classic ML benchmarks. It is less about raw model intelligence, and more about whether an agent can survive contact with real scientific workflows.

Findings — The numbers (and what they actually mean)

Aggregate accuracy: underwhelming by design

Model	Accuracy (%)	Avg. Steps	Avg. Cost (USD)
Opus-4.5	~38	~2.8	~0.14
GPT-5.2	~34	~2.1	~0.04
Sonnet-4.5	~28	~2.4	~0.08
Gemini-2.5-Pro	~20	~3.6	~0.19
Grok-4.x	~23–25	~10	~0.05–0.08

No model crosses 40% accuracy in the base configuration. This is not a failure of the benchmark—it is evidence that spatial biological reasoning remains an unsolved problem for general-purpose agents.

Task-level fracture lines

Performance varies wildly by task type:

Dimensionality reduction & spatial analysis: models reach ~50% accuracy in best cases.
Quality control & cell typing: accuracy often collapses toward zero.

These are not random failures. QC and cell typing require contextual judgment: knowing that spatial assays tolerate lower gene counts than scRNA-seq, or that marker expression shifts with tissue and disease. Models default to generic heuristics—and are punished accordingly.

Platform matters (a lot)

The same model can swing 15–20 percentage points depending on the spatial technology. Seeker datasets, for example, consistently depress performance across all models, indicating higher intrinsic complexity. This alone invalidates any claim that “one agent workflow fits all spatial assays.”

Harness design — The uncomfortable conclusion

The most important result in the paper is not about models.

It is about harnesses.

When the same Opus-4.5 model is wrapped in a better agent harness—improved tool routing, prompts, control flow, and execution environment—accuracy jumps:

Base harness: ~38%
Claude Code harness: ~48%
Latch harness: ~62%

That is a 23-point absolute gain without changing the underlying model.

Certain task categories (clustering, differential expression, dimensionality reduction) benefit disproportionately from better harnesses—precisely those requiring multi-step exploration and intermediate inspection. What is often dismissed as “glue code” turns out to be the difference between thrashing and insight.

Behavioral diagnosis — How models actually fail

Trajectory-level analysis reveals distinct failure modes:

Instruction compliance: Grok variants hemorrhage steps on formatting errors and retries.
Over-efficiency: GPT models are fast and cheap, but underexplore and underutilize intermediate results.
Domain calibration: Only Opus-4.5 consistently applies spatially appropriate QC thresholds.
Productive exploration: Inspecting data structures is not enough—models must use what they find.

In short: reasoning quality is not just about intelligence, but about behavior under uncertainty.

Implications — What this means for AI in science

Three uncomfortable truths emerge:

Model scaling alone will not fix this. Spatial biology failures are structural, not just parametric.
Agent design is first-class. Tools, prompts, and control flow deserve the same scrutiny as model weights.
Benchmarks must reflect reality. SpatialBench succeeds precisely because it is messy, narrow, and unforgiving.

For businesses building AI-driven scientific products, the takeaway is sobering but actionable: competitive advantage will come less from chasing the next frontier model, and more from engineering disciplined, domain-aware agent stacks.

Conclusion — A benchmark with teeth

SpatialBench does not flatter today’s AI agents—and that is its virtue. It exposes the gap between synthetic competence and empirical reasoning, between fluent narration and scientific judgment.

If AI is to become a reliable collaborator in biology, it must first survive benchmarks like this one. Not by guessing better—but by working better.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From benchmarks to biological reality#

Analysis — What SpatialBench actually tests#

Findings — The numbers (and what they actually mean)#

Aggregate accuracy: underwhelming by design#

Task-level fracture lines#

Platform matters (a lot)#

Harness design — The uncomfortable conclusion#

Behavioral diagnosis — How models actually fail#

Implications — What this means for AI in science#

Conclusion — A benchmark with teeth#