SpatialBench: When AI Meets Messy Biology

A dataset arrives.

Not a clean demo dataset. Not a tidy CSV with three columns and a tutorial notebook waiting nearby like a hotel concierge. A real spatial biology dataset arrives: high-dimensional, platform-specific, noisy, partially processed, full of tacit assumptions, and attached to a scientific question that cannot be answered by knowing biology in the abstract.

This is where many AI agent demos quietly stop smiling.

The paper behind SpatialBench asks a simple but inconvenient question: can frontier AI agents actually analyze real-world spatial biology data?¹ The answer is not “no.” That would be too easy, and therefore suspicious. The answer is more useful: today’s agents can sometimes recover meaningful biological results, but their failures are systematic, measurable, and strongly shaped by the execution harness around the model.

That last phrase matters. The paper is not merely another leaderboard where one model gets a small crown and the others get polite applause from their investors. SpatialBench is more interesting because it turns agent failure into a diagnostic object. It asks where the breakdown happens: in empirical data interaction, in domain calibration, in multi-step exploration, in instruction following, or in the surrounding tool-and-control layer that people still like to call “glue code,” as if glue has never held anything important together.

The short version is this: a better base model helps, but dropping a frontier LLM into messy spatial biology and expecting a reliable analyst is not a strategy. It is procurement with a lab coat.

SpatialBench tests the part between “knows biology” and “can do biology”

Spatial biology sits in an awkward zone for AI evaluation. It is not pure knowledge recall. A model may know what astrocytes are, what differential expression means, or why quality control matters in single-cell workflows. That knowledge is useful. It is not enough.

Spatial transcriptomics assays preserve molecular information in tissue context. The analyst is not merely asking, “What gene marks this cell type?” The analyst is often asking something closer to: given this platform, this tissue, this disease state, this preprocessing stage, this artifact profile, and this intermediate analysis object, what should be done next?

SpatialBench is designed around that reality. The benchmark contains 146 verifiable problems derived from real spatial analysis workflows. These problems span five spatial technologies and seven task categories: quality control, normalization, dimensionality reduction, clustering, cell typing, differential expression, and spatial analysis. Each problem snapshots a workflow immediately before a target analysis step, gives the agent access to the relevant data objects, and asks for a structured answer that can be graded deterministically.

That design choice is the core contribution. The benchmark does not ask whether the model can recite biology. It asks whether the agent can interact with messy data and recover a specific biological result.

A typical problem may require computing summary statistics, inspecting adata.obs, adata.var, or adata.uns, choosing a quality-control threshold, interpreting principal components, identifying enriched marker genes, or comparing cell neighborhoods. Some tasks are multiple choice. Others use numeric tolerance, marker-gene precision/recall, label-set Jaccard similarity, or distribution comparison. The benchmark’s grading system is deliberately mechanical because otherwise evaluation would collapse into expert vibes, and we already have enough of those.

The authors also apply manual quality control and adversarial shortcut testing. That matters because biological benchmarks are vulnerable to accidental leakage through prior knowledge. If a task can be solved by guessing from generic biology without touching the provided data, it is not really testing analysis. SpatialBench tries to avoid that trap by requiring empirical interaction with the actual artifacts in the workspace.

So the first mechanism is already visible:

Layer	What SpatialBench forces the agent to do	Why ordinary biology QA misses it
Data interaction	Load, inspect, compute, and summarize real workflow objects	QA can be answered from memorized concepts
Domain calibration	Apply thresholds and interpretations appropriate to assay context	Generic heuristics may be biologically wrong
Structured output	Return a gradeable answer in a required schema	Free-form explanation can hide failure
Scientific durability	Recover results stable under reasonable method choices	Toy tasks often reward one brittle implementation
Workflow realism	Start from a midstream snapshot, not a blank notebook	Real analysis is path-dependent

This is why SpatialBench is not just “harder.” It is harder in a more operationally relevant way.

The headline scores are low, but the failure pattern is the real result

In the base configuration, no evaluated model reaches 40% aggregate accuracy. Opus-4.5 leads with 38.36%. GPT-5.2 follows at 34.02%. Sonnet-4.5 reaches 28.31%. GPT-5.1 reaches 27.40%. Grok-4 and Grok-4.1 land at 22.83% and 24.66%, while Gemini-2.5-Pro reaches 20.09%.

These numbers are not flattering. They also should not be read as a simple “AI cannot do biology” verdict. The more important point is that the benchmark stratifies capability. It shows that frontier agents are not failing uniformly; they are failing differently.

The efficiency metrics sharpen the picture. GPT-5.1 and GPT-5.2 are relatively cheap and fast, with reported costs around $0.02–$0.04 per evaluation and latency between 56 and 89 seconds. Opus-4.5 is more accurate but more expensive, at about $0.143 per evaluation and roughly 124 seconds of latency. Grok variants use nearly ten steps on average and frequently hit the maximum step budget, while GPT and Claude models usually complete tasks in about two to three steps.

This is where the paper becomes useful for builders. Accuracy alone says who won this benchmark run. Steps, latency, cost, and trajectory behavior say what kind of system you are dealing with.

A fast but under-calibrated agent is not the same product risk as a slow agent that thrashes through retries. A model that follows schema perfectly but misses biological context is not the same failure mode as a model that finds relevant intermediate data but cannot use it. In business terms, these are different engineering problems, different monitoring problems, and different liability problems.

The paper’s aggregate table is main evidence: it establishes that today’s agents remain unreliable on the full benchmark. But the aggregate table is not the full argument. The diagnostic sections explain why the unreliability happens.

Task type changes the meaning of “good model”

The task-stratified results reveal large model-task interactions. Some categories are difficult for nearly everyone. Quality control is especially weak: model accuracies range roughly from 10% to 22% in the base setting, with several confidence intervals close to zero. Cell typing is also poor, generally landing in the 20–36% range. By contrast, the best models reach higher performance in normalization, dimensionality reduction, and spatial analysis.

The difference is not random.

Quality control and cell typing require contextual scientific judgment. A spatial assay may have lower per-cell gene counts than dissociated single-cell RNA-seq. A generic single-cell heuristic can therefore be confidently wrong. Cell typing also depends on tissue, disease, platform, marker reliability, and sometimes subtle state distinctions. The agent has to calibrate its expectations to the assay, not merely import whatever rule it saw most often during training.

The paper’s trajectory analysis gives a useful example: for some targeted and imaging-based spatial assays, reasonable min_genes thresholds can sit between 5 and 20. Opus-4.5 applies a median threshold of 10, while other models tend to default to scRNA-seq-like thresholds around 100–200. That difference aligns with higher quality-control performance for Opus-4.5. This is not mystical “reasoning.” It is domain calibration.

That point deserves emphasis. Many AI workflow products fail not because they cannot execute code, but because they execute plausible code under the wrong assumptions. In scientific settings, plausible wrongness is particularly dangerous because it produces outputs that look professional enough to survive a casual review.

The task breakdown is therefore diagnostic main evidence. It shows that the bottleneck is not one homogeneous capability called “biology intelligence.” It is a mixture of assay-specific calibration, procedural analysis, empirical inspection, and structured decision-making.

A builder should not ask, “Does the model know spatial biology?” That question is too soft.

The better question is: “For which workflow steps does this agent possess calibrated operational judgment, and where does it merely imitate familiar analysis patterns?”

The answer will not be uniform across tasks.

Platform dependence quietly kills one-size-fits-all agent design

SpatialBench also stratifies performance by spatial technology: AtlasXomics, MERFISH, Seeker, Visium, and Xenium. The same model can swing by roughly 15–20 percentage points across platforms. Seeker is consistently difficult, with model accuracies around 19–31% despite having the largest number of evaluations.

This is not just a benchmark curiosity. It is a warning label for product architecture.

Spatial biology is not one workflow. It is a family of platforms with different measurement technologies, data structures, artifacts, conventions, and analysis habits. An agent that performs adequately on one platform may quietly degrade on another. A general “spatial omics assistant” that does not know which platform it is touching is therefore not a clever generalist. It is an accident waiting for a dashboard.

The platform-stratified results function like a sensitivity test. They show that performance is conditional on assay context. They do not prove that every platform gap is caused by platform artifacts alone; dataset composition and task distribution also matter. But they do show that platform identity cannot be treated as metadata decoration.

For business use, the implication is practical:

Product decision	SpatialBench evidence	Business interpretation	Boundary
Use one generic agent workflow	Accuracy varies strongly by platform	Generic workflows may hide platform-specific failure	Benchmark composition may partly influence platform gaps
Add assay-aware prompts and tools	QC and platform results suggest calibration matters	Domain-specific scaffolding can reduce predictable errors	Prompting alone may not fix deeper reasoning gaps
Evaluate by workflow step	Task categories show different failure regimes	Validation should be modular, not only end-to-end	Step-level success may not capture long-horizon compounding
Track cost and latency	Efficiency separates model families sharply	Deployment choice depends on throughput, not accuracy alone	Benchmark costs may shift with pricing and infrastructure
Treat harness as core IP	Harness comparison shows large uplift	Agent reliability may come from workflow engineering	Results are strongest for tested harnesses and models

Notice what this table does not say. It does not say every biotech team should immediately build a fully autonomous spatial biology agent. That would be adorable. It says teams should stop evaluating scientific agents as if the model name were the system.

Harness design is not glue code; it is the experiment

The most important result in the paper is the harness comparison.

The authors compare Opus-4.5 across different agent harnesses: a base configuration, Claude Code, and the Latch agent harness. The same base model reaches 38.4% accuracy in the base setup, 48.1% with Claude Code, and 61.7% with the Latch harness. The absolute improvement from base to Latch is 23.3 percentage points. The improvement from Claude Code to Latch is 13.6 points.

That is larger than many model-to-model gaps.

This comparison is not a pure ablation in the narrow laboratory sense, because the harness changes multiple things at once: prompts, tools, control flow, runtime behavior, and execution environment. But as a system-level intervention, it is exactly the kind of comparison practitioners need. It asks: if we keep the base model fixed but change the agent wrapper, how much capability becomes usable?

The answer is: a lot.

The task-level harness results are even more informative. Latch improves especially on tasks requiring multi-step programming and intermediate analysis: clustering, differential expression, and dimensionality reduction. For example, Opus-4.5 with Latch reaches 65.9% on clustering compared with 33.3% in the base configuration; 64.1% on differential expression compared with 37.2%; and 75.6% on dimensionality reduction compared with 51.1%.

This is mechanism-first evidence. Better harnesses help not by sprinkling “domain magic” over the model, but by stabilizing the sequence of work: inspect data, run code, handle errors, preserve intermediate findings, follow answer formats, and stop at the right time. In other words, the harness turns a language model into a more disciplined analyst.

That matters because many AI products still treat the harness as an implementation afterthought. The paper suggests the opposite: for scientific agents, the harness is part of the scientific instrument.

A model without a good harness is like a microscope without a focusing mechanism. Technically impressive. Operationally irritating.

More steps help only when they are productive

The trajectory analysis is the paper’s diagnostic extension. It is not the main benchmark score, and it should not be read as a second leaderboard. Its purpose is to explain behavioral regimes behind the numbers.

The authors inspect session logs containing reasoning traces, tool invocations, terminal output, and errors. Several patterns emerge.

First, step count is not inherently good or bad. For Opus-4.5, pass rate rises with more steps: from 26.0% for one-step runs to 50.0% for runs with six or more steps. That suggests productive exploration. The model does not merely spend more time; it uses additional actions to inspect, compute, and refine.

For Grok variants, the story is different. They average nearly ten steps, generate high format-error counts, and all 119 instances of 100-step limit exhaustion occur in Grok runs, all ending in failure. That is not exploration. That is a washing machine with a PhD.

Second, instruction following matters because structured scientific evaluation depends on schema compliance. Grok variants average more than seven format errors per evaluation, while GPT-5.2 produces zero format errors in the reported runs and Claude/Gemini models remain near-perfect. Every format correction consumes agent budget that could have been used for analysis.

Third, finding information is not the same as using it. Opus-4.5 inspects adata.uns more frequently than GPT models, but the bigger point is utilization: when Opus inspects adata.uns, its pass rate rises by 26 percentage points. Grok variants inspect it too, but gain only 4–6 points. The difference is not access. The difference is interpretation.

This distinction is crucial for business AI systems. Tool use is often measured crudely: did the agent call the tool, retrieve the file, inspect the object, search the database? SpatialBench reminds us that tool use has a second stage: did the agent recognize what mattered and integrate it into the answer?

A mediocre agent with more tools can simply fail more ornamentally.

What the paper directly shows, and what business should infer

The paper directly shows three things.

First, frontier agents remain unreliable on real spatial biology workflows under base harnesses. The top base accuracy is 38.36%, and several model families remain around 20–28%. This does not mean the systems are useless. It means they are not ready to be trusted as independent analysts across the tested workflow distribution.

Second, performance varies strongly by task type and platform. QC and cell typing are weak points; normalization and some dimensionality-reduction tasks are more tractable; platform identity changes performance meaningfully. Scientific agent evaluation must therefore be stratified. A single aggregate score is too blunt.

Third, harness design materially changes performance. The same Opus-4.5 model moves from 38.4% in the base configuration to 61.7% with the Latch harness. That is the practical center of gravity.

Cognaptus would infer three business lessons from this, with appropriate restraint.

The first lesson is that scientific AI agents should be evaluated as full stacks, not models. The unit of deployment is not “GPT” or “Claude” or “Gemini.” It is model plus system prompt, tools, code execution environment, retry policy, schema enforcement, intermediate validation, data connectors, and domain calibration routines. Procurement teams love simple model comparisons because they fit into slides. Real workflows do not care.

The second lesson is that domain-aware verification is likely where defensibility lives. If every vendor has access to strong base models, competitive advantage shifts toward workflow-specific evaluation sets, calibrated analysis templates, verified intermediate checks, and careful failure-mode logging. In spatial biology, that means assay-aware thresholds, platform-specific data loaders, marker-reference context, and guardrails that force empirical inspection before final answers.

The third lesson is that the first valuable products may be assistant systems, not autonomous scientists. SpatialBench tasks are step-level snapshots. They test whether agents can recover key intermediate results. That is already valuable. But a full scientific workflow involves revisiting earlier choices, interpreting contradictory evidence, debugging assumptions, negotiating with human expertise, and deciding when not to continue. Autonomy should be earned step by step, not declared in a product launch.

A useful deployment roadmap would look less like “replace the analyst” and more like this:

Deployment layer	Near-term role	Required control
Data inspection assistant	Summarize objects, surface metadata, suggest plots	Must cite inspected fields and generated artifacts
QC co-pilot	Propose thresholds and show sensitivity	Must expose platform assumptions and alternative cutoffs
Marker and DE assistant	Generate candidate gene lists and comparisons	Must report filtering choices and statistical settings
Spatial-pattern explainer	Interpret niches, neighborhoods, and gradients	Must link claims to computed evidence
Workflow orchestrator	Chain steps under human supervision	Must support rollback, audit logs, and checkpoint review

That is not as glamorous as “AI scientist discovers drug while you sleep.” It is also much less likely to embarrass everyone before lunch.

How to read the paper’s evidence without over-reading it

SpatialBench is strong because it is concrete. But concrete does not mean unlimited.

The deterministic graders are necessary for reproducible benchmarking. They also compress scientific judgment into checkable outputs. That is a trade-off. Numeric tolerance, Jaccard similarity, precision-at-K, and multiple-choice grading make comparison possible, but they cannot fully represent the texture of expert reasoning. A scientifically defensible answer may sometimes be more nuanced than a grader can capture.

The benchmark also uses snapshotted workflow steps. This is sensible: it isolates specific capabilities and avoids the chaos of evaluating entire open-ended projects. But real scientific analysis is iterative. A poor QC threshold may distort clustering, which may distort cell typing, which may distort differential expression, which may force the analyst to revisit QC. SpatialBench does not fully measure that long-horizon feedback loop.

The harness comparison is highly informative, but it is not a universal law. Latch improves Opus-4.5 substantially in this benchmark, but the result does not prove that the same architecture will dominate every scientific domain, every model family, or every future benchmark revision. It proves the more important practical point: harness choice can be large enough to change deployment conclusions.

The platform analysis is also conditional. Platform-level differences may reflect assay difficulty, data distribution, task mix, or all of the above. The safe interpretation is not “Seeker is always harder” in every imaginable setting. The safe interpretation is that platform context materially affects agent performance and should be explicitly evaluated.

These limitations do not weaken the paper’s business relevance. They define it. SpatialBench is best read as a diagnostic benchmark for building better scientific agents, not as a final certificate of biological intelligence.

The benchmark is really a specification for agent engineering

The most useful way to read SpatialBench is as a workflow specification disguised as a benchmark.

It says a scientific agent must do at least five things reliably:

It must inspect the actual dataset, not hallucinate from biological priors.
It must apply assay-aware calibration, not generic single-cell defaults.
It must execute multi-step analysis without losing the plot.
It must produce structured answers that can be verified.
It must use intermediate discoveries, not merely collect them.

That list is more valuable than the leaderboard.

For biotech and pharma teams, the implication is that AI adoption should begin with workflow decomposition. Identify the specific analysis steps where current bottlenecks are painful and outputs can be verified. Build small evaluation sets from historical workflows. Test agents under realistic data snapshots. Track not only final answers but also failure trajectories: wrong assumptions, unused evidence, schema errors, excessive retries, and platform-specific collapse.

For AI vendors, the paper suggests that “agentic biology” will not be won by model access alone. It will be won by disciplined harness engineering: tools that know the data objects, prompts that force empirical checks, control flow that prevents thrashing, and graders that make failure visible.

For scientific leaders, the managerial lesson is simpler: do not confuse fluency with analysis. A model that explains spatial transcriptomics beautifully may still choose the wrong threshold, ignore the relevant field in adata.uns, or return a polished answer in the wrong schema. Biology does not award points for tone.

Conclusion: messy biology is the right test

SpatialBench is valuable because it refuses to make AI agents look cleaner than they are.

It does not ask whether models can sound like computational biologists. It asks whether agent systems can recover biological results from messy, midstream, platform-specific workflows. The answer is: sometimes, but not reliably enough—and the difference often lies in the harness, calibration, and control flow rather than the base model alone.

That is the article’s central correction to the obvious misconception. The future of AI in scientific analysis is not simply “wait for the next better model.” Better models will matter. But SpatialBench shows that practical reliability will also depend on something less fashionable and more important: the engineering discipline around the model.

In messy biology, intelligence is not just what the model knows. It is what the system can verify, repeat, and repair.

That is less cinematic than the autonomous AI scientist fantasy. It is also how real scientific tools usually become useful: not by being magical, but by becoming dependable.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Kenny Workman, Zhen Yang, Harihara Muralidharan, and Hannah Le, “SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?”, arXiv:2512.21907, https://arxiv.org/html/2512.21907. ↩︎

SpatialBench tests the part between “knows biology” and “can do biology”#

The headline scores are low, but the failure pattern is the real result#

Task type changes the meaning of “good model”#

Platform dependence quietly kills one-size-fits-all agent design#

Harness design is not glue code; it is the experiment#

More steps help only when they are productive#

What the paper directly shows, and what business should infer#

How to read the paper’s evidence without over-reading it#

The benchmark is really a specification for agent engineering#

Conclusion: messy biology is the right test#