Opening — Why this matters now
Multimodal Large Language Models (MLLMs) have become impressively fluent readers of the world. They can caption images, parse charts, and answer questions about documents that would once have required a human analyst and a strong coffee. Naturally, chemistry was next.
But chemistry does not speak in sentences. It speaks in arrows, wedges, dashed bonds, cryptic tables, and reaction schemes buried three pages away from their explanations. If we want autonomous “AI chemists,” the real test is not trivia or SMILES strings — it is whether models can read actual chemical papers.
That is exactly where RxnBench steps in. And it does not flatter today’s models.
Background — Context and prior art
Most existing chemistry benchmarks fall into one of two camps:
- Text-only benchmarks (ChemBench, ChemLLMBench, MMLU, SciBench) that test factual recall or symbolic reasoning.
- Lightly multimodal benchmarks that include molecular images or tables, but treat them as static objects rather than parts of a reaction process.
What these benchmarks largely miss is how chemists actually work: hopping between figures, scanning tables, resolving references like “compound 3a,” and mentally simulating reaction mechanisms.
Scientific PDFs — the dominant format for chemistry — make this even worse. They are visually dense, poorly structured for machines, and unforgiving to errors. A flipped stereocenter is not a typo; it is a different molecule.
RxnBench is built around a simple but uncomfortable premise: if a model cannot read a real chemistry paper, it does not understand chemistry.
Analysis — What RxnBench actually does
RxnBench introduces a two-tier evaluation framework designed to mirror a chemist’s cognitive workflow:
1. Single-Figure QA (SF-QA): Perception under pressure
SF-QA isolates reaction schemes from PDFs and asks models to answer detailed questions about them. This includes:
- Extracting reaction conditions (temperature, solvent, catalyst)
- Identifying reagent roles
- Comparing yields and selectivity
- Reasoning about mechanisms
- Translating molecular drawings into precise (E-)SMILES
The dataset includes 1,525 expert-verified questions derived from 305 reaction schemes, spanning modern organic and catalytic chemistry.
Crucially, the benchmark uses adversarially edited distractors. Wrong answers are chemically plausible — enantiomers, regioisomers, subtly incorrect conditions — forcing models to truly look and reason rather than guess.
2. Full-Document QA (FD-QA): Reading like a chemist
FD-QA raises the difficulty sharply. Models are given entire chemistry papers (rendered as page images) and must answer questions that require:
- Cross-referencing text, figures, and tables
- Resolving entity references across pages
- Synthesizing scattered experimental details
- Performing structure-level reasoning
Each question may have zero to four correct answers, plus a mandatory “None of the above” option. This design explicitly penalizes hallucinated confidence.
In short: FD-QA tests whether a model can survive unsupervised literature reading.
Findings — Results that should make us pause
SF-QA: Strong perception, brittle vision
On single figures, top-tier models perform extremely well:
| Capability | Observation |
|---|---|
| Fact extraction | Near-solved (≈96% for frontier models) |
| Mechanistic reasoning | Strong with inference-time reasoning |
| Comparative analysis | Consistently high |
| Structure recognition | Persistent weakness across models |
Inference-time “thinking” models dramatically outperform standard instruction-tuned models. Reasoning helps — but it does not fix weak visual encoders.
FD-QA: The real bottleneck
Full-document performance is where optimism collapses.
- No evaluated model exceeds 50% accuracy
- Non-reasoning models often fail to reach 15%
- Even the best models struggle with structure-based reasoning
A clear pattern emerges:
| Task Type | Relative Performance |
|---|---|
| Context reasoning (text + tables) | Moderate |
| Structure reasoning (visual chemistry) | Consistently poor |
Models can retrieve facts. They can summarize. They cannot reliably verify molecular structures across a document.
Implications — What this means for AI chemists
RxnBench exposes a hard truth: reasoning alone cannot compensate for weak chemical vision.
Three implications stand out:
-
Domain-specific visual encoders are non-negotiable Generic vision backbones are not designed for stereochemistry, bond topology, or reaction arrows.
-
Inference-time reasoning is necessary, but insufficient Chain-of-thought helps integrate information, but garbage perception still yields garbage conclusions.
-
Agentic workflows are the next frontier Real AI chemists will need to navigate documents, cross-check answers, and verify structures using external tools (e.g., RDKit), not just answer static questions.
RxnBench is less a leaderboard and more a diagnostic scan. It shows us exactly where today’s models break — and where future research must focus.
Conclusion — A benchmark that refuses to be easy
RxnBench does something rare in AI evaluation: it respects the complexity of the domain it tests.
By grounding evaluation in real chemistry papers, enforcing visual precision, and punishing hallucinated confidence, it draws a clear boundary between looking smart and being correct.
For now, multimodal LLMs can read chemistry papers the way a distracted undergraduate might: skimming confidently, missing the details that matter most.
RxnBench reminds us that scientific understanding is not about eloquence. It is about precision.
Cognaptus: Automate the Present, Incubate the Future.