RxnBench: Reading Chemistry Like a Human (Turns Out That’s Hard)

A reaction scheme looks like a picture.

To a chemist, it is closer to a compressed process model. A few arrows may encode the starting materials, catalysts, solvents, temperatures, intermediate states, selectivity, yield, and the structural change that makes the entire experiment worth publishing.

Reading that scheme correctly is already difficult. Reading the paper around it is worse.

The explanation may appear two pages later. A compound may be called “3a” in the text, drawn without that label in a figure, and compared against several near-identical structures in a table. The decisive difference may be the position of one substituent or the direction of one stereochemical bond.

Multimodal large language models can now describe scientific images fluently. RxnBench asks a less flattering question: can they read chemistry accurately enough for the details to count?¹

Its answer depends heavily on what “reading chemistry” means.

When models receive one carefully cropped reaction scheme, several achieve scores above 90%. When they must process a complete research paper, combine information across pages, and verify molecular structures, the best overall score falls below 50%.

That contrast is the central result. It separates models that can perform impressively on bounded chemical questions from systems that can reliably interpret scientific literature as part of an actual research workflow.

One Figure and One Paper Are Not Slightly Different Tasks

RxnBench contains two evaluation tiers designed around different parts of a chemist’s reading process.

The first, Single-Figure Question Answering, isolates a reaction-scheme image and asks questions grounded in that figure. The benchmark contains 1,525 expert-verified questions derived from 305 reaction schemes.

The second, Full-Document Question Answering, gives the model an entire chemistry paper rendered as page images. It contains 540 questions from 108 papers and requires the model to connect text, structures, figures, and tables distributed throughout the document.

The distinction sounds simple: small context versus large context.

Operationally, it changes almost everything.

Evaluation tier	What the model receives	What the model must do	Primary failure risk
Single-Figure QA	One isolated reaction scheme	Read labels, identify roles, compare outcomes, interpret mechanisms, recognize structures	Misreading local visual or chemical details
Full-Document QA	A complete paper rendered as page images	Locate evidence, resolve references, integrate modalities, verify structures, select all correct answers	Losing evidence across pages or reasoning from an incorrectly perceived structure

The single-figure task resembles asking an analyst to inspect one well-prepared slide.

The full-document task resembles giving the analyst an unfamiliar technical report and asking for an exact conclusion supported by evidence scattered across the document. The analyst must first find the relevant pages, then align the terminology, then interpret the evidence, then avoid confusing a plausible answer with the correct one.

A model can be excellent at the first job without being dependable at the second.

RxnBench Makes Wrong Answers Look Chemically Reasonable

Many benchmarks become easier once a model learns the habits of benchmark writers.

The correct answer may be longer, more carefully qualified, or more technically phrased. Incorrect options may contain obvious factual errors. A capable model can sometimes answer without deeply interpreting the source material.

RxnBench attempts to remove those shortcuts.

For Single-Figure QA, an initial model-generated question-and-answer pipeline is followed by expert review and adversarial editing. The annotators, doctoral researchers specializing in organic chemistry, revise the distractors so that they remain chemically plausible.

A wrong option may represent an enantiomer, a regioisomer, a different reaction condition, or a subtly altered molecular structure. It should look believable until the model checks the specific evidence in the figure.

Full-Document QA makes the selection problem stricter. Each question can have zero to four correct answers among options A through D. Option E represents “None of the Above.” Incorrect choices are often derived from elsewhere in the same paper rather than invented from nothing.

This matters because scientific errors are rarely theatrical. A wrong reaction condition copied from another row of the correct table looks entirely respectable. So does the right compound with the substituent attached at the wrong position.

RxnBench is therefore testing more than whether a model knows chemistry. It is testing whether the model can resist answering from chemical plausibility when the document requires chemical verification.

The 96% Score and the 46% Score Describe Different Capabilities

On Single-Figure QA, the leading model in the paper, Gemini-3-Flash-preview, reaches a mean score of 96.23% across the English and Chinese versions.

Several other reasoning-enabled models also exceed 90%. Qwen3-VL-235B-A22B-Think, an open-weight model, reaches 91.77%, compared with 85.84% for its instruction-tuned counterpart.

These are strong results. They suggest that frontier multimodal models can extract and reason over a substantial amount of information contained within a reaction scheme.

Then the full paper arrives.

On Full-Document QA, Gemini-3-Flash-preview leads with an overall score of 46.30%. Gemini-3-Pro-preview and Gemini-2.5-Pro both score 42.87%. GPT-5(high) reaches 34.54%. GPT-4o scores 14.07%.

Model	Single-Figure mean score	Full-Document overall score
Gemini-3-Flash-preview	96.23%	46.30%
Gemini-3-Pro-preview	93.61%	42.87%
Gemini-2.5-Pro	92.59%	42.87%
GPT-5(high)	92.63%	34.54%
Qwen3-VL-235B-A22B-Think	91.77%	30.56%
GPT-4o	74.49%	14.07%

The obvious interpretation is that models lose roughly half their capability when moving from a figure to a document.

That is directionally useful but technically incomplete.

The two tiers do not use identical question formats or scoring rules. Single-Figure QA uses conventional multiple-choice questions with one correct answer. Full-Document QA uses multiple-select questions and exact-match scoring: omitting one correct option or adding one incorrect option makes the entire response wrong.

Full-Document QA is also deliberately dominated by difficult structure-related questions. Structure Reasoning accounts for 75.6% of its questions, while Context Reasoning accounts for 24.4%.

The score gap therefore should not be read as a controlled measurement of how much context length alone harms performance. It combines several changes: more pages, more modalities, more retrieval demands, stricter scoring, more complex answer formats, and a heavier concentration of structure-sensitive tasks.

But that does not weaken the practical conclusion. It clarifies it.

Real literature analysis is not simply single-figure analysis with additional tokens. It is a compound workflow in which locating, aligning, perceiving, reasoning, and verifying must all succeed together. A high score on one isolated component cannot establish reliability for the complete process.

Models Can Follow the Discussion Better Than They Can Verify the Molecule

RxnBench divides Full-Document QA into two broad categories.

Context Reasoning requires models to combine information from text, tables, and reaction images. A question may ask the model to reconstruct an experimental procedure, compare reported outcomes, or connect a textual claim with supporting evidence elsewhere in the paper.

Structure Reasoning requires precise deductions involving molecular structures, Markush representations, reaction components, or mechanisms.

Across nearly every evaluated model, Context Reasoning scores exceed Structure Reasoning scores.

Gemini-2.5-Pro achieves 56.82% on Context Reasoning but only 36.52% on Structure Reasoning. Gemini-3-Flash-preview scores 55.30% and 42.64%, respectively. Qwen3-VL-235B-A22B-Think scores 40.91% on context and 27.21% on structure.

This comparison identifies the real bottleneck more precisely than the overall leaderboard.

The models are not merely becoming confused because papers are long. They are comparatively better at retrieving and connecting contextual information than at making exact judgments from molecular diagrams.

That difference matters because chemical structures are not decorative illustrations accompanying the “real” textual content. They are part of the argument.

A model may correctly find the paragraph explaining a reaction’s selectivity and still attach that explanation to the wrong structure. It may retrieve the reported yield but misidentify which product achieved it. It may understand a mechanistic discussion in general terms while overlooking that the depicted compound has different connectivity or stereochemistry.

In ordinary document automation, a minor visual mistake may create an inconvenient extraction error. In chemistry, it can change the object being discussed.

Reasoning Helps Most When There Is Something Correct to Reason About

The paper repeatedly finds that models equipped with inference-time reasoning outperform standard instruction-tuned models.

The paired Qwen3-VL results make the contrast especially visible.

On Single-Figure QA, Qwen3-VL-235B-A22B-Think scores 91.77%, compared with 85.84% for Qwen3-VL-235B-A22B-Instruct.

On Full-Document QA, the gap expands. The thinking variant reaches 30.56%, while the instruction-tuned variant reaches 13.61%.

Seed1.5 shows a similar full-document pattern: 33.61% for Seed1.5-VL-Think versus 16.85% for Seed1.5-VL-Instruct.

Reasoning appears particularly valuable when the task requires multiple steps: locating evidence, comparing alternatives, resolving references, and checking whether every selected option is supported.

Still, these comparisons should not be treated as a clean ablation proving that additional reasoning tokens alone caused every improvement. The evaluated variants may differ in training, architecture, prompting behaviour, and other implementation details. The results provide strong comparative evidence for the value of reasoning-enabled models, not a laboratory-isolated estimate of its causal effect.

More importantly, reasoning does not eliminate the structural bottleneck.

On Single-Figure QA, many thinking models score above 90% on mechanism, process, and global-understanding questions while performing materially worse on structure recognition. Gemini-3-Pro-preview, for example, scores 95.27% on Mechanism and Process but 74.63% on Structure Recognition.

The pattern is revealing. Once the visual information has been interpreted correctly, reasoning can connect it impressively. When the model fails to perceive the molecular graph precisely, additional reasoning may simply produce a more coherent explanation of the wrong molecule.

Reasoning can inspect a conclusion. It cannot reliably repair evidence it never saw correctly.

Single-Figure Structure Recognition Is Improving, but Document-Level Structure Reasoning Remains Fragile

It would be too simple to say that multimodal models cannot recognize chemical structures.

The best Single-Figure QA result for Structure Recognition is 90.30%, achieved by Gemini-3-Flash-preview. Qwen3-VL-235B-A22B-Think reaches 84.33%, and GPT-5(high) reaches 83.58%.

Those results show that structure recognition in isolated figures is not uniformly poor. For some models and carefully presented inputs, it is becoming highly capable.

The fragility becomes clearer when structure interpretation is embedded inside full-document reasoning.

Gemini-3-Flash-preview falls from 90.30% on isolated Structure Recognition to 42.64% on Full-Document Structure Reasoning. These categories are not directly equivalent, but the contrast shows how much harder the problem becomes when perception is only one step inside a longer chain.

In the full-document setting, the model may need to:

identify the relevant page;
locate the correct figure or table;
map a textual reference such as “compound 3a” to a visual structure;
distinguish it from similar structures elsewhere;
infer the requested chemical relationship;
evaluate several plausible options;
return the exact complete answer set.

The workflow succeeds only when every stage cooperates.

This is why “better chemical vision” should not be interpreted narrowly as improving image recognition accuracy. A production system also needs document navigation, entity resolution, structured extraction, and verification tools that preserve molecular identity across the workflow.

The Supplementary Results Reinforce the Diagnosis Rather Than Introduce a Second Thesis

The paper’s supplementary tables provide the complete model leaderboards behind the selected results in the main text. They expand the comparison from representative models to 41 models for Single-Figure QA and 23 models for Full-Document QA.

Their main purpose is coverage. They show that the reported patterns are not based on one convenient comparison.

Larger and reasoning-enabled models generally perform better. Open-weight models can compete strongly on Single-Figure QA. Yet Full-Document QA remains difficult across proprietary and open-weight systems, especially for instruction-tuned models without explicit reasoning behaviour.

The English and Chinese evaluations function as a linguistic robustness check. Top models generally produce similar results across the two languages. Gemini-3-Flash-preview, for example, scores 95.93% in English and 96.52% in Chinese on Single-Figure QA, while its Full-Document scores are 45.74% and 46.85%.

This suggests that prompt language is not the dominant explanation for the main capability gap among the strongest models. The difficult part is not primarily translating the question. It is interpreting the chemistry and integrating the document.

Other evaluation choices are implementation details rather than separate findings. The papers are rendered as page images at 144 dpi, and a GPT-4o-based pipeline converts verbose model outputs into standardized answer selections. These choices make broad model comparison feasible, but they also become part of what the benchmark measures.

A model that performs differently under higher-resolution rendering, native PDF parsing, or tool-assisted navigation might achieve a different score. RxnBench evaluates a clearly defined reading setup; it does not exhaust every possible system design.

What RxnBench Directly Shows—and What Businesses Should Infer Carefully

RxnBench directly shows that current multimodal models perform much better on bounded reaction-scheme questions than on strict, full-document chemical reasoning tasks.

It also shows that reasoning-enabled models generally outperform standard instruction-tuned models, while structure-dependent questions remain comparatively difficult.

The paper does not directly measure the effect of these models on research productivity, drug-development timelines, experimental safety, or commercial returns.

Those business implications require an additional reasoning step.

For chemical, pharmaceutical, and materials-science organizations, the results support a staged deployment model rather than a choice between “use AI” and “do not use AI.”

Workflow role	Plausible use of current MLLMs	Required controls
Literature triage	Identify potentially relevant papers, topics, reaction classes, or reported outcomes	Human review of inclusion and exclusion decisions
Bounded extraction	Extract reaction conditions, yields, named compounds, or textual claims from selected pages	Schema validation and source-page links
Comparative review	Summarize differences across papers or experimental conditions	Verification against tables, figures, and original text
Structure-sensitive analysis	Match compounds, compare stereochemistry, reason about molecular changes	Specialized chemical parsers, molecular representations, and expert review
Autonomous research decisions	Propose or prioritize experiments based on literature evidence	External tools, validation pipelines, audit trails, and accountable scientific oversight

The cheapest useful deployment is likely not an autonomous AI chemist. It is a literature assistant that reduces the cost of finding, organizing, and initially interpreting evidence while making verification easier for researchers.

That may sound less cinematic. It is also closer to a system that an R&D organization can evaluate responsibly.

A useful assistant might highlight every source passage and structure used in its conclusion. It might convert recognized molecules into machine-readable representations and validate them with a cheminformatics toolkit. It might flag low-confidence structural matches instead of smoothing them into confident prose.

The business value would come from reducing search and review effort without silently converting uncertain perception into apparently definitive scientific claims.

The Right Architecture Is a Verified Workflow, Not a Larger Chat Window

The paper recommends domain-specific visual encoders and external chemical tools such as RDKit. Its results make the logic behind that recommendation clear.

A general multimodal model is being asked to solve at least four different technical problems:

parse an unstructured scientific document;
recognize chemical structures precisely;
connect structures with textual references;
reason over the resulting evidence.

Expecting one model to perform all four internally creates a convenient interface and an inconvenient reliability problem.

A more defensible system would separate them.

Scientific PDF
      |
      v
Layout and page parsing
      |
      v
Text, tables, figures, and references
      |
      +----------------------+
      |                      |
      v                      v
Chemical structure       Context extraction
recognition                  and retrieval
      |                      |
      v                      v
Machine-readable        Evidence-linked
molecular graphs        document records
      |                      |
      +----------+-----------+
                 |
                 v
        Multimodal reasoning
                 |
                 v
     Structure and claim verification
                 |
                 v
          Researcher review

This architecture treats the language model as a coordinator and reasoning layer rather than the sole location of scientific truth.

It also creates measurable failure points. Teams can separately evaluate document retrieval, structure-recognition accuracy, reference resolution, answer generation, and final verification.

That is more useful than reporting one attractive overall accuracy score and discovering later that the system fails whenever two compounds differ by a wedge bond.

Deployment Boundaries Matter More Than the Leaderboard Winner

RxnBench is a substantial improvement over chemistry benchmarks that test only textual knowledge or isolated molecular puzzles. Still, several boundaries affect how its results should be used.

First, the benchmark uses curated multiple-choice questions. Real researchers usually begin with open-ended objectives, incomplete terminology, and uncertain evidence. Multiple-choice evaluation measures whether a model can verify proposed answers, not whether it can independently formulate the right scientific question.

Second, Full-Document QA uses strict exact-match scoring. This appropriately penalizes unsupported selections, but it also means a nearly correct multi-select response receives the same score as a substantially incorrect one. The scores are therefore excellent for testing reliability under a strict criterion, but they do not describe the full distribution of partial usefulness.

Third, the benchmark corpus is selected from recent open-access papers in prominent chemistry journals, focused on organic reaction methodology and total synthesis. That provides sophisticated and relevant material, but performance may differ across patents, internal laboratory reports, older scanned literature, regulatory documents, or less standardized publications.

Fourth, Full-Document QA is heavily weighted toward Structure Reasoning. This makes it a valuable stress test for chemical precision, but its overall score should not be treated as a universal estimate of performance across every literature-review workflow.

Finally, the benchmark evaluates models answering static questions from rendered documents. It does not evaluate an agent that can actively search within a paper, enlarge a figure, call a structure-recognition system, query a chemical database, run validation code, or request human clarification.

That last boundary is especially important. RxnBench demonstrates the weakness of unaided models under a demanding reading setup. It also indicates exactly where tools and workflow design could help.

Reading Chemistry Requires Knowing When Looking Is Not Enough

RxnBench’s most useful contribution is not the declaration that chemistry is difficult. Chemists had noticed.

Its contribution is the separation of capabilities that are too often bundled together under the phrase “document understanding.”

A model may read labels correctly without recognizing a molecular structure. It may recognize a structure in isolation without tracking it across a paper. It may retrieve all relevant evidence without selecting the exact supported conclusion. It may reason well while beginning from a subtly incorrect visual interpretation.

The benchmark’s comparison between isolated figures and complete papers exposes these differences with unusual clarity.

For business users, the lesson is neither that multimodal models are ready to replace research chemists nor that they are useless until they become perfect.

The useful middle ground is more specific: deploy them where evidence can be bounded, traced, and verified. Add specialist tools where molecular identity matters. Measure each stage of the workflow rather than trusting fluency as a proxy for accuracy.

Today’s strongest models can often discuss a reaction scheme convincingly. RxnBench shows that reading the entire chemistry paper—and remaining correct when the molecules begin to look almost identical—is still another reaction entirely.

Cognaptus: Automate the Present, Incubate the Future.

Hanzheng Li et al., “RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature,” arXiv:2512.23565, https://arxiv.org/abs/2512.23565. ↩︎

One Figure and One Paper Are Not Slightly Different Tasks#

RxnBench Makes Wrong Answers Look Chemically Reasonable#

The 96% Score and the 46% Score Describe Different Capabilities#

Models Can Follow the Discussion Better Than They Can Verify the Molecule#

Reasoning Helps Most When There Is Something Correct to Reason About#

Single-Figure Structure Recognition Is Improving, but Document-Level Structure Reasoning Remains Fragile#

The Supplementary Results Reinforce the Diagnosis Rather Than Introduce a Second Thesis#

What RxnBench Directly Shows—and What Businesses Should Infer Carefully#

The Right Architecture Is a Verified Workflow, Not a Larger Chat Window#

Deployment Boundaries Matter More Than the Leaderboard Winner#

Reading Chemistry Requires Knowing When Looking Is Not Enough#