Opening — Why this matters now

In demos, AI agents look impressively capable. They summarize reports, answer questions, and sometimes even automate workflows. But most demonstrations rely on relatively clean datasets or short context windows.

Real enterprises do not look like that.

Government archives, financial reports, compliance filings, and corporate records are messy, multi‑format, and historically layered. Information is scattered across decades of PDFs, tables, footnotes, and inconsistent layouts.

The question is simple: can modern AI agents actually reason over these documents?

A recent benchmark from Databricks—OfficeQA Pro—suggests the answer is: not yet.


Background — The missing benchmark for enterprise reasoning

Most AI benchmarks focus on tasks like coding, math, or knowledge recall. These are useful but somewhat detached from the daily reality of enterprises.

Enterprise decision‑making typically requires several capabilities at once:

Capability Why It Matters
Document parsing Corporate records often exist as PDFs or scanned tables
Retrieval Relevant facts may be spread across multiple documents
Numerical reasoning Financial and policy data frequently involves calculations
Cross‑document grounding Answers depend on evidence across different reports

Existing benchmarks rarely combine all four.

To address this gap, researchers created OfficeQA Pro, a dataset built around a particularly demanding document corpus: nearly a century of U.S. Treasury Bulletins.

Key dataset characteristics:

Dataset Property Value
Documents ~89,000 pages
Numerical values 26+ million
Time span ~100 years
Questions 133 complex queries

These questions require multi‑step reasoning, including reading tables, combining values from different sections, and tracing information across documents.

In other words: exactly the kind of work analysts, auditors, and regulators do every day.


Analysis — What the paper actually tests

The benchmark evaluates full AI agent systems, not just raw language models.

Three major agent frameworks were tested:

Agent Framework Model Provider
Google Agent Gemini models
OpenAI Agent Codex / GPT models
Anthropic Agent Claude models

Each system was tested under several conditions:

  1. Parametric knowledge only – the model relies purely on training data.
  2. Web access – the agent can search external sources.
  3. Direct document access – the model receives the full corpus.
  4. Structured document parsing – documents are converted into structured representations.

This setup isolates the core challenge: can AI reliably read and reason over enterprise documentation?


Findings — The performance gap is surprisingly large

The headline result is blunt.

Even frontier AI systems struggle badly with this task.

Configuration Average Accuracy
Parametric knowledge only <5%
With web access <12%
With full document access 34.1%
With structured document parsing +16.1% relative improvement

Two conclusions immediately stand out.

1. LLM memory is not enough

Relying on a model’s internal knowledge performs extremely poorly. Enterprise questions require grounded evidence, not memorized facts.

2. Raw PDFs are hostile to AI reasoning

Even when models receive the entire corpus, performance remains low. The bottleneck is not reasoning alone—it is document structure.

When documents are converted into structured formats (using Databricks’ document parsing tools), accuracy improves significantly.

But even then, models remain far from reliable.


Implications — The real bottleneck is data structure

The most important insight from the benchmark is subtle but powerful:

Enterprise AI performance depends more on document structure than model size.

This has several practical implications.

1. Document infrastructure matters

Organizations hoping to deploy AI agents should invest in:

  • document parsing
  • table extraction
  • semantic indexing
  • structured knowledge layers

Without this infrastructure, even the best LLMs underperform.

2. Agent frameworks need better retrieval logic

Multi‑document reasoning requires smarter orchestration between retrieval systems and reasoning models.

This is less about prompting and more about system design.

3. Enterprise benchmarks will become strategic assets

Benchmarks like OfficeQA Pro are valuable because they mirror real workflows.

Expect more domain‑specific evaluation suites for:

  • finance
  • compliance
  • government archives
  • legal research

These benchmarks will increasingly shape how enterprise AI systems are designed.


Conclusion — AI still cannot read the paperwork

Frontier language models are extraordinary generalists.

But when confronted with the dense, messy document ecosystems that define real organizations, their capabilities shrink dramatically.

OfficeQA Pro exposes a reality many enterprise teams already suspect:

the hardest part of enterprise AI is not the model—it is the documents.

Until AI systems can reliably parse, retrieve, and reason across complex document collections, truly autonomous enterprise agents will remain just out of reach.

And ironically, the path forward may involve less model scaling—and more attention to the humble PDF.

Cognaptus: Automate the Present, Incubate the Future.