Opening — Why this matters now
In demos, AI agents look impressively capable. They summarize reports, answer questions, and sometimes even automate workflows. But most demonstrations rely on relatively clean datasets or short context windows.
Real enterprises do not look like that.
Government archives, financial reports, compliance filings, and corporate records are messy, multi‑format, and historically layered. Information is scattered across decades of PDFs, tables, footnotes, and inconsistent layouts.
The question is simple: can modern AI agents actually reason over these documents?
A recent benchmark from Databricks—OfficeQA Pro—suggests the answer is: not yet.
Background — The missing benchmark for enterprise reasoning
Most AI benchmarks focus on tasks like coding, math, or knowledge recall. These are useful but somewhat detached from the daily reality of enterprises.
Enterprise decision‑making typically requires several capabilities at once:
| Capability | Why It Matters |
|---|---|
| Document parsing | Corporate records often exist as PDFs or scanned tables |
| Retrieval | Relevant facts may be spread across multiple documents |
| Numerical reasoning | Financial and policy data frequently involves calculations |
| Cross‑document grounding | Answers depend on evidence across different reports |
Existing benchmarks rarely combine all four.
To address this gap, researchers created OfficeQA Pro, a dataset built around a particularly demanding document corpus: nearly a century of U.S. Treasury Bulletins.
Key dataset characteristics:
| Dataset Property | Value |
|---|---|
| Documents | ~89,000 pages |
| Numerical values | 26+ million |
| Time span | ~100 years |
| Questions | 133 complex queries |
These questions require multi‑step reasoning, including reading tables, combining values from different sections, and tracing information across documents.
In other words: exactly the kind of work analysts, auditors, and regulators do every day.
Analysis — What the paper actually tests
The benchmark evaluates full AI agent systems, not just raw language models.
Three major agent frameworks were tested:
| Agent Framework | Model Provider |
|---|---|
| Google Agent | Gemini models |
| OpenAI Agent | Codex / GPT models |
| Anthropic Agent | Claude models |
Each system was tested under several conditions:
- Parametric knowledge only – the model relies purely on training data.
- Web access – the agent can search external sources.
- Direct document access – the model receives the full corpus.
- Structured document parsing – documents are converted into structured representations.
This setup isolates the core challenge: can AI reliably read and reason over enterprise documentation?
Findings — The performance gap is surprisingly large
The headline result is blunt.
Even frontier AI systems struggle badly with this task.
| Configuration | Average Accuracy |
|---|---|
| Parametric knowledge only | <5% |
| With web access | <12% |
| With full document access | 34.1% |
| With structured document parsing | +16.1% relative improvement |
Two conclusions immediately stand out.
1. LLM memory is not enough
Relying on a model’s internal knowledge performs extremely poorly. Enterprise questions require grounded evidence, not memorized facts.
2. Raw PDFs are hostile to AI reasoning
Even when models receive the entire corpus, performance remains low. The bottleneck is not reasoning alone—it is document structure.
When documents are converted into structured formats (using Databricks’ document parsing tools), accuracy improves significantly.
But even then, models remain far from reliable.
Implications — The real bottleneck is data structure
The most important insight from the benchmark is subtle but powerful:
Enterprise AI performance depends more on document structure than model size.
This has several practical implications.
1. Document infrastructure matters
Organizations hoping to deploy AI agents should invest in:
- document parsing
- table extraction
- semantic indexing
- structured knowledge layers
Without this infrastructure, even the best LLMs underperform.
2. Agent frameworks need better retrieval logic
Multi‑document reasoning requires smarter orchestration between retrieval systems and reasoning models.
This is less about prompting and more about system design.
3. Enterprise benchmarks will become strategic assets
Benchmarks like OfficeQA Pro are valuable because they mirror real workflows.
Expect more domain‑specific evaluation suites for:
- finance
- compliance
- government archives
- legal research
These benchmarks will increasingly shape how enterprise AI systems are designed.
Conclusion — AI still cannot read the paperwork
Frontier language models are extraordinary generalists.
But when confronted with the dense, messy document ecosystems that define real organizations, their capabilities shrink dramatically.
OfficeQA Pro exposes a reality many enterprise teams already suspect:
the hardest part of enterprise AI is not the model—it is the documents.
Until AI systems can reliably parse, retrieve, and reason across complex document collections, truly autonomous enterprise agents will remain just out of reach.
And ironically, the path forward may involve less model scaling—and more attention to the humble PDF.
Cognaptus: Automate the Present, Incubate the Future.