Paperwork is where enterprise AI demos go to lose their charm.
In a product demo, an AI agent usually receives a clean PDF, a friendly question, and a document that has the decency to behave like a document. It summarizes, retrieves, answers, maybe even produces a small spreadsheet. Everyone nods. Someone says “workflow automation.” Someone else says “agentic.” The meeting ends before anyone asks whether the same system can handle 89,000 pages of historical reports, nested tables, revised statistics, scanned pages, ambiguous row headers, and a calculation that must be correct to the last digit.
That less theatrical question is exactly where OfficeQA Pro becomes interesting. The benchmark, introduced by Databricks AI Research, tests AI systems on grounded reasoning over U.S. Treasury Bulletins spanning nearly a century: 89,000 pages, more than 26 million numerical values, and 133 difficult questions requiring document parsing, retrieval, and quantitative analysis.1
The result is not “AI is useless.” That would be emotionally satisfying and analytically lazy. The sharper conclusion is that frontier models are increasingly capable, but enterprise document intelligence is not a single capability. It is a chain. Parse the page. Find the right document. Distinguish the latest revised number from an earlier one. Preserve table structure. Retrieve the correct evidence. Run the calculation. Format the answer exactly.
Break one link, and the final answer is wrong.
That is the central lesson of OfficeQA Pro: the model may be smart, but the paperwork is still smarter in all the most annoying ways.
The misconception: file access is not enterprise intelligence
A common belief in enterprise AI is that once a model has tool access, search access, and a large document corpus, it becomes a reliable analyst. The logic sounds plausible. The model can search, read, compute, and reason. What else is missing?
OfficeQA Pro’s answer is: quite a lot.
The benchmark is designed around “grounded reasoning,” meaning the system must answer by using evidence from a large document collection rather than relying on memorized knowledge. This matters because many real enterprise tasks are not closed-book reasoning problems. They are open-archive tasks. The answer exists somewhere, but the difficulty lies in locating it, interpreting it, and transforming it into the requested output.
The U.S. Treasury Bulletin corpus is a useful stress test because it behaves like an enterprise archive rather than a tidy dataset. Reports change format across decades. Older pages are scanned. Later PDFs are more digital-native. Tables contain nested row and column structures. Footnotes alter interpretation. Values may be revised in later issues. A number that looks right may be merely the first plausible wrong number.
That last phrase deserves attention. “First plausible wrong number” is probably the unofficial national language of enterprise AI failure.
OfficeQA Pro filters for questions that cannot be solved by easy memory or shallow search. Its Pro split contains 133 difficult questions. Many are numerical. Some require data from multiple bulletins. Some require external information, such as historical CPI or exchange-rate values. Some require visual reasoning over charts. Most require more than basic arithmetic; the paper notes that 62% require data analysis beyond basic arithmetic, including tasks such as linear regression.
So the benchmark is not asking whether AI can summarize a PDF. It is asking whether AI can behave like a careful analyst working through a messy archive.
That distinction is the article.
What OfficeQA Pro actually tests
The paper tests several levels of assistance, from direct LLM prompting to full agent workflows.
At the simplest level, frontier models are asked to answer using only the question. This tests parametric knowledge: what the model “knows” from training. Unsurprisingly, performance is poor. Exact accuracy at the strict 0.0% error threshold is below 3% for the tested frontier models in the appendix table.
Adding web search helps, but not enough. GPT-5.4 reaches 11.28% accuracy in the web-search-only setting, while Claude Opus 4.6 and Gemini 3.1 Pro Preview remain around 3.01%. The problem is not merely that the information is unavailable online. The problem is that finding adjacent information is not the same as applying the correct source, definition, revision, unit, and calculation.
The paper then moves to more informative conditions. In “oracle page” settings, models receive the exact pages needed to answer the question. This isolates retrieval from reading and reasoning. With raw PDF oracle pages plus web search, model accuracy ranges from 36.09% to 57.14%. With Databricks-parsed oracle pages plus web search, performance improves to 56.39%–65.41%.
That sounds much better until one remembers what “oracle” means. The system has already been handed the right pages. The treasure map has a red circle around the treasure. And the best configuration still leaves roughly one-third of answers wrong.
The agent experiments are closer to the real business problem. Agents are given full access to the corpus and tools such as file search, shell execution, code interpreters, and web search. In the full-corpus raw PDF setting, the three frontier agents score:
| Agent / model | Full PDF accuracy | Full parsed accuracy | Operational reading |
|---|---|---|---|
| Claude Opus 4.6 agent | 48.12% | 54.14% | Best raw-PDF performer, but still below reliability |
| GPT-5.4 agent | 36.09% | 56.39% | Strong gain from parsing; cheaper and faster in some settings |
| Gemini 3.1 Pro Preview agent | 18.05% | 29.32% | Much more sensitive to retrieval and parsing quality |
The average full-corpus raw PDF performance is about 34.1%. That is not “ready for autonomous finance operations.” That is “please keep the analyst in the loop, and preferably one who has coffee.”
The strongest full-corpus parsed result reaches 56.39%. The strongest oracle parsed agent result reaches 66.92%. These are meaningful improvements. They are also not close to enterprise-grade reliability for tasks where one wrong value can distort a compliance report, financial analysis, audit trail, or regulatory submission.
The pipeline fails before the model gets to “reason”
The temptation is to interpret OfficeQA Pro as another model leaderboard. Claude versus GPT versus Gemini. Who wins, who loses, who gets the slide title.
That is the least useful reading.
The more valuable reading is pipeline diagnosis. OfficeQA Pro shows that document reasoning failure is distributed across several stages:
| Stage | What can go wrong | Why it matters in business workflows |
|---|---|---|
| Parsing | OCR errors, broken reading order, missing rows, corrupted tables | The model reasons over a damaged representation of the document |
| Retrieval | The agent finds a plausible page, not the right page | Wrong evidence can look convincing when the archive is large |
| Revision handling | The agent uses an earlier value instead of the latest revised value | Historical financial and operational records often change over time |
| Table interpretation | Nested headers and footnotes are lost or flattened badly | A correct number under the wrong header is still wrong |
| External lookup | CPI, exchange-rate, or other supporting values are pulled from inconsistent sources | Small source differences can become large answer errors |
| Calculation | Wrong formula, premature rounding, internal arithmetic errors | Even correct evidence can produce an incorrect final answer |
The first business implication is simple: buying a stronger model is not the same as building a stronger document intelligence system.
Model selection matters. The paper’s custom-agent ablations show that Claude Opus 4.6 achieves the highest correctness among the tested custom-agent models at 57.10%, while other models differ in latency, tool calls, and cost. But model choice does not eliminate the bottlenecks. The system still depends heavily on parsing quality, retrieval strategy, table serialization, and answer verification.
In other words, enterprise AI performance is not located inside the model alone. It is located in the interface between the model and the paperwork.
That interface is currently fragile.
Parsing is not preprocessing; it is the agent’s reality
The paper’s clearest operational finding is that parsing quality materially changes outcomes.
In the baseline agent experiments, using Databricks-parsed documents instead of raw PDFs improves accuracy by 6.0 to 20.3 absolute percentage points, depending on the agent. It also makes agents 4–9 times faster. This is not a cosmetic improvement. It means the agent is no longer burning most of its effort trying to translate a document into something it can use.
The custom parser comparison sharpens the point. Across custom-agent configurations, Databricks’ parser reaches an average accuracy of 50.4%, compared with 38.4% for Docling and 31.1% for unstructured.io. The cost comparison is also notable: the paper reports $178 to parse the full corpus with Databricks’ tool, versus $2,670 for unstructured.io.
For a business reader, the lesson is not “use this specific parser.” The lesson is that parser choice can move the result by more than the difference between some frontier models.
This changes how AI projects should be scoped. Document parsing should not be treated as a low-level engineering detail delegated to the cheapest extraction pipeline. It is a core model-interface decision. If a table loses its header hierarchy, if a scanned number becomes a different number, or if a footnote disappears, the downstream model does not know that the source has been damaged. It simply reasons confidently over the wreckage. Very modern.
OfficeQA Pro’s failure-mode analysis makes this explicit. Baseline agents using original PDFs have a 40–50% failure rate attributable to parsing problems such as misread numbers, corrupted text, and misaligned tables. Even state-of-the-art parsing still leaves errors that propagate into retrieval and calculation.
This is why the phrase “we uploaded the PDFs” should make enterprise teams nervous. Uploading a PDF is not the same as making the document usable.
Retrieval fails when relevance depends on structure, not keywords
Retrieval looks easy when the question contains unique terms. Search for the phrase, read the page, answer.
OfficeQA Pro is designed to break that comfort. The relevant information may be spread across multiple pages or bulletins. Values may recur over time. Tables may contain repeated labels under different sections. A query can retrieve a semantically similar chunk that lacks the table header needed to interpret it.
The paper’s search-tool ablations show this clearly. Standard vector search performs worse than file search on average, with a reported 27% relative drop. That does not mean vector search is bad. It means naive chunk embeddings can detach table fragments from the metadata that gives them meaning.
The authors improve vector search by adding contextual information to chunks: document name, bulletin date and month, page number, page header, and table or section titles. This contextual vector search improves performance by 21% on average over standard vector search, while reducing tool calls, latency, and cost. Combining file search with contextual vector search improves performance further, producing the highest accuracy in two of the three tested model families.
This is a useful corrective to the fashionable belief that vector search is a universal retrieval solvent. It is not. Vector search is useful when the chunk contains enough meaning to stand alone. Enterprise documents often violate that condition. Tables, appendices, footnotes, and repeated forms are relational objects. A row is not meaningful without its column context. A number is not meaningful without a date, unit, and definition.
For business systems, retrieval design should preserve provenance and structure, not merely semantic similarity. The retrieval layer must answer questions such as:
| Retrieval question | Why it matters |
|---|---|
| Which document version produced this value? | Revision history changes answers |
| Which page, table, row, and column contain the evidence? | Table position defines meaning |
| Was this value preliminary, revised, or final? | Financial reports often update later |
| Which external source was used for supporting data? | CPI, exchange-rate, and benchmark values vary by source |
| Can the answer be reproduced from cited evidence? | Auditability is not optional in serious workflows |
This is where “RAG” becomes too vague a word. Retrieval-augmented generation can mean anything from a loose semantic search wrapper to a carefully engineered evidence system. OfficeQA Pro rewards the latter.
Revision awareness is a separate capability
One of the paper’s most important failure modes is temporal revision verification.
Treasury Bulletins form a continuously updated archive. The same statistic may appear in one issue and then be revised in a later issue. An agent must not simply find a number. It must determine whether that number is the correct version of the number for the question.
This is very close to real enterprise behavior. Financial reports are restated. Compliance records are updated. CRM histories are corrected. Inventory snapshots change. Policy manuals are revised. A system that retrieves the first plausible value will fail quietly.
OfficeQA Pro shows that agents frequently converge too early on plausible values. Even when prompted to identify the most recently published figure, they can enter repeated search loops, consume context, lose track of validated intermediate values, and fall back to approximation near step limits.
This is not a “prompting problem” in the narrow sense. It is a workflow-state problem. Revision-aware retrieval needs a designed procedure:
- Identify candidate values.
- Extract document dates and publication context.
- Compare later documents for revisions.
- Track which candidate has been superseded.
- Preserve that reasoning path until final calculation.
An agent may be able to improvise this procedure sometimes. But enterprise reliability requires making it explicit. The system should maintain a revision ledger, not ask the model to remember everything in a long scratchpad that gradually becomes a swamp.
The business translation is direct: if your documents contain revised values, the agent must be designed for temporal reconciliation. Otherwise, it is not doing analysis. It is doing numerically decorated search.
Tables are still hostile territory
The paper’s table-representation ablation looks modest at first glance. HTML table representations slightly outperform hierarchical Markdown overall, winning in 7 out of 11 agent configurations. But the differences are model-dependent. Some Claude variants perform better with hierarchical Markdown, while GPT and Gemini Pro variants generally favor HTML.
This is not a small technical footnote. It tells us that there is no universally optimal table serialization format for all models. The same table can become easier or harder depending on how the model has learned to interpret structure.
Markdown is token-efficient, but it does not naturally represent nested headers. HTML preserves more hierarchy but may cost more tokens or interact differently with model training. Hierarchical Markdown attempts to collapse nested headers into a single header string, which can help but also creates its own distortions.
For enterprise AI, the practical answer is not to hold a theological debate about Markdown versus HTML. The answer is to test table serialization against the actual task family.
Financial documents, invoices, policy schedules, insurance forms, shipping manifests, procurement tables, and regulatory filings all have different structural pain points. A representation that works for flat transaction logs may fail on multi-level balance-sheet tables. A representation that works for modern digital PDFs may fail on scanned legacy reports.
The paper’s broader message is that table representation is an empirical system-design variable, not a formatting preference.
Calculation errors survive even after the evidence is found
A comforting theory says that once the model retrieves the right evidence, the remaining problem is easy. Just calculate.
OfficeQA Pro does not fully support that comfort.
The paper’s remaining failure modes include wrong formulas, misaligned definitions, premature rounding, and agents performing calculations internally instead of delegating to scripts. These are not exotic errors. They are exactly the errors human analysts make when tired, except the AI does not look tired. It looks like a machine, which somehow makes people forgive it faster.
The benchmark’s strict evaluation is important here. Most answers are numerical, and unless otherwise stated, results are judged at a 0.0% allowable absolute relative error threshold. This is severe, but appropriate. In many enterprise settings, “approximately correct” is not a success condition. If the task asks for a rounded slope and intercept from a regression, the model must retrieve the correct values, run the right regression, and format the output exactly.
The paper also reports that prompt-only models can make reasonable Fermi estimates at looser thresholds. That is intellectually interesting and operationally dangerous. A model that can estimate plausibly may appear useful in a chat interface, but approximate competence is not the same as grounded correctness.
For business use, the implication is clear: numerical agents should externalize computation. Use scripts. Preserve intermediate data. Store calculation traces. Check units. Delay rounding. Validate formulas. Make the final answer reproducible.
A model should not be trusted because it sounds numerically fluent. Numerically fluent wrongness is still wrongness, just with better diction.
How to read the experiments without overreading them
The paper includes several experiment types. Treating them all as equal “results” would flatten the argument, so it is useful to classify what each part is doing.
| Experiment or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Prompt-only and web-search LLM baselines | Main evidence for insufficiency of memory and shallow search | Frontier models cannot solve this benchmark from knowledge or web search alone | That web search is useless in all workflows |
| Oracle page experiments | Diagnostic isolation of retrieval from reading/reasoning | Even with correct pages, parsing and reasoning remain hard | That oracle performance reflects production performance |
| Full-corpus agent baselines | Main end-to-end enterprise proxy | Agents struggle when they must search, parse, and calculate across a large corpus | That all enterprise corpora are equally difficult |
| Parser comparison | Ablation of document extraction quality | Parsing choice materially affects accuracy, speed, and cost | That one parser is universally best across domains |
| Table representation comparison | Ablation of serialization format | Table format matters, but effects are model-dependent and modest | That HTML or Markdown is always superior |
| Search tool comparison | Ablation of retrieval method | Contextual retrieval and hybrid search improve quality-cost tradeoffs | That vector search alone should be abandoned |
| Test-time scaling / plurality voting | Robustness and variance exploration | Multiple rollouts can improve results, especially for weaker baselines | That voting solves systematic evidence or parsing errors |
| Human annotator study | Comparison with human workflows on a subset | Agents can outperform humans in speed and accuracy when representation is usable | That agents are reliable enough for unsupervised deployment |
This classification matters because the paper is not simply saying “AI agents are bad.” In several controlled settings, agents outperform human annotators on both speed and accuracy. On a 30-question OfficeQA Full subset, average human accuracy with full corpus access is 34.6% at 31.4 minutes, while agents using parsed documents reach 56.7% at 3.5 minutes.
That is a serious result. It means AI agents can already be useful productivity tools for document-heavy analysis. But usefulness is not autonomy. A system can outperform a human average and still be too unreliable for unsupervised financial or compliance work.
That is the uncomfortable middle ground where most enterprise AI actually lives.
The business value is cheaper diagnosis, not magical autonomy
The practical value of OfficeQA Pro is not that every company should copy its exact benchmark. Most firms are not asking agents to analyze Treasury Bulletins from 1940. The value is that the benchmark decomposes enterprise document intelligence into testable failure points.
A firm deploying AI over internal documents should ask:
| Business question | OfficeQA Pro lesson |
|---|---|
| Are our PDFs machine-usable? | Parsing quality can shift accuracy dramatically |
| Can the agent find the right document version? | Revision-aware search must be designed, not hoped for |
| Are tables represented faithfully? | Serialization format affects model comprehension |
| Can calculations be reproduced? | Numerical reasoning needs auditable computation |
| Does retrieval preserve context? | Chunks need document, page, section, and table metadata |
| Is performance measured end-to-end? | Isolated model tests overstate production readiness |
This suggests a more disciplined deployment model.
First, build a small benchmark from real internal workflows. Not 10 toy questions. Not “summarize this policy.” Use tasks where the answer is verifiable and where the current human process is painful. Include revisions, tables, multi-document retrieval, and calculations if those exist in the business.
Second, evaluate the pipeline, not only the model. Run variants with different parsers, retrieval methods, table formats, and calculation controls. The question is not “Which model is best?” The question is “Which combination produces reliable answers at acceptable cost and latency?”
Third, separate answer generation from answer verification. If a system produces a number, require evidence pointers, document dates, extracted values, calculation scripts, and final formatting checks. The goal is not to make the agent sound professional. The goal is to make the answer inspectable.
Fourth, keep humans where the failure cost is high. OfficeQA Pro’s best results are impressive but still below the reliability threshold needed for unsupervised operation in many enterprise contexts. The near-term business value is analyst acceleration, not analyst removal. Yes, that is less exciting than “autonomous agent workforce.” It is also less likely to create a spreadsheet-shaped crater.
What applies beyond Treasury Bulletins, and what may not
The paper’s strongest evidence concerns numerical reasoning over historical U.S. Treasury PDFs. That gives the benchmark unusually clean evaluation: most answers are numerical, ground truth can be checked automatically, and strict accuracy can be measured.
This is both strength and boundary.
The findings should transfer well to domains with similar structure: finance, audit, regulatory reporting, insurance, procurement, legal discovery, public-sector records, and compliance documentation. These domains also contain large document archives, revised values, tables, footnotes, scanned material, and high penalties for small errors.
The findings transfer less directly to tasks where correctness is qualitative, where multiple answers are acceptable, or where the document corpus is already highly structured. A customer-support knowledge base with clean versioning is not the same problem as a century of Treasury Bulletins. A modern data warehouse is not a scanned 1950 table. Different paperwork, different poison.
There is also a benchmark-publicity issue. The paper notes that none of the models referenced the publicly available GitHub dataset containing exact questions, answers, and ground-truth links during web-search experiments. That is reassuring for the reported runs but also a reminder: public benchmarks can eventually become less clean as models and agents learn to search for benchmark artifacts.
Finally, the paper evaluates a particular generation of frontier models and agent frameworks. These will change. The deeper mechanism is more durable than the exact leaderboard. Models will improve, but enterprise document systems will still need faithful parsing, structure-aware retrieval, revision handling, visual interpretation, and auditable computation.
The PDF will not become kind just because the model becomes larger.
The real message: intelligence must be grounded in infrastructure
OfficeQA Pro is useful because it punctures a specific fantasy: that frontier reasoning models become reliable enterprise analysts once given tools and documents.
They do not. They become potentially useful agents operating inside a brittle evidence pipeline.
That distinction should shape how companies invest. If the task is document-heavy, the budget should not go only to model access and prompt engineering. It should go to document parsing, metadata design, retrieval evaluation, table reconstruction, calculation audit trails, and domain-specific benchmarks. These are not glamorous. Neither is plumbing. Buildings still need it.
The paper also suggests a healthier way to talk about enterprise AI progress. The question is not whether AI can “read documents.” That phrase is too broad to be useful. The question is whether a system can perform a specific grounded workflow with known documents, known revision behavior, known output requirements, and measurable error tolerance.
For many firms, the answer today will be: partly, with good infrastructure and human review.
That is not a failure. It is a roadmap.
The next wave of enterprise AI advantage will not come from pretending the document layer is solved. It will come from firms that treat paperwork as a first-class technical object: parsed carefully, indexed structurally, versioned explicitly, and checked numerically.
Because in real organizations, intelligence does not begin with the model.
It begins with whether the system can read the damn table.
Cognaptus: Automate the Present, Incubate the Future.
-
Databricks AI Research, OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning, arXiv:2603.08655, March 2026. ↩︎