Search looks easy until someone asks where the answer actually came from.
A researcher types a rough query into a literature assistant. The system retrieves several papers, writes a fluent answer, and appends citations. Everyone relaxes a little. The citation tag has done its small administrative magic. The answer now looks grounded.
That is the dangerous part.
A citation does not prove that the model used the cited text. A retrieved document does not prove that the generated sentence is faithful to that document. And a polished paragraph can still be an invented synthesis with a bibliography attached like a respectable hat.
The paper “ACL-Verbatim: hallucination-free question answering for research” tackles this problem by making a deceptively simple design choice: do not let the model write the final evidence-bearing answer at all. Instead, retrieve research-paper chunks and return verbatim spans from those chunks.1 The user does not receive a confident miniature essay. The user receives highlighted source text.
This is not the glamorous version of RAG. It is closer to a receipt printer. That is precisely the point.
The key design move is to remove free-form generation from the evidence path
Most RAG systems try to solve hallucination by improving retrieval, adding better citations, or asking the generator to be more faithful. These improvements help, but they leave the final dangerous step intact: a language model still composes the answer shown to the user.
ACL-Verbatim changes the interface. It treats question answering as query-conditioned evidence extraction. Given a user query and a retrieved chunk, the model’s job is not to explain, summarize, or synthesize. Its job is to mark the exact words in the source that answer the query. If the chunk does not contain relevant evidence, the model should return nothing.
The mechanism is therefore not:
query -> retrieve documents -> generate answer with citations
It is closer to:
query -> retrieve chunks -> extract verbatim spans -> show evidence
That small arrow change matters. In conventional RAG, the user must inspect whether the generated answer is supported by the retrieved context. In VerbatimRAG, the output is already constrained to source text. It can still be incomplete, poorly selected, or irrelevant if retrieval or extraction fails. But it cannot hallucinate in the ordinary sense of inventing new factual wording, because the output is copied from the document.
This is why the paper’s “hallucination-free” claim should be read technically, not magically. The system is hallucination-free because it returns source spans verbatim. It is not error-free. A wrong receipt is still a receipt, not a prophecy.
The corpus work is part of the model, not housekeeping
The paper’s first contribution is an application of VerbatimRAG to the ACL Anthology, a large research-paper library in computational linguistics and NLP. This sounds like data plumbing. It is more important than that.
For extractive research QA, the quality of the document pipeline determines what the extractor is even allowed to see. The authors process ACL Anthology metadata from February 26, 2026. They start from 120,034 paper entries, map 114,567 to PDFs, and convert 114,475 PDFs into markdown using Docling. Papers not covered by the permissive ACL Anthology license are discarded, and fewer than 100 mapped PDFs are skipped due to Docling errors.
The conversion step is not neutral. Text, headers, lists, tables, and captions are rendered into markdown. Figures and some formulas may be replaced by placeholders. That means the system is strong where the answer lives in text, captions, or tables that survive conversion. It is weaker where the answer depends on visual figures, mathematical notation, or layout-sensitive material. This is not a minor limitation hiding in the basement. It defines the evidence surface.
The chunking strategy is also domain-aware. Instead of blindly splitting every paper into equal blocks, the system parses research-paper section structure, segments along section boundaries, prefixes section and subsection titles, avoids splitting tables and code blocks, and keeps chunk sizes between 500 and 5,000 characters. The chunks are indexed for both full-text search and dense vector search using IBM’s granite-embedding-english-r2.
That gives ACL-Verbatim a fairly practical lesson: an evidence-first RAG system is not just a better model wrapped around the same old PDFs. It needs paper-aware ingestion, chunking, metadata handling, and indexing. Otherwise the extractor may be asked to highlight evidence that the pipeline has already damaged, separated, or hidden.
The benchmark asks whether the system can abstain, not just find nice sentences
The paper’s second contribution is a small but carefully designed gold benchmark for span-level evidence extraction.
The authors sample 333 English-language ACL papers, retrieve chunks, and generate synthetic queries following the ScIRGen methodology. Their pipeline first generates possible question types for a chunk, then generates questions for those types, and finally rewrites long academic questions into shorter, fragmented search-style queries. That last step is useful because real users rarely type elegant benchmark questions. They type things like they are arguing with a search bar at midnight.
Generated queries are then run through the retrieval system. For each query, the top five retrieved chunks become annotation candidates. Annotators first decide whether a chunk is relevant at all. If it is relevant, they highlight the span or spans that best satisfy the query. If a table or figure matters, annotators highlight the caption.
The resulting manually annotated benchmark is small: 20 synthetic queries, five retrieved chunks per query, and therefore 100 query–chunk pairs. Of those 100 rows, 47 chunks are relevant and contain 78 gold evidence spans. The remaining 53 chunks are irrelevant and should produce no extracted span.
That last number is crucial. More than half the benchmark rows are negative examples. The extractor is not merely asked, “Can you find the useful sentence?” It is also asked, “Can you resist highlighting plausible-but-wrong text?”
This is the right pressure for RAG. In retrieval-augmented systems, the top retrieved chunk is often close to the topic but not actually responsive to the query. Vocabulary overlap is cheap. Relevance is expensive.
The paper gives a good annotation example from parsing research. A query about “parsing merge predicate sequence equivalence conditions” retrieves several chunks from syntactic parsing papers. Some look relevant by terminology. Deciding which spans actually answer the query requires reading and domain judgment. In one case, a 4,700-character algorithm description is reduced to a 92-character sentence. In another, most of a 1,902-character introduction remains relevant.
This example is not just colorful. It explains why the benchmark is small. High-quality extraction labels for research papers are labor-intensive, subjective, and domain-dependent. The paper’s benchmark is therefore best read as a focused diagnostic set, not a broad leaderboard carved into stone.
The best model wins by being selective, not by sounding smarter
The extraction experiments compare three families of systems.
First, the authors test LLM-based span extractors: Mistral Small 2603, Nemotron-120B-A12B, GLM-5, and Qwen 3.6 35B. These models receive a query and retrieved chunk, then return verbatim spans. For several models, the paper compares a default extraction prompt with a paragraph-oriented prompt that encourages broader supporting passages.
Second, the paper evaluates existing pruning and highlighting baselines: Zilliz Semantic Highlight and Provence. These are relevant because they represent a neighboring idea: remove irrelevant context or highlight important text before feeding or showing it elsewhere.
Third, the authors train a compact student model. It is a query-conditioned token classifier over an 8,192-token ModernBERT backbone. The input is the question plus chunk; the output is a binary evidence label per token, decoded back into character spans. The model is trained on silver supervision generated from the ACL corpus, using Qwen 3.6 35B with the paragraph-oriented extraction prompt as the teacher. The final released ACL-specialized student is based on the GTE-reranker ModernBERT backbone and has about 150M parameters.
The main result is straightforward, but the interpretation is not.
| Model / configuration | Word precision | Word recall | Word F1 | Latency per row |
|---|---|---|---|---|
| ACL-Verbatim ModernBERT student | 65.43 | 45.43 | 53.63 | 0.47s |
| GLM-5, default prompt | 44.50 | 53.80 | 48.71 | 1.04s |
| Mistral Small 2603, default prompt | 40.41 | 55.99 | 46.94 | 0.78s |
| Qwen 3.6 35B, paragraph prompt | 39.43 | 57.35 | 46.73 | 1.20s |
| Generic multi-domain ModernBERT model | 63.00 | 36.58 | 46.29 | 0.40s |
| Provence | 27.58 | 45.70 | 34.40 | 2.40s |
| Zilliz Semantic Highlight | 46.97 | 22.11 | 30.07 | 1.04s |
The ACL-specialized ModernBERT student achieves the best word-level F1: 53.63. It also has the highest word-level precision among the reported systems, 65.43, while running faster than most evaluated LLM extractors and using far fewer parameters.
But the important operational result is not merely “small model beats large model.” That headline is tempting, and therefore slightly too easy.
The better interpretation is that the student model behaves more like a filter. It is willing to abstain. On the 100 benchmark chunks, 53 have no gold evidence spans. The ACL-specialized student predicts no spans for 60 chunks. By contrast, the paragraph-based Mistral extractor abstains only 35 times. The latter has higher recall, but it extracts more text from chunks that should have been left alone.
This is the central trade-off:
| Extractor behavior | Looks good when | Fails when | Business meaning |
|---|---|---|---|
| High recall, broad passage extraction | The relevant answer is present and the user wants generous context | Retrieved chunks are topically close but not actually responsive | More review burden and more false confidence |
| High precision, frequent abstention | The system must filter noisy retrieval results | Users expect every query to produce something | Better audit posture, but less satisfying as a chatbot |
| Exact span extraction | The user needs traceable evidence | The answer requires synthesis across documents | Good evidence layer, incomplete answer layer |
For research QA, precision and abstention have special value. A system that highlights too much is not just verbose; it transfers the verification burden back to the user. The whole point was to avoid making the researcher fact-check a polite machine with a citation habit.
Word-level F1 is the right main metric because span boundaries are negotiable
The paper evaluates extraction primarily with word-level precision, recall, and F1. This is a sensible choice for the task.
Exact span matching would punish small boundary differences too harshly. If the gold answer consists of two relevant spans separated by a short irrelevant phrase, and the model highlights the whole surrounding passage, exact span scoring may treat the output as completely wrong. But for a user reading evidence, that prediction may still be mostly useful.
The authors therefore also report containment and coverage metrics. Containment asks how much of the predicted span lies inside gold evidence. Coverage asks how much of the gold evidence is covered by predictions. These metrics expose different preferences. A lawyer might prefer high containment: do not highlight unsupported material. A researcher doing exploratory reading might tolerate broader coverage: give me the paragraph, I can inspect it.
The appendix’s detailed model comparison is best interpreted as diagnostic support for this trade-off. It is not a second thesis. The LLMs with paragraph prompts often improve coverage because they select broader passages. But broader passages reduce precision and containment. Again, the question is not whether a model can produce more highlighted text. The question is whether it can highlight the right text and leave the wrong chunks blank.
The threshold sweep is an implementation choice, not a robustness miracle
The student model is a binary token classifier. At inference time, token probabilities must be converted into span decisions. That requires a threshold. The appendix reports a small threshold sweep for the GTE-reranker student while keeping post-processing fixed: spans shorter than 10 characters are dropped, and neighboring spans separated by at most 20 characters are merged.
| Threshold | Word precision | Word recall | Word F1 | Likely purpose |
|---|---|---|---|---|
| 0.2 | 0.654 | 0.454 | 0.536 | Selected operating point for best F1 |
| 0.3 | 0.667 | 0.421 | 0.516 | Higher precision, lower recall |
| 0.4 | 0.678 | 0.403 | 0.506 | More conservative extraction |
| 0.5 | 0.701 | 0.380 | 0.493 | Highest precision in sweep, weaker F1 |
This is a sensitivity and operating-point test. It supports the model configuration used in the main table, and it shows the expected precision–recall trade-off. It does not prove the model is robust across domains, query styles, or annotation policies.
For business deployment, however, this threshold matters. A compliance team may prefer a higher threshold and accept more abstentions. A research discovery tool may prefer lower threshold and more coverage. The model is not merely “accurate” or “inaccurate.” It has a policy knob.
That knob should be exposed deliberately. Hiding it behind a cheerful chatbot interface would be a very 2024 way to create a 2026 audit problem.
The prompts reveal the philosophical split between extraction and answering
The paper’s appendix includes the extraction prompts used for LLM-based span extraction. The default prompt asks models to extract exact verbatim text spans that answer the question, never paraphrase, preserve original wording, and return empty arrays when no relevant information exists. The paragraph-style prompt asks for complete supporting passages, including setup sentences, directly relevant sentences, interpretation sentences, and table captions where relevant.
This prompt comparison is an implementation detail with conceptual value. It shows that “extractive QA” is not one behavior. There are at least two plausible extraction styles:
- Minimal evidence extraction: mark only the words or sentences directly needed.
- Passage support extraction: mark a broader paragraph that helps justify the answer.
The paper’s results suggest that broader passage prompts can increase recall and coverage, but they also pull in more irrelevant text. In an ordinary summarization task, extra context may feel harmless. In an evidence interface, extra context is noise with official formatting.
This is where the paper’s mechanism-first contribution becomes clearest. ACL-Verbatim is not just asking LLMs to behave nicely. It is defining a narrower output contract: evidence must be copied, bounded, and optionally absent.
What the paper directly shows, and what business users should infer carefully
The paper directly shows that, on a 100-row ACL research-paper extraction benchmark, a 150M ModernBERT token classifier trained on silver supervision outperforms the evaluated LLM extractors on word-level F1. It also shows that the model’s stronger precision and abstention behavior are practically important for filtering irrelevant retrieved chunks.
Cognaptus’ business inference is broader, but should remain disciplined: in high-stakes knowledge systems, the evidence layer and the explanation layer should be separated.
| Paper component | What the paper shows | Business interpretation | Boundary |
|---|---|---|---|
| Verbatim span output | Answers can be constrained to source text | Use extraction as an audit layer before generation | Exact spans may still be incomplete or wrongly selected |
| ACL markdown corpus | Large paper collections can be converted and indexed for extraction | Enterprise document QA needs ingestion engineering, not only prompting | Figures, formulas, and layout-sensitive evidence may degrade in conversion |
| Gold benchmark | Expert annotation can evaluate query–chunk relevance and spans | Build domain-specific evidence benchmarks before trusting a RAG system | The benchmark is small and focused on ACL papers |
| Silver-trained student | A compact model can outperform evaluated LLM extractors on this task | Specialized extractors may reduce cost and latency | Silver supervision may inherit teacher-model bias |
| Threshold tuning | Precision and recall can be shifted by operating point | Different business workflows need different abstention policies | The sweep is small and benchmark-specific |
In practice, this points toward a three-layer architecture for enterprise RAG:
Layer 1: Retrieval
Find candidate documents and chunks from trusted sources.
Layer 2: Evidence extraction
Highlight exact spans, abstain on irrelevant chunks, and store evidence traces.
Layer 3: Optional synthesis
Generate summaries only after evidence is visible, logged, and inspectable.
The third layer can still use an LLM. The difference is governance. The LLM should not be the first component that decides what the evidence is. It should consume an already visible evidence set, and its output should remain downstream from a source-bound extraction step.
This matters for legal research, compliance monitoring, medical literature review, technical support knowledge bases, due diligence, and internal policy QA. In these settings, the user often does not need a charming answer. The user needs to know which sentence in which document supports the next decision.
A cited generated answer says, “Trust me; I read something.” A highlighted span says, “Here is the thing.” The second is less theatrical. It is also easier to audit.
The result stops before full research intelligence begins
The paper’s limitations are not cosmetic. They define the scope of adoption.
First, the gold benchmark contains only 100 query–chunk pairs. That is enough for a focused extraction study, but not enough to conclude that the approach dominates across all research QA conditions. The authors acknowledge that the annotation task is complex and that the benchmark did not include rigorous agreement measurement and adjudication across a broader annotator pool.
Second, the benchmark is built from synthetic queries. The authors make those queries more search-like through rewriting, but real users have their own habits: abbreviations, malformed intent, missing context, and occasionally the kind of query that looks like it fell down stairs. Deployment should test real user logs where privacy and governance allow.
Third, the benchmark evaluates extraction on retrieved chunks. It does not prove that retrieval always finds the right chunk. A perfect extractor cannot highlight evidence that retrieval failed to bring into view. In production, retrieval quality, chunking, metadata filtering, document permissions, and freshness all remain part of the risk surface.
Fourth, the document conversion process may lose some content. Figures and some formulas can be replaced by placeholders. For fields where visual evidence, equations, or table structure carry the answer, extraction from markdown text alone may underperform.
Fifth, the student model learns from silver supervision generated by an LLM teacher. That is economically attractive, but it can propagate teacher biases or systematic extraction preferences. Silver data is useful. It is not holy water.
Finally, extractive evidence is not the same as synthesized understanding. Many business questions require comparing multiple documents, resolving contradictions, computing implications, or translating evidence into a decision. ACL-Verbatim is strongest as an evidence layer. It should not be mistaken for the entire analyst.
The practical lesson is to design RAG around evidence, not prose
The paper’s most useful idea is not that ModernBERT beat several LLM extractors on a small benchmark. That is interesting, but it is not the strategic lesson.
The strategic lesson is architectural: if the business risk is hallucinated evidence, then the answer pipeline should not begin by asking a generative model to write. It should begin by forcing the system to show what it found.
This reverses the usual RAG product instinct. Many teams start with a chatbot and add retrieval, citations, and safety checks around it. ACL-Verbatim suggests starting with the evidence surface: corpus conversion, chunking, retrieval, span extraction, abstention, and threshold policy. Only after that should generation enter, and even then as a downstream summarizer rather than the owner of truth.
That is less dazzling than a conversational assistant that speaks in perfect consultant prose. It is also more honest. In serious knowledge work, the receipt is not a decoration after the answer. The receipt is the answer’s license to exist.
Cognaptus: Automate the Present, Incubate the Future.
-
Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, and Ádám Kovács, “ACL-Verbatim: hallucination-free question answering for research,” arXiv:2605.21102v1, 20 May 2026, https://arxiv.org/abs/2605.21102. ↩︎