A receipt is useful because it tells you what was bought, where, and when. It does not prove the product was good. It does not prove the cashier understood economics. It certainly does not prove the shop was honest.
Citations in enterprise AI have a similar problem.
A support chatbot that says “according to [1]” looks more trustworthy than one that simply improvises. A compliance assistant that appends source markers feels less reckless than one that delivers uncited confidence. A multilingual knowledge assistant that can cite sources in English and Hindi looks like a serious operational system rather than a demo with subtitles.
But the paper Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs makes the uncomfortable point hiding under that neat product story: citation markers are not the same thing as grounding.1 They can be evidence trails. They can also be decorative receipts.
That distinction matters because the paper’s headline result is genuinely striking. In its experiments, citation-grounded supervised fine-tuning reduces hallucination to 0.0% for encoder-decoder models under automatic NLI-based evaluation from Stage 2 onward. That sounds like a vendor slide trying to win procurement by dazzling a spreadsheet.
The paper is more interesting than the headline. Its real value is not “zero hallucination.” It is the mechanism chain showing how citation behavior is learned, where it transfers across languages, why reinforcement learning adds little after strong supervised fine-tuning, and how models can learn the appearance of citation without causally relying on the cited source.
So the practical lesson is not “citations solve hallucination.” That would be too convenient, and convenience is where AI governance usually goes to nap. The lesson is narrower and more useful: if a business wants citation-grounded AI, it needs to train the behavior, test the grounding, and avoid mistaking citation format for source-conditioned reasoning.
XKD-Dial trains trust as a sequence of skills, not a single virtue
The paper proposes XKD-Dial, a progressive four-stage pipeline for English-Hindi knowledge-grounded dialogue. The task is simple to describe: the model receives a user query and numbered knowledge passages, then must generate a response that is fluent, factually consistent with the passages, and explicitly linked to them using citation markers such as [1] or [2].
The implementation is less simple. The authors evaluate six models across encoder-decoder and decoder-only families, from Flan-T5-Base at 250M parameters to Mistral-7B. They build a bilingual dataset by combining DSTC9, FaithDial, and Wizard of Wikipedia, then translating English examples into Hindi with citation markers preserved. The resulting dataset contains 135,000 training examples, 7,500 validation examples, and 7,500 test examples, with a roughly balanced English-Hindi split.
The training pipeline is best read as a curriculum:
| Stage | What is trained | Main role in the mechanism | Business interpretation |
|---|---|---|---|
| Stage 1 | English-Hindi translation adaptation | Gives the model bilingual representational footing | Prepare localization before asking for governed answers |
| Stage 2 | English citation-grounded dialogue SFT | Teaches the model to answer with source-linked claims | Build the core compliance behavior directly into generation |
| Stage 3 | Bilingual citation-grounded dialogue SFT | Transfers and strengthens citation behavior in Hindi while retaining English | Localize the workflow, not just the text |
| Stage 4 | GRPO alignment with citation-aware rewards | Attempts to refine factuality and citation quality | Expensive optimization layer after the basic behavior exists |
This sequence is important. Many enterprise AI conversations quietly pretend that “trustworthiness” is a property that can be added at the end: take a capable model, connect retrieval, add a system prompt, install a policy checker, and hope the machine becomes responsible. XKD-Dial instead treats trust-like behavior as a staged training problem.
That is the right framing. A model cannot reliably cite sources in Hindi if it has not first built enough Hindi capability. It cannot reliably ground claims if citation format is only an output decoration. And reinforcement learning has little to optimize if the supervised examples have already taught the model the required structure.
The paper’s contribution is therefore less about one miraculous metric and more about sequencing. It asks: which stage gives the model which skill, and which apparent improvement is merely format compliance wearing a lab coat?
Stage 2 is the phase transition: supervised fine-tuning teaches the answer contract
The most dramatic movement happens at Stage 2, where the models are fine-tuned on English knowledge-grounded dialogue with inline citations. This stage teaches three things at once: answer style, citation placement, and the expectation that claims should be tied to numbered passages.
For encoder-decoder models, the effect is sharp. In the overall results, Flan-T5-Base moves from weak baseline generation to Stage 2 scores of BLEU 0.094, ROUGE-L 0.388, Citation-F1 0.859, BERTScore 0.739, and hallucination 0.000. Flan-T5-Large similarly reaches Citation-F1 0.901 and hallucination 0.000 at Stage 2. Flan-T5-XL collapses at Stage 2, which we will return to, but recovers at Stage 3.
The English-specific results make the capacity story even more pointed. After Stage 2, Flan-T5-Base and Flan-T5-Large both reach English BLEU 0.172, BERTScore 0.889, and Citation-F1 0.980, with hallucination at 0.000. A 250M model and a 780M model converge on the same English performance once the task is sufficiently structured.
That should annoy anyone selling scale as a universal substitute for task design. Good. Annoyance is sometimes the beginning of better budgeting.
The mechanism is not mystical. Citation-grounded SFT converts the output space from “produce a plausible answer” into “produce a supported answer with explicit references.” The model is no longer merely learning what fluent responses look like. It is learning a contract: claims belong next to source markers.
This does not prove every generated claim is true. The hallucination metric is automatic and NLI-based, not human adjudication. But the pattern is still operationally meaningful. It suggests that for structured enterprise tasks—support answers, policy explanations, internal knowledge-base assistants, regulated workflow summaries—well-designed SFT can be a stronger first investment than more elaborate alignment machinery.
The business inference is not “use small models everywhere.” It is more precise: when the output format is narrow, repetitive, and source-bound, smaller models can become economically attractive if the training examples encode the task contract clearly enough.
Stage 3 shows localization is a training problem, not a translation button
Stage 3 extends the system to bilingual citation-grounded dialogue. The authors use a Hindi-weighted mixture of English and Hindi training examples, with English examples retained as a replay buffer to reduce catastrophic forgetting.
This stage matters because multilingual AI often gets treated as a post-processing layer. Translate the input, run the English system, translate the output, and pretend the governance problem did not multiply. Elegant. Also lazy.
The Hindi results show why that view is too thin. Stage 3 gives the largest Hindi improvement. For Flan-T5-Base, Hindi ROUGE-1 jumps from 0.481 at Stage 2 to 0.691 at Stage 3; Hindi Citation-F1 rises from 0.718 to 0.812. For Gemma-2-2B, Hindi ROUGE-1 reaches 0.719 and Citation-F1 reaches 0.812 at Stage 4, comparable to the encoder-decoder models on citation quality.
The paper also shows cross-lingual transfer before full bilingual training. For Flan-T5-Base, Stage 2 is English-only, yet Hindi Citation-F1 improves from 0.485 after Stage 1 to 0.718 after Stage 2. The citation pattern appears to behave partly as a language-agnostic structural skill. Once the model learns that claims can be bound to markers like [1], some of that behavior transfers across languages.
But Stage 3 still matters. Structural transfer is not the same as full linguistic competence. Hindi examples improve Hindi response quality, and the model needs language-specific practice to make citation behavior useful rather than merely syntactically present.
One detail deserves special care: Hindi BLEU remains near zero for encoder-decoder models even when ROUGE and BERTScore indicate meaningful improvement. The authors interpret this as a metric limitation rather than a pure model failure. That interpretation is plausible because Hindi’s morphology and word order can punish exact n-gram matching. For business teams, this is not an academic footnote. It means multilingual model evaluation must not inherit English-centric metrics by default. A procurement dashboard that treats BLEU as universal is not rigorous; it is just tidy.
Stage 4 asks whether reinforcement learning adds value; mostly, it adds a bill
The fourth stage applies Group Relative Policy Optimization, or GRPO, with a citation-aware reward. The reward combines factual consistency, entity overlap, citation attribution, a fluency proxy, length penalty, hallucination penalty, correct citation bonus, and wrong citation penalty. The hallucination penalty is deliberately high.
That sounds like the part where the sophisticated alignment method should rescue everything. Instead, the results are modest to the point of comedy.
From Stage 3 to Stage 4, changes are negligible for encoder-decoder models. Flan-T5-Base keeps Citation-F1 at 0.902, hallucination at 0.000, and BERTScore at 0.766, with FactScore only moving from 0.096 to 0.098. Flan-T5-Large and Flan-T5-XL are similarly flat. Among decoder-only models, Mistral gains only +0.004 Citation-F1 and +0.004 FactScore; LLaMA gains +0.003 Citation-F1 and +0.008 FactScore; Gemma’s main metrics barely move.
The likely reason is not that reinforcement learning is useless. That would be a lazy conclusion, and there are already enough lazy conclusions wandering around AI commentary unsupervised. The more disciplined reading is that GRPO had little additional signal to exploit after strong SFT on a well-structured task.
The authors themselves identify several possible constraints: reward saturation after SFT, KL penalty strength, binary citation signals that may lack gradient richness, and a limited 500-step training budget with group size 4. The GRPO reward trajectory also declines from best to final for most models, suggesting instability in this configuration.
For enterprises, the translation is blunt:
| Decision point | What the paper directly shows | Cognaptus inference | Boundary |
|---|---|---|---|
| Should teams start with RL alignment? | GRPO adds marginal gains after strong SFT in this setup | Start with high-quality citation-grounded SFT before expensive RL | Other GRPO settings might perform better |
| Is reward design unnecessary? | No; the tested reward adds little once SFT is strong | Reward design may be more useful for edge cases and preference tradeoffs | The paper tests one reward configuration |
| Does “advanced alignment” imply better governance? | Not automatically | Governance value comes from measurable behavior, not method prestige | Human evaluation was not included |
This is the economic core of the paper. If the required behavior is highly structured, the cheapest path to reliability may be carefully built supervised examples, not a heroic alignment stage. Enterprise AI projects do not fail because they lack exotic methods. They often fail because the basic task contract was never made explicit.
The LLaMA-1B counterexample: zero hallucination can mean zero commitment
The paper’s best warning comes from LLaMA-3.2-1B.
At Stage 1, this model suffers a hallucination explosion. Its overall hallucination rate rises from 13.5% at baseline to 66.5% after multilingual adaptation; in English, the paper reports a jump from 16.0% to 81.0%. This is a useful reminder that “adaptation” is not automatically benign. Small decoder-only models can be disrupted by translation-style training.
Stage 2 appears to fix the hallucination problem. English hallucination drops to 0.0%. Wonderful, yes? Not quite.
English Citation-F1 simultaneously drops to 0.000 and stays there. The model eliminates hallucination not by producing grounded, cited answers, but by becoming conservative and non-committal. The paper gives a qualitative example where the model answers without citation markers and falls into repetition. It avoids unsupported claims partly by avoiding useful specificity.
This is the paper’s most business-relevant anti-metric.
A system can be “safe” because it is well-grounded. It can also be “safe” because it says almost nothing. These are not the same product. One is an accountable assistant. The other is a very polite fog machine.
The LLaMA case separates three concepts that dashboards often blur together:
| Surface metric | What it can mean | What must be checked next |
|---|---|---|
| Low hallucination | The model grounds claims in sources | Or the model avoids claims entirely |
| High Citation-F1 | The model places citation markers correctly | Or it learned citation syntax without using sources |
| Good fluency | The response reads naturally | Or it hides weak grounding behind smooth text |
For AI governance, the implication is uncomfortable but helpful: “hallucination rate” is not a sufficient KPI. A model that refuses to make claims will look clean under some factuality metrics. A model that cites every sentence may look compliant under citation metrics. Neither proves the system is useful or grounded.
The correct operational question is not simply, “Did it hallucinate?” It is, “Did it make the right claims, supported by the right sources, while remaining useful enough for the task?”
That longer question is less dashboard-friendly. Naturally, it is also the one that matters.
The real trust test is causal grounding, not citation formatting
The paper’s explainability analyses are where the “citation equals grounding” misconception gets dismantled properly.
The authors apply three post-hoc analyses: cross-attention alignment for encoder-decoder models, gradient saliency, and occlusion sensitivity. These tests serve different purposes and should not be read as interchangeable evidence.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Cross-attention alignment | Mechanistic evidence for encoder-decoder grounding | Whether generated citation tokens attend to cited knowledge tokens | Does not apply cleanly to decoder-only models |
| Gradient saliency | Diagnostic evidence of attribution spread | Whether input tokens meaningfully influence generation | Not a full proof of factual correctness |
| Occlusion sensitivity | Causal grounding test | Whether citations disappear when the source passage is removed | Tested on representative examples, not a full human audit |
The encoder-decoder results are encouraging. Flan-T5-Base and Flan-T5-Large improve in occlusion causal grounding from 0.647 and 0.656 at baseline to 0.889 and 0.909 at Stage 3. This suggests that training did not merely teach the models to print [1]; it made citations more dependent on the cited passages.
The decoder-only results are more alarming. Mistral-7B reaches strong citation scores, yet its occlusion causal grounding drops to 0.000 at Stage 3 and Stage 4 in the paper’s analysis. Gemma-2-2B also drops to 0.000 from Stage 1 onward on the explainability subset, despite strong overall Citation-F1 elsewhere. In other words, the model can produce citation markers while those markers are not causally dependent on the source passage being present.
This is not a minor measurement quirk. It is the governance problem in miniature.
A citation marker is an observable behavior. Source grounding is a causal relationship. The former can be learned as syntax; the latter requires the output to actually depend on the cited evidence. Enterprise systems usually monitor the observable behavior because it is easier. The paper shows why that is not enough.
Architecture helps explain the split. Encoder-decoder models have explicit cross-attention from decoder outputs to encoded input passages. That gives a more direct mechanism for source-conditioned generation. Decoder-only models process query, context, and response in one causal sequence, so citation grounding must emerge through self-attention patterns. They may learn where citations should appear without reliably using the cited text as the reason for the claim.
This does not mean decoder-only models are unusable for citation-grounded systems. Gemma-2-2B performs well on many citation metrics, and Mistral-7B has strong fluency and FactScore. But it does mean businesses should not certify grounding by checking citation format alone. If the model family lacks a transparent grounding mechanism, the evaluation system must compensate.
The existing article’s headline is right, but the mechanism is the article
“Zero hallucination, zero trust” is a useful phrase because it captures the paradox. But the useful content is not the paradox itself. The useful content is the operational chain behind it.
The paper’s results can be compressed into four claims:
| Claim | Evidence in the paper | Business meaning | Boundary |
|---|---|---|---|
| Citation-grounded SFT is powerful | Encoder-decoder hallucination reaches 0.000 from Stage 2 onward under automatic NLI evaluation | Train the answer contract directly rather than relying on prompts | Automatic metrics may miss human-perceived errors |
| Multilingual grounding must be staged | Stage 3 produces the largest Hindi gains, while Stage 2 transfers citation structure | Localize training and evaluation, not just interface language | Hindi examples are machine-translated |
| GRPO is marginal here | Stage 3 to Stage 4 deltas are near zero for most metrics | Do not buy alignment complexity before exhausting SFT | Only one GRPO setup is tested |
| Citation metrics can deceive | Decoder-only models can show high Citation-F1 but 0.000 occlusion grounding | Add causal grounding tests to AI assurance | Explainability analysis is limited in scope |
That table is also a practical deployment sequence.
First, define the answer contract: what sources must be cited, how claims should attach to sources, and what the model should do when the source is insufficient. Second, train on examples that actually demonstrate that contract. Third, evaluate usefulness, citation accuracy, hallucination, and causal grounding separately. Fourth, only then consider reinforcement learning if there is still a measurable gap that reward optimization is likely to improve.
This order sounds obvious. Many AI projects ignore it anyway. There is a whole industry of expensive detours built on ignoring the obvious with confidence.
What this means for customer support, compliance, and knowledge-base assistants
The most immediate use case is not open-domain chat. It is controlled enterprise dialogue: product support, internal policy Q&A, compliance knowledge bases, financial documentation assistants, HR policy bots, medical-administrative support, and multilingual service workflows.
These settings share three properties. The source material is bounded. The answer format can be standardized. Users need a way to verify claims. That makes them a good match for citation-grounded SFT.
But the paper also changes how such systems should be evaluated. A serious enterprise evaluation should separate at least five layers:
- Retrieval quality: Did the system retrieve the relevant source passages?
- Answer usefulness: Did the response actually answer the user’s question?
- Citation placement: Did the response include source markers in the right places?
- Factual consistency: Are the claims supported by the provided passages?
- Causal grounding: Does the answer change appropriately when the cited source is removed or replaced?
Most deployed systems over-measure the first and third layers because they are easy to automate. The paper argues, indirectly but strongly, that the fifth layer is where trust gets expensive.
For businesses, that expense is not optional if the system is being used in high-impact contexts. A chatbot that cites the wrong policy paragraph can be worse than a chatbot with no citations, because it gives users a false audit trail. A multilingual assistant that works in English but fails asymmetrically in Hindi, Tagalog, Bahasa Indonesia, or Thai does not have a “translation issue.” It has a governance issue that happens to speak multiple languages.
The ROI case is therefore not simply cheaper generation. The ROI case is cheaper diagnosis. Citation-grounded training can reduce hallucination under the tested conditions, but explainability and occlusion tests help identify whether the system is genuinely source-dependent or merely citation-fluent. That diagnostic value matters because remediation differs: poor retrieval requires retrieval fixes; poor citation formatting requires SFT examples; poor causal grounding may require architecture choice, training objective changes, or stronger verification layers.
The limits are not decorative; they define the deployment boundary
The paper is careful enough to leave several boundaries visible. They should not be treated as fine print.
First, the hallucination result depends on automatic NLI-based evaluation. That is useful for scale, but it is not a replacement for human review in domains where subtle factuality, policy interpretation, or legal meaning matters.
Second, the Hindi data is machine-translated. Preserving citation markers through regex processing is sensible, but natural Hindi support conversations may differ from translated benchmark examples. A business deploying multilingual support should test native user queries, not just translated English templates.
Third, GRPO is tested under one configuration: 500 steps, group size 4, and a particular reward design. The result “GRPO adds little” should be read as “GRPO added little here after strong SFT,” not “RL alignment is pointless forever.” Eternal conclusions from one configuration are a known disease. Fortunately, it is treatable with experiments.
Fourth, the explainability analysis is post-hoc and limited. Cross-attention alignment is naturally available for encoder-decoder models but not directly comparable for decoder-only models. Occlusion sensitivity is more causal, but the paper’s explainability tests are still representative analyses rather than a full production assurance suite.
These limitations do not weaken the article’s business relevance. They sharpen it. The paper is not a universal guarantee of zero hallucination. It is a blueprint for asking better deployment questions.
The strange economics: SFT buys behavior, evaluation buys trust
The economics of citation-grounded AI are strange because the visible part is cheap to fake.
A citation marker costs almost nothing to generate. A fluent answer costs less every year. A polished UI can make both look responsible. The expensive part is proving that the answer depended on the right evidence.
XKD-Dial suggests that supervised fine-tuning can buy a large amount of desired behavior when the task is structured: answer with citations, stay close to the source, transfer the citation pattern across languages, and reduce unsupported claims. That is good news for deployment cost.
But it also shows that behavior is not trust. Trust requires evaluation that can distinguish source-conditioned answers from citation theater. That is less glamorous than a model leaderboard and more useful than most of them.
The paper’s most practical lesson is therefore a two-part rule:
Train citation behavior with explicit examples. Validate grounding with tests that can fail even when citations look perfect.
That second sentence is where many systems will be underbuilt.
The future of enterprise AI will not be won by models that merely sound accountable. It will be won by systems whose accountability signals remain meaningful when the evidence is perturbed, removed, contradicted, translated, or audited by someone who is not impressed by brackets.
Citations are a start. Receipts are useful.
But nobody should confuse the receipt with the goods.
Cognaptus: Automate the Present, Incubate the Future.
-
Vedant Pandya, “Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs,” arXiv:2603.18911v1, 19 March 2026, https://arxiv.org/abs/2603.18911. ↩︎