Whispers Against the Noise: How Contrastive Decoding Tames Long‑Form ASR Hallucinations

A transcript is usually treated as boring infrastructure. It sits underneath meeting summaries, call-center analytics, podcast search, earnings-call review, legal discovery, medical documentation, and the cheerful dashboard that tells managers everything is now “AI-powered.”

Then the transcript invents a sentence.

Not a typo. Not a small mishearing. A fluent, confident, context-shaped sentence that nobody said. In short clips, this is irritating. In long recordings, it becomes structural. One bad segment can become context for the next segment; the next segment inherits the mistake; and soon the system is not transcribing a recording so much as continuing a badly seeded story.

The paper Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding proposes a useful way to think about this failure.¹ The problem is not only that Whisper sometimes chooses the wrong token. The deeper problem is that, in long-form transcription, wrong tokens can become self-reinforcing context. Beam search may explore more paths, but it does not necessarily challenge the model’s own probability distribution. If a hallucinated continuation already looks highly probable, searching harder can become a very expensive way to be wrong with confidence. Delightful, in the way only production AI failures can be.

Whisper-CD attacks the issue at decoding time. It keeps the Whisper model frozen, feeds the same segment through clean and deliberately degraded audio paths, and subtracts the tokens that remain attractive even when the acoustic evidence has been damaged. The practical idea is simple: if a word is likely when the real speech is present, that is useful. If the same word is also likely when the speech has been replaced by noise, silence, or temporal misalignment, it may be the model’s textual prior talking too loudly.

That is the mechanism worth understanding. The business implication follows from it: hallucination control may not always require retraining, a new architecture, or an LLM post-editor with a heroic job title. Sometimes reliability improves when inference becomes more skeptical.

The real bug is not one bad segment; it is error inheritance

Whisper processes long audio by splitting it into short segments, commonly around 30 seconds. Previous segment text can be passed forward as context so the transcript stays coherent across the recording. That sounds reasonable. It is also exactly where the trap begins.

The paper reports that previous-context conditioning can degrade performance sharply. On Whisper Large-v3, the authors find that conditioning on previous text increases WER by over 190 percentage points on CORAAL and over 500 percentage points on Earnings22. Those are not normal “model is a bit worse than expected” numbers. They are symptoms of runaway repetition and error accumulation.

The paper separates the long-form failure pattern into three recurring forms:

Failure pattern	What happens	Why long-form decoding makes it worse
Silence-region hallucination	The model emits words during non-speech intervals	The generated text can become context for later segments
Repetition loops	A phrase or sentence repeats across boundaries	The loop reinforces itself as the prefix grows
Content omission	Real spoken content is skipped	Later decoding has less correct context to recover from

This is why the common instinct—“just use a bigger model” or “just search harder”—is incomplete. The paper’s more interesting observation is that the failure is not merely acoustic. It is acoustic plus textual plus sequential. Once the model has written a bad prefix, later segments are decoded under a contaminated history.

A post-editor can clean some output after the fact. Voice activity detection can avoid some silence mistakes. Beam search can explore multiple hypotheses. But none of these directly changes the token distribution at the moment where the model is deciding what to say next.

Whisper-CD lives exactly at that moment.

Contrastive decoding asks what the model says when the audio evidence disappears

The core contrastive decoding formula is compact:

$$ \ell^{CD}_t = (1 + \alpha)\ell^{pos}_t - \alpha \ell^{neg}_t $$

Here, $\ell^{pos}_t$ is the model’s logit vector for the clean audio at decoding step $t$, and $\ell^{neg}_t$ is the logit vector for a degraded version of the same audio. The coefficient $\alpha$ controls how strongly the negative path is subtracted.

The intuition is more important than the notation. A token should be rewarded when it is supported by the clean signal. It should be penalized when it is also preferred under a corrupted signal that should not contain enough evidence for that token. In other words, Whisper-CD tries to separate acoustic evidence from model habit.

The paper uses the same Whisper model for both paths. No smaller assistant model. No fine-tuning. No parameter update. The negative signal comes from perturbing the audio input, not from changing the model.

That distinction matters operationally. In many deployed ASR systems, retraining is not a casual Tuesday afternoon activity. It implies data collection, annotation, validation, infrastructure, compliance checks, and the tedious ceremony of discovering that the new model fixed one domain and damaged another. Whisper-CD instead changes the decoding process around an existing model.

Three negative paths expose three different hallucination tendencies

A single degraded audio path would be too narrow. Different hallucination patterns come from different weaknesses. Whisper-CD therefore uses three negative signals and aggregates them into one multi-negative objective.

Negative path	What it does	What it is designed to reveal
Gaussian noise	Adds controlled noise at SNR$_{dB}=10$	Tokens the model prefers when phonetic evidence is weakened
Silence signal	Replaces the spectrogram with zeros	Textual priors and silence-region hallucination phrases
Audio temporal shift	Shifts the waveform left by $\Delta_s=7$ seconds	Segment-boundary and prefix-alignment failures

The silence path is especially revealing. If the model still wants to produce a fluent phrase when the audio is effectively blank, that phrase is not being driven by the speaker. It is being driven by the decoder’s learned language prior. This is useful information, although not exactly flattering to the model.

The temporal shift path targets a different weakness. Long-form ASR is not just about recognizing words; it is about keeping local audio aligned with local text under a moving segment window. A shifted input disrupts that alignment and helps expose tokens that survive despite poor temporal grounding.

The multi-negative formula aggregates the three negative logits using a log-mean-exp operation:

$$ \ell^{CD}_t = (1+\alpha\tau)\ell^{pos}_t --- \alpha\tau \log \left( \frac{1}{K} \sum_{k=1}^{K} \exp(\ell^{neg}_{k,t}/\tau) \right) $$

The paper sets $K=3$, $\tau=1.0$, and varies $\alpha$ between 0.5 and 2.0. The clean input and three perturbed inputs are processed in parallel paths. For efficiency, the encoder outputs and autoregressive decoding paths are batched, so the method does not require four separate end-to-end transcription runs.

This is the small engineering detail that makes the idea commercially interesting. The method adds negative paths, but it can batch them. And if it suppresses repetition loops, it may also reduce the total number of generated tokens. In long-form ASR, avoiding nonsense can save compute. A rare case where being less verbose is literally cheaper.

The main result is broad WER reduction, but the baseline numbers need interpretation

The headline result is that Whisper-CD reduces WER across five English long-form benchmarks: CORAAL, VoxPopuli, Earnings22, TED-LIUM, and REV-16.

Model	Method	CORAAL	VoxPopuli	Earnings22	TED-LIUM	REV-16	Speed tokens/s	RTF
Large-v3	Baseline	208.76	44.95	520.94	66.42	173.69	30.6	0.2886
Large-v3	+ CD	45.77	19.86	57.08	25.62	21.38	27.3	0.1655
Large-v3-Turbo	Baseline	38.75	30.63	33.25	12.93	19.82	168.9	0.0239
Large-v3-Turbo	+ CD	14.43	25.71	16.16	10.11	14.81	144.2	0.0346

The Large-v3 baseline WER values above 100% are not a typo. WER can exceed 100% when the output contains severe insertions, especially repetition loops that inflate transcript length far beyond the reference. That makes these numbers both impressive and awkward. Whisper-CD is clearly suppressing catastrophic loops, but the Large-v3 results also tell us that some model-context combinations are already in a very damaged state before contrastive decoding intervenes.

For Large-v3-Turbo, the improvement looks cleaner and more deployable. CORAAL falls from 38.75% to 14.43%. Earnings22 falls from 33.25% to 16.16%. TED-LIUM improves from 12.93% to 10.11%. REV-16 improves from 19.82% to 14.81%. VoxPopuli improves more modestly, from 30.63% to 25.71%.

The business reading should therefore be precise: this is not a universal guarantee that every transcript becomes near-perfect. The paper shows that, under the tested English long-form settings, multi-negative contrastive decoding consistently reduces WER and is especially valuable when hallucination and repetition are major contributors to error.

The ablations show complementarity, not just decoration

The most useful experimental sections are not only the main benchmark table. The ablations explain why the three-negative design matters.

First, the paper varies $\alpha$, the contrastive strength, on CORAAL, Earnings22, and TED-LIUM using Whisper Large-v3-Turbo.

$\alpha$	CORAAL	Earnings22	TED-LIUM
0.0	38.75	33.25	12.93
0.5	26.58	16.16	11.19
1.0	14.43	17.70	11.65
1.5	19.65	19.44	10.11
2.0	20.82	21.77	11.00

This is best read as a sensitivity test, not a second thesis. Higher contrastive strength is not automatically better. CORAAL benefits most at $\alpha=1.0$, Earnings22 at $\alpha=0.5$, and TED-LIUM at $\alpha=1.5$. The pattern suggests that hallucination-prone datasets can tolerate stronger subtraction, while cleaner settings may suffer if the method over-subtracts useful evidence.

That matters for deployment. A fixed global $\alpha$ may work well enough as a first patch, but production systems will probably want dynamic control by segment, domain, audio quality, or confidence profile.

Second, the paper tests each perturbation strategy alone.

Strategy	CORAAL	Earnings22	TED-LIUM
Gaussian	38.11	19.50	12.49
Silence	18.99	17.41	21.62
Audio Shift	18.77	15.54	13.81
Multi-negative CD	14.43	16.16	10.11

This is an ablation of design choice. It shows that no single negative path dominates. Silence helps CORAAL but damages TED-LIUM badly. Audio shift performs well on Earnings22 but not as well on TED-LIUM. Gaussian noise is mild in some cases. The combined method is strongest across the three reported datasets.

That is the paper’s quiet but important evidence for the multi-negative formulation. The three perturbations are not three ways to say the same thing. They expose different failure tendencies, and the aggregate is more robust than betting on one perturbation.

Beam search explores alternatives; Whisper-CD changes the scoring landscape

The beam-search comparison is the cleanest way to explain the engineering distinction.

Method	CORAAL	TED-LIUM	Speed tokens/s	RTF
Baseline	38.75	12.93	174.3	0.0246
Beam search, size 5	22.65	17.50	99.0	0.0436
Contrastive decoding	14.43	10.11	147.0	0.0302

Beam search improves CORAAL but worsens TED-LIUM. It also cuts throughput substantially. Whisper-CD performs better on both datasets in this comparison and runs at 147 tokens per second versus 99 for beam search, roughly 48% faster token throughput.

The reason is conceptual. Beam search explores more candidate sequences under the model’s existing distribution. Whisper-CD modifies the logits before token selection. If hallucinated tokens receive high probability because the model’s textual prior is overpowering the acoustic signal, wider search does not necessarily solve the problem. It may just find more polished versions of the same mistake.

For business systems, this distinction is not academic. Beam search is often the default “make decoding better” knob. Whisper-CD suggests another class of knob: test-time distribution correction. Instead of asking the model for more candidates, ask which candidates survive when the audio evidence is degraded. Then penalize those.

What the evidence supports, and what it does not

Paper component	Likely purpose	What it supports	What it does not prove
Main benchmark table	Main evidence	Whisper-CD reduces WER across five English long-form ASR benchmarks	Universal performance across languages, accents, domains, or all Whisper settings
Qualitative examples	Failure-mode illustration	CD can break repetition loops and recover more plausible transcript flow	Frequency of qualitative improvements in all cases
$\alpha$ table	Robustness and sensitivity test	Contrastive strength matters and nonzero values help hallucination-prone datasets	A single optimal setting for production
Individual perturbation table	Ablation	The three negative paths are complementary	Every perturbation is always useful
Beam-search table	Comparison with common decoding alternative	CD can offer a better accuracy-throughput trade-off than beam search in tested settings	That beam search is never useful or that CD replaces all search methods

This separation is important because the paper’s contribution is easy to over-market. Whisper-CD is not “speech hallucination solved.” It is a training-free decoding framework that performs well in a specific set of long-form English ASR tests and gives a strong mechanism for why the improvement occurs.

That is already useful. There is no need to inflate it into magic.

The practical value is an inference-layer reliability patch

The most direct business use case is any product that already relies on Whisper-like long-form transcription:

meeting intelligence and executive summaries;
call-center analytics;
earnings-call monitoring;
interview transcription;
podcast and video indexing;
legal and compliance review;
voice-note search and summarization.

In these settings, hallucinations are not merely accuracy defects. They contaminate downstream AI. A meeting summarizer may summarize a sentence nobody said. A compliance classifier may flag an invented promise. A sales-coaching dashboard may critique a representative for a phrase generated during silence. Once transcripts become input data for other models, transcription hallucination becomes pipeline hallucination.

Whisper-CD is attractive because it acts before the transcript is finalized. That gives it a different operational profile from LLM post-correction. Post-correction may improve readability, but it also risks smoothing over uncertainty. Contrastive decoding is closer to a preventive control: adjust token selection before the false phrase enters the transcript.

A practical deployment framework would look like this:

Deployment layer	What Whisper-CD changes	Operational question
Decoder	Reweights token logits using negative audio paths	Does it reduce insertions and loops without harming clean speech?
Latency budget	Adds batched negative paths	Is the slower Turbo RTF acceptable for the product tier?
Monitoring	Tracks WER proxies, repetition rate, no-speech emissions, and segment resets	Can production detect when contrastive strength is too high or too low?
Domain policy	Tunes $\alpha$ by audio type or risk level	Should legal or medical transcription use stronger hallucination suppression than casual notes?
Downstream AI	Feeds cleaner transcripts into summarizers and classifiers	Does fewer hallucinated words reduce downstream false claims?

The ROI case is therefore not “better ASR because the table improved.” The ROI case is lower correction cost, fewer downstream false positives, less human review on hallucination-prone files, and reduced reputational risk when transcripts become evidence for business decisions.

The boundaries are clear enough to matter

The paper tests English long-form benchmarks. That does not automatically transfer to multilingual settings, low-resource languages, code-switching meetings, heavy background noise, medical jargon, legal cross-talk, or domain-specific audio hardware.

The method also depends on perturbation choices. Gaussian noise at SNR$_{dB}=10$ and a 7-second temporal shift are reasonable experimental settings, but production audio is not one dataset. A customer-support call, a board meeting, a podcast, and a courtroom recording may not want the same contrastive strength.

Model architecture is another boundary. Whisper-CD is natural for encoder-decoder ASR because the audio path can be perturbed while the decoder prefix is preserved. The paper explicitly notes that decoder-only ASR models are less straightforward because audio and text are processed in a single stream. This does not make the idea irrelevant; it means the negative-path construction has to be redesigned.

Finally, severe repetition loops may be difficult to recover from once established. The paper’s Large-v3 discussion is a useful warning: when the model assigns overwhelming probability mass to self-reinforcing continuations, logit-level contrast can reduce damage but may not always pull decoding back to a clean trajectory. In production terms, Whisper-CD should probably be paired with loop detection, segment reset policies, and no-speech handling—not treated as a lonely hero with a cape.

A small decoding change points to a larger AI systems pattern

The broader lesson is not limited to Whisper. As AI models become expensive to retrain and increasingly embedded inside business workflows, more reliability work will move to inference-time control.

That does not mean training is unimportant. It means deployment teams need more tools between “accept the model’s output” and “train a new model.” Contrastive decoding is one such tool. It asks a disciplined question at generation time: which tokens are supported by the input, and which tokens are merely what the model likes to say when evidence is weak?

For long-form speech recognition, that question is unusually valuable. The transcript is often treated as ground truth by everything downstream. If the first layer fabricates, the rest of the stack becomes very good at processing fiction.

Whisper-CD is therefore best understood as a mechanism-first contribution. Its value is not only that WER goes down. Its value is that it identifies a practical control point: the token distribution can be challenged at inference time using negative evidence derived from the input itself.

That is a useful idea for ASR systems today, and a useful design pattern for AI reliability more broadly.

Sometimes the model does not need a lecture, a new degree, or six more months of fine-tuning. Sometimes it just needs to hear what silence sounds like—and stop treating it as permission to keep talking.

Cognaptus: Automate the Present, Incubate the Future.

Hoseong Ahn, Jeongyun Chae, Yoonji Park, and Kyuhong Shim, “Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding,” arXiv:2603.06193, 2026. https://arxiv.org/pdf/2603.06193 ↩︎

The real bug is not one bad segment; it is error inheritance#

Contrastive decoding asks what the model says when the audio evidence disappears#

Three negative paths expose three different hallucination tendencies#

The main result is broad WER reduction, but the baseline numbers need interpretation#

The ablations show complementarity, not just decoration#

Beam search explores alternatives; Whisper-CD changes the scoring landscape#

What the evidence supports, and what it does not#

The practical value is an inference-layer reliability patch#

The boundaries are clear enough to matter#

A small decoding change points to a larger AI systems pattern#