A transcript is usually treated as boring infrastructure. It sits underneath meeting summaries, call-center analytics, podcast search, earnings-call review, legal discovery, medical documentation, and the cheerful dashboard that tells managers everything is now “AI-powered.”
Then the transcript invents a sentence.
Not a typo. Not a small mishearing. A fluent, confident, context-shaped sentence that nobody said. In short clips, this is irritating. In long recordings, it becomes structural. One bad segment can become context for the next segment; the next segment inherits the mistake; and soon the system is not transcribing a recording so much as continuing a badly seeded story.
The paper Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding proposes a useful way to think about this failure.1 The problem is not only that Whisper sometimes chooses the wrong token. The deeper problem is that, in long-form transcription, wrong tokens can become self-reinforcing context. Beam search may explore more paths, but it does not necessarily challenge the model’s own probability distribution. If a hallucinated continuation already looks highly probable, searching harder can become a very expensive way to be wrong with confidence. Delightful, in the way only production AI failures can be.
Whisper-CD attacks the issue at decoding time. It keeps the Whisper model frozen, feeds the same segment through clean and deliberately degraded audio paths, and subtracts the tokens that remain attractive even when the acoustic evidence has been damaged. The practical idea is simple: if a word is likely when the real speech is present, that is useful. If the same word is also likely when the speech has been replaced by noise, silence, or temporal misalignment, it may be the model’s textual prior talking too loudly.
That is the mechanism worth understanding. The business implication follows from it: hallucination control may not always require retraining, a new architecture, or an LLM post-editor with a heroic job title. Sometimes reliability improves when inference becomes more skeptical.
The real bug is not one bad segment; it is error inheritance
Whisper processes long audio by splitting it into short segments, commonly around 30 seconds. Previous segment text can be passed forward as context so the transcript stays coherent across the recording. That sounds reasonable. It is also exactly where the trap begins.
The paper reports that previous-context conditioning can degrade performance sharply. On Whisper Large-v3, the authors find that conditioning on previous text increases WER by over 190 percentage points on CORAAL and over 500 percentage points on Earnings22. Those are not normal “model is a bit worse than expected” numbers. They are symptoms of runaway repetition and error accumulation.
The paper separates the long-form failure pattern into three recurring forms:
| Failure pattern | What happens | Why long-form decoding makes it worse |
|---|---|---|
| Silence-region hallucination | The model emits words during non-speech intervals | The generated text can become context for later segments |
| Repetition loops | A phrase or sentence repeats across boundaries | The loop reinforces itself as the prefix grows |
| Content omission | Real spoken content is skipped | Later decoding has less correct context to recover from |
This is why the common instinct—“just use a bigger model” or “just search harder”—is incomplete. The paper’s more interesting observation is that the failure is not merely acoustic. It is acoustic plus textual plus sequential. Once the model has written a bad prefix, later segments are decoded under a contaminated history.
A post-editor can clean some output after the fact. Voice activity detection can avoid some silence mistakes. Beam search can explore multiple hypotheses. But none of these directly changes the token distribution at the moment where the model is deciding what to say next.
Whisper-CD lives exactly at that moment.
Contrastive decoding asks what the model says when the audio evidence disappears
The core contrastive decoding formula is compact:
$$ \ell^{CD}_t = (1 + \alpha)\ell^{pos}_t - \alpha \ell^{neg}_t $$
Here, $\ell^{pos}_t$ is the model’s logit vector for the clean audio at decoding step $t$, and $\ell^{neg}_t$ is the logit vector for a degraded version of the same audio. The coefficient $\alpha$ controls how strongly the negative path is subtracted.
The intuition is more important than the notation. A token should be rewarded when it is supported by the clean signal. It should be penalized when it is also preferred under a corrupted signal that should not contain enough evidence for that token. In other words, Whisper-CD tries to separate acoustic evidence from model habit.
The paper uses the same Whisper model for both paths. No smaller assistant model. No fine-tuning. No parameter update. The negative signal comes from perturbing the audio input, not from changing the model.
That distinction matters operationally. In many deployed ASR systems, retraining is not a casual Tuesday afternoon activity. It implies data collection, annotation, validation, infrastructure, compliance checks, and the tedious ceremony of discovering that the new model fixed one domain and damaged another. Whisper-CD instead changes the decoding process around an existing model.
Three negative paths expose three different hallucination tendencies
A single degraded audio path would be too narrow. Different hallucination patterns come from different weaknesses. Whisper-CD therefore uses three negative signals and aggregates them into one multi-negative objective.
| Negative path | What it does | What it is designed to reveal |
|---|---|---|
| Gaussian noise | Adds controlled noise at SNR$_{dB}=10$ | Tokens the model prefers when phonetic evidence is weakened |
| Silence signal | Replaces the spectrogram with zeros | Textual priors and silence-region hallucination phrases |
| Audio temporal shift | Shifts the waveform left by $\Delta_s=7$ seconds | Segment-boundary and prefix-alignment failures |
The silence path is especially revealing. If the model still wants to produce a fluent phrase when the audio is effectively blank, that phrase is not being driven by the speaker. It is being driven by the decoder’s learned language prior. This is useful information, although not exactly flattering to the model.
The temporal shift path targets a different weakness. Long-form ASR is not just about recognizing words; it is about keeping local audio aligned with local text under a moving segment window. A shifted input disrupts that alignment and helps expose tokens that survive despite poor temporal grounding.
The multi-negative formula aggregates the three negative logits using a log-mean-exp operation:
$$ \ell^{CD}_t = (1+\alpha\tau)\ell^{pos}_t
\alpha\tau \log \left( \frac{1}{K} \sum_{k=1}^{K} \exp(\ell^{neg}_{k,t}/\tau) \right) $$
The paper sets $K=3$, $\tau=1.0$, and varies $\alpha$ between 0.5 and 2.0. The clean input and three perturbed inputs are processed in parallel paths. For efficiency, the encoder outputs and autoregressive decoding paths are batched, so the method does not require four separate end-to-end transcription runs.
This is the small engineering detail that makes the idea commercially interesting. The method adds negative paths, but it can batch them. And if it suppresses repetition loops, it may also reduce the total number of generated tokens. In long-form ASR, avoiding nonsense can save compute. A rare case where being less verbose is literally cheaper.
The main result is broad WER reduction, but the baseline numbers need interpretation
The headline result is that Whisper-CD reduces WER across five English long-form benchmarks: CORAAL, VoxPopuli, Earnings22, TED-LIUM, and REV-16.
| Model | Method | CORAAL | VoxPopuli | Earnings22 | TED-LIUM | REV-16 | Speed tokens/s | RTF |
|---|---|---|---|---|---|---|---|---|
| Large-v3 | Baseline | 208.76 | 44.95 | 520.94 | 66.42 | 173.69 | 30.6 | 0.2886 |
| Large-v3 | + CD | 45.77 | 19.86 | 57.08 | 25.62 | 21.38 | 27.3 | 0.1655 |
| Large-v3-Turbo | Baseline | 38.75 | 30.63 | 33.25 | 12.93 | 19.82 | 168.9 | 0.0239 |
| Large-v3-Turbo | + CD | 14.43 | 25.71 | 16.16 | 10.11 | 14.81 | 144.2 | 0.0346 |
The Large-v3 baseline WER values above 100% are not a typo. WER can exceed 100% when the output contains severe insertions, especially repetition loops that inflate transcript length far beyond the reference. That makes these numbers both impressive and awkward. Whisper-CD is clearly suppressing catastrophic loops, but the Large-v3 results also tell us that some model-context combinations are already in a very damaged state before contrastive decoding intervenes.
For Large-v3-Turbo, the improvement looks cleaner and more deployable. CORAAL falls from 38.75% to 14.43%. Earnings22 falls from 33.25% to 16.16%. TED-LIUM improves from 12.93% to 10.11%. REV-16 improves from 19.82% to 14.81%. VoxPopuli improves more modestly, from 30.63% to 25.71%.
The business reading should therefore be precise: this is not a universal guarantee that every transcript becomes near-perfect. The paper shows that, under the tested English long-form settings, multi-negative contrastive decoding consistently reduces WER and is especially valuable when hallucination and repetition are major contributors to error.
The ablations show complementarity, not just decoration
The most useful experimental sections are not only the main benchmark table. The ablations explain why the three-negative design matters.
First, the paper varies $\alpha$, the contrastive strength, on CORAAL, Earnings22, and TED-LIUM using Whisper Large-v3-Turbo.
| $\alpha$ | CORAAL | Earnings22 | TED-LIUM |
|---|---|---|---|
| 0.0 | 38.75 | 33.25 | 12.93 |
| 0.5 | 26.58 | 16.16 | 11.19 |
| 1.0 | 14.43 | 17.70 | 11.65 |
| 1.5 | 19.65 | 19.44 | 10.11 |
| 2.0 | 20.82 | 21.77 | 11.00 |
This is best read as a sensitivity test, not a second thesis. Higher contrastive strength is not automatically better. CORAAL benefits most at $\alpha=1.0$, Earnings22 at $\alpha=0.5$, and TED-LIUM at $\alpha=1.5$. The pattern suggests that hallucination-prone datasets can tolerate stronger subtraction, while cleaner settings may suffer if the method over-subtracts useful evidence.
That matters for deployment. A fixed global $\alpha$ may work well enough as a first patch, but production systems will probably want dynamic control by segment, domain, audio quality, or confidence profile.
Second, the paper tests each perturbation strategy alone.
| Strategy | CORAAL | Earnings22 | TED-LIUM |
|---|---|---|---|
| Gaussian | 38.11 | 19.50 | 12.49 |
| Silence | 18.99 | 17.41 | 21.62 |
| Audio Shift | 18.77 | 15.54 | 13.81 |
| Multi-negative CD | 14.43 | 16.16 | 10.11 |
This is an ablation of design choice. It shows that no single negative path dominates. Silence helps CORAAL but damages TED-LIUM badly. Audio shift performs well on Earnings22 but not as well on TED-LIUM. Gaussian noise is mild in some cases. The combined method is strongest across the three reported datasets.
That is the paper’s quiet but important evidence for the multi-negative formulation. The three perturbations are not three ways to say the same thing. They expose different failure tendencies, and the aggregate is more robust than betting on one perturbation.
Beam search explores alternatives; Whisper-CD changes the scoring landscape
The beam-search comparison is the cleanest way to explain the engineering distinction.
| Method | CORAAL | TED-LIUM | Speed tokens/s | RTF |
|---|---|---|---|---|
| Baseline | 38.75 | 12.93 | 174.3 | 0.0246 |
| Beam search, size 5 | 22.65 | 17.50 | 99.0 | 0.0436 |
| Contrastive decoding | 14.43 | 10.11 | 147.0 | 0.0302 |
Beam search improves CORAAL but worsens TED-LIUM. It also cuts throughput substantially. Whisper-CD performs better on both datasets in this comparison and runs at 147 tokens per second versus 99 for beam search, roughly 48% faster token throughput.
The reason is conceptual. Beam search explores more candidate sequences under the model’s existing distribution. Whisper-CD modifies the logits before token selection. If hallucinated tokens receive high probability because the model’s textual prior is overpowering the acoustic signal, wider search does not necessarily solve the problem. It may just find more polished versions of the same mistake.
For business systems, this distinction is not academic. Beam search is often the default “make decoding better” knob. Whisper-CD suggests another class of knob: test-time distribution correction. Instead of asking the model for more candidates, ask which candidates survive when the audio evidence is degraded. Then penalize those.
What the evidence supports, and what it does not
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark table | Main evidence | Whisper-CD reduces WER across five English long-form ASR benchmarks | Universal performance across languages, accents, domains, or all Whisper settings |
| Qualitative examples | Failure-mode illustration | CD can break repetition loops and recover more plausible transcript flow | Frequency of qualitative improvements in all cases |
| $\alpha$ table | Robustness and sensitivity test | Contrastive strength matters and nonzero values help hallucination-prone datasets | A single optimal setting for production |
| Individual perturbation table | Ablation | The three negative paths are complementary | Every perturbation is always useful |
| Beam-search table | Comparison with common decoding alternative | CD can offer a better accuracy-throughput trade-off than beam search in tested settings | That beam search is never useful or that CD replaces all search methods |
This separation is important because the paper’s contribution is easy to over-market. Whisper-CD is not “speech hallucination solved.” It is a training-free decoding framework that performs well in a specific set of long-form English ASR tests and gives a strong mechanism for why the improvement occurs.
That is already useful. There is no need to inflate it into magic.
The practical value is an inference-layer reliability patch
The most direct business use case is any product that already relies on Whisper-like long-form transcription:
- meeting intelligence and executive summaries;
- call-center analytics;
- earnings-call monitoring;
- interview transcription;
- podcast and video indexing;
- legal and compliance review;
- voice-note search and summarization.
In these settings, hallucinations are not merely accuracy defects. They contaminate downstream AI. A meeting summarizer may summarize a sentence nobody said. A compliance classifier may flag an invented promise. A sales-coaching dashboard may critique a representative for a phrase generated during silence. Once transcripts become input data for other models, transcription hallucination becomes pipeline hallucination.
Whisper-CD is attractive because it acts before the transcript is finalized. That gives it a different operational profile from LLM post-correction. Post-correction may improve readability, but it also risks smoothing over uncertainty. Contrastive decoding is closer to a preventive control: adjust token selection before the false phrase enters the transcript.
A practical deployment framework would look like this:
| Deployment layer | What Whisper-CD changes | Operational question |
|---|---|---|
| Decoder | Reweights token logits using negative audio paths | Does it reduce insertions and loops without harming clean speech? |
| Latency budget | Adds batched negative paths | Is the slower Turbo RTF acceptable for the product tier? |
| Monitoring | Tracks WER proxies, repetition rate, no-speech emissions, and segment resets | Can production detect when contrastive strength is too high or too low? |
| Domain policy | Tunes $\alpha$ by audio type or risk level | Should legal or medical transcription use stronger hallucination suppression than casual notes? |
| Downstream AI | Feeds cleaner transcripts into summarizers and classifiers | Does fewer hallucinated words reduce downstream false claims? |
The ROI case is therefore not “better ASR because the table improved.” The ROI case is lower correction cost, fewer downstream false positives, less human review on hallucination-prone files, and reduced reputational risk when transcripts become evidence for business decisions.
The boundaries are clear enough to matter
The paper tests English long-form benchmarks. That does not automatically transfer to multilingual settings, low-resource languages, code-switching meetings, heavy background noise, medical jargon, legal cross-talk, or domain-specific audio hardware.
The method also depends on perturbation choices. Gaussian noise at SNR$_{dB}=10$ and a 7-second temporal shift are reasonable experimental settings, but production audio is not one dataset. A customer-support call, a board meeting, a podcast, and a courtroom recording may not want the same contrastive strength.
Model architecture is another boundary. Whisper-CD is natural for encoder-decoder ASR because the audio path can be perturbed while the decoder prefix is preserved. The paper explicitly notes that decoder-only ASR models are less straightforward because audio and text are processed in a single stream. This does not make the idea irrelevant; it means the negative-path construction has to be redesigned.
Finally, severe repetition loops may be difficult to recover from once established. The paper’s Large-v3 discussion is a useful warning: when the model assigns overwhelming probability mass to self-reinforcing continuations, logit-level contrast can reduce damage but may not always pull decoding back to a clean trajectory. In production terms, Whisper-CD should probably be paired with loop detection, segment reset policies, and no-speech handling—not treated as a lonely hero with a cape.
A small decoding change points to a larger AI systems pattern
The broader lesson is not limited to Whisper. As AI models become expensive to retrain and increasingly embedded inside business workflows, more reliability work will move to inference-time control.
That does not mean training is unimportant. It means deployment teams need more tools between “accept the model’s output” and “train a new model.” Contrastive decoding is one such tool. It asks a disciplined question at generation time: which tokens are supported by the input, and which tokens are merely what the model likes to say when evidence is weak?
For long-form speech recognition, that question is unusually valuable. The transcript is often treated as ground truth by everything downstream. If the first layer fabricates, the rest of the stack becomes very good at processing fiction.
Whisper-CD is therefore best understood as a mechanism-first contribution. Its value is not only that WER goes down. Its value is that it identifies a practical control point: the token distribution can be challenged at inference time using negative evidence derived from the input itself.
That is a useful idea for ASR systems today, and a useful design pattern for AI reliability more broadly.
Sometimes the model does not need a lecture, a new degree, or six more months of fine-tuning. Sometimes it just needs to hear what silence sounds like—and stop treating it as permission to keep talking.
Cognaptus: Automate the Present, Incubate the Future.
-
Hoseong Ahn, Jeongyun Chae, Yoonji Park, and Kyuhong Shim, “Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding,” arXiv:2603.06193, 2026. https://arxiv.org/pdf/2603.06193 ↩︎