A voice assistant can transcribe your question correctly and still answer like it heard something else.

That is the awkward part of modern audio-language models. The obvious diagnosis is usually “better speech recognition.” The less obvious diagnosis is nastier: the model may receive an audio input that is semantically equivalent to the text prompt, but once generation begins, its audio-conditioned reasoning trajectory drifts away from the reasoning trajectory it would have followed if the same question had been typed.

In other words, the model listens, then stops thinking like itself.

That is the problem behind CORD, short for Cross-modal Weighted On-policy Reward-guided Distillation, proposed in the paper CORD: Bridging the Audio–Text Reasoning Gap via Weighted On-policy Cross-modal Distillation.1 The paper’s central idea is not to bolt on a bigger recognizer, buy a mountain of labeled speech data, or ask a separate giant teacher model to lecture the audio model. Instead, it uses the model’s own text-conditioned behavior as an internal teacher, then trains the audio-conditioned pathway on the actual audio rollouts where the model makes its mistakes.

That small phrase — actual audio rollouts — is the mechanism that matters.

Most business readers will look at the benchmark table first, because tables feel reassuringly managerial. Resist the temptation for a moment. The interesting contribution is not merely that CORD reduces the audio–text gap by 41.6% on Qwen2-Audio-7B-Instruct and 44.8% on Step-Audio2-mini. The interesting contribution is the diagnosis: audio reasoning errors are not spread evenly across a generated answer. They concentrate at a small number of high-divergence tokens, often early in the reasoning process, where one wrong semantic turn can contaminate everything that follows.

Apparently, even neural networks have meetings where the first bad assumption ruins the whole agenda.

The failure is not just hearing; it is trajectory drift

Large audio-language models are usually built by taking a text-based LLM, adding an audio encoder, and training an alignment module so that audio representations can enter the LLM’s semantic space. This architecture carries a convenient assumption: if the audio and text inputs mean the same thing, the model should reason similarly.

The paper argues that this assumption is not reliable.

For the same underlying question, the model induces two conditional distributions during generation:

  • one distribution when the question enters as audio;
  • another distribution when the question enters as text.

CORD compares these distributions along the generated answer, step by step. The point is not just whether the final answer is wrong. The point is where the audio-conditioned path begins to diverge from the text-conditioned path.

The authors visualize this using token-level reverse KL divergence. In one incorrect example, the model generates a formulaic phrase — “We refer to …” — and eventually selects the wrong multiple-choice option. The final option token has high divergence, but the more important observation is that several earlier reasoning tokens already show elevated divergence. The model’s reasoning did not fail at the last second. It was already wandering.

This matters because a normal loss that averages across all tokens treats boring filler words and decisive reasoning tokens too similarly. If most tokens are already aligned, the average loss becomes a swamp: the few important errors are diluted by many harmless words.

CORD’s empirical analysis on MMSU finds exactly that pattern. The token-level divergence distribution is heavy-tailed. The 80th percentile KL threshold is only 0.23, meaning most tokens have low divergence. High-divergence tokens are relatively sparse, and the paper’s word-cloud analysis shows that these tokens are enriched for reasoning words and answer-choice markers such as “Therefore,” “answer,” “A,” and “B.” Low-divergence regions, by contrast, are dominated by common functional words and background text.

So the mechanism is not “audio makes every token worse.” It is more selective and more dangerous: audio can perturb a few decision-bearing tokens, and those early deviations can cascade.

That is the replacement belief business readers should take away:

Reader belief Paper’s correction Operational meaning
Audio reasoning fails mainly because speech is hard to recognize. The model may hear enough but reason differently once conditioned on audio. Evaluation should compare typed and spoken versions of the same task, not only transcription accuracy.
Distillation should match the teacher’s ideal text trajectory. Training should correct the audio model on the states it actually visits. Voice-agent training should focus on failure trajectories, not polished demonstrations.
Every generated token deserves roughly equal alignment pressure. A few high-divergence and early tokens carry disproportionate semantic risk. Debugging should locate decision points, not merely count final wrong answers.

This is why the paper is better read as a mechanism paper than as a benchmark paper. The benchmark numbers are the receipt. The mechanism is the purchase.

CORD uses the model’s text pathway as the teacher

CORD’s first design choice is internal self-distillation.

For each text prompt, the authors synthesize a semantically equivalent audio input. The same model can then process the question in two modes: audio-conditioned and text-conditioned. Instead of asking an external teacher model to generate a perfect answer, CORD uses the model’s own text-conditioned distribution as a reference for its audio-conditioned behavior.

That is attractive for a practical reason. External teacher distillation can introduce model mismatch: the teacher may have a different architecture, vocabulary behavior, reasoning style, or calibration profile. CORD avoids that particular mismatch by making the text pathway of the same model the teacher.

But there is a second, more important design choice: on-policy alignment.

Traditional teacher-rollout distillation often supervises the student along a trajectory produced by the teacher. That sounds sensible until one notices the problem: the audio model, during inference, does not live on the teacher’s ideal text trajectory. It lives on its own audio-conditioned trajectory, including the strange prefixes it creates after early mistakes.

CORD samples from the current audio-conditioned policy, then compares the audio and text distributions at each state along that sampled path. This means the training signal lands where the model actually gets lost.

A simple analogy: if an employee keeps making mistakes during customer calls, you do not only train them on a pristine script written by a senior manager. You review the calls where they actually stumbled, including the awkward detours. Less elegant, more useful.

Technically, CORD measures token-level discrepancy using reverse KL divergence:

$$ D_t = KL\left(p_\theta(\cdot \mid y_{

Here, $x^a$ is the audio input, $x^t$ is the semantically equivalent text input, and $y_{<t}$ is the current generated prefix. The model compares what it would predict next under audio conditioning against what it would predict next under text conditioning, given the same prefix.

The paper argues that reverse KL is useful because it pushes the audio-conditioned policy toward the high-probability decisions preferred by the text-conditioned pathway. In business language: it tries to recover the text model’s decision discipline when the input arrives through speech.

Token weighting targets the places where reasoning actually breaks

If token divergence is heavy-tailed, uniform alignment is wasteful.

CORD therefore applies two token-level weights.

First, it selects the top-$K$ highest-divergence tokens, with $K=20$ in the experiments, and gives them higher weight. This is the importance-aware part. Instead of spending most of the gradient budget on already-aligned tokens, it emphasizes the states where the audio pathway most strongly disagrees with the text pathway.

Second, it applies a position-based decay weight, giving earlier tokens more importance and linearly reducing that emphasis over the generated sequence. The intuition is straightforward: an early wrong turn in reasoning is more likely to poison the later answer than a late stylistic variation.

The final token weight is the product of these two components:

$$ w_t = w^{KL}_t \cdot w^{pos}_t $$

The authors set both weighting hyperparameters, $\alpha$ and $\beta$, to 2 in the main experiments. Their sensitivity analysis later supports this choice: $\alpha=\beta=2.0$ gives the most consistent gains, while 1.0 behaves more like uniform KL and 2.5 appears to over-concentrate the gradient on too narrow a token subset.

That sensitivity result is not the main thesis. It is a robustness and tuning test. Its purpose is to show that the weighting mechanism is not arbitrary hand-waving, while also warning that “more focus” is not always better. Over-focus can suppress useful long-tail semantic information. A model trained only on the loudest errors may become very good at correcting the loudest errors and slightly worse at understanding the quieter context. Tragic, but familiar.

Sequence-level GRPO keeps local fixes from becoming global nonsense

Token-level alignment solves a local problem. It does not guarantee that the whole answer is semantically aligned.

CORD adds a sequence-level reward using GRPO, or Group Relative Policy Optimization. For a given audio input, the model samples multiple audio-conditioned trajectories. A judge model evaluates whether each audio-conditioned answer is semantically consistent with the text-conditioned answer. The reward is binary: aligned or not aligned.

The GRPO objective then increases the likelihood of audio trajectories that receive higher rewards relative to other sampled trajectories in the same group.

This is where the paper’s “no external teacher” claim needs careful reading. CORD does not rely on an external teacher model to generate the reference reasoning trajectory for distillation. The text-conditioned pathway inside the same model is the reference. However, the sequence-level component does use a judge model. The paper says this judge was developed by distilling evaluation outputs of proprietary frontier models on millions of text-based instruction-following samples and reports self-evaluation accuracy above 99%.

For research, that is acceptable as an implementation detail. For business adoption, it is a cost and governance detail. A company reproducing this recipe still needs an evaluator, and evaluator quality matters. A weak judge can reward shallow agreement. A biased judge can encode the wrong standard of “semantic consistency.” A proprietary judge can make the pipeline harder to audit. Yes, the teacher may be internal; the referee still has a badge.

The mechanism stack therefore looks like this:

CORD component Likely purpose in the paper What it supports What it does not prove
On-policy audio rollouts Main mechanism The model is trained on states it actually encounters under audio input. It does not prove the method generalizes to every speech environment.
Internal text pathway as teacher Main mechanism Avoids external teacher trajectory mismatch for token-level alignment. It does not remove the need for a judge in sequence-level optimization.
High-KL token weighting Ablation-supported mechanism Focuses learning on sparse, semantically important divergence points. Its absolute gain over OPD is modest in the reported ablation.
Early-token positional weighting Mechanism plus sensitivity test Reflects the cascade risk of early reasoning errors. It does not mean late tokens are irrelevant.
GRPO sequence reward Main evidence and ablation focus Aligns complete reasoning trajectories and improves global answer consistency. GRPO alone can become unstable without OPD anchoring.

The last row matters. In the ablation, GRPO alone improves at 500 steps but collapses by 1000 steps. On Qwen2-Audio-7B-Instruct, GSM8K rises to 35.59 at 500 GRPO steps, then falls to 19.89 at 1000 steps, below the base model’s 20.73. MMSU and OBQA also collapse sharply.

The paper’s interpretation is that OPD acts as a stabilizer. With GRPO plus on-policy distillation, the model trains for 3000 steps without that collapse. Adding token-level weighting gives the best final numbers: 38.06 on MMSU, 52.77 on OBQA, and 36.20 on GSM8K.

This ablation is not just a component checklist. It is a warning: reinforcement-style sequence alignment can push the model in the right direction, then push it off a cliff. The distillation anchor matters.

The benchmark gains are real, but the pattern is uneven

The main results compare CORD with SFT and Forward-KL baselines on two audio-language backbones: Qwen2-Audio-7B-Instruct and Step-Audio2-mini. The benchmarks are MMSU and OpenBookQA from VoiceBench for knowledge-based question answering, GSM8K for mathematical reasoning, and MMAU for acoustic and paralinguistic reasoning.

The paper defines the modality gap as:

$$ \Delta_{base} = Acc^{Base}\ast{text} - Acc^{Method}\ast{audio} $$

So the question is not merely “does the audio model improve?” It is “how much closer does the audio-conditioned model get to the base model’s text-conditioned performance?”

The headline result:

Backbone Base average audio-text gap CORD average gap Relative gap reduction Forward-KL relative gap reduction
Qwen2-Audio-7B-Instruct 15.25 8.90 41.6% 28.5%
Step-Audio2-mini 10.86 6.00 44.8% 10.5%

This is the cleanest evidence for CORD’s main claim. Across the two backbones, CORD narrows the audio–text reasoning gap more consistently than SFT or Forward-KL.

The per-task picture is more informative:

Backbone and method MMSU audio accuracy OBQA audio accuracy GSM8K audio accuracy
Qwen2-Audio base 36.04 51.20 20.73
Qwen2-Audio + CORD 38.06 52.77 36.20
Step-Audio2 base 52.31 72.30 43.75
Step-Audio2 + CORD 57.63 77.74 47.56

For Qwen2-Audio, the biggest improvement is on GSM8K, where audio accuracy rises from 20.73 to 36.20. For Step-Audio2-mini, the larger improvements are on MMSU and OBQA, while GSM8K improves more modestly. This nuance matters because the paper also argues for cross-domain generalization: CORD trains on 80,000 NuminaMath examples converted into audio using Kokoro, yet also improves non-math benchmarks.

That argument is plausible, but it should not be oversold. The evidence shows that gains are not confined to the training domain. It does not show that one math-only synthetic dataset is enough for production voice agents in medicine, logistics, banking, or multilingual call centers. A benchmark transfer signal is not the same thing as domain readiness. There, the moat is still wet.

Preserving audio ability is almost as important as improving reasoning

A useful voice model should not become a better text reasoner by forgetting how to understand audio. This is why the MMAU result matters.

The paper evaluates Qwen2-Audio-7B-Instruct on music, sound, and speech categories. The goal is to test whether reasoning alignment damages auxiliary audio capabilities beyond speech-question answering.

Method Music Sound Speech Average
Base model 58.98 64.74 58.73 60.81
SFT 56.29 64.44 51.51 57.39
Forward KL 55.99 61.70 53.01 56.90
CORD 60.18 64.44 55.42 60.01

CORD is not perfectly lossless. Speech drops from 58.73 to 55.42, and the average remains slightly below the base model. But it preserves auxiliary audio ability much better than SFT and Forward KL, and it even improves the music score from 58.98 to 60.18.

The likely interpretation is not “CORD magically improves all audio skills.” It is narrower and more useful: on-policy alignment appears to reduce collateral damage compared with conventional distillation. The model becomes better at audio-conditioned reasoning without paying the same forgetting tax on general audio tasks.

For business deployment, this is highly relevant. Voice agents are rarely pure reasoning machines. They operate in environments where speech tone, background sounds, turn-taking, speaker uncertainty, and non-verbal acoustic cues may matter. If training a model to reason better through audio makes it worse at recognizing useful audio signals, the enterprise has not improved the system. It has merely moved the failure to a place the benchmark might not look.

The business value is diagnostic leverage, not just benchmark lift

CORD’s practical value is not that every company should immediately implement this exact training pipeline. Most will not. The method assumes access to model internals, paired audio-text data, rollout generation, token-level distribution comparison, and a competent judge model. This is not a no-code weekend experiment, despite what someone on LinkedIn will inevitably imply.

The more general business lesson is diagnostic.

For voice agents, audio copilots, and spoken enterprise assistants, performance evaluation should not stop at transcription accuracy or final answer accuracy. The real question is whether the audio-conditioned model follows the same reasoning policy it follows when the input is typed.

That suggests a more useful evaluation framework:

Business question CORD-inspired diagnostic Why it matters
Does the assistant understand spoken requests? Compare typed and spoken versions of the same task. Isolates modality-induced degradation from general reasoning weakness.
Where does the spoken interaction fail? Inspect divergence across the generated trajectory, not only final answers. Finds early decision points that trigger cascading errors.
Is fine-tuning damaging other audio skills? Test reasoning gains alongside music, sound, speech, and paralinguistic tasks. Prevents “improvement” that quietly breaks broader audio intelligence.
Can the system improve without massive labeled speech data? Use synthetic paired audio-text prompts where domain ambiguity is controlled. Reduces data cost, though production realism still requires validation.
Is the evaluator trustworthy? Audit the judge model or human evaluation standard used for sequence rewards. A bad judge turns alignment into optimized self-deception. Very efficient, unfortunately.

The ROI pathway is therefore not simply “CORD makes audio models better.” It is more precise:

  1. Find the audio-text reasoning gap by comparing spoken and typed equivalents.
  2. Locate trajectory divergence rather than treating every wrong answer as a black box.
  3. Prioritize high-impact tokens and early reasoning states where small deviations cause large downstream failures.
  4. Use sequence-level evaluation to ensure local token alignment improves complete answers.
  5. Check collateral damage on non-target audio capabilities.

That is a practical workflow even for teams that do not reproduce CORD end to end. It changes how voice AI failures are investigated.

The boundaries are narrow enough to respect

The paper is strongest when read as evidence for a mechanism: audio-conditioned reasoning can drift from text-conditioned reasoning, and on-policy cross-modal alignment can reduce that drift.

It is weaker if inflated into a universal claim about voice AI readiness.

First, the training data is 80,000 NuminaMath examples converted to audio using Kokoro. That is controlled and useful for isolating the alignment objective. It is not equivalent to messy real-world speech. Customer calls contain accents, interruptions, domain terminology, background noise, emotional tone, partial information, and people who start sentences with “actually” and then abandon civilization.

Second, the evaluation covers selected benchmarks: MMSU, OpenBookQA, GSM8K, and MMAU. These are valuable, but they do not substitute for domain-specific acceptance testing. A legal intake assistant, a medical triage assistant, and a warehouse voice copilot have very different failure costs.

Third, the judge model is powerful but not transparent from a deployment perspective. The paper reports that it was distilled from proprietary frontier model evaluations and achieves self-evaluation accuracy above 99%. That supports the experimental setup, but businesses still need to ask whether their judge reflects their own standards, risks, and compliance requirements.

Fourth, the ablation shows that GRPO alone can collapse. This is not a cosmetic detail. It means sequence-level reward optimization is powerful enough to destabilize the model if not anchored properly. The full method works because the pieces reinforce and restrain one another.

Finally, the MMAU result is encouraging but not absolute. CORD preserves audio capabilities better than the baselines, yet it still reduces the speech category score relative to the base model. Any production adaptation should monitor the full audio skill portfolio, not just the reasoning benchmark that motivated training.

Audio agents need to be tested as listeners and thinkers

CORD’s contribution is not that it discovers audio models are weaker than text models. That has been visible for a while. Its contribution is to show a plausible internal failure pattern and a targeted training response.

The pattern is simple enough to remember:

  • audio and text inputs may be semantically equivalent;
  • the model’s audio-conditioned trajectory may still diverge from its text-conditioned trajectory;
  • the divergence is sparse, high-impact, and often early;
  • training should therefore focus on the model’s real audio mistakes, not only ideal teacher rollouts;
  • local token alignment needs a global sequence-level check;
  • global reward optimization needs an on-policy distillation anchor to avoid collapse.

For businesses building voice agents, this reframes the problem. The question is not only whether the model can hear. The question is whether hearing preserves the reasoning behavior the model already has when reading.

That distinction matters. A voice agent that transcribes correctly but reasons differently is hard to debug with traditional speech metrics. It may pass ASR tests, pass some final-answer tests, and still fail in precisely the complex spoken workflows where enterprises wanted AI help in the first place.

CORD offers a sharper lens. It says: compare the paths, not just the outputs.

For once, that is not a poetic metaphor. It is the training signal.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, and Haifeng Wang, “CORD: Bridging the Audio–Text Reasoning Gap via Weighted On-policy Cross-modal Distillation,” arXiv:2601.16547, 2026. https://arxiv.org/abs/2601.16547 ↩︎