Lost in Translation (Literally): Why ASR Still Breaks in the Age of Voice Agents

Voice is supposed to be the easy interface.

No menus. No forms. No training session. A user speaks, the agent understands, and some neat piece of software magic happens in the background. That is the sales pitch. It is also mostly true in a demo room, which is a place where microphones behave, users speak politely, and nobody’s child interrupts from the back seat.

Production is less considerate.

A customer calls from a noisy street. A child says one syllable and waits. A patient with an older voice pauses mid-sentence. A bilingual user switches language halfway through a command. The voice agent hears something, transcribes something else, and then hands that text to a downstream system that may retrieve records, trigger tools, confirm payments, or update account settings. At that point, a transcription error is no longer a typo. It is a control input.

That is the useful warning in Back to Basics: Revisiting ASR in the Age of Voice Agents, which introduces WildASR, a multilingual diagnostic benchmark for automatic speech recognition under real-world distribution shifts.¹ The paper is not interesting because it says ASR sometimes fails. Everyone who has used a phone menu already owns that research result. It is interesting because it breaks the problem into operational categories: where the audio is degraded, who is speaking, and what kind of linguistic pattern the user produces.

That category structure is the point. Businesses do not need another abstract reminder that “edge cases matter.” They need to know which edge cases belong in the pre-launch test plan, which ones require fallback logic, and which ones should make a voice agent refuse to act without confirmation. WildASR is best read less as a benchmark leaderboard and more as a deployment audit template.

The real misconception is not “ASR is bad.” It is “average ASR is enough.”

The tempting story is that speech recognition has matured. Curated benchmarks show very low error rates. Large audio-language models are improving quickly. Speech-to-speech systems can sometimes bypass explicit transcription altogether. So the easy conclusion is that ASR is now plumbing: necessary, boring, and basically solved.

WildASR attacks that conclusion from a practical angle. The authors note that modern systems can report word error rates below 5% on curated benchmarks, yet voice agents operate under conditions those benchmarks do not systematically cover: telephony compression, far-field microphones, reverberation, regional accents, disfluencies, clipped audio, short utterances, incomplete speech, and code-switching. The issue is not that benchmark progress is fake. The issue is that benchmark progress is too compressed into a single number.

A single ASR score answers a vendor-friendly question: “How accurate is the model on this dataset?” Production teams need a less pleasant question: “Which user, in which condition, speaking in which pattern, causes the system to become unsafe or useless?”

WildASR’s contribution is to make that second question testable.

WildASR organizes speech failure into where, who, and what

The benchmark covers four languages: English, Chinese, Japanese, and Korean. It evaluates seven systems: Whisper Large V3, GPT-4o Transcribe, Gemini 2.5 Pro, Gemini 3 Pro, Qwen2-Audio, Nova 2, and Scribe V1. The authors report WER for English, CER for Chinese/Japanese/Korean, and MER for code-switching, where mixed-script tokenization is needed.

The design follows a useful principle: real source speech, controlled stress. The source audio comes from human recordings rather than text-to-speech generation. Controlled perturbations are then applied where needed, so the benchmark can isolate a specific stress factor without pretending that synthetic speech contains the messy details of actual human articulation.

That matters. Synthetic speech can imitate pitch or rough age category, but it often misses hesitations, unstable articulation, irregular prosody, and other small human irregularities that speech systems quietly trip over. In the paper’s discussion, Whisper Large V3 performs near ceiling on synthetic English child speech at 3.7% WER, but reaches 21.7% WER on real English child speech. That gap is not a rounding error. It is the difference between testing a costume and testing the person wearing it.

WildASR splits the failure surface into three categories:

Category	What it tests	Examples in WildASR	Why it matters operationally
Environmental degradation	The recording condition	Reverberation, far-field audio, phone codecs, noise gaps, clipping	Determines whether audio quality should trigger fallback, repair, or abstention
Demographic shift	The speaker population	Children, older adults, accented speech	Determines whether the voice product works for the users it claims to serve
Linguistic diversity	The interaction pattern	Short utterances, incomplete audio, code-switching	Determines whether the agent can survive real conversation rather than textbook sentences

This structure is more useful than a leaderboard because it maps directly to deployment decisions. A call center, classroom assistant, healthcare intake bot, in-car agent, and multilingual sales assistant do not face the same speech risk. Treating them as one “voice AI” category is how optimistic dashboards become legal discovery exhibits. Lovely, but not ideal.

Where: bad audio does not degrade smoothly

The environmental subset tests five controlled perturbations: reverberation, far-field audio, phone codec effects, noise gaps, and clipping. The paper evaluates these on both FLEURS, which is closer to read speech, and MagicData, which captures more spontaneous conversational speech.

The first result is unsurprising but important: every acoustic perturbation increases error. The more useful result is that degradation is uneven across languages and recording settings.

Noise gaps are especially brutal for conversational speech. On MagicData, the average model error increase under noise gaps is +67.7 percentage points for English and +10.3 for Chinese. For Japanese and Korean, the increases are much larger: +118.9 and +121.0 percentage points respectively. Because WER/CER can exceed 100 when insertions dominate, these figures are not just “the system missed some words.” They indicate systems generating a lot of extra content.

That is the first business lesson: audio robustness is not a transferable certificate. A model that survives one language, corpus, or perturbation may fail badly under another. Buying “noise-robust ASR” without specifying which noise, which language, and which interaction pattern is procurement theater with a microphone attached.

The paper’s P90 elbow analysis is particularly useful here. Instead of looking only at the mean WER as reverberation increases, the authors track the 90th percentile error rate. The mean grows gradually, but the P90 curve rises faster and exposes severe tail failures earlier. The elbow marks the point where the tail begins accelerating.

That is deployable. A production system could use a similar threshold to decide when to continue, when to ask the user to repeat, when to switch channel, or when to prevent high-impact tool execution. The key is that the P90 elbow is not trying to make the model look better. It is trying to reveal when the worst 10% of interactions stop being tolerable.

For voice agents, that tail is often the product. Most users remember the one dangerous misunderstanding, not the 97 routine turns that worked.

Who: demographic robustness is not evenly distributed

The demographic subset covers children, older adults, and accented speech, but only for English and Chinese because high-quality child and older-adult speech resources are limited for the other languages. That boundary matters: the paper diagnoses a real issue, but it does not claim complete demographic coverage for all languages.

The results show a sharp asymmetry. English accent and older-adult speech remain relatively robust for many systems, often in the low single digits. Chinese demographic conditions are much harder. In Table 3, Chinese accent error ranges from 7.5 for Qwen2-Audio to 62.5 for Gemini 3 Pro; Chinese child speech ranges from 23.4 for Qwen2-Audio to 65.1 for Scribe V1; Chinese older-adult speech ranges from 18.6 for Qwen2-Audio to 52.6 for Gemini 2.5 Pro.

English child speech is also persistently difficult. The best reported English child result is still 18.2 WER from Gemini 3 Pro. That is not catastrophic for a dictation toy. It is much more concerning for an educational tutor, family assistant, child-safety tool, or any product where children are treated as first-class users rather than adorable noise sources in the background.

There is also an instructive model-specific pattern. Qwen2-Audio performs best on the Chinese demographic subsets, which the authors suggest may reflect stronger coverage in Chinese training data. The broader point is not that one model wins. It is that robustness can be local to language, data mixture, and speaker group.

For business teams, this changes the evaluation question. Do not ask, “Does the vendor support Chinese?” Ask, “Does the vendor support the Chinese-speaking users in our actual demographic distribution?” The difference sounds small until a product is deployed to older adults, children, regional speakers, or second-language users.

Prompt sensitivity turns ASR into a configuration risk

One of the paper’s most useful diagnostics is easy to overlook because it sounds almost too simple: test the same audio with different prompts.

The authors evaluate Gemini 2.5 Pro on demographic subsets using ten paraphrased prompts. Each prompt asks for the same thing: transcribe the audio in the target language and output only the transcript. The wording changes; the task does not.

For English, variation is minimal: the standard deviation is at most 0.6 percentage points across conditions. For Chinese, the effect is much larger: prompt standard deviation reaches 13.7% for accent, 46.1% for children, and 8.3% for older adults.

This is not an exotic jailbreak issue. It is basic transcription. The model is not being asked to reason about moral philosophy while blindfolded. It is being asked to write down what was said.

The operational implication is awkward but clear: prompt wording can become part of the ASR system’s reliability profile. A team that evaluates one prompt, ships another, and later modifies the wrapper instruction during a UI refactor may have changed performance without changing the model. That is the kind of silent configuration drift that makes postmortems unnecessarily theatrical.

A practical evaluation should therefore include prompt robustness, not just prompt selection. The metric is simple: use a controlled set of semantically equivalent prompts, then report both mean error and variance. Low average error with high prompt variance is not stable ASR. It is a system that happens to behave under one phrasing.

What: short, incomplete, and mixed speech exposes hallucination

The linguistic diversity subset is where WildASR becomes most relevant to voice agents. Real dialogue is full of tiny utterances, interruptions, fragments, and language mixing. Traditional ASR benchmarks often prefer complete, well-formed input. Humans, regrettably, did not sign that agreement.

The paper tests three scenarios: short utterances, incomplete audio, and code-switching. It also reports Hallucination Error Rate (HER), which is designed to capture semantic-level fabrication that lexical metrics may understate.

Short utterances are consistently hard. Even in English, reported error rates for short utterances range from 38.7% to 73.9% across the seven models. This should make anyone building a voice interface pause. Short utterances are not rare edge cases. They are the grammar of conversation: “yes,” “no,” “stop,” “next,” “again,” “hold on,” “cancel.”

The failure mechanism is plausible. Short clips contain little acoustic evidence, are vulnerable to voice activity detection errors, and give decoder-style systems more room to lean on language priors. If the audio is ambiguous, the model may complete what it thinks the user probably meant. That is useful in autocomplete. It is less charming when the system is supposed to transcribe actual speech.

The most alarming results involve insertion-heavy failures, where WER/CER/MER can exceed 100%. The paper reports Qwen2-Audio reaching 102.6% CER on Korean short utterances, 211.7% MER on Korean code-switching, and 224.4% on Japanese incomplete audio. These numbers mean the output is not merely inaccurate; it contains substantial generated content beyond the target transcript.

HER clarifies why this matters. Nova 2 on Chinese code-switching has 33.7% MER but 68.4% HER. That gap means surface error and semantic risk are not the same thing. A transcript can look close enough lexically while still distorting meaning in a way that matters for downstream action. The paper gives the example of a single inserted negation changing meaning. Voice agents do not need many such errors before users stop trusting them.

This is the second business lesson: ASR error types are not equal.

Failure type	Product interpretation	Typical mitigation
Omission	The system missed part of what was said	Ask for repetition when confidence is low
Substitution	The system replaced one phrase with another	Confirm before irreversible actions
Hallucination	The system added content not spoken	Block high-impact automation unless transcript evidence is strong
Semantic distortion	The text changes intent despite seeming plausible	Combine ASR confidence with intent-level risk checks

A missed filler word is annoying. A fabricated “cancel,” “approve,” “transfer,” or “delete all” is a product incident wearing a headset.

What the tests support—and what they do not

WildASR’s experiments serve different evidential purposes. Treating all tables and appendices as the same kind of proof would flatten the paper’s value.

Test or analysis	Likely purpose	What it supports	What it does not prove
Clean benchmark reference	Main comparison baseline	Curated performance can look strong while OOD performance fails	That clean benchmarks are useless
Environmental perturbations	Main evidence	Acoustic degradation creates non-uniform, language-dependent failures	That these five perturbations cover every production audio condition
P90 elbow under reverberation	Robustness/sensitivity diagnostic	Tail risk can accelerate before mean error looks catastrophic	A universal threshold across models, languages, or hardware
Demographic subsets	Main evidence and fairness/product-risk probe	Age, accent, and language interact strongly with ASR reliability	Complete demographic coverage across all languages
Prompt sensitivity	Sensitivity test	Instruction wording can materially affect transcription stability	That prompt engineering alone solves ASR robustness
Linguistic diversity with HER	Main evidence for semantic risk	Short, incomplete, and mixed-language inputs can trigger hallucination	That HER replaces WER/CER/MER for all use cases
Real versus synthetic child speech	Exploratory validation of benchmark design	Synthetic evaluation may underestimate real speech difficulty	That all synthetic speech is useless in all testing pipelines

This table is also a decent template for internal evaluation. Not every test has to be a leaderboard. Some tests exist to find thresholds. Some exist to expose instability. Some exist to validate the benchmark itself. A mature voice-agent team should know which kind of test it is running before it celebrates the result.

Turning WildASR into a voice-agent risk audit

The strongest business use of WildASR is not copying the benchmark wholesale. It is adopting the shape of the evaluation.

A practical pre-deployment audit would look something like this:

Audit question	WildASR-inspired method	Deployment decision
Where will audio quality collapse?	Test reverberation, far-field, codec compression, clipping, and silence/noise gaps using target devices and channels	Set abstention, repeat-request, or channel-switch thresholds
Who will be underserved?	Build demographic slices for the actual user base, not a generic “global” sample	Localize model choice, routing, or fallback policy
What dialogue patterns are risky?	Test short commands, interrupted speech, and code-switching from real interactions	Require confirmation for high-impact intents
Is prompt behavior stable?	Evaluate semantically equivalent prompts and measure variance	Freeze prompt templates and monitor configuration drift
Are errors semantically dangerous?	Pair WER/CER/MER with HER or intent-level semantic checks	Separate harmless recognition errors from tool-execution risks

This is not glamorous. It is also cheaper than discovering, after launch, that the agent understands adult English in clean audio but fails for children, older users, bilingual customers, or mobile callers in echoing rooms. The market has many ways to punish that kind of surprise. Most are more expensive than testing.

The audit framing also helps procurement. Vendor claims about “95% accuracy” should be treated as the beginning of a conversation, not the end. Accuracy under what codec? Which accent distribution? Which user age? Which language pair? Which utterance length? Which prompt? Which tail threshold? If the answers are unavailable, the number is decorative.

Speech-to-speech does not make ASR irrelevant

The paper addresses a fashionable counterargument: perhaps explicit ASR matters less now because end-to-end speech-to-speech systems can operate directly on audio representations. In that architecture, transcription may not be the central interface.

That is partially true, but it does not eliminate the problem. It changes where the problem hides.

Explicit ASR gives teams an inspectable layer. Transcripts can be logged, audited, searched, reviewed, compared, and used to debug downstream failures. Speech-to-speech systems may preserve paralinguistic cues and reduce latency, but when they fail under OOD audio, the failure can be harder to inspect. A fluent response does not prove the system heard correctly. Fluency is very good at laundering uncertainty.

For businesses, the better framing is not ASR versus speech-to-speech. It is: which parts of the voice pipeline must be auditable, and which parts can remain latent? For customer support, healthcare intake, financial instructions, enterprise workflow automation, and regulated interactions, explicit transcription remains valuable as a safety and compliance interface.

The paper’s conclusion is therefore conservative in the best sense: ASR is not legacy plumbing. It is a stabilizing layer for voice agents, especially when downstream systems can act.

Boundaries: what WildASR does not settle

WildASR is a diagnostic benchmark, not a cure. It tells us where systems break; it does not train a more robust model or prescribe a universal architecture.

Its language coverage is four languages. That is useful, but not universal. The demographic subset covers English and Chinese only, because high-quality child and older-adult data are hard to obtain for every language. The environmental perturbations are controlled and valuable, but they do not cover every messy condition in production: overlapping speakers, streaming latency artifacts, microphone switching, emotional speech, background media, domain-specific jargon, and multi-party conversation all remain important.

There is also a deeper measurement issue. “Ground truth” transcription is not culturally neutral. Whether to preserve fillers, partial words, backchannels, and incomplete utterances can differ by language and use case. For a meeting transcript, normalization may be acceptable. For a voice command, the difference between a fragment and a completed phrase may determine whether the system should act.

So the right lesson is not “use WildASR and you are safe.” The right lesson is “evaluate ASR as a situated control layer.” That means testing the environment, the speaker population, the linguistic pattern, the prompt wrapper, and the downstream action risk together.

The edge case is not outside the product

The old ASR question was: can the system transcribe clean speech accurately?

The voice-agent question is harsher: can the system avoid doing the wrong thing when speech is messy, short, partial, accented, age-shifted, multilingual, or degraded?

WildASR helps because it refuses to hide those cases inside one average score. It turns ASR evaluation into a set of operational categories: where the audio fails, who is speaking, and what kind of conversational behavior breaks the model. Then it adds diagnostics that businesses can actually reuse: P90 elbow for tail instability, prompt sensitivity for configuration risk, and HER for semantic fabrication.

That is a better standard for voice agents. Not because it is more elegant. Because it is harder to fool.

A voice agent that works only when users speak like benchmark samples is not ready for the world. It is ready for a demo. The world, as usual, has worse acoustics.

Cognaptus: Automate the Present, Incubate the Future.

Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, and Alex Smola, “Back to Basics: Revisiting ASR in the Age of Voice Agents,” arXiv:2603.25727, 2026. https://arxiv.org/abs/2603.25727 ↩︎

The real misconception is not “ASR is bad.” It is “average ASR is enough.”#

WildASR organizes speech failure into where, who, and what#

Where: bad audio does not degrade smoothly#

Who: demographic robustness is not evenly distributed#

Prompt sensitivity turns ASR into a configuration risk#

What: short, incomplete, and mixed speech exposes hallucination#

What the tests support—and what they do not#

Turning WildASR into a voice-agent risk audit#

Speech-to-speech does not make ASR irrelevant#

Boundaries: what WildASR does not settle#

The edge case is not outside the product#