Voice is supposed to be the easy interface.
No menus. No forms. No training session. A user speaks, the agent understands, and some neat piece of software magic happens in the background. That is the sales pitch. It is also mostly true in a demo room, which is a place where microphones behave, users speak politely, and nobody’s child interrupts from the back seat.
Production is less considerate.
A customer calls from a noisy street. A child says one syllable and waits. A patient with an older voice pauses mid-sentence. A bilingual user switches language halfway through a command. The voice agent hears something, transcribes something else, and then hands that text to a downstream system that may retrieve records, trigger tools, confirm payments, or update account settings. At that point, a transcription error is no longer a typo. It is a control input.
That is the useful warning in Back to Basics: Revisiting ASR in the Age of Voice Agents, which introduces WildASR, a multilingual diagnostic benchmark for automatic speech recognition under real-world distribution shifts.1 The paper is not interesting because it says ASR sometimes fails. Everyone who has used a phone menu already owns that research result. It is interesting because it breaks the problem into operational categories: where the audio is degraded, who is speaking, and what kind of linguistic pattern the user produces.
That category structure is the point. Businesses do not need another abstract reminder that “edge cases matter.” They need to know which edge cases belong in the pre-launch test plan, which ones require fallback logic, and which ones should make a voice agent refuse to act without confirmation. WildASR is best read less as a benchmark leaderboard and more as a deployment audit template.
The real misconception is not “ASR is bad.” It is “average ASR is enough.”
The tempting story is that speech recognition has matured. Curated benchmarks show very low error rates. Large audio-language models are improving quickly. Speech-to-speech systems can sometimes bypass explicit transcription altogether. So the easy conclusion is that ASR is now plumbing: necessary, boring, and basically solved.
WildASR attacks that conclusion from a practical angle. The authors note that modern systems can report word error rates below 5% on curated benchmarks, yet voice agents operate under conditions those benchmarks do not systematically cover: telephony compression, far-field microphones, reverberation, regional accents, disfluencies, clipped audio, short utterances, incomplete speech, and code-switching. The issue is not that benchmark progress is fake. The issue is that benchmark progress is too compressed into a single number.
A single ASR score answers a vendor-friendly question: “How accurate is the model on this dataset?” Production teams need a less pleasant question: “Which user, in which condition, speaking in which pattern, causes the system to become unsafe or useless?”
WildASR’s contribution is to make that second question testable.
WildASR organizes speech failure into where, who, and what
The benchmark covers four languages: English, Chinese, Japanese, and Korean. It evaluates seven systems: Whisper Large V3, GPT-4o Transcribe, Gemini 2.5 Pro, Gemini 3 Pro, Qwen2-Audio, Nova 2, and Scribe V1. The authors report WER for English, CER for Chinese/Japanese/Korean, and MER for code-switching, where mixed-script tokenization is needed.
The design follows a useful principle: real source speech, controlled stress. The source audio comes from human recordings rather than text-to-speech generation. Controlled perturbations are then applied where needed, so the benchmark can isolate a specific stress factor without pretending that synthetic speech contains the messy details of actual human articulation.
That matters. Synthetic speech can imitate pitch or rough age category, but it often misses hesitations, unstable articulation, irregular prosody, and other small human irregularities that speech systems quietly trip over. In the paper’s discussion, Whisper Large V3 performs near ceiling on synthetic English child speech at 3.7% WER, but reaches 21.7% WER on real English child speech. That gap is not a rounding error. It is the difference between testing a costume and testing the person wearing it.
WildASR splits the failure surface into three categories:
| Category | What it tests | Examples in WildASR | Why it matters operationally |
|---|---|---|---|
| Environmental degradation | The recording condition | Reverberation, far-field audio, phone codecs, noise gaps, clipping | Determines whether audio quality should trigger fallback, repair, or abstention |
| Demographic shift | The speaker population | Children, older adults, accented speech | Determines whether the voice product works for the users it claims to serve |
| Linguistic diversity | The interaction pattern | Short utterances, incomplete audio, code-switching | Determines whether the agent can survive real conversation rather than textbook sentences |
This structure is more useful than a leaderboard because it maps directly to deployment decisions. A call center, classroom assistant, healthcare intake bot, in-car agent, and multilingual sales assistant do not face the same speech risk. Treating them as one “voice AI” category is how optimistic dashboards become legal discovery exhibits. Lovely, but not ideal.
Where: bad audio does not degrade smoothly
The environmental subset tests five controlled perturbations: reverberation, far-field audio, phone codec effects, noise gaps, and clipping. The paper evaluates these on both FLEURS, which is closer to read speech, and MagicData, which captures more spontaneous conversational speech.
The first result is unsurprising but important: every acoustic perturbation increases error. The more useful result is that degradation is uneven across languages and recording settings.
Noise gaps are especially brutal for conversational speech. On MagicData, the average model error increase under noise gaps is +67.7 percentage points for English and +10.3 for Chinese. For Japanese and Korean, the increases are much larger: +118.9 and +121.0 percentage points respectively. Because WER/CER can exceed 100 when insertions dominate, these figures are not just “the system missed some words.” They indicate systems generating a lot of extra content.
That is the first business lesson: audio robustness is not a transferable certificate. A model that survives one language, corpus, or perturbation may fail badly under another. Buying “noise-robust ASR” without specifying which noise, which language, and which interaction pattern is procurement theater with a microphone attached.
The paper’s P90 elbow analysis is particularly useful here. Instead of looking only at the mean WER as reverberation increases, the authors track the 90th percentile error rate. The mean grows gradually, but the P90 curve rises faster and exposes severe tail failures earlier. The elbow marks the point where the tail begins accelerating.
That is deployable. A production system could use a similar threshold to decide when to continue, when to ask the user to repeat, when to switch channel, or when to prevent high-impact tool execution. The key is that the P90 elbow is not trying to make the model look better. It is trying to reveal when the worst 10% of interactions stop being tolerable.
For voice agents, that tail is often the product. Most users remember the one dangerous misunderstanding, not the 97 routine turns that worked.
Who: demographic robustness is not evenly distributed
The demographic subset covers children, older adults, and accented speech, but only for English and Chinese because high-quality child and older-adult speech resources are limited for the other languages. That boundary matters: the paper diagnoses a real issue, but it does not claim complete demographic coverage for all languages.
The results show a sharp asymmetry. English accent and older-adult speech remain relatively robust for many systems, often in the low single digits. Chinese demographic conditions are much harder. In Table 3, Chinese accent error ranges from 7.5 for Qwen2-Audio to 62.5 for Gemini 3 Pro; Chinese child speech ranges from 23.4 for Qwen2-Audio to 65.1 for Scribe V1; Chinese older-adult speech ranges from 18.6 for Qwen2-Audio to 52.6 for Gemini 2.5 Pro.
English child speech is also persistently difficult. The best reported English child result is still 18.2 WER from Gemini 3 Pro. That is not catastrophic for a dictation toy. It is much more concerning for an educational tutor, family assistant, child-safety tool, or any product where children are treated as first-class users rather than adorable noise sources in the background.
There is also an instructive model-specific pattern. Qwen2-Audio performs best on the Chinese demographic subsets, which the authors suggest may reflect stronger coverage in Chinese training data. The broader point is not that one model wins. It is that robustness can be local to language, data mixture, and speaker group.
For business teams, this changes the evaluation question. Do not ask, “Does the vendor support Chinese?” Ask, “Does the vendor support the Chinese-speaking users in our actual demographic distribution?” The difference sounds small until a product is deployed to older adults, children, regional speakers, or second-language users.
Prompt sensitivity turns ASR into a configuration risk
One of the paper’s most useful diagnostics is easy to overlook because it sounds almost too simple: test the same audio with different prompts.
The authors evaluate Gemini 2.5 Pro on demographic subsets using ten paraphrased prompts. Each prompt asks for the same thing: transcribe the audio in the target language and output only the transcript. The wording changes; the task does not.
For English, variation is minimal: the standard deviation is at most 0.6 percentage points across conditions. For Chinese, the effect is much larger: prompt standard deviation reaches 13.7% for accent, 46.1% for children, and 8.3% for older adults.
This is not an exotic jailbreak issue. It is basic transcription. The model is not being asked to reason about moral philosophy while blindfolded. It is being asked to write down what was said.
The operational implication is awkward but clear: prompt wording can become part of the ASR system’s reliability profile. A team that evaluates one prompt, ships another, and later modifies the wrapper instruction during a UI refactor may have changed performance without changing the model. That is the kind of silent configuration drift that makes postmortems unnecessarily theatrical.
A practical evaluation should therefore include prompt robustness, not just prompt selection. The metric is simple: use a controlled set of semantically equivalent prompts, then report both mean error and variance. Low average error with high prompt variance is not stable ASR. It is a system that happens to behave under one phrasing.
What: short, incomplete, and mixed speech exposes hallucination
The linguistic diversity subset is where WildASR becomes most relevant to voice agents. Real dialogue is full of tiny utterances, interruptions, fragments, and language mixing. Traditional ASR benchmarks often prefer complete, well-formed input. Humans, regrettably, did not sign that agreement.
The paper tests three scenarios: short utterances, incomplete audio, and code-switching. It also reports Hallucination Error Rate (HER), which is designed to capture semantic-level fabrication that lexical metrics may understate.
Short utterances are consistently hard. Even in English, reported error rates for short utterances range from 38.7% to 73.9% across the seven models. This should make anyone building a voice interface pause. Short utterances are not rare edge cases. They are the grammar of conversation: “yes,” “no,” “stop,” “next,” “again,” “hold on,” “cancel.”
The failure mechanism is plausible. Short clips contain little acoustic evidence, are vulnerable to voice activity detection errors, and give decoder-style systems more room to lean on language priors. If the audio is ambiguous, the model may complete what it thinks the user probably meant. That is useful in autocomplete. It is less charming when the system is supposed to transcribe actual speech.
The most alarming results involve insertion-heavy failures, where WER/CER/MER can exceed 100%. The paper reports Qwen2-Audio reaching 102.6% CER on Korean short utterances, 211.7% MER on Korean code-switching, and 224.4% on Japanese incomplete audio. These numbers mean the output is not merely inaccurate; it contains substantial generated content beyond the target transcript.
HER clarifies why this matters. Nova 2 on Chinese code-switching has 33.7% MER but 68.4% HER. That gap means surface error and semantic risk are not the same thing. A transcript can look close enough lexically while still distorting meaning in a way that matters for downstream action. The paper gives the example of a single inserted negation changing meaning. Voice agents do not need many such errors before users stop trusting them.
This is the second business lesson: ASR error types are not equal.
| Failure type | Product interpretation | Typical mitigation |
|---|---|---|
| Omission | The system missed part of what was said | Ask for repetition when confidence is low |
| Substitution | The system replaced one phrase with another | Confirm before irreversible actions |
| Hallucination | The system added content not spoken | Block high-impact automation unless transcript evidence is strong |
| Semantic distortion | The text changes intent despite seeming plausible | Combine ASR confidence with intent-level risk checks |
A missed filler word is annoying. A fabricated “cancel,” “approve,” “transfer,” or “delete all” is a product incident wearing a headset.
What the tests support—and what they do not
WildASR’s experiments serve different evidential purposes. Treating all tables and appendices as the same kind of proof would flatten the paper’s value.
| Test or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Clean benchmark reference | Main comparison baseline | Curated performance can look strong while OOD performance fails | That clean benchmarks are useless |
| Environmental perturbations | Main evidence | Acoustic degradation creates non-uniform, language-dependent failures | That these five perturbations cover every production audio condition |
| P90 elbow under reverberation | Robustness/sensitivity diagnostic | Tail risk can accelerate before mean error looks catastrophic | A universal threshold across models, languages, or hardware |
| Demographic subsets | Main evidence and fairness/product-risk probe | Age, accent, and language interact strongly with ASR reliability | Complete demographic coverage across all languages |
| Prompt sensitivity | Sensitivity test | Instruction wording can materially affect transcription stability | That prompt engineering alone solves ASR robustness |
| Linguistic diversity with HER | Main evidence for semantic risk | Short, incomplete, and mixed-language inputs can trigger hallucination | That HER replaces WER/CER/MER for all use cases |
| Real versus synthetic child speech | Exploratory validation of benchmark design | Synthetic evaluation may underestimate real speech difficulty | That all synthetic speech is useless in all testing pipelines |
This table is also a decent template for internal evaluation. Not every test has to be a leaderboard. Some tests exist to find thresholds. Some exist to expose instability. Some exist to validate the benchmark itself. A mature voice-agent team should know which kind of test it is running before it celebrates the result.
Turning WildASR into a voice-agent risk audit
The strongest business use of WildASR is not copying the benchmark wholesale. It is adopting the shape of the evaluation.
A practical pre-deployment audit would look something like this:
| Audit question | WildASR-inspired method | Deployment decision |
|---|---|---|
| Where will audio quality collapse? | Test reverberation, far-field, codec compression, clipping, and silence/noise gaps using target devices and channels | Set abstention, repeat-request, or channel-switch thresholds |
| Who will be underserved? | Build demographic slices for the actual user base, not a generic “global” sample | Localize model choice, routing, or fallback policy |
| What dialogue patterns are risky? | Test short commands, interrupted speech, and code-switching from real interactions | Require confirmation for high-impact intents |
| Is prompt behavior stable? | Evaluate semantically equivalent prompts and measure variance | Freeze prompt templates and monitor configuration drift |
| Are errors semantically dangerous? | Pair WER/CER/MER with HER or intent-level semantic checks | Separate harmless recognition errors from tool-execution risks |
This is not glamorous. It is also cheaper than discovering, after launch, that the agent understands adult English in clean audio but fails for children, older users, bilingual customers, or mobile callers in echoing rooms. The market has many ways to punish that kind of surprise. Most are more expensive than testing.
The audit framing also helps procurement. Vendor claims about “95% accuracy” should be treated as the beginning of a conversation, not the end. Accuracy under what codec? Which accent distribution? Which user age? Which language pair? Which utterance length? Which prompt? Which tail threshold? If the answers are unavailable, the number is decorative.
Speech-to-speech does not make ASR irrelevant
The paper addresses a fashionable counterargument: perhaps explicit ASR matters less now because end-to-end speech-to-speech systems can operate directly on audio representations. In that architecture, transcription may not be the central interface.
That is partially true, but it does not eliminate the problem. It changes where the problem hides.
Explicit ASR gives teams an inspectable layer. Transcripts can be logged, audited, searched, reviewed, compared, and used to debug downstream failures. Speech-to-speech systems may preserve paralinguistic cues and reduce latency, but when they fail under OOD audio, the failure can be harder to inspect. A fluent response does not prove the system heard correctly. Fluency is very good at laundering uncertainty.
For businesses, the better framing is not ASR versus speech-to-speech. It is: which parts of the voice pipeline must be auditable, and which parts can remain latent? For customer support, healthcare intake, financial instructions, enterprise workflow automation, and regulated interactions, explicit transcription remains valuable as a safety and compliance interface.
The paper’s conclusion is therefore conservative in the best sense: ASR is not legacy plumbing. It is a stabilizing layer for voice agents, especially when downstream systems can act.
Boundaries: what WildASR does not settle
WildASR is a diagnostic benchmark, not a cure. It tells us where systems break; it does not train a more robust model or prescribe a universal architecture.
Its language coverage is four languages. That is useful, but not universal. The demographic subset covers English and Chinese only, because high-quality child and older-adult data are hard to obtain for every language. The environmental perturbations are controlled and valuable, but they do not cover every messy condition in production: overlapping speakers, streaming latency artifacts, microphone switching, emotional speech, background media, domain-specific jargon, and multi-party conversation all remain important.
There is also a deeper measurement issue. “Ground truth” transcription is not culturally neutral. Whether to preserve fillers, partial words, backchannels, and incomplete utterances can differ by language and use case. For a meeting transcript, normalization may be acceptable. For a voice command, the difference between a fragment and a completed phrase may determine whether the system should act.
So the right lesson is not “use WildASR and you are safe.” The right lesson is “evaluate ASR as a situated control layer.” That means testing the environment, the speaker population, the linguistic pattern, the prompt wrapper, and the downstream action risk together.
The edge case is not outside the product
The old ASR question was: can the system transcribe clean speech accurately?
The voice-agent question is harsher: can the system avoid doing the wrong thing when speech is messy, short, partial, accented, age-shifted, multilingual, or degraded?
WildASR helps because it refuses to hide those cases inside one average score. It turns ASR evaluation into a set of operational categories: where the audio fails, who is speaking, and what kind of conversational behavior breaks the model. Then it adds diagnostics that businesses can actually reuse: P90 elbow for tail instability, prompt sensitivity for configuration risk, and HER for semantic fabrication.
That is a better standard for voice agents. Not because it is more elegant. Because it is harder to fool.
A voice agent that works only when users speak like benchmark samples is not ready for the world. It is ready for a demo. The world, as usual, has worse acoustics.
Cognaptus: Automate the Present, Incubate the Future.
-
Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, and Alex Smola, “Back to Basics: Revisiting ASR in the Age of Voice Agents,” arXiv:2603.25727, 2026. https://arxiv.org/abs/2603.25727 ↩︎