Lost in Translation: When 14% WER Hides a 44% Failure Rate

Taxi dispatch is not a poetry recital.

When a passenger calls and says, “I’m on Arguello,” the system does not need to appreciate the full expressive richness of the sentence. It needs to identify one street name, map it to the right place, and send a vehicle there. This is not a broad language-understanding task. It is a narrow operational task with coordinates attached.

That is exactly why it is dangerous.

A recent paper by Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, and James Zou tests 15 speech recognition models from OpenAI, Deepgram, Google, and Microsoft on a deceptively simple problem: can they correctly transcribe U.S. street names spoken by U.S.-based participants?¹ The answer is not comforting. Across the evaluated models, the average street-name transcription error rate is 44%.

Nearly every other street name is wrong.

This would be less interesting if the models were obviously weak. They are not. Some are production-grade systems. Some are optimized for telephony. Some have strong benchmark reputations. Whisper-Large, for example, can achieve a low overall word error rate of about 14% in the paper’s analysis, yet its street-name transcription error rate rises to 27%.

That gap is the article.

The problem is not that speech recognition has failed. The problem is that the usual evaluation metric is looking in the wrong direction. Word Error Rate asks whether the transcript is textually close. Operations ask whether the ambulance, taxi, delivery rider, or field technician goes to the right place. These are related questions only in the same relaxed sense that a menu is related to dinner.

The one word that moves the vehicle

The paper’s case is street names, but the deeper object is operationally critical named entities.

A named entity is not just another word in a sentence. In many business processes, it is the handle that triggers allocation: a street name routes a taxi, a hospital name routes an ambulance, a product code routes a warehouse search, a client name pulls a record, and a medication name shapes clinical action. If the filler words are imperfect but the entity is right, the process often survives. If the entity is wrong, the rest of the sentence can be pristine and still useless.

This is why average transcription quality is a misleading comfort blanket. Consider two transcripts:

Spoken input	Model output	Textual severity	Operational severity
“I’m on Font.”	“I’m on Bont.”	Small edit distance	Potentially wrong location
“Can you pick me up at Cesar Chavez?”	“Can you pick me up at Caesar Chaves?”	Possibly acceptable if phonetically equivalent	Depends on routing resolution
“I’m near Lake Merced.”	“I’m near Lake Mercy.”	Small textual error	Could still fail downstream mapping

The paper handles this carefully. It does not blindly punish spelling variants. The authors define a street-name transcription error using phonetic matching, meaning orthographically different but phonetically equivalent spellings can still count as correct. That matters. The reported 44% average error rate is not just a spelling-pedantry score dressed up as a safety concern.

The test dataset is also intentionally operational. The main SF Streets dataset contains 2,262 utterances from 78 participants pronouncing San Francisco boulevard names using the template “I’m on [STREET NAME].” The broader U.S. Streets dataset adds 3,600 recordings from 97 participants, covering 360 street names across 12 major U.S. cities. The design is simple enough to look almost unfair to the models. Then again, a customer call center does not usually provide a doctoral seminar before asking where you are.

Why low WER hides high-risk failure

Word Error Rate is a useful metric, but it is not a business-risk metric. It measures substitutions, deletions, and insertions between a reference transcript and a model transcript. If the model gets most of the sentence right, WER can look good even when the one decisive token is wrong.

That is the central misconception this paper corrects:

Reader belief	Correction	Business consequence
Low WER means the ASR system is reliable.	Low WER can coexist with high named-entity failure.	The deployment may look safe in aggregate testing while failing on the tokens that determine action.
More context should fix the problem.	Context helps less than expected because the model still has to discriminate the spoken entity.	Prompting is not a substitute for domain-specific evaluation.
Bigger models solve it.	Scale helps, but not cleanly or cheaply.	Production latency, memory, and cost still matter.
Errors are UX annoyances.	Errors become routing deviations, delays, and unequal burdens.	The KPI should be measured in miles, minutes, and money, not only tokens.

The paper gives a useful example of why the metric gap is structural. Whisper-Large can have an overall WER of about 14%, yet its street-name transcription error rate is 27%. This is not a rounding issue. It is a measurement problem.

The authors also test whether additional context helps. They try prompting models with a lightweight situational cue: the user is giving an address. That produces little improvement. They then test a much stronger diagnostic condition: provide the full set of possible target street names in the prompt. This is not realistic deployment design; it is closer to giving the model the answer sheet and asking it to point. Even then, average accuracy across tested models reaches only about 76%.

That result is important because it narrows the failure mechanism. If the model still fails when the candidate list is available, the bottleneck is not merely “the model lacks context.” It is recognition and discrimination: hearing the entity, separating similar phonetic patterns, and selecting the right candidate.

For business readers, this is the part to underline. The cheap fix—“add a better system prompt”—is not enough. Very tragic. The prompt fairy has limits.

The fairness problem becomes a routing problem

The paper’s second contribution is not simply that speech models make errors. Everyone who has dictated a message near a ceiling fan already knows that. The sharper finding is that errors are not evenly distributed across speakers.

The study groups SF Streets participants by primary spoken language: English Only, Multilingual with English, and Non-English. Across the 15 evaluated models and variants, non-English-primary speakers show substantially lower street-name transcription accuracy than English-primary speakers: 46% versus 64%. That is an 18 percentage-point gap.

The authors report no significant effects of self-identified gender or age on transcription performance, but the language-background gap appears consistently across model families.

This distinction matters. In ordinary AI ethics writing, “fairness” often floats above the ground as a metric table. Here it lands on the map. If a model is less accurate for non-English-primary speakers, the harm is not only representational. The vehicle may go farther in the wrong direction.

The paper estimates this downstream impact using Google Maps API queries. It takes generated transcriptions from Whisper-Base, maps them to locations, and measures the distance between the intended and transcribed destinations. The method is lenient in useful ways: if the map system can autocorrect a transcription error, the error may resolve correctly; cases with no found location are dropped; and extreme out-of-city deviations are capped or excluded under the assumption that a human would likely intervene.

Even after this leniency, the routing difference is large.

Speaker group	Average driving distance error	Estimated cost impact	Estimated delay
English-primary speakers	1.26 miles	About $4	About 5 minutes
Non-English-primary speakers	2.4 miles	About $8	About 10 minutes

This is where the paper becomes more than an ASR benchmark. It converts a transcription metric into an operational metric. The unit changes from words to distance. That is the right move.

A business should not ask only:

$$ \text{How many words did the model transcribe correctly?} $$

It should also ask:

$$ \text{Expected operational loss} = P(\text{critical entity error}) \times \text{downstream cost of that error} $$

The paper applies this logic to San Francisco taxi dispatch. Using assumptions about weekday taxi volume, phone-based dispatch share, average traffic speed, and taxi fare schedules, the authors estimate roughly 43,500 hours of avoidable delay per year and about $2.1 million in annual economic cost. They describe this as conservative because the calculation assumes all riders are English-primary speakers, even though the paper finds larger errors for non-English-primary speakers.

Do not overread the dollar figure. It depends on San Francisco taxi assumptions, voice-mediated dispatch volume, fare schedules, and the specific mapping procedure. But do not underread it either. The purpose is not to declare a universal taxi-loss constant. The purpose is to show the accounting method: a speech error can be priced once it touches routing.

That is the business relevance.

The model is not failing at language; it is failing at allocation

One easy but wrong reading of the paper is: “Speech models still struggle with accents.” That is partly true, but too blunt.

A better reading is: current ASR evaluation underweights short, high-stakes utterances where a rare named entity carries most of the operational load. The problem is not just accent robustness. It is the mismatch between benchmark incentives and deployment consequences.

Standard ASR benchmarks often involve longer-form speech. In longer speech, language models can lean on context. If a sentence has enough surrounding words, the model can reconstruct likely content. Named entities are different. They are often rare, local, historically layered, multilingual, and weakly predictable from surrounding syntax. San Francisco street names are a neat trap because many are not generic English words: Arguello, Alemany, Junipero Serra, Cesar Chavez, O’Shaughnessy. Local signage survives history; speech models inherit training distributions. The two are not always friends.

The authors also note that a small analysis of U.S. city street names found 33% came from non-English origins, classified using GPT-4.1. That result should be treated as supporting context rather than a core empirical pillar, but it reinforces the mechanism: street names are exactly the kind of long-tail vocabulary that aggregate ASR benchmarks can fail to stress.

For deployment, the issue is not whether the transcript looks readable. The issue is whether the transcript is actionable.

A call-center dashboard may show high completion rates. A vendor may show impressive benchmark WER. A pilot may sound fine in demos. But if the system fails on names, addresses, product identifiers, account numbers, airports, hospitals, or local landmarks, the process is quietly leaking operational quality. It will not always look like model failure. It may look like longer resolution time, more transfers, more manual corrections, more customer complaints, or more “driver could not find me” incidents.

AI failures often enter the P&L wearing a fake mustache.

What the synthetic-data fix actually proves

The paper’s third contribution is a mitigation: use multilingual synthetic text-to-speech data to fine-tune ASR models on diverse pronunciations of named entities.

The first attempted route did not work. The authors tried using XTTS voice cloning to clone accented English speech and generate more examples with similar pronunciation patterns. The model tended to preserve vocal identity while normalizing the accent toward American or British English. That failure is not a side note. It tells us something useful about current voice cloning systems: they may wash out precisely the variation needed for robustness.

The workaround is clever. Instead of asking the TTS system to speak English with a cloned accent, the authors generate speech in a non-English language while inserting English street names into the sentence. For example, the model might generate an Italian sentence containing “Washington,” then the authors extract the audio segment corresponding to the English street name. The foreign-language generation imposes phonetic structure on the inserted English token.

In simplified form:

Step	Purpose
Select multilingual speech from Common Voice	Obtain diverse speaker and language patterns
Use XTTS to synthesize non-English speech	Induce non-English phonetic structure
Insert English street names into the generated sentence	Create accented variants of target named entities
Extract the street-name audio segment	Build training samples focused on the entity
Fine-tune Whisper-Base with fewer than 1,000 samples	Improve recognition without retraining a large ASR model

The result is practically interesting because it is small. The authors fine-tune Whisper-Base with fewer than 1,000 synthetic utterances and report substantial gains, including nearly 60% relative improvement for non-English-primary speakers in the main contribution summary. In the detailed mitigation section, they report strong relative gains for non-English-primary and multilingual speakers, and they show that improvements hold across Whisper model sizes in appendix figures.

The distinction between main evidence and supporting tests matters here.

Test or result	Likely purpose	What it supports	What it does not prove
SF Streets evaluation across 15 models	Main evidence	Street-name recognition is a major failure mode for strong ASR systems	Universal ASR failure across all domains
Language-group accuracy gap	Main evidence	Errors are worse for non-English-primary speakers	Complete demographic fairness analysis across all populations
Google Maps routing-distance estimate	Operational impact analysis	Transcription errors can become miles, minutes, and dollars	Exact cost in every city or dispatch system
Prompting with address context and candidate lists	Diagnostic test	Context alone does not eliminate the bottleneck	Prompting is useless in all ASR workflows
Synthetic multilingual TTS fine-tuning	Mitigation evidence	Targeted synthetic data can improve named-entity recognition	Synthetic data fully solves accented speech recognition
Single-language and out-of-distribution tests	Robustness/exploratory extension	Aggregated multilingual synthetic data seems more useful than narrow single-language training	Direct language-to-language transfer guarantees

This is a good example of how to read applied AI papers without turning every figure into a slogan. The mitigation is promising because it is targeted, reproducible, and cheap relative to model-scale escalation. It is not magic accent coverage. It improves a specific named-entity task under specific dataset conditions.

That is enough. In business, a repair does not need to be metaphysical. It needs to reduce the failure rate where the money leaks.

The real lesson for AI automation teams

The obvious conclusion is “evaluate speech models better.” True, but too polite.

The more useful conclusion is that AI automation teams should stop treating vendor benchmarks as deployment evidence. Benchmarks tell you whether a model performs well under benchmark conditions. This is not a scandal. It is literally what benchmarks do. The problem begins when procurement, product, or operations teams mistake benchmark success for workflow reliability.

A serious voice-automation deployment should include at least four evaluation layers:

Evaluation layer	Question	Example metric
Aggregate transcription	Is the transcript generally accurate?	WER, character error rate
Critical-entity recognition	Are the operational tokens correct?	Named-entity accuracy, phonetic match rate
Group-stratified performance	Does accuracy differ by speaker group?	Accuracy by language background, accent, region
Downstream process cost	What happens when the entity is wrong?	Miles, minutes, dollars, escalation rate

For call centers, logistics, mobility, healthcare access, public services, and field operations, the fourth layer is not optional. It is the layer where AI risk becomes visible to management.

The paper also suggests a practical remediation workflow:

Identify the entity classes that drive action: addresses, facility names, customer names, product codes, medication names, policy IDs.
Build a domain-specific benchmark using real or representative utterances.
Score both aggregate transcription and entity-level accuracy.
Stratify results by relevant speaker groups, especially language background and accent exposure.
Translate errors into operational units: delay, wrong dispatch, manual correction, failed resolution, refund, SLA breach.
Fine-tune or adapt using targeted synthetic and real data.
Re-test the full process, not just the model transcript.

This is not glamorous. It is the AI equivalent of checking whether the bridge connects to the road. Apparently we still have to say this.

Where the paper’s evidence should not be stretched

The paper is strongest on a specific claim: strong ASR systems can perform poorly on U.S. street-name recognition, and those failures can produce unequal routing harms for non-English-primary speakers. It is also persuasive that targeted synthetic data can materially improve performance in this setting.

Several boundaries matter.

First, the main empirical setting is U.S. street names, especially San Francisco boulevards. The broader U.S. Streets dataset adds coverage across 12 cities, but this remains an address-like named-entity task. We should not automatically generalize the 44% error rate to every speech workflow.

Second, the financial model is an estimate, not an audited loss statement. It depends on taxi volume, phone-dispatch assumptions, fare schedules, traffic speed, and mapping behavior. Its value is methodological: it demonstrates how to convert ASR errors into operational cost.

Third, the synthetic-data mitigation is promising but not a universal fairness fix. It improves recognition by exposing the ASR model to more phonetic variation around target entities. It does not prove full coverage across all accents, all languages, all acoustic conditions, or all named-entity classes.

Fourth, the paper’s language-group categories are useful but coarse. “Non-English primary speaker” is not a single linguistic condition. It groups many language backgrounds, pronunciation patterns, and user histories. For business deployment, this means local testing matters. A model that works well in one city, customer base, or service channel may fail in another.

These boundaries do not weaken the paper. They make it more usable. The right takeaway is not “all ASR is unsafe.” The right takeaway is “aggregate ASR evaluation is insufficient for workflows where one named entity moves resources.”

The KPI is not whether the transcript looks good

The deeper business lesson is metric design.

AI systems are often evaluated in the language of model builders, then deployed into the economics of operations. This creates a translation gap. A model team reports WER. A dispatcher experiences wrong addresses. A customer experiences waiting. A regulator sees unequal service quality. Finance sees avoidable cost. Everyone is describing the same failure, but with different units.

The paper’s contribution is to force those units into the same frame.

For Cognaptus readers thinking about automation, the question is not whether speech models are good enough in general. That question is too vague to be useful. The better question is:

\ast\astWhat are the five words in this workflow that must not be wrong?\ast\ast

Once you know those words, the evaluation becomes concrete. Build the benchmark around them. Test them across user groups. Price the downstream error. Then decide whether prompting, fine-tuning, human confirmation, or workflow redesign is the right control.

Sometimes the answer will be synthetic data. Sometimes it will be a confirmation step: “Did you say Arguello Boulevard?” Sometimes it will be geolocation fallback. Sometimes it will be a human-in-the-loop threshold for low-confidence named entities. The point is not to worship one fix. The point is to stop pretending that a low average error rate means the system understands what matters.

Speech models are now good enough to be deployed widely. That is precisely why their evaluation must become less lazy.

The old question was:

Can the model transcribe speech?

The operational question is:

Can the model correctly hear the word that moves the resource?

For taxis, that word has coordinates. For hospitals, it has consequences. For businesses, it has cost.

And for anyone still waving around aggregate WER as proof of deployment readiness: congratulations, the spreadsheet is clean. The car is still on the wrong street.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, and James Zou, “‘Sorry, I Didn’t Catch That’: How Speech Models Miss What Matters Most,” arXiv:2602.12249, 2026, https://arxiv.org/abs/2602.12249. ↩︎

The one word that moves the vehicle#

Why low WER hides high-risk failure#

The fairness problem becomes a routing problem#

The model is not failing at language; it is failing at allocation#

What the synthetic-data fix actually proves#

The real lesson for AI automation teams#

Where the paper’s evidence should not be stretched#

The KPI is not whether the transcript looks good#