Talk Is Cheap, Until It Trains ASR
Call centers are very good at producing audio. They are much worse at producing clean, labeled, domain-matched, multi-speaker training data.
That distinction matters. A business may have thousands of hours of customer calls, branch conversations, medical consultations, field-service recordings, or internal support audio. But most of it is noisy, consent-constrained, poorly transcribed, unevenly distributed across accents and topics, and inconveniently full of humans doing human things: interrupting, pausing, talking over each other, drifting off-topic, and using domain-specific shorthand as if the ASR model had attended the onboarding session.
The paper Efficient ASR Training with Conversations that Never Happened addresses exactly this bottleneck.1 Its core proposal is not simply “use synthetic data,” which is now the AI equivalent of saying “try electricity.” The more useful claim is narrower: if you need conversational ASR for a lower-resource language or a niche domain, you can generate structured synthetic conversations with LLMs, synthesize them with metadata-matched TTS voices, simulate multi-speaker timing, and train an ASR model on the resulting audio.
The paper’s most interesting lesson is not that synthetic speech helps. It does. The sharper lesson is that synthetic speech is not fungible. Generator choice matters. Mixtures matter. Fixed-budget diversity can hurt. Real-utterance simulation still contributes something LLM-generated speech does not. More data helps only when the added data resembles the missing part of the target world. Apparently even fake conversations need quality control. How inconvenient.
The real comparison is not synthetic versus real, but matched versus generic
The obvious reading of the paper is: “LLMs can generate data for speech recognition.” That is true, but it is not the useful reading.
The business reading is different. Companies do not usually ask, “Can we generate artificial audio?” They ask whether they can avoid collecting and annotating another mountain of expensive, messy, sensitive, domain-specific speech. The paper’s answer is: perhaps, if the synthetic data is designed to match the conversational task rather than merely inflate the training set.
The authors evaluate the method on Hungarian BEA-Dialogue, a multi-speaker conversational ASR benchmark. The experimental setup is deliberately comparative. They do not just add one synthetic dataset and declare victory. They compare five LLM families, test single-generator scaling, test fixed-budget mixtures, then remove the fixed budget and scale selected generator combinations. Finally, they compare against several baselines, including real-only training, speaker-aware simulated conversations from real utterances, Whisper zero-shot, and a Hungarian monolingual model trained on 2700 hours of labeled speech.
This makes the paper unusually useful for operational interpretation. It separates several questions that are often blurred together:
| Question | Paper test | Likely purpose | Business reading |
|---|---|---|---|
| Does synthetic conversation help at all? | BEA-Dialogue only vs synthetic augmentation | Main evidence | Synthetic augmentation can reduce error when real conversational data is scarce. |
| Does the LLM generator matter? | Five single-generator conditions | Ablation / sensitivity test | Vendor/model choice affects downstream ASR, not just text fluency. |
| Does mixing generators improve diversity? | Fixed 500-conversation mixtures | Ablation on composition | Diversity helps only when generators are complementary. |
| Does scaling synthetic data still help? | Full-data scale-up conditions | Main evidence / scaling test | More synthetic speech can help, but marginal quality matters. |
| Is LLM-generated data enough? | LLM scale-up + real-utterance simulation | Comparison with prior augmentation | Generated and real-derived synthetic conversations capture different useful signals. |
| Are gains likely noise? | Bootstrap significance testing | Robustness check | Some cpWER gains are statistically supported; cpCER gains are narrower. |
That table is the paper in business language. The problem is not whether synthetic data is “real enough.” The problem is whether it is useful enough in the right dimensions.
The pipeline creates situations, voices, and timing
The method has three stages.
First, an LLM generates a scenario, speaker metadata, and a structured dialogue. The metadata includes age, gender, occupation, and conversational role. The prompts encourage topic diversity, autobiographical or experiential discussion, realistic variation in ages and occupations, and avoidance of topics that are too broad or too niche.
Second, the text is converted into speech using xTTS-v2. This is not done with one generic voice. Each generated speaker is mapped to a reference voice profile from a filtered BEA-Large speaker bank. The selection matches gender and chooses the closest available age. The bank contains 287 speakers after filtering, excludes speakers from the development and evaluation sets, and uses spontaneous-speech reference segments rather than read speech.
Third, the synthesized turns are assembled into multi-speaker conversations using speaker-aware conversation simulation. This stage inserts turn timing, pauses, speaker changes, and overlap patterns. In other words, the output is not a pile of isolated TTS sentences. It is a timestamped conversation-shaped waveform with aligned transcripts.
For business readers, this is the important mechanism:
LLM scenario + speaker metadata
↓
LLM dialogue turns
↓
metadata-matched TTS voice selection
↓
speaker-aware timing, pauses, and overlap
↓
ASR training data that resembles conversational audio
This is why the paper should not be filed under “synthetic audio generation” alone. Its target is the joint structure of conversational data: what is said, who says it, and how the turns collide in time.
That last part is easy to underestimate. Many enterprise ASR failures are not caused by clean single-speaker dictation. They emerge in the ugly zone: a customer interrupts, an agent talks over the customer, the speaker changes before the model has recovered, and the transcript becomes a polite hallucination wearing a headset.
Comparison one: real-only training hears too little of the target world
The BEA-Dialogue-only baseline uses 67 hours of real conversational training speech and reaches 20.44 cpWER and 9.00 cpCER. Lower is better. cpWER and cpCER are conversation-aware error metrics designed for multi-speaker recognition, where speaker ordering and concatenated conversation structure matter.
Adding synthetic conversations improves performance across the tested LLM generators. In the best single-generator setup, GPT-5.4 mini reaches 17.75 cpWER and 8.20 cpCER with 146 synthetic hours added to the 67 hours of real training data. Claude Haiku, Gemini, Grok, and Qwen also improve over real-only training, although with different strength.
The best single-generator results are:
| Single generator | Best conversation count | cpCER | cpWER |
|---|---|---|---|
| GPT-5.4 mini | 500 | 8.20 | 17.75 |
| Claude Haiku 4.5 | 500 | 8.26 | 18.02 |
| Gemini 3.5 Flash | 500 | 8.26 | 18.18 |
| Grok 4.1 non-reasoning | 500 | 8.41 | 18.58 |
| Qwen3-235B-A22B | 500 | 8.47 | 18.60 |
This is main evidence, but it also functions as a sensitivity test. All generators help, but not equally. The difference between GPT and Qwen is not a rounding error. It is the gap between a stronger augmentation source and a weaker one under the same ASR recipe.
The authors also show a scale-down ablation in Figure 1: each generator is tested across 100, 200, 300, 400, and 500 generated conversations. The dominant pattern is improvement with scale, especially as the synthetic set approaches 500 conversations. But the curves are not perfectly monotonic. Grok and Qwen fluctuate, and their relative ordering changes at intermediate scales.
That matters because it kills a lazy procurement rule: “buy the cheapest generator and generate more.” Grok is extremely cheap in the paper’s API-cost table, but its single-generator ASR result trails GPT, Haiku, and Gemini. The cost of generation is not the same as the cost of error. In production, that difference shows up as agent corrections, failed call summaries, bad search, compliance review, and angry customers saying “that is not what I said,” which is a remarkably old-fashioned but still valid evaluation metric.
Comparison two: the best generator is not automatically the best teammate
The next experiment asks whether mixing LLMs improves performance. The authors fix the synthetic budget at 500 conversations and test uniform mixtures. This is a composition ablation: it does not ask whether more data helps; it asks whether replacing some samples from one generator with samples from another improves the training set.
The answer is selective.
| Fixed-budget mixture | Best subset | cpCER | cpWER | Change vs best single generator |
|---|---|---|---|---|
| 1-mix | GPT | 8.20 | 17.75 | 0.00 |
| 2-mix | GPT + Haiku | 8.19 | 17.56 | -0.19 |
| 3-mix | GPT + Haiku + Qwen | 8.22 | 17.87 | +0.12 |
| 4-mix | GPT + Haiku + Qwen + Grok | 8.29 | 18.19 | +0.44 |
| 5-mix | GPT + Haiku + Qwen + Grok + Gemini | 8.35 | 18.27 | +0.52 |
The best fixed-budget mixture is GPT + Haiku, which improves cpWER by 0.19 absolute points over GPT alone. That is modest, but meaningful in this setting because the total number of conversations is fixed. The mixture is not winning by adding volume. It is winning by adding complementarity.
Then the result turns. Adding more generators under the same budget makes performance worse. The best three-way mixture is weaker than GPT alone. Four-way and five-way mixtures degrade further.
This is the paper’s cleanest correction to a common synthetic-data misconception. Diversity is not a holy substance. It is not sprinkled over a dataset to make models generalize out of gratitude. At a fixed budget, every added generator replaces some samples from another generator. If the added generator contributes less useful or less compatible variation, the average quality of the synthetic set falls.
The figures sharpen this point. Figure 2 is an exploratory subset visualization: it shows that the best-performing mixtures cluster around GPT and Haiku rather than around larger mixtures. Figure 3 is a pairwise diagnostic heatmap: it shows that pairwise complementarity is concentrated in a small number of pairings. GPT + Haiku is strong; Haiku + Qwen is also comparatively good; some other pairs are clearly worse.
One subtle result is especially useful: Gemini is strong as a single generator, but absent from the top subset combinations discussed by the authors. Qwen is weaker alone, yet appears in two of the top five mixture configurations. That means standalone generator ranking does not fully predict mixture value.
For an enterprise team, the implication is blunt: do not evaluate synthetic-data providers only by single-model sample quality. Evaluate them as components of a training distribution. A generator can be beautiful alone and redundant in a mixture. Another can look weaker alone but add useful variation when paired with a stronger source. Yes, even data vendors can be socially awkward.
Comparison three: more data helps when the marginal data earns its place
The fixed-budget mixture experiment says that diversity can dilute quality. The scale-up experiment asks a different question: what happens when the budget is removed and adding a generator increases the total amount of synthetic speech?
Here the results are more favorable to scaling, but still not blindly so.
| Setup | Configuration | Training size | cpCER | cpWER |
|---|---|---|---|---|
| Whisper zero-shot | Whisper-large-v3 | N/A | 12.18 | 22.13 |
| Hungarian monolingual zero-shot | 2700h model | 2700h | 7.71 | 16.27 |
| BEA-Dialogue only | Real-only | 67h | 9.00 | 20.44 |
| BEA-Dialogue + simulated | Real + BEA-Large-derived simulation | 67h + 209h | 8.13 | 17.64 |
| 1-scale | GPT | 67h + 146h | 8.20 | 17.75 |
| 2-scale | GPT + Haiku | 67h + 273h | 8.03 | 16.96 |
| 3-scale | GPT + Haiku + Qwen | 67h + 346h | 8.05 | 16.86 |
| 4-scale | GPT + Haiku + Qwen + Grok | 67h + 427h | 7.97 | 16.65 |
| 5-scale | GPT + Haiku + Qwen + Grok + Gemini | 67h + 512h | 8.06 | 16.68 |
| 4-scale + sim | GPT + Haiku + Qwen + Grok + simulated | 67h + 427h + 209h | 7.57 | 15.40 |
This is the main evidence table. It contains three important comparisons.
First, GPT alone nearly matches the prior BEA-Dialogue + simulated baseline: 17.75 cpWER versus 17.64. That means LLM-generated conversational speech is already competitive with a strong real-utterance-derived simulation baseline, but not superior by itself.
Second, scaling selected generators helps. GPT + Haiku gives the largest marginal gain, reducing cpWER from 17.75 to 16.96. Adding Qwen improves cpWER further to 16.86, though cpCER slightly worsens relative to the two-generator condition. Adding Grok produces the best pure LLM-generated scale-up result: 16.65 cpWER and 7.97 cpCER.
Third, adding Gemini increases the synthetic pool from 427 to 512 hours but slightly worsens performance. This is the scale-up version of the same lesson: more is useful only when the marginal data is useful. Synthetic hours are not calories. You do not get stronger by consuming them indiscriminately.
The 2700-hour comparison is impressive, but it is not magic
The most headline-friendly result is the final row: 67 hours of real BEA-Dialogue training data plus 427 hours of LLM-generated synthetic speech plus 209 hours of BEA-Large-derived simulated conversations reaches 15.40 cpWER and 7.57 cpCER.
That outperforms the 2700-hour Hungarian monolingual zero-shot model on cpWER, which reaches 16.27 cpWER and 7.71 cpCER. The authors also report that the 4-scale + sim model significantly improves over the 2700-hour baseline in cpWER, while cpCER shows no significant difference.
This is strong evidence, but it should be interpreted carefully.
The 2700-hour model is a zero-shot comparison on BEA-Dialogue. Its training data did not include BEA-Dialogue, though it did include substantial broadcast news and broadcast conversations. The synthetic pipeline, by contrast, is explicitly tuned toward the target conversational benchmark style. So the result does not prove that 703 hours of mixed real and synthetic data are universally better than 2700 hours of real speech. It shows something more operationally relevant: target-matched conversational augmentation can outperform much larger generic training when the evaluation domain is specific.
That is exactly the kind of result enterprises should care about. Most businesses do not need the world’s best generic recognizer. They need a recognizer that performs well on their calls, their jargon, their accents, their interaction patterns, and their failure modes. A thousand hours of broad audio may be less useful than a smaller amount of carefully matched conversational data. This is not glamorous. It is just how distributions work.
The combined final row also clarifies why LLM-generated conversations do not make real speech obsolete. The best system combines two augmentation sources:
| Augmentation source | What it contributes | What it may miss |
|---|---|---|
| LLM-generated + TTS conversations | Controlled topic, scenario, role, and dialogue diversity | Natural acoustic texture, real lexical quirks, spontaneous speech artifacts |
| Real-utterance-derived simulation | Natural speech and lexical material from real utterances | New scenario generation and controlled metadata coverage |
The gain from 16.65 cpWER to 15.40 cpWER after adding BEA-Large-derived simulation suggests complementarity. LLM data supplies controllable conversational variety. Real-derived simulation preserves natural acoustic and lexical properties. A practical system should not treat these as rivals. It should treat them as different ingredients in a training-data portfolio.
The cost signal is real, but it is not the whole ROI
The paper reports approximate API generation costs for the synthetic dialogue generation step after batch discounts:
| Generator | Approximate API cost | Final synthetic audio |
|---|---|---|
| GPT-5.4 mini | $5.81 | 146h |
| Claude Haiku 4.5 | $8.70 | 127h |
| Gemini 3.5 Flash | $4.50 | 85h |
| Grok 4.1 non-reasoning | $0.55 | 81h |
| Qwen3-235B-A22B | $2.62 | 73h |
Those numbers are almost comically small compared with the cost of collecting and transcribing high-quality domain speech. But the paper’s better business message is not “synthetic data is cheap.” Cheap data that degrades a model is an expensive hobby.
The relevant ROI calculation has at least five layers:
- Generation cost: LLM API cost for scenario and dialogue generation.
- Synthesis cost: TTS inference cost and quality control.
- Training cost: GPU time and experiment iteration.
- Evaluation cost: real held-out test data, domain review, and error analysis.
- Operational error cost: downstream impact of transcription mistakes.
The paper mainly quantifies the first and the downstream ASR metrics. It does not fully price the whole deployment pipeline, and it does not need to. Its contribution is to show that data composition affects recognition enough that procurement decisions should be evaluation-driven, not price-list-driven.
For a business building ASR, the right question becomes: “Which synthetic data source reduces our target-domain errors per dollar of total pipeline cost?” Not “Which generator produces the most hours for the least money?” That second question is how one buys an impressively large pile of mediocre transcripts.
The practical playbook is a data portfolio, not a data dump
A company applying this idea should not begin by generating 10,000 fake conversations and hoping the model becomes grateful. The paper points toward a more disciplined workflow.
| Step | Practical action | Paper logic |
|---|---|---|
| Define target conversations | Identify call types, roles, topics, speaker attributes, and interaction patterns | The LLM stage generates scenarios and speaker metadata, not just sentences. |
| Build or select a reference voice bank | Cover relevant age/gender/accent ranges and exclude evaluation speakers | The TTS stage depends on metadata-conditioned voice matching. |
| Generate small competing sets | Test several LLMs or prompt variants under the same training recipe | Single-generator results differ materially. |
| Test mixture value under a fixed budget | Compare pairings before scaling | GPT + Haiku helped; larger mixtures diluted performance. |
| Scale only useful sources | Add full datasets from generators that contribute marginal gains | Scale-up helped until Gemini slightly worsened results. |
| Combine generated and real-derived simulation | Preserve natural acoustic and lexical properties where possible | The best result came from LLM synthetic plus BEA-Large-derived simulation. |
| Evaluate on real held-out conversations | Use target-domain eval data only, with speaker separation | The paper evaluates on real BEA-Dialogue eval data, not synthetic test data. |
This is the business value: cheaper diagnosis of what data the ASR system lacks.
Synthetic conversation generation gives teams a way to probe the training distribution. If adding a scenario family improves recognition on held-out real calls, the system lacked that conversational pattern. If adding a generator hurts under fixed budget, the new data probably replaces more valuable examples. If real-derived simulation still improves the best LLM setup, the model still needs natural speech properties that TTS has not captured.
That makes synthetic data not just a training shortcut, but an analytical instrument. It reveals which missing data dimensions matter.
Boundary conditions: where the result may not travel cleanly
The paper is careful about portability. The pipeline is said to be transferable to languages where the necessary resources exist, but those resources are not trivial.
First, the method needs usable TTS. If TTS quality is poor for a target language, accent, or speaking style, the ASR model may learn artifacts rather than useful variation. Voice cloning quality also depends on a reference bank with enough coverage.
Second, the method needs a speaker-reference bank with relevant metadata. The paper’s bank has 287 speakers after filtering and excludes speakers from the development and evaluation sets. A business cannot casually skip this hygiene. Speaker leakage can make results look better than they are, which is a delightful way to fool yourself before production humiliates you.
Third, the method depends on domain realism in the generated conversations. The authors prompt for autobiographical and experiential discussion to better match BEA-style conversations. A banking complaint call, a hospital triage exchange, and a port logistics radio conversation are not the same genre. Prompt design and scenario taxonomy would need to reflect the target workflow.
Fourth, the evidence is Hungarian BEA-Dialogue evidence. That is valuable precisely because it is not another English-only demonstration, but it is still one language, one benchmark family, one TTS setup, and one ASR training recipe. The results justify replication, not blind transfer.
Fifth, the strongest cpWER comparison against the 2700-hour model is a target-domain result against a zero-shot baseline. It supports the value of matched augmentation; it does not repeal the usefulness of large real corpora.
Finally, the statistical testing is more supportive for word-level error than for character-level error in some comparisons. The paper reports that scaled models significantly outperform the real-only and Whisper baselines on both cpCER and cpWER. Against the BEA-Dialogue + simulated baseline, several scaled models significantly improve cpWER, but only the 4-scale + sim model significantly improves cpCER. Against the 2700-hour baseline, the 4-scale + sim model significantly improves cpWER, while cpCER shows no significant difference. Translation: the result is strong, but not every metric celebrates equally. Metrics have personalities too.
What this paper directly shows, and what Cognaptus infers
The paper directly shows that, on Hungarian BEA-Dialogue, LLM-generated synthetic conversations synthesized with metadata-conditioned TTS and assembled with speaker-aware simulation improve conversational ASR compared with real-only training. It also shows that generator choice and composition matter; GPT is the strongest single generator in the reported setup, GPT + Haiku is the best fixed-budget pair, larger fixed-budget mixtures can degrade performance, and the strongest system combines LLM-generated conversations with real-utterance-derived simulated conversations.
Cognaptus infers a broader operational lesson: for enterprises, synthetic conversational data should be managed as a targeted augmentation portfolio. The goal is not to maximize synthetic hours. The goal is to cover missing conversational structures, speaker attributes, domain topics, and timing patterns that the real training set underrepresents.
What remains uncertain is how this pipeline behaves across other languages, accents, regulated domains, and noisier business audio. The paper itself suggests future work on more domain-specific corpora and larger scaling studies. That is the right next step. In production, the only acceptable proof is held-out real audio from the target environment. Synthetic evaluation would be too convenient, and convenience is often where benchmarking goes to die.
Conclusion: the winning fake conversation is the one that fixes a real error
This paper is useful because it refuses the simple story. It does not say synthetic speech automatically solves low-resource ASR. It does not say one LLM is universally best. It does not say more generator diversity is always better. It does not even say LLM-generated audio replaces real-derived simulation.
It says something more practical: when real conversational data is scarce, structured synthetic conversations can improve ASR, but only when the synthetic data is matched, composed, scaled, and evaluated with discipline.
For business teams, that shifts synthetic data from a novelty to an engineering lever. Generate scenarios that match the target workflow. Use voices that represent the speaker population. Simulate the timing patterns that break recognizers in real conversations. Compare generators under the same recipe. Scale only what earns its place. Keep real held-out evaluation sacred.
Talk may be cheap. Useful synthetic conversation is cheaper than annotation, but only after it survives the audit.
Cognaptus: Automate the Present, Incubate the Future.
-
Máté Gedeon and Péter Mihajlik, “Efficient ASR Training with Conversations that Never Happened,” arXiv:2606.03957, 2026. ↩︎