A voice assistant has one job before it has any other job: do not make the user wonder whether it heard them.
That tiny silence after a user stops speaking is not merely awkward. It is a control signal. It tells the user whether the system is alive, attentive, confused, or quietly regretting its product roadmap. In text chat, a delay can be tolerated because the medium already feels asynchronous. In speech, delay feels personal. The room has a rhythm, and the machine has missed the beat.
That is the practical problem behind RelayS2S, a new paper on real-time spoken dialogue systems.1 The paper is easy to summarize badly: a system reaches 81 ms P90 response-onset latency while retaining about 99% of cascaded response quality in average score. Good headline. Slightly dangerous headline. The real point is not that the authors shaved milliseconds from a pipeline. It is that they changed which part of the answer must be ready first.
RelayS2S does not ask the whole AI system to become fast. It asks only the opening of the response to be fast, verified, and speakable. The rest of the answer can arrive through the slower, stronger path. This is less like making a thinker sprint and more like teaching a relay team when to pass the baton without dropping it in front of the customer.
The old voice-agent trade-off is really a timing trap
Most production-style voice assistants are built as cascaded systems: detect speech, transcribe it with ASR, send the text to an LLM, then synthesize the response with TTS. This architecture is attractive because each component can be strong. Whisper-like ASR can be robust. A text LLM can reason, follow instructions, and use tools. A good TTS model can sound natural. The unpleasant part is that these components often sit in a queue.
End-to-end speech-to-speech systems attack the queue directly. They listen and respond in the same modality, often with better support for turn-taking, interruption, and backchannels. They can sound conversational because they operate closer to the timing of speech itself. The problem is that semantic quality usually lags stronger cascaded pipelines. They are quick on their feet, but sometimes their feet are doing most of the thinking.
So the perceived choice has been unattractive:
| Architecture | What it gives you | What it makes you pay |
|---|---|---|
| End-to-end S2S | Low onset latency, natural turn-taking, interruption handling | Weaker semantic response quality |
| Cascaded ASR-LLM-TTS | Stronger reasoning and instruction following | Higher onset latency, especially with larger or API-served LLMs |
| RelayS2S | Fast audible start plus stronger continuation | Extra coordination risk at the handoff |
The usual workaround is filler. The system says “Well,” “Let me think,” or “Sure,” while the real answer forms somewhere behind the curtain. This is not wrong. Humans use fillers too. But a filler-only design buys a very small buffer and often says nothing useful. It is the conversational equivalent of a loading spinner wearing a tie.
RelayS2S is more ambitious. It speculates that the first few words of a response are often predictable enough to speak, then lets a stronger LLM continue from that committed opening. The paper reports that only 8.5% of five-word S2S prefixes in its analysis were judged contextually inappropriate. That number is the hinge of the whole design. If most openings are usable, then the first seconds of speech can be produced by a fast model while the stronger model catches up.
RelayS2S splits one answer into two clocks
The system runs two paths in parallel after the model decides that the user has yielded the floor.
The fast path is a duplex speech-to-speech model. Its job is not to solve the whole conversation. Its job is to listen, manage dialogue control, and draft a short response prefix. If the prefix is approved, it is sent immediately to streaming TTS. The user hears the beginning of an answer almost at once.
The slow path is the more familiar ASR-LLM pipeline. It transcribes the user’s utterance, sends the dialogue context to a stronger text LLM, and generates the continuation. If a fast-path prefix has already been committed, the LLM is instructed to continue from the exact prefix rather than produce a new response from scratch.
That distinction matters. The prefix is not a suggestion. It is already spoken, or at least already committed to be spoken. The slow-path LLM must treat it as a hard context and continue from there. The paper’s appendix gives the prompt template used for this continuation task: the model is told that the assistant has already started its next response and must output only the continuation. This is an implementation detail, but not a decorative one. Without it, the relay becomes two agents talking over each other, which is generally discouraged outside certain committee meetings.
A simplified view looks like this:
User finishes speaking
|
v
Fast S2S path: draft short prefix -> verifier -> TTS begins speaking
| |
| v
| user hears response onset
|
Slow ASR-LLM path: transcribe -> stronger LLM continues from committed prefix
|
v
Shared response buffer -> seamless continuation if timing works
The mechanism is not “make GPT-4o faster.” The mechanism is “remove GPT-4o from the first-audio critical path.” That is why the result becomes more interesting as the slow-path model grows. A larger or remote model can be slower, but the user may not experience that slowness if the prefix creates enough speech time for the slow continuation to arrive.
The first five words are a bridge, not the answer
RelayS2S relies on a simple asymmetry in dialogue: the first few words often carry less semantic burden than the middle of the response. Many useful openings are structurally predictable: “Sure, I can help,” “That depends on,” “The main issue is,” “Yes, but only if.” They are not content-free, but they do not usually require the full reasoning chain.
The paper uses five words as the main prefix length in the core experiments. Five spoken words correspond to roughly two seconds of audio, enough for streaming TTS to start and for the slow path to begin producing the continuation. In other words, the prefix is not valuable because five words solve the user’s request. It is valuable because five words create a temporal bridge.
This changes how we should interpret the latency metric. The reported response-onset latency is the time from turn detection to the availability of the first synthesizable words. It excludes turn-detection latency, network latency, and TTS synthesis time. That is a reasonable comparative definition inside the paper, because the same omitted components affect the compared systems. But it is not the same as “the user hears audio after exactly 81 ms in every production setting.” Product managers, please put the confetti cannon down.
The business-relevant version is narrower and more useful: RelayS2S reduces the part of latency caused by waiting for ASR plus LLM generation before any response can begin. In a production stack, total perceived latency still depends on endpointing, network conditions, TTS startup, device constraints, and orchestration. The paper does not erase those. It shows a way to stop the slowest reasoning component from blocking the first audible response.
Forked speculative generation keeps speed from destroying interruptibility
A naive version of this idea would be easy to break. Let the S2S model generate the first words as fast as possible, send them to TTS, and hope the user does not interrupt. That would lower latency but sacrifice the best reason to use a duplex model in the first place: it can keep listening while speaking.
RelayS2S avoids this with forked speculative generation. When the fast-path model emits a response-start control token, it forks into two streams.
One stream remains the main online stream. It keeps processing live user audio and preserves the model’s ability to detect barge-in. If the user interrupts, this stream can emit a stop token and halt playback.
The other stream becomes the speculative stream. It stops observing future speech inputs and generates the short text prefix at maximum decoding speed. Because it is no longer tied to the 160 ms speech update cycle, it can produce a buffered chunk quickly enough for TTS.
This is the paper’s most important engineering idea. The system does not choose between “listen carefully” and “speak quickly.” It splits those responsibilities. One branch keeps the conversation socially safe; the other branch manufactures the opening words.
The authors evaluate turn-taking separately because the fast path is responsible for dialogue control. The results are strong for the core control events: stay-silent F1 is 99.8%, start-speaking F1 is 88.5%, and stop-speaking F1 is 96.7%. Backchannel prediction is weaker, with F1 at 50.8%, driven by lower precision. That weakness is not a footnote to ignore. Backchannels are subjective and socially delicate. A badly timed “mm-hmm” can be worse than silence, especially if the user is explaining a billing error, a medical symptom, or why they are already irritated.
So the practical reading is: the system looks promising for start/stop conversational control, while backchanneling remains less mature. That is not a fatal limitation. It is a deployment warning label.
The verifier is a risk-control device, not a second brain
Speculation creates risk. If the fast S2S model says something awkward in the first five words, the slow LLM is forced to continue from a bad opening. That is not just a quality problem; it is a continuity problem. The system has already spoken the beginning, and speech cannot be edited retroactively. Voice interfaces are cruel that way.
RelayS2S therefore adds a lightweight prefix verifier. The verifier decides whether the speculative prefix should be committed or suppressed. If it approves the prefix, the system gets the latency benefit. If it rejects the prefix, the system falls back to the cascaded path and behaves like a normal slower pipeline.
The paper reports that the verifier has about 170K parameters and adds roughly 10 ms of overhead. It reuses decoder hidden states and calibration signals such as entropy, selected-token log probability, and top-two margin. That makes the verifier closer to a gate on already-computed evidence than a new model that rereads the whole conversation.
Its operating trade-off is worth reading carefully:
| Verifier threshold | Bad prefixes committed | Good prefixes committed | Overall fallback rate | Interpretation |
|---|---|---|---|---|
| 0.25 | 81.0% | 99.8% | 1.8% | Almost always takes the latency win, but lets too many bad prefixes through |
| 0.50 | 45.7% | 96.3% | 8.0% | The paper’s chosen balance between speed and safety |
| 0.75 | 11.8% | 72.8% | 32.3% | Much safer, but gives up the fast path too often |
This table should prevent a common misreading. The verifier does not make bad prefixes disappear. At the selected threshold, it still commits 45.7% of bad prefixes. The comfort is that bad prefixes are a minority, and the system commits 96.3% of good ones while falling back only 8.0% of the time.
For business deployment, this is exactly the kind of dial that matters. A casual tutoring assistant can accept more prefix risk. A healthcare triage bot, insurance claims assistant, or high-friction customer-service system should prefer a stricter threshold, more fallbacks, and fewer awkward openings. The same architecture can serve different risk appetites, assuming teams actually tune it rather than worship the default number. A bold assumption, but hope is free.
The main evidence shows latency decoupling, not magical quality improvement
The core experiment compares pure S2S, cascaded ASR-LLM systems, and RelayS2S across three slow-path LLM configurations. The test set contains about 3,000 held-out conversations, yielding 6,401 context-response pairs. Response quality is judged by Gemini-3 on a 1–5 scale, with low-quality rate also reported.
The headline table is compact but revealing:
| System | Slow-path LLM | P90 onset latency | Avg quality | Low-quality rate |
|---|---|---|---|---|
| S2S only | — | 71 ms | 3.04 | 59.3% |
| Cascaded | Qwen2.5-0.5B | 420 ms | 3.34 | 51.8% |
| RelayS2S | Qwen2.5-0.5B | 81 ms | 3.36 | 51.4% |
| Cascaded | Qwen2.5-7B | 513 ms | 4.38 | 21.3% |
| RelayS2S | Qwen2.5-7B | 81 ms | 4.35 | 22.3% |
| Cascaded | GPT-4o | 1,091 ms | 4.83 | 5.5% |
| RelayS2S | GPT-4o | 81 ms | 4.78 | 7.4% |
The correct interpretation is not that RelayS2S makes the small S2S model smart. The S2S-only baseline is fast but weak: 59.3% low-quality rate. RelayS2S works because the weak model is used only where it is most tolerable: the beginning of the utterance. The stronger LLM still carries the semantic body of the response.
With GPT-4o as the slow path, RelayS2S reduces P90 onset latency from 1,091 ms to 81 ms while average quality moves from 4.83 to 4.78. The low-quality rate rises from 5.5% to 7.4%. That is not zero cost. It is a small cost relative to the latency collapse, and whether it is acceptable depends on the product setting.
The result is most persuasive because the benefit grows with slow-path strength. Conventional cascaded latency increases as the backend becomes heavier: 420 ms for Qwen2.5-0.5B, 513 ms for Qwen2.5-7B, and 1,091 ms for GPT-4o. RelayS2S stays at 81 ms because the slow path is no longer responsible for the first synthesizable words.
That is the structural contribution. Latency becomes less coupled to backend intelligence.
The prefix-length ablation is a risk curve, not a side note
The paper’s prefix-length ablation asks a practical question: how much speech can the fast path safely claim before the slow path takes over?
Using GPT-4o as the backend, the authors compare 3-, 5-, and 7-word prefixes with and without verifier gating:
| Prefix length | Verifier | Avg quality | Low-quality rate | Likely purpose of the test |
|---|---|---|---|---|
| 3 | No | 4.74 | 8.6% | Ablation: shorter speculative bridge without safety gate |
| 3 | Yes | 4.80 | 6.7% | Ablation: checks whether verifier helps even when prefix is short |
| 5 | No | 4.69 | 10.3% | Ablation: main prefix length without gate |
| 5 | Yes | 4.78 | 7.4% | Main operating point in the paper |
| 7 | No | 4.66 | 11.5% | Ablation: longer bridge with higher semantic exposure |
| 7 | Yes | 4.76 | 8.0% | Sensitivity test for longer prefix under verifier control |
The pattern is intuitive and important. Longer prefixes buy more time but increase semantic risk. The verifier reduces the low-quality rate by roughly 2–3 percentage points across prefix lengths. It does not eliminate the trade-off; it makes the trade-off manageable.
For product design, this becomes a configuration question. A 3-word prefix may be enough for systems with a fast local backend. A 7-word prefix may be useful when the slow path is remote or tool-heavy. A 5-word prefix is a reasonable middle point in the paper’s setting, not a universal law of nature delivered on stone tablets.
What Cognaptus infers for business systems
The paper directly shows that, under its experimental setup, RelayS2S can approach S2S onset latency while preserving most cascaded response quality. Cognaptus infers a broader design lesson: in voice AI, perceived responsiveness can be engineered by separating first-audio onset from full-answer completion.
That matters most in domains where users punish silence before they evaluate intelligence.
| Business setting | Why RelayS2S is relevant | Practical design implication | Boundary |
|---|---|---|---|
| Customer service | Users read silence as failure or indifference | Use fast verified openings to acknowledge the request while a stronger backend prepares the answer | Risk threshold should be stricter in complaint, refund, or compliance contexts |
| Tutoring and coaching | Conversational rhythm affects engagement | Start with a short instructional bridge, then let the LLM produce the explanation | Bad prefixes can misframe a concept, so subject-specific tuning matters |
| Internal voice copilots | Workers want hands-free speed without shallow answers | Keep large backend reasoning while avoiding long dead air | Enterprise latency also depends on tool calls and authorization checks |
| In-car or device assistants | Timing is safety- and attention-sensitive | Maintain interruptibility while reducing onset delay | On-device constraints and noisy environments need separate validation |
| Sales and onboarding | Flow affects trust and conversion | Reduce “robot pause” without relying only on empty filler | Overly eager backchannels may annoy users rather than reassure them |
There is also an architectural implication. Teams often treat voice agents as a chain of best-of-breed components. RelayS2S suggests a more temporal view: which component owns the first 200 ms, which owns the next two seconds, and which owns the final semantic answer? The answer may not be the same component.
That framing is useful beyond voice. Many agentic systems already separate immediate acknowledgment from delayed reasoning in text interfaces. RelayS2S gives the idea a speech-native implementation where timing is no longer just UX decoration. It becomes part of the model architecture.
Where the result should not be over-read
The paper is careful enough to provide useful boundaries, and the business reader should keep them intact.
First, the training data is synthetic. The fast-path model is trained on 104,478 synthetic duplex conversations totaling 2,133 hours of audio, with injected backchannels, interruptions, pauses, and noise. Synthetic data is reasonable for building controlled duplex phenomena, especially when real full-duplex dialogue data is scarce. But real users interrupt in more chaotic ways. They mumble, overlap, hesitate, switch languages, and say “no, wait” at exactly the wrong moment because humans enjoy stress-testing engineering assumptions.
Second, response quality is judged by Gemini-3. LLM-as-judge evaluation is useful for scaling comparison, but it is not the same as human preference testing in live voice interaction. In speech, quality includes prosody, timing, emotional appropriateness, and repair behavior after interruption. The paper’s textual quality score captures an important slice, not the entire meal.
Third, the latency metric excludes turn detection, network latency, and TTS synthesis time. Again, that is acceptable for isolating the architecture’s contribution. It is not a full production latency budget. A deployment team still needs to measure the whole stack on target devices, target regions, and target workloads.
Fourth, backchannel prediction remains weak. The model’s backchannel F1 is 50.8%, with precision at 40.4%. For many products, this means teams should initially use RelayS2S for response onset and interruption handling, while treating backchannels as optional or tightly constrained behavior.
Finally, the verifier’s selected threshold is a policy choice. The paper’s 0.50 threshold keeps the latency benefit for most good prefixes, but still commits many bad prefixes among the small bad-prefix subset. In high-stakes settings, the right answer may be a stricter threshold and more frequent fallback. Fast is charming until it says the wrong thing first.
The useful lesson is not “speak faster”; it is “own the handoff”
RelayS2S is valuable because it treats conversation as a timed sequence of commitments. The system does not need to know the full answer before it begins. It does need to know whether the first fragment is safe enough to become irreversible speech.
That is a subtle but important shift. Voice agents are not just reasoning systems with a microphone attached. They are turn-taking systems. A technically correct answer that arrives late can feel broken. A fast answer that cannot be interrupted can feel rude. A filler that buys time but says nothing can feel fake. RelayS2S tries to occupy the narrow middle ground: speak early, stay interruptible, and let the stronger model finish the thought.
The paper’s best contribution is not the 81 ms number by itself. Numbers travel badly when copied out of their measurement setup. The stronger contribution is the design pattern: use a duplex S2S model for timing-sensitive conversational control, use a cascaded LLM for semantic quality, and use a verifier to decide when the fast opening is safe enough to hand off.
For businesses building voice agents, that suggests a practical question to ask before buying or building yet another “real-time AI” layer: which part of your system is responsible for the first words, and who checks whether those words deserve to be spoken?
If the answer is “the large model, after everything else finishes,” the pause will remain. If the answer is “a filler phrase and vibes,” good luck. RelayS2S offers a more disciplined alternative: a timed relay between immediacy and intelligence.
The machine does not stop thinking. It simply stops making the user listen to it think.
Cognaptus: Automate the Present, Incubate the Future.
-
Long Mai, “RelayS2S: Dual-Path Speculative Generation for Real-Time Dialogue,” arXiv:2603.23346v1, March 24, 2026. https://arxiv.org/html/2603.23346. ↩︎