How I built end-to-end realtime translation with LiveKit
I wanted a translation pipeline that feels like natural conversation, not a delayed transcription app. So I built an end-to-end realtime translation flow on top of LiveKit with a custom server-side agent.
The full path was:
- user speech stream in
- ASR + translation
- translated text converted back to audio
- translated audio streamed back live
Why LiveKit as the base
LiveKit gave me the exact primitives I needed for low-latency audio:
- reliable room/session management
- real-time media ingress/egress
- strong control over tracks and server-side processing
The custom LiveKit server agent acts as the bridge between media transport and model pipelines.
Pipeline architecture
The agent ingests the audio stream and runs two main stages:
- Speech-to-text + translation
- Text-to-speech + audio return
1) Speech-to-text with Gummy ASR + translation
For the first stage, I used Gummy ASR to turn live speech into text, then translated it in the same pipeline. The key work here was segmentation and buffering:
- chunk too small -> unstable text
- chunk too large -> higher latency
I ended up using a balanced chunking strategy with partial hypotheses and confidence-aware updates so users can hear translation quickly without too many corrections.
2) Text-to-speech with CosyVoice
After translation, I passed text into CosyVoice and piped the generated audio back into the LiveKit room as an outbound track.
The hardest part was not the model call itself. It was stream coordination:
- avoid overlap between source and translated voice
- keep timing smooth between sentence boundaries
- handle backpressure when network quality changes
What mattered most in practice
Three lessons had the biggest impact:
- latency budget must be designed per stage, not just overall
- interruption behavior must be explicit (barge-in and restart rules)
- quality monitoring is essential (WER-like signals, translation drift, TTS clipping)
Realtime translation is not one model. It is a systems problem across transport, inference, and turn-taking.
If I did the next version
For the next iteration, I would add:
- adaptive chunk sizing based on speech tempo
- domain-specific terminology memory per room/session
- better speaker separation for multi-party meetings
LiveKit made the realtime media side very workable. The rest is pipeline discipline: good ASR choices, careful translation logic, and audio orchestration that respects human conversation timing.