How I built end-to-end realtime translation with LiveKit

I wanted a translation pipeline that feels like natural conversation, not a delayed transcription app. So I built an end-to-end realtime translation flow on top of LiveKit with a custom server-side agent.

The full path was:

user speech stream in
ASR + translation
translated text converted back to audio
translated audio streamed back live

Why LiveKit as the base

LiveKit gave me the exact primitives I needed for low-latency audio:

reliable room/session management
real-time media ingress/egress
strong control over tracks and server-side processing

The custom LiveKit server agent acts as the bridge between media transport and model pipelines.

Pipeline architecture

The agent ingests the audio stream and runs two main stages:

Speech-to-text + translation
Text-to-speech + audio return

1) Speech-to-text with Gummy ASR + translation

For the first stage, I used Gummy ASR to turn live speech into text, then translated it in the same pipeline. The key work here was segmentation and buffering:

chunk too small -> unstable text
chunk too large -> higher latency

I ended up using a balanced chunking strategy with partial hypotheses and confidence-aware updates so users can hear translation quickly without too many corrections.

2) Text-to-speech with CosyVoice

After translation, I passed text into CosyVoice and piped the generated audio back into the LiveKit room as an outbound track.

The hardest part was not the model call itself. It was stream coordination:

avoid overlap between source and translated voice
keep timing smooth between sentence boundaries
handle backpressure when network quality changes

What mattered most in practice

Three lessons had the biggest impact:

latency budget must be designed per stage, not just overall
interruption behavior must be explicit (barge-in and restart rules)
quality monitoring is essential (WER-like signals, translation drift, TTS clipping)

Realtime translation is not one model. It is a systems problem across transport, inference, and turn-taking.

If I did the next version

For the next iteration, I would add:

adaptive chunk sizing based on speech tempo
domain-specific terminology memory per room/session
better speaker separation for multi-party meetings

LiveKit made the realtime media side very workable. The rest is pipeline discipline: good ASR choices, careful translation logic, and audio orchestration that respects human conversation timing.