The Neuron: How to Build Live Voice Agents with LiveKit

Want to learn how to make and launch real-time voice agents? Well, ou can watch the full livestream we just did with Ben Cherry of LiveKit, or use the timecoded guide below to jump straight to the key moments!

Voice agents are having their “website builder in 1999” moment. The demos already feel useful. The production patterns are still forming. And every serious builder is running into the same problem: a chatbot that talks is easy; a voice agent that listens, interrupts, calls tools, handles bad networks, and survives real customers is a different machine.

That was the thread running through our livestream with Ben Cherry of LiveKit, an open source framework and cloud platform for building voice, video, and physical AI agents. Ben walked through the architecture, built and edited agents live, cloned his voice, debugged tool calls, showed the Agent Builder, and explained where small business use cases are already working.

The big idea: a good voice agent behaves less like a stateless chat request and more like a long-running program. It is always listening, always managing timing, always deciding whether the user is finished, and often juggling multiple models or tools at once. That changes how you build it.

Start here: the key livestream moments
Next up, the TL;DR
The mental model: a voice agent is a running system
The architecture decision: pipeline or realtime?
Why WebRTC beats websockets for real-time voice
How to build your own LiveKit voice agent
Use a coding agent, but give it LiveKit-specific context
Add tools when the model needs real actions or reliable answers
Debug voice agents like production software
Voice cloning is powerful, and voice is biometric data
The small-business wedge: answer the phone when nobody else does
The drive-thru example is the repo to study
How to build a vertical voice-agent product
Production checklist from Ben’s advice
Resources mentioned in the stream
Next resources to study for real production edge cases
What people get wrong about voice agents
What to watch next

Start here: the key livestream moments

0:53: What LiveKit is: an open source framework and cloud platform for voice, video, and physical AI agents.
3:50: Ben’s simplest definition: you are getting on the phone with an AI system.
5:03: How LiveKit’s early WebRTC demo helped shape the first ChatGPT voice mode.
6:36: Why voice is an accessibility breakthrough, not a side feature.
8:41: How real-time multimodal models process audio and image frames.
10:57: The two main voice-agent architectures: STT-LLM-TTS pipelines and true realtime models.
14:21: Why voice agents are persistent programs that run for the whole call.
19:04: Why WebRTC matters more than websockets for low-latency voice.
23:55: Ben’s high-level path for adding realtime voice to a product.
27:00: Running the LiveKit Python agent starter locally.
29:29: A healthcare intake demo that collects information without giving medical advice.
31:17: Local agents, local models, and Ollama-compatible setups.
32:29: Using Claude Code with the Docs MCP server and LiveKit agent skills.
33:06: The audience-requested “Gaslight AI” prompt edit.
40:27: Adding a Python function tool so the agent can do math reliably.
45:40: The fastest zero-to-one path with LiveKit Cloud and Agent Builder.
48:15: Session replay, traces, transcripts, audio playback, and logs.
51:05: Voice cloning inside LiveKit, with consent and provider fallback.
53:56: The practical small-business use case: 24/7 receptionists.
56:27: The drive-thru ordering example in the LiveKit agents repo.
1:00:11: Deepfake and public-figure voice concerns.
1:02:20: How to build a vertical voice-agent product for other businesses.
1:05:35: Ben’s recommended learning path, including the LiveKit YouTube channel.
1:06:56: The D&D side project, and why domain vocabulary matters.

Next up, the TL;DR

As we said, a voice agent feels simple when it works. You talk, it listens, it replies. The hard part starts when the agent has to survive a real call: bad Wi-Fi, interruptions, scheduling tools, weird customer names, and a user who starts talking before the model is done.

That was the key takeaway from our livestream with Ben Cherry of LiveKit: serious voice agents are long-running programs that manage audio, timing, tools, memory, logs, and handoffs while the user is still talking.

Here's what happened on the stream:

Ben explained what LiveKit builds: infrastructure for voice, video, and physical AI agents.
He broke down the two core architectures: pipeline agents that stitch together speech-to-text, an LLM, and text-to-speech, and realtime models that understand and generate speech directly.
He showed why WebRTC beats websockets for voice because real-time conversation matters more than perfect packet delivery.
He ran the Python starter agent, edited it with Claude Code, created the audience-requested “Gaslight AI”, and added a Python tool for reliable math.
He demoed Agent Builder, session replay and traces, voice cloning, and real use cases like medical intake, drive-thru ordering, and 24/7 small-business receptionists.

Why this matters: Voice turns latency and reliability into product questions. A web chatbot can pause, but a phone agent has to handle silence, interruptions, background noise, network drops, and user trust in real time. That makes the boring pieces (observability, telephony, turn detection, retrieval, fallback models, and human escalation) the actual product.

For builders, the path is clear: start from the Python starter or Agent Builder (linked below), connect the Docs MCP server to your coding agent (also linked below), add tools for anything that needs certainty, then inspect every weird session before users find the edge cases for you.

Our take: The near-term winners will be narrow vertical agents that answer one expensive phone call better than current menu trees. Ben’s plumber story is the whole market played out in miniature: the business that answered the customer call at 10:30 p.m. won the job. From here, the open question is how quickly will customers learn to trust an AI receptionist when money, health, or time pressure is involved?

Keep scrolling below for a deep dive into all things voice agents and how to set one up yourself!

The mental model: a voice agent is a running system

Ben started with a simple analogy: LiveKit began by helping humans talk to each other over the internet. Once ChatGPT arrived, the obvious next question was, “What if the other side of that call was an AI system?”

That makes LiveKit’s job clearer. The platform is not only about the model. It is about the infrastructure around the model:

Getting audio and video from a user to an AI system in real time.
Letting the agent speak back with low latency.
Handling interruptions and turn-taking.
Connecting tools, APIs, frontends, telephony, and logs.
Letting developers control enough of the stack to ship real workflows.

Ben’s deeper point came around 14:21: most text chat systems are stateless. Each turn sends the conversation history back to the model. A voice agent can’t work that way. It needs a persistent connection to the microphone or phone call, and the agent program runs for the full session.

That sounds technical, but the product implication is simple. The agent is closer to a worker sitting on a call than a form that returns one answer at a time. It needs memory for the session, timing, tools, error handling, and logs. If it is going to answer phones for a plumbing company at 10:30 p.m., it has to behave like infrastructure.

The architecture decision: pipeline or realtime?

Ben split voice agents into two broad architectures at 10:57.

Option 1: the pipeline agent

This is the “three models in a trench coat” approach:

Speech-to-text listens to the user and turns speech into text.
The LLM decides what to say or which tool to call.
Text-to-speech speaks the answer back to the user.

Ben said many production deployments still use this setup because it is easier to tune, more reliable for tool calling, and better when you need to control behavior carefully. The tradeoff is latency. Each stage needs to start working before the previous stage is fully done, because users notice small delays in conversation.

Option 2: the realtime multimodal model

The realtime model receives speech directly and generates speech directly. Transcripts can become a byproduct instead of the source of truth. That makes these systems more expressive, because the model can hear tone, emphasis, and emotion. It also creates a real compliance question.

If your product needs to know exactly what the agent said, ask how the model generates transcripts. Ben pointed out that in some realtime systems, the words spoken and the printed transcript can differ while preserving the same meaning. That might be fine for a friendly assistant. It may be a problem for regulated workflows, medical intake, customer support disputes, or legal records.

Ben’s practical advice: choose the architecture based on the workflow, not the demo. Realtime models can feel magical. Pipelines can be more dependable when tools, transcripts, and deterministic behavior matter.

Why WebRTC beats websockets for real-time voice

The technical heart of the stream was Ben’s explanation of WebRTC vs. websockets.

A websocket is great when every message must arrive in order. That is perfect for many app events. Voice has a different priority: keep the conversation live. If a network hiccup happens, a websocket can buffer and catch up later. That is poison for a live conversation, because the user experiences lag, choppy audio, or weird pauses.

WebRTC is built for real-time audio and video. It can allow quality to degrade temporarily while keeping the call live. That matters on mobile networks, in cars, in dead zones, and anywhere a user’s connection is uneven.

The plain-English version: for voice agents, staying real-time matters more than preserving every packet perfectly. A tiny audio-quality dip is usually tolerable. A two-second conversational delay makes the agent feel broken.

How to build your own LiveKit voice agent

There are two starting paths Ben recommended. Use the browser-based path if you want a fast prototype. Use the code path if you are building a real product, custom workflow, or vertical business.

Path 1: prototype in Agent Builder

At 45:40, Ben showed the fastest zero-to-one path:

Go to LiveKit and create a LiveKit Cloud project.
Open the Cloud dashboard.
Go to Agents.
Choose Deploy new agent.
Use Agent Builder to create a browser-based prototype.
Talk to the agent in the preview panel.
Download the generated Python code when you are ready to customize.

This path is useful because the builder is backed by real Python code. In the stream, Ben showed that the generated code looks similar to the starter app, so a non-Python-first builder can start in the browser and graduate to code.

LiveKit’s own Agent Builder docs describe the same philosophy: prototype and deploy simple agents in the browser, then convert to code when you need finer control. The docs also call out useful builder features, including instructions, data collection, welcome greetings, model selection, tools, MCP servers, metadata variables, secrets, call summaries, and production deployment.

Path 2: start from the Python starter repo

At 27:00, Ben ran an agent from the LiveKit Python agent starter. The repo is a complete starter project for LiveKit Agents with Python and LiveKit Cloud. It includes a simple voice assistant, a voice pipeline, turn detection, background voice cancellation, observability, tests, and production deployment files.

The basic setup looks like this:

brew install livekit-cli
lk cloud auth
lk agent init my-agent --template agent-starter-python
cd my-agent
uv sync
uv run python src/agent.py download-files
uv run python src/agent.py dev

To test locally in the terminal, the starter repo also supports:

uv run python src/agent.py console

For production, the repo uses:

uv run python src/agent.py start

The starter can connect to a frontend or telephony flow. If you want a web UI similar to what Ben used in the demo, LiveKit also publishes a React / Next.js starter frontend. The frontend starter docs also list Swift, Android, Flutter, React Native, and embeddable web widget options.

Use a coding agent, but give it LiveKit-specific context

The strongest build trick from the livestream was Ben’s use of Claude Code. At 32:29, he explained that the starter project includes agent instructions for coding agents, and he had connected the LiveKit Docs MCP server so Claude Code could look up the current LiveKit docs while editing the project.

That matters because voice agents have lots of moving pieces. You do not want a coding agent guessing outdated SDK signatures. You want it using current docs, examples, and repo-specific instructions.

To add the Docs MCP server to Claude Code, LiveKit’s docs show:

claude mcp add --transport http livekit-docs https://docs.livekit.io/mcp

For Codex, the docs show:

codex mcp add --url https://docs.livekit.io/mcp livekit-docs

Ben also pointed to LiveKit Agent Skills, a repo of reusable coding-agent skills for building voice AI with LiveKit. The important distinction: MCP gives the coding agent current factual documentation, while the skill gives it procedural guidance for how to think through voice-agent architecture, testing, handoffs, latency, and workflow design.

The practical workflow is:

Create or open the LiveKit starter project.
Add the LiveKit Docs MCP server to Claude Code, Cursor, Codex, VS Code, or your coding-agent tool of choice.
Add the LiveKit agent skill if your coding agent supports reusable skills.
Ask the coding agent to inspect the project before editing.
Make one behavior change at a time, then test by speaking to the agent.

Ben’s live edit came from a viewer suggestion: make the agent convince the caller that the caller is an AI. The result was funny, unsettling, and useful. It proved that prompt edits can quickly reshape the character of a voice agent, while the underlying LiveKit runtime keeps the conversation live.

Somewhere, a support chatbot watched that demo and immediately updated its LinkedIn to “ontological crisis specialist.”

Add tools when the model needs real actions or reliable answers

The math demo at 40:27 was a clean example of why tool calling matters.

A viewer asked whether the agent could do arithmetic. Ben’s first point was that modern models can handle simple arithmetic, but if you need reliable answers, you should give the agent a tool. He asked Claude Code to add a Python function tool for multiplying a list of numbers and tell the agent to use it.

The lesson was bigger than multiplication. Use tools whenever the agent needs to:

Calculate something exactly.
Look up account, order, or scheduling information.
Book an appointment.
Update a CRM.
Trigger a frontend action.
Call an external API.
Store or retrieve session state.

LiveKit’s tools docs support both function tools defined in your agent code and provider tools supplied by model providers. The docs also note that tools can run in the background, which matters for voice because the agent may need to keep talking while a longer task completes.

Ben’s live debug also showed an important failure mode. The agent successfully called the tool for 2 × 3, but it failed to call the tool for multiplying the full range from 10 to 20. That is exactly the kind of thing you want to catch before users do.

Debug voice agents like production software

At 48:15, Ben showed LiveKit Cloud’s session list and observability views. This may have been the most important production moment in the whole stream.

In the dashboard, he showed:

The full transcript of a session.
Audio playback of what the user and agent actually said.
Detailed traces for performance timing.
Server logs from the session.
Tool-call traces, including whether the multiply tool was called.
Errors from the model API.

That gave the team a way to see why the 2 × 3 tool call worked and why the 10-to-20 tool call did not. Without those traces, the agent becomes a black box. With them, you can answer the real production questions: did it hear the user, did it choose the right tool, did the tool fail, did the model ignore instructions, or did latency make the experience feel broken?

For production agents, this is the difference between “cool demo” and “supportable product.” If your agent answers thousands of customer calls per day, you need replay, logs, traces, and ways to inspect bad sessions.

Voice cloning is powerful, and voice is biometric data

At 51:05, a viewer asked whether you can change the voice or use your own voice. Ben showed a new voice-cloning layer in LiveKit. He recorded a short consent script, uploaded it, previewed the clone, and then selected his cloned voice in Agent Builder.

The build detail: LiveKit sits above multiple TTS providers, so cloning through LiveKit can clone across underlying providers and make fallback easier. The safety detail: Ben explicitly called out that voices are biometric data, and the demo required consent.

That led into a later question at 1:00:11 about public-figure voices and deepfakes. Ben did not pretend there is a tidy answer. His concern was broader: these technologies make it easier to erode what people believe is true, and convincing only a small slice of people can still cause real harm.

The practical takeaway for builders:

Get explicit consent for cloned voices.
Avoid public-figure voices unless you have rights and a clear use case.
Assume voice can be sensitive biometric data.
Build audit logs for when voices are created, selected, and used.
Make impersonation risk part of your product review, not a launch-week afterthought.

The small-business wedge: answer the phone when nobody else does

The best business use case came from Ben’s own life. At 53:56, he described calling plumbers at 10:30 p.m. after a backup in his house. Most shops sent him to voicemail or business-hours menus. One business answered with a 24-hour voice AI receptionist built on top of LiveKit through another provider. The agent booked a plumber, and the plumber arrived 45 minutes later.

That story is the product thesis in miniature. A small business does not need a sci-fi agent. It needs a system that answers the phone, understands the request, collects the right details, checks availability, and dispatches a human.

Ben’s examples clustered around real phone workflows:

Plumbers and home services that need after-hours intake.
Fast-food ordering and drive-thrus.
Medical or dermatology intake that collects details before a clinician sees the patient.
Customer support flows where phone menus frustrate users.
Vertical tools for industries such as detailing, trucking, fleet scheduling, or field services.

The key is narrow scope. The agent does not need to “do everything.” It needs to do one workflow better than the current phone menu.

The drive-thru example is the repo to study

At 56:27, Ben said one of LiveKit’s most complete examples is the drive-thru ordering system in the LiveKit agents repo. It simulates a fast-food drive-thru and demonstrates interactive voice ordering, database integration, order state, menu context, add-ons, sauces, and modifications.

That example is worth studying because it shows the pattern every production voice workflow needs:

A domain-specific menu or knowledge source.
Structured state for the current order.
Tools for adding, editing, and confirming items.
Rules for what the agent should ask next.
Tests that check whether the agent can handle real user paths.

You can copy the pattern even if you are building a different business. Replace “menu” with services, policies, appointment types, inventory, pricing, or support categories. Replace “order state” with “case state,” “booking state,” or “intake state.” The architecture remains useful.

How to build a vertical voice-agent product

At 1:02:20, a viewer asked how to build on top of LiveKit and offer voice agents to other business owners. Ben’s answer was a playbook for vertical SaaS builders.

Do the hard work once, then give each customer a simple dashboard.

On the backend, you build the agent runtime in Python. At runtime, the agent receives variables such as:

The customer or business ID.
The business-specific prompt.
The business knowledge base.
Phone-number configuration.
Integration settings for calendars, CRMs, order systems, or dispatch tools.

On the frontend, the business owner should see a narrow configuration surface:

Business name and greeting.
Hours, locations, and services.
Escalation rules.
Appointment or dispatch settings.
Knowledge-base updates.
Call summaries and transcripts.

Ben suggested building that dashboard in Next.js or any web framework you like, then connecting it to LiveKit APIs. For phone-based products, you can use LiveKit’s integrated phone-number service, LiveKit SIP, or providers such as Twilio.

The audience member’s detailing-business example is exactly the kind of vertical wedge that makes sense: missed calls, simple qualification questions, appointment scheduling, service-area constraints, and high value when a call turns into a booked job.

Production checklist from Ben’s advice

Use this as a quick build checklist before you ship a voice agent to real users.

1. Pick the right architecture

Use a pipeline when tool reliability, control, and transcripts matter most.
Use a realtime model when expressiveness and natural emotion matter most.
Test latency with a real microphone, a real phone, and a real bad network.

2. Optimize for latency early

Use WebRTC for real-time media.
Measure model, transcription, TTS, and tool-call timing separately.
Do not assume a fast text agent will feel fast as a voice agent.

3. Give the agent real tools

Do math, lookups, scheduling, and external actions through tools.
Make tool instructions specific.
Log tool calls and failures.
Watch for cases where the model should call a tool but does not.

4. Treat observability as part of the product

Replay bad calls.
Compare transcript and audio.
Inspect traces for latency bottlenecks.
Capture logs for model and API errors.

5. Design for the actual business workflow

Start with one job the customer already needs done.
Collect structured information.
Escalate to a human when needed.
Keep the dashboard simple for non-technical business users.

6. Handle voice safety up front

Get consent for cloning.
Track which voice was used.
Avoid impersonation-sensitive use cases.
Clarify whether the caller is speaking to an AI.

7. Feed the agent the weird words

At 1:08:13, Ben’s D&D side project turned into a serious production lesson. General transcription models struggle with domain vocabulary. His project loads campaign lore and picks likely vocabulary so the transcription system can handle names like Sprocket and other fantasy terms.

The same problem shows up in medicine, law, construction, insurance, manufacturing, gaming, and support. If your users say weird product names, acronyms, customer IDs, medication names, menu items, or character names, give your agent that context before the call.

Resources mentioned in the stream

Full livestream: “Building Real-Time AI Voice Agents with LiveKit’s Ben Cherry.”
LiveKit: the open source framework and cloud platform for building voice, video, and physical AI agents.
LiveKit YouTube channel: Ben recommended this at 1:05:35, specifically calling out the LiveKit 101 how-to content.
LiveKit Python agent starter: mentioned and demoed at 27:00, and shared in chat around 27:29.
LiveKit React / Next.js starter frontend: the kind of frontend Ben showed when he ran the local demo at 27:28.
LiveKit Docs MCP server: discussed at 32:29 and shared in chat around 33:06.
LiveKit Agent Skills: discussed around 41:38 and shared in chat around 42:08.
Drive-thru example: discussed at 56:27 and shared in chat around 58:01.
Voice AI quickstart: LiveKit’s official quickstart for building and deploying a simple voice assistant.
Agent Builder docs: LiveKit’s guide to prototyping and deploying simple voice agents in the browser.
Tool definition and use docs: LiveKit’s documentation for function tools, provider tools, MCP, RPC, workflows, and RAG.
LiveKit telephony docs: the place to start for phone numbers, SIP trunking, inbound calls, outbound calls, and telephony testing.
Agent deployment docs: LiveKit’s guide for deploying agents to LiveKit Cloud, including scaling, logs, limits, and deployments.
Frontend starter apps: LiveKit’s starter apps for Next.js, SwiftUI, Android, Flutter, React Native, and web embeds.
Turn detector docs: LiveKit’s context-aware turn detection model, useful for reducing awkward interruptions.
VS Code: shared in chat at 27:12 because Ben was demoing from VS Code.
Ben Cherry on GitHub and Ben Cherry on LinkedIn.

Next resources to study for real production edge cases

The livestream covered the main build path. The next questions readers will ask are the production ones: What happens when someone interrupts? How do you add private business data? How do you test the agent before real callers find the weird bug? These are the resources worth studying next.

If the agent interrupts at the wrong time

Start with LiveKit’s turn detector, Silero VAD plugin, turn-taking tuning, and adaptive interruption handling. This is the “why did it cut me off?” section. The docs explain how to combine voice activity detection, speech-to-text, contextual turn detection, and barge-in handling so the agent can tell the difference between a real interruption and a quick “uh-huh.”

If the agent needs company data

Read External data and RAG with the tool definition docs. This is where the agent stops being a charming demo and starts acting like a real receptionist, support rep, intake assistant, or dispatcher. It can load user context before a call, search a private knowledge base, update a calendar, call an API, or store the conversation record after the session.

If the call needs a specialist or human escalation

Use Agents and handoffs. Ben’s vertical SaaS advice points in this direction: separate the triage agent from the billing agent, the scheduling agent, or the human handoff path. Handoffs also let you give different agents different permissions, which matters when one workflow can answer FAQs and another can access payments, protected records, or booking tools.

If one model provider fails mid-call

Study fallback strategies. In a live voice conversation, a model outage can strand the caller in silence. LiveKit’s fallback docs cover backup providers for speech-to-text, text-to-speech, and LLMs, including cases where the fallback runs server-side through LiveKit Inference or directly in your agent process.

If you need to test before launch

Use the LiveKit Agents test framework. This is the next stop after the drive-thru demo. You can test single-turn and multi-turn conversations, assert that tool calls happen, judge responses with an LLM, mock tools, and check handoffs. In other words, stop asking “did the demo work?” and start asking “does the workflow survive the top 50 ways users will mess with it?”

If the product lives on the phone

Go deeper on telephony, then branch into DTMF, answering machine detection, HD voice for SIP, inbound call setup, outbound calls, and transfers. Phone agents have old-world edge cases that web demos do not: keypad input, voicemail, call transfers, carrier quirks, emergency escalation, caller ID, and bad audio from cheap headsets.

If you need metrics, records, and cost control

Read data hooks, agent deployment, and LiveKit’s deployment docs for secrets, logs, and log drains. The data hooks docs are especially useful because they cover session transcripts, audio recordings, reports, per-turn latency, usage data, and OpenTelemetry integration. For a business product, this is where billing, quality review, compliance review, and debugging all start.

If you want to ship it inside an app

Start with frontend starter apps and the Agent Embed Widget. The article’s main build path uses a local web frontend, but many readers will want a widget on a website, a mobile app, or a dashboard for business users. The frontend docs are the bridge between “agent works locally” and “customers can actually use this.”

The order I’d study them in: starter repo, Agent Builder, turn detection, tools and RAG, testing, observability, telephony, deployment. That sequence follows the same path most builders will take: make it talk, make it useful, make it reliable, then make it available to real users.

What people get wrong about voice agents

The common mistake is treating voice as a skin on top of chat. Ben’s whole demo argued for the opposite. Voice changes the architecture.

A text chatbot can pause while it thinks. A voice agent has to manage the social awkwardness of silence. A text chatbot can return an error. A voice agent has to recover conversationally. A text chatbot can miss a tool call and maybe the user retries. A voice receptionist can lose a booking.

The other mistake is assuming no-code builders will cover every serious use case. Ben gave no-code its place: it works for simple workflows and fast prototypes. His bet is that production systems will need code because every real business eventually has a strange integration, exception, policy, or workflow. Coding agents make code easier, which makes a developer-first platform more useful over time.

What to watch next

The open question is where the stable production pattern lands.

Realtime models will keep getting more expressive. Pipelines will keep getting more reliable. Tool calling will improve. Costs will fall. Small-business receptionists, drive-thru ordering, medical intake, and accessibility assistants are already close enough to be useful.

The next battle is trust. Can the agent be fast without being sloppy? Can it sound human without impersonating one? Can it handle domain vocabulary, bad networks, angry customers, compliance records, and weird edge cases without turning every call into a support ticket?

Ben’s answer was practical: start with the infrastructure, keep the workflow narrow, use tools for real actions, inspect every failure, and build around the fact that voice is a real-time medium.

That is the real shift. Voice agents are moving from “talking demo” to product infrastructure. The teams that win will be the ones that treat them that way from the first line of code.

How to Build Real-Time AI Voice Agents With LiveKit