AI Voice Agents Github: Proven Guide [Dograh vs LiveKit vs Pipecat]

If you search "AI voice agents GitHub", you usually want one thing: repos that actually let you talk to a bot in real time. Not a chatbot demo that "could" become voice later.

This guide is a builder-first roundup. You will get practical picks, real-time pipeline basics, and fast starter paths for both web voice and phone calling open source ai voice agent agents like Dograh AI, Livekit, Pipecat and Vocode.

What "AI voice agents GitHub" usually means (and what you will get here)

If you want an Open-source code that supports streaming audio , not just text prompts. Key requirements include barge-in, turn-taking, and acceptable latency. For building calling agents, telephony streaming and DTMF (Dual-tone multi-frequency).

This roundup focuses on repositories and frameworks that can realistically be built on today, along with criteria for evaluating whether a project is actively maintained and production-ready.

What this roundup covers (real-time voice assistants + calling agents)

You will get a curated list of GitHub projects across four buckets:

Full platforms (UI + workflows) to build fast
Real-time frameworks (build your own agent loop)
Telephony/calling-focused projects (Twilio/Plivo/SIP patterns)
Speech-to-speech and TTS components (for voice quality and speed)

Dograh GitHub repo had a dedicated "Show HN" post on Hacker News, which signals real developer interest in building and testing voice agents quickly.

Use this selector to choose your starting point:

If you need a UI + workflows + fast iteration + self-hosted + low latency, choose Dograh AI.
If you need WebRTC-first real-time voice with tight infra integration, choose LiveKit Agents (see the LiveKit GitHub org).
If you need high-control pipelines and transport flexibility, choose Pipecat.
If you need telephony-style calling patterns with lots of provider switches, evaluate Vocode.
If your main problem is voice quality or speech-to-speech, look at Chatterbox and NVIDIA Personaplex.

Myths that waste time (and what is true)

Most teams lose weeks because they assume voice is just chat + microphone. That assumption breaks as soon as you care about latency, interruptions, and audio transport.

Myth 1: You need a GPU to run any voice agent.

Many real-time stacks run fine on CPU if you use hosted STT/TTS and keep your local code focused on orchestration. GPUs help for local STT/TTS or heavy multimodal workloads, but they are not required for a working agent.

Myth 2: Any chatbot repo can become a real-time calling bot with no extra work.

Calling agents need streaming audio in/out, end-of-turn detection, barge-in, and telephony constraints (codec, sampling rate, jitter). A chat UI repo usually has none of that.

Myth 3: Telephony audio is the same as web audio so integration is always easy.

Telephony often runs narrowband, different codecs, and strict streaming constraints. Sampling-rate mismatches and buffering issues show up immediately in STT accuracy and latency.

Before you choose: the voice agent pipeline and what "real-time" needs

Real-time voice is mostly about managing small delays across multiple moving parts. The repo you pick should make those moving parts visible and configurable.

Core pipeline: Audio in -> STT -> LLM -> TTS -> Audio out

A real-time voice agent loop typically looks like this:

Audio in : Microphone audio (web) or call audio (telephony) arrives as frames/packets.
VAD + endpointing (turn detection) : You detect when the user is speaking and when they stop. This drives responsiveness.
STT (Speech-to-Text) : Streaming transcription as partial and final results.
LLM orchestration : Tool calls, retrieval, memory, guardrails, and response streaming.
TTS (Text-to-Speech) : Streaming audio output, ideally starting quickly and continuing as text streams.
Audio out : WebRTC/WebSocket playback for web, or telephony streaming back into the call.

If you want natural interaction, you also need:

Barge-in: user interrupts while the bot is talking
Interruption handling: stop TTS, cancel LLM, keep the conversation state consistent
Echo and noise handling: especially in speakerphone use cases
Jitter buffers: for WebRTC/telephony network variance

Latency targets matter. A practical set of targets is:

STT target: 50-250ms, with streaming buffers often 100-250ms to balance speed and accuracy .
TTS target: 50-150ms to first audio, with streaming neural TTS often producing initial audio in ~100-250ms for a ~5s response .
End-to-end ideal: <600-800ms, and if possible <400ms feels great. The same guide notes TTFT (time-to-first-token) ~200-300ms, with examples like Gemini ~280ms and GPT-4o ~250ms .
UX note: humans notice gaps. >700ms can feel artificial, while <1s end-to-end is a good baseline .

Integrations checklist (STT, TTS, LLM, telephony, transports)

Use this checklist to judge any GitHub repo in 60 seconds.

Speech-to-Text (STT)

Open-source: Whisper variants
Hosted: Deepgram, AssemblyAI, Google, Azure
Key features: streaming partials, diarization (optional), language detection

LLMs

Hosted: OpenAI, Anthropic, Google, Groq
Self-hosted: local models (if you accept ops + GPU requirements)
Key features: streaming output, function/tool calls, JSON mode, tool latency controls

Text-to-Speech (TTS)

Hosted: ElevenLabs, Cartesia, PlayHT, Resemble
Open-source/self-hosted: varies (quality and latency differ widely)
Key features: streaming audio chunks, multiple voices, multilingual

Telephony

Twilio, Plivo, SIP trunking
Key features: real-time media streaming, call control webhooks, recording, DTMF events, transfers

Transports

WebRTC (LiveKit, Daily, others)
WebSockets (often simpler, sometimes higher latency or less robust under bad networks)
gRPC (common in backend pipelines)

Business integrations

Webhooks for your backend
CRM integrations (HubSpot, Salesforce)
Ticketing (Zendesk)
Calendar and payments
Observability: logs, metrics, tracing, conversation replay

Cost also matters. A practical 2026-style estimate for real-time voice agents is $0.05-$0.10 per minute total, often broken down as:

STT ~$0.006-$0.02/min
LLM tokens ~$0.01-$0.05/min
TTS ~$0.01-$0.04/min
WebRTC egress ~$0.001-$0.005/min
These are usage-based estimates aggregated from provider pricing and real deployments, with self-hosting sometimes cutting to <$0.08/min at high call volumes.

Production-ready signals on GitHub (what to check fast)

Stars help, but they do not prove reliability. Check these signals:

CI/tests: GitHub Actions, test folders, badges
Releases/tags: versioned releases, changelog
Recent commits: not abandoned
Issue response: maintainers reply, issues are triaged
Security policy: SECURITY.md or guidance
Contributing guide: CONTRIBUTING.md, code style, PR workflow
Examples: an examples/ folder matters more than a long README
Docker support: Dockerfile / compose for repeatable setups
Clear docs: docs site or structured markdown docs

This is how you avoid weekend demos when you need production.

Glossary (key terms)

Barge-in handling :The ability for the user to interrupt the agent mid-speech. A good system stops TTS quickly, cancels or redirects the LLM stream, and keeps context clean.
Voice Activity Detection (VAD) tuning: Adjusting how your system detects speech vs silence. Tuning affects cut-offs, false endpoints, and perceived latency. Too aggressive feels jumpy. Too slow feels laggy.
Twilio Media Streams (real-time audio streaming): A pattern for streaming call audio in real time to your server and sending audio back. It is not the same as normal webhooks. It changes how you build low-latency STT/TTS for phone calls.
DTMF fallback flows: "Press 1 for sales" keypad flows used as a reliability and compliance fallback. DTMF is simple, predictable, and useful when voice fails or when you need explicit consent steps.
WebRTC: A real-time media protocol used for low-latency audio/video with NAT traversal, jitter buffers, and adaptive bitrate. WebRTC is often the difference between "demo works" and "works on bad Wi-Fi".

Top frameworks overview: LiveKit Agents vs Pipecat (and where Dograh fits)

Most teams end up choosing between a code-first framework and a platform that reduces wiring so you can iterate faster.

Comparison table: Pipecat vs LiveKit Agents vs Dograh (best use cases)

A fast way to think about these three:

Item	Dograh	LiveKit Agents	Pipecat
Best for	Fast building with workflows + UI	WebRTC-first, scalable realtime apps	High-control pipelines, custom orchestration
Real-time streaming	Yes (platform-oriented)	Yes, WebRTC-optimized	Yes, frame-based pipelines
Telephony	Designed to work with telephony providers (BYO)	Possible via integrations/bridges	Supports telephony patterns via transports/integrations
Transport layer	Abstracted for builders	WebRTC + LiveKit SFU	Transport-agnostic (WebRTC/WebSocket/others)
Extensibility	Webhooks + BYO keys + workflows	Deep infra hooks in LiveKit ecosystem	Very pluggable processors/services
Learning curve	Lowest (UI-first)	Medium (WebRTC concepts)	Medium-high (pipeline complexity)
Deployment	Self-hostable + cloud options	Self-host + managed cloud	Self-host; infra depends on you

There is also a meaningful difference in philosophy and cost signals between LiveKit Agents and Pipecat.

A published comparison frames it like this:

LiveKit Agents: "Unified infrastructure (WebRTC + Agents)", "extremely low latency", "high scalability", and cost often $0.005-$0.01/min for their layer.
Pipecat: "transport agnostic framework", "high-control orchestration", "low latency but tunable", and cost depends more on your infra choices.

Treat those numbers as directional guidance, not a full bill. Your real cost is still dominated by STT/LLM/TTS usage.

When to use a code-first framework vs a self-hosted platform UI

If you are shipping a real product, speed matters. Debuggability also matters.

Use a UI-driven platform (like Dograh) when:

You want to go from idea to working call flow in hours
Your agent needs decision-tree reliability, not only free-form chat
Non-ML developers must edit flows safely
You want built-in testing personas and repeatable evaluation

Dograh is built around that reality: drag-and-drop workflows, plain-English editing, multi-agent workflows to reduce hallucinations, and bring-your-own keys to avoid lock-in. The repo currently shows 157 stars and notable developer attention via a "Show HN" post.

Use code-first frameworks (Pipecat/LiveKit Agents/Vocode) when:

You need custom audio handling and deep control over turn-taking
You are integrating complex tools, RAG, or enterprise auth
You need to embed voice into an existing product architecture
Your team is comfortable operating real-time services

My view after building and testing these stacks: code-first gives the best control, but it also increases surface area. The first real bug is almost always audio framing, endpointing, or cancellation behavior.

Two first-person notes that match what I see in practice:

"Having total control is really important to us, it's the benefit of OSS / having access to every line of code." (by Shayps)
"Retell feels like the fastest way to ship, while LiveKit becomes compelling only if you need deeper control." (by Own_Professional6525)

Even if you disagree with parts of those takes, the pattern holds: speed vs control is the trade.

A practical ranking checklist (the one we actually use)

This is the checklist we use to rank voice-agent GitHub projects:

Simplicity to get started (time to first conversation)
UI simplicity (if any UI exists)
Hardware/GPU requirements (CPU-first is easier)
Extensibility/integrations (STT/TTS/LLM/telephony/tool calls)
Quality of real-time outputs (barge-in, turn-taking, streaming TTS)
Deployability (Docker/K8s readiness, config, secrets)
License clarity (OSS license, commercial use allowed)
Docs quality (examples, diagrams, troubleshooting)
Safety controls (logging, redaction hooks, policy points, human handoff)

This checklist is more predictive than stars alone.

Hands-on GitHub picks: best AI voice agents and frameworks (grouped by use case)

These picks are oriented around "can I ship with it". Stars help you gauge ecosystem gravity, but your final decision should come from fit and maintenance signals.

Category A: Full voice agent platforms (self-hosted, faster to build)

You pick these when you want a working agent fast, plus a UI for iteration.

Dograh - open-source voice agent platform with workflow builder

One-line: A self-hostable, open-source platform to build inbound and outbound voice agents using a drag-and-drop workflow builder and plain-English edits.

Best for: teams that want to ship fast and still own the code
Stack: BYO STT/LLM/TTS keys; webhooks for your APIs
Real-time: yes (designed for conversational voice)
Telephony: yes (built for calling agents; provider-agnostic approach)
License: open-source (see repo for current license details)
Activity/popularity: 157 stars, plus developer interest via "Show HN" (Dograh repo)
Platform features that matter in practice:

1. Multi-agent workflows (reduces hallucination by routing intent)

2. Built-in testing suite ("Looptalk") to stress test with AI personas (work in progress)

3. Multilingual support and multiple voices

4. Variable extraction from calls and follow-up actions

5. Bring-your-own-keys (no forced vendor lock-in)

3-step quick start

Clone Dograh AI on GitHub.
Set environment variables for your STT/LLM/TTS providers.
Run the app, create a workflow, and place a test call or start a web voice session.

What I like here is the workflow-first approach. In real support and outbound use cases, deterministic routing beats a single prompt.

Category B: Real-time voice frameworks (build your own agent loop)

You pick these when you want code-level control and you can handle more wiring.

Pipecat - real-time voice + multimodal pipeline framework (Python)

One-line: A frame-based Python framework for building real-time voice and multimodal agents with pluggable transports and services.

Best for: custom orchestration, multimodal inputs, deep control over the pipeline
Stack: pluggable STT/TTS/LLMs; common choices include Deepgram/ElevenLabs/OpenAI
Real-time: yes; designed for streaming and interruption management
Telephony: possible via transport choices and integrations
Popularity: 10,000+ GitHub stars (Pipecat)
Why it is strong: pipelines are explicit. You can tune VAD, endpointing, cancellation, and tool calls.

3-step quick start

Clone Pipecat.
Create a .env with STT/LLM/TTS keys (per examples).
Run an example pipeline and speak to it; then swap providers one by one.

When Pipecat feels hard, it is usually because real-time systems are hard. The framework makes the complexity visible, which is good for production.

LiveKit Agents - realtime voice agent framework in the LiveKit ecosystem

One-line: A realtime framework for building voice AI agents with WebRTC-first streaming and turn detection.

Best for: low-latency web voice, scalable sessions, realtime media handling

Stack: supports STT + LLM orchestration + TTS pipelines with VAD and turn detection

Real-time: yes; WebRTC optimization is a major advantage

Telephony: can be added with bridges; best when your core transport is WebRTC

Popularity: ~9.2k stars (see the LiveKit GitHub org)

Notable infra notes: SFU architecture reduces client CPU/bandwidth and handles ugly network conditions.

3-step quick start

Start from the LiveKit GitHub org and find Agents examples.

Configure your STT/LLM/TTS provider keys.

Run a demo agent and connect from a web client.

If you want a guided walkthrough, this video is a decent practical path: Build Your First Voice AI Agent in 20 Minutes with LiveKit.

Vocode Core - programmable voice agents with provider integrations

One-line: A framework/SDK-style codebase to build voice agents with provider switching across STT/TTS/LLMs.

Best for: calling-agent patterns, fast switching between vendors, inbound/outbound automation

Stack: integrates with popular STT/TTS and multiple LLM options

Real-time: yes, with turn-based and streaming abstractions

Telephony: yes, commonly used for calling flows

Popularity: ~3.7k GitHub stars (Vocode GitHub org)

3-step quick start

Open Vocode on GitHub and pick the core repo.

Set provider keys in env variables.

Run the minimal agent example, then add telephony transport if needed.

Category C: Telephony and calling agents (Twilio/Plivo/SIP)

Calling agents add constraints that web voice does not have. You need streaming call audio, DTMF, and reliable fallbacks.

Even when your agent logic is solid, the call experience fails if:

barge-in does not work

sampling rate is wrong

you do not handle long silences, hold music, and transfers

you cannot hand off to a human safely

What features matter for the "best AI outbound calling bot" angle

Answer detection (human vs voicemail)

Compliance hooks (consent and recording notices)

Rate limits and pacing (avoid call bursts and carrier flags)

CRM handoff (webhooks to create leads, tickets, notes)

Recording + summaries (with retention controls)

What is answer detection in outbound calling bots?

Answer detection is the logic that decides what picked up your call: a real human, voicemail, or an IVR/menu.

In outbound calling, this matters because you do not want your bot talking to voicemail as if it is a person. It also affects compliance and user trust. The agent may need to change its script, leave a message, or end the call.

In practice, answer detection is a set of signals:

early audio patterns (beep, greeting length)

silence windows and timing

DTMF/menu prompts

When you evaluate repos, check if they expose these hooks cleanly. If the repo hides telephony events behind one callback, you will struggle later.

Recommended approach for telephony repos (practical, repo-agnostic)

Rather than listing dozens of small Twilio demo repos (many are unmaintained), use a framework plus a clear transport.

Good pairings are:

Dograh (workflow + calling agent UX) + your telephony provider

Pipecat (custom orchestration) + telephony media streaming

Vocode (calling patterns) + telephony integration layer

3-step quick start (telephony pattern)

Pick your transport: Twilio Media Streams, SIP, or a supported bridge.

Ensure audio is normalized (codec + sampling rate) before STT.

Implement DTMF fallback and human handoff from day one.

This Reddit thread captures the common split between hosted platforms and open-source frameworks: "What's your current / best AI voice agents stack?".

Starter paths: copy-paste routes to your first real-time voice agent

These are straightforward paths. The goal is to get your first real conversation fast, then harden it.

Starter Path 1: Fastest local demo (talk to a bot in minutes)

Fast feedback matters more than architecture early on.

Recommended path

Use Dograh if you want a UI and workflow edits fast.

Use Pipecat if you want a minimal code pipeline and you will tune VAD/endpointing.

Steps

Clone Dograh or Pipecat.

Add your provider keys (STT/LLM/TTS) in .env.

Run the default demo and speak.

Turn on debug logs for VAD/endpointing decisions.

Fix the first obvious issue: cut-offs or slow first response.

A simple latency goal for a "feels real" demo is: keep end-to-end below ~600-800ms, and aim for better as you tune buffers and TTFT.

Starter Path 2: Self-hosted production (Docker + env + scaling basics)

Self-hosting is mainly about repeatability, secrets, and visibility.

Steps

Containerize your agent stack with Docker (or use the repo's Docker support).

Store secrets properly (dotenv in dev, secret manager in prod).

Add logs for: audio format, VAD events, STT timing, TTFT, TTS start time.

Add rate limiting and session limits.

Store audio and transcripts with retention rules.

Dograh being self-hostable and open source is a real advantage: you can inspect and control everything, and you are not blocked by a vendor's black box.

Also keep cost in mind when scaling. A realistic usage-based estimate for a typical hosted stack is $0.05-$0.25/min all-in, while self-hosting can reduce vendor costs but adds infra cost and ops time.

Starter Path 3: Telephony (Twilio/Plivo) with real-time streaming + barge-in

Telephony is where real-time gets serious.

Steps

Use a telephony streaming method (for Twilio, this is commonly Media Streams).

Normalize audio immediately (codec, sampling rate).

Implement barge-in: stop TTS quickly when user speech resumes.

Handle DTMF from day one (fallback menus, explicit consent, safe navigation).

Add human handoff and failure fallbacks.

Common pitfalls:

buffering too large (kills interactivity)

wrong sampling rate (STT accuracy drops, audio sounds distorted)

no cancellation logic (agent talks over the user)

no retry logic for flaky streaming sessions

If you want to learn the WebRTC-first approach before telephony, start with LiveKit. This video walkthrough is a helpful baseline: Build Your First Voice AI Agent in 20 Minutes with LiveKit.

Common failure points in real-time voice systems

Most "it works locally" failures fall into predictable buckets:

Echo and feedback loops

Jitter and unstable network conditions

Wrong sampling rate (audio distortion + STT errors)

Buffering too large (kills barge-in and responsiveness)

Token latency spikes (LLM TTFT variability)

TTS delays (slow first audio makes it feel broken)

Interruption bugs (agent continues speaking after user starts)

Tool-call latency (agent pauses during API calls)

Flaky WebSocket/WebRTC sessions (reconnect logic missing)

Latency targets can be used as a sanity check: STT 50-250ms, TTS 50-150ms to first audio, and aim for <600-800ms end-to-end (Simplismart latency guide).

Compliance and safety basics for calling bots (what to plan for)

Calling bots touch sensitive data. Plan for safety early, even in prototypes.

Key basics:

Consent prompts (especially if recording)

Call recording notice and storage retention rules

PII handling: redact where possible, limit transcript exposure

Human handoff: transfer to a person when confidence is low

Auditability: logs and call summaries help, but store responsibly

When you evaluate repos, look for:

webhook points for compliance prompts

structured logging

easy handoff and end-call controls

configurable retention and export

Final recommendations (practical picks)

If you want a single starting point that is open source and fast to ship, start with Dograh. It is built for rapid voice workflows, self-hosting, and bring-your-own-keys. I would pick it when you want reliable call flows and fast iteration without rebuilding the whole stack from scratch.

If you want deep code control, choose:

Pipecat for pipeline clarity and orchestration control (10k+ stars).

LiveKit (Agents) for WebRTC-first, low-latency realtime architecture (~9.2k stars).

Vocode when you want calling patterns and provider flexibility (~3.7k stars).

FAQ's

1. Name ai voice assistant github ?

If you’re looking for an AI voice assistant GitHub project that supports real-time conversations, Dograh AI is a solid option. It’s a fully open-source, self-hosted platform for building real-time voice agents quickly, with a clean UI and a drag-and-drop workflow builder.

2. How to test voice AI agent?

A practical way to test a real-time voice agent is to simulate many realistic conversations and see where it fails—barge-ins, edge cases, silence, noise, and API errors. In Dograh AI, LoopTalk runs AI-to-AI stress tests with different caller personas to quickly expose weak prompts, broken flows, and missed handoffs critical for reliable AI calling and AI cold-calling bots.

3. Name ai voice agent open source ?

If you want open-source AI voice agents, top GitHub options include Dograh AI (full self-hostable platform), LiveKit Agents (WebRTC-first real-time audio), Pipecat (deeply customizable agent pipelines), and Vocode (telephony-focused calling flows).

4. How do I choose the best AI voice agent GitHub repo for a real-time calling bot ?

To choose the right AI voice agents GitHub project, check for true end-to-end streaming (barge-in, low-latency TTS), fast demo setup, and clear production signals. Then match the tool to your goal: Dograh for platform speed, LiveKit Agents for WebRTC streaming, or Pipecat for deep pipeline control.

5. What is a real-time voice agent stack (STT, LLM, TTS), and why does streaming matter ?

A real-time voice agent stack has three parts STT, LLM, and TTS and the key difference between a demo and a product is streaming, which enables low latency, barge-in, and natural turn-taking.

6. Are there open-source AI voice agents for specific use cases like medical intake or recruiter screening ?

An AI medical voice agent GitHub project needs structured intake, safe prompts, and clear escalation, while an AI recruiter voice agent GitHub focuses on branching questions, resume context, and consistent scoring. The fastest path is using an open-source platform like Dograh to design workflows and plug in your own STT/LLM/TTS.

Was this article helpful?

What "AI voice agents GitHub" usually means (and what you will get here)

What this roundup covers (real-time voice assistants + calling agents)

Myths that waste time (and what is true)

Before you choose: the voice agent pipeline and what "real-time" needs

Core pipeline: Audio in -> STT -> LLM -> TTS -> Audio out

Integrations checklist (STT, TTS, LLM, telephony, transports)

Production-ready signals on GitHub (what to check fast)

Glossary (key terms)

Top frameworks overview: LiveKit Agents vs Pipecat (and where Dograh fits)

Comparison table: Pipecat vs LiveKit Agents vs Dograh (best use cases)

When to use a code-first framework vs a self-hosted platform UI

Use a UI-driven platform (like Dograh) when:

Use code-first frameworks (Pipecat/LiveKit Agents/Vocode) when:

A practical ranking checklist (the one we actually use)

Hands-on GitHub picks: best AI voice agents and frameworks (grouped by use case)

Category A: Full voice agent platforms (self-hosted, faster to build)

Dograh - open-source voice agent platform with workflow builder

Category B: Real-time voice frameworks (build your own agent loop)

Pipecat - real-time voice + multimodal pipeline framework (Python)

LiveKit Agents - realtime voice agent framework in the LiveKit ecosystem

Vocode Core - programmable voice agents with provider integrations

Category C: Telephony and calling agents (Twilio/Plivo/SIP)

What features matter for the "best AI outbound calling bot" angle

What is answer detection in outbound calling bots?

Recommended approach for telephony repos (practical, repo-agnostic)

3-step quick start (telephony pattern)

Starter paths: copy-paste routes to your first real-time voice agent

Starter Path 1: Fastest local demo (talk to a bot in minutes)

Starter Path 2: Self-hosted production (Docker + env + scaling basics)

Starter Path 3: Telephony (Twilio/Plivo) with real-time streaming + barge-in

Common failure points in real-time voice systems

Compliance and safety basics for calling bots (what to plan for)

Final recommendations (practical picks)

FAQ's

Written by: