Tavus

API-first conversational video AI for real-time face-to-face agents

Agent PlatformSupervised

Last reviewed 2026-06-20

Tavus is an API-first platform for building real-time conversational video AI: AI humans (digital replicas) that see, hear, and talk face to face on video. Its Conversational Video Interface (CVI) unifies speech, perception, dialogue, and real-time rendering behind a single API, so developers can drop a humanlike video agent into a product, website, or app with a few lines of code. It is built on Tavus's own models (Phoenix for rendering, Raven for perception, Sparrow for turn-taking). Tavus targets developers and product teams who want a video front-end for an LLM (onboarding, support, coaching, tutoring, sales, healthcare intake), plus enterprise teams via white-label deployments. It is bring-your-own-LLM and OpenAI-compatible, so the conversation logic and knowledge stay in the customer's stack while Tavus handles the real-time avatar, perception, and timing. A human still designs the agent, supplies the LLM and knowledge, and reviews behavior, so it is best described as a supervised agent rather than an autonomous one.

What it can do

Real-time conversational video agents (CVI)
Supervised
The Conversational Video Interface gives developers an out-of-the-box video agent that unifies speech, perception, dialogue, and rendering in one API, reportedly at roughly 600ms speech-to-video latency.
source
Real-time avatar rendering (Phoenix model)
Assistant
Tavus's Phoenix rendering model produces full-face animation with lip-sync and micro-expressions; the company states 1080p full-face rendering at 40+ FPS.
source
Visual and audio perception (Raven model)
Supervised
The Raven perception model analyzes facial expressions, tone, gaze, emotion, and ambient environment in real time, and can trigger tools from visual or audio events.
source
Turn-taking and interruption handling (Sparrow model)
Assistant
The Sparrow model handles natural pauses, interruptions, and conversational timing for smoother back-and-forth speech.
source
Custom digital replicas
Assistant
Builds a custom AI human (replica) from roughly two minutes of video, alongside 100+ stock replicas, in 30+ to 50+ languages per Tavus's materials.
source
Bring-your-own-LLM, function calling, RAG, and memory
Supervised
CVI is OpenAI-compatible and bring-your-own-LLM, with function calling, RAG knowledge bases, cross-session memory, and a data layer of transcripts, emotion timelines, and perception events.
source

Strengths

+API-first and bring-your-own-LLM, so the conversation logic and knowledge stay in your stack
+Low-latency real-time video (reportedly ~600ms speech-to-video) with perception and turn-taking, not just lip-sync
+Generous free tier and a clear usage-based ladder priced on conversational minutes

Limitations

−Minutes-based usage pricing can climb quickly for high-volume, always-on agents
−It supplies the video front-end, not the agent's reasoning, so you still build and own the LLM and knowledge
−Realistic talking-head agents raise consent and deepfake concerns that need policy guardrails

Overview

Tavus builds the real-time conversational video layer for AI: AI humans (digital replicas) that see, hear, and talk face to face on video. Its flagship product, the Conversational Video Interface (CVI), is API-first and unifies speech, perception, dialogue, and rendering behind a single API so developers can embed a humanlike video agent in a product or app. Tavus is a Y Combinator company headquartered in San Francisco.

What it does

CVI runs on Tavus's own models: Phoenix for real-time full-face rendering (Tavus states 1080p at 40+ FPS with lip-sync and micro-expressions), Raven for visual and audio perception (facial expressions, tone, gaze, emotion, ambient environment), and Sparrow for turn-taking, interruptions, and conversational timing. Tavus reports roughly 600ms speech-to-video latency. Developers bring their own LLM (the platform is OpenAI-compatible) and get function calling, RAG knowledge bases, cross-session memory, and a data layer of transcripts, emotion timelines, and perception events. Custom replicas can be trained from about two minutes of video, alongside 100+ stock replicas across many languages.

Integrations & setup

Tavus is shipped as a REST API with an OpenAI-compatible LLM interface and an @tavus/react-cvi npm package; Tavus markets a roughly ten-line integration. It deploys on the customer's own infrastructure (the docs reference Vercel, AWS, or custom) and offers white-label enterprise options. Tavus states SOC 2, HIPAA, and GDPR compliance.

Pricing

Freemium and usage-based on conversational minutes. Developer plans: Basic (free, 25 minutes, 25 stock replicas), Starter $59/mo (100 minutes, 3 custom replica trainings/mo), Growth $397/mo (1,250 minutes, 7 custom replica trainings/mo), and custom Enterprise with unlimited replicas and volume discounts. Paid plans add pay-as-you-go overages. Separate consumer-style PALs plans (Free, Plus $20/mo, Max $50/mo) cover personal voice and video calls.

Traction

Tavus raised an $18M Series A led by Scale Venture Partners (with Sequoia, Y Combinator, and HubSpot) in 2024, and in November 2025 announced a $40M Series B led by CRV, bringing total funding to roughly $64M per secondary sources. Tavus's own materials cite figures such as 2B agent interactions; those are vendor-reported.

Best for / not for

Best for developers and product teams that already have an LLM and want to add a real-time, perceptive video face to it (onboarding, support, coaching, tutoring, healthcare intake), and for enterprises that want a white-label conversational video layer. Less suited to teams that want a finished, scripted marketing-video tool (HeyGen, Synthesia) or that want the agent's reasoning and knowledge handled for them.

Alternatives

HeyGen and Synthesia are the closest avatar-video competitors but lean toward pre-scripted rendering; D-ID also offers conversational video agents. For voice-only real-time agents, ElevenLabs Agents and Bland AI are adjacent.

What people are saying

We aggregate real LinkedIn discussion into sentiment for the agents people search most. Tavus isn't tracked yet, want it added? Request tracking.

FAQ

What is Tavus's Conversational Video Interface (CVI)?+

CVI is Tavus's API-first product for real-time conversational video: it bundles speech, visual/audio perception, dialogue timing, and real-time avatar rendering behind one API so developers can add a face-to-face AI video agent with a few lines of code.

Is Tavus autonomous?+

No. Tavus provides the real-time video, perception, and turn-taking layer; the reasoning comes from a bring-your-own LLM, and a human designs the agent and supplies the knowledge. It is best described as a supervised agent.

How is Tavus different from HeyGen or Synthesia?+

HeyGen and Synthesia focus mostly on rendering pre-scripted avatar videos. Tavus focuses on real-time, two-way conversational video agents that perceive and respond live, exposed as a developer API.

Sources

Tavus CVI (official product page) · accessed 2026-06-20
Tavus (official site) · accessed 2026-06-20
Tavus pricing (official) · accessed 2026-06-20
Generative AI video startup Tavus raises $18M (TechCrunch) · accessed 2026-06-20
Tavus Raises $40M Series B (Business Wire) · accessed 2026-06-20
Tavus (Y Combinator profile) · accessed 2026-06-20

Last reviewed 2026-06-20

Alternatives & related

HeyGen

AI avatar video generator with translation and a video agent

Synthesia

AI video platform that turns scripts into avatar-presented videos

D-ID

AI avatar video generation plus real-time conversational Visual Agents

ElevenLabs Agents

Platform for building real-time voice and chat AI agents

Bland AI

Enterprise platform for AI phone agents that run full voice calls