guide·May 27, 2026·Rabbitpair

How to Auto-Generate Live Captions for Any Online Video – DualPiP + Deepgram Setup Guide

#ASR#live captions#speech recognition#Deepgram#Chrome extension#picture-in-picture#language learning#auto subtitles

How to auto-generate subtitles for online videos that have no captions?

Many online videos lack subtitles entirely, or only have low-quality auto-generated captions with poor punctuation and inaccurate word boundaries. DualPiP 1.7.0 introduces ASR (Automatic Speech Recognition) live captions that generate high-accuracy AI subtitles for any web video in real time, displayed directly inside the picture-in-picture window and fully integrated with learning mode and AI translation.

DualPiP ASR captures the video's audio stream from the browser, sends it to a speech recognition service like Deepgram for real-time transcription, and overlays timestamped captions on the video. If a video has native subtitles but poor quality, you can also use DualPiP's subtitle search to find better subtitle files from OpenSubtitles. For videos with no subtitle source at all, ASR live captions are the best solution.

If you prefer watching videos on the original page instead of PiP mode, or need ASR live captions on mobile browsers, CaptionGo offers the same ASR engine with in-page subtitle overlay and mobile browser support (Chrome, Edge, Firefox, Android).

What's the difference between DualPiP ASR and Chrome's built-in Live Caption?

Chrome has a built-in Live Caption feature available under Settings → Accessibility. However, Chrome's native live captions have significant limitations for video learning — most critically, captions disappear when you enter picture-in-picture mode, which makes it unusable for multitasking while watching videos.

Feature	Chrome Built-in Live Caption	DualPiP ASR Live Captions
Picture-in-picture	Captions disappear in PiP	Full captions inside PiP window
Accuracy	Basic, poor sentence boundaries	Deepgram nova-3 with smart punctuation
Bilingual translation	Separate translation feature, disconnected	Integrated with DualPiP's 12 translation engines, AI LLM translation recommended
Learning mode	Not supported	Subtitle panel, AB loop repeat
Caption styling	Fixed appearance	Font size, color, position, background fully customizable
Caption position	Browser bottom bubble, covers page content	Overlaid on video, follows playback window
Language support	~20 languages	22 languages + multilingual auto-detection
Recognition modes	Real-time streaming only	Real-time WebSocket + pre-download batch

The key advantage of DualPiP ASR is captions that persist in picture-in-picture mode. When you pop out a video into a floating window, Chrome's built-in captions vanish — DualPiP's ASR captions stay with the PiP window, ideal for watching videos while working.

Which speech recognition services does DualPiP ASR support?

DualPiP ASR uses a BYOK (Bring Your Own Key) architecture — you configure your own speech recognition service, and requests go directly from your browser to the provider without any intermediary server. Two types of backends are supported:

Cloud ASR: Deepgram

Deepgram is the pre-configured cloud ASR provider in DualPiP, using the nova-3 model for speech recognition. Nova-3 is one of the most accurate real-time speech recognition models available today:

Real-time WebSocket streaming: Sub-300ms latency, captions appear almost instantly
Smart punctuation and sentence boundaries: Automatic punctuation with accurate sentence segmentation
22 languages supported: English, Chinese, Japanese, Korean, French, German, Spanish, and more
Multilingual auto-detection: Deepgram's unique multi mode detects and switches languages automatically
Low cost: $0.007/minute (nova-3), approximately $0.84 for a 2-hour movie

Unlike extensions like Immersive Translate that offer built-in ASR with monthly limits (~50 videos/month), DualPiP's BYOK model has no video count limits — you pay only for what you use, with transparent API billing. For a full comparison of DualPiP with other subtitle extensions, see Best Chrome Bilingual Subtitle Extensions 2026.

Local ASR: Whisper

DualPiP also supports locally deployed OpenAI-compatible Whisper servers, processing audio entirely on your machine — ideal for privacy-conscious users or restricted network environments:

Local Solution	Description
Speaches	High-performance Whisper API server with GPU acceleration
whisper.cpp	Lightweight C++ implementation, runs on CPU
hwdsl2/whisper-server	One-command Docker deployment
Any OpenAI-compatible server	Any service with `/v1/audio/transcriptions` endpoint

Local backends use HTTP batch recognition mode, with DualPiP sending audio segments (default 5 seconds) for transcription — completely free and works offline.

How to get Deepgram's free $200 credit and API key

No credit card required. Deepgram offers $200 in free credits to new users — no payment method needed during signup. At nova-3's rate of $0.007/minute, $200 covers approximately 476 hours of audio — enough for about 238 two-hour movies.

Visit deepgram.com and click Sign Up Free
Register with your Google account or email (no credit card information required)
After logging in, you'll enter the Console dashboard with a default project already created
Navigate to Settings → API Keys in the left sidebar
Click Create a New API Key
Enter a name (e.g., "DualPiP"), set permissions to Member, and click Create Key
Copy and save the API key immediately — it cannot be viewed again after closing the page

Detail	Info
Free credit	$200 (on signup)
Credit card required	No
Credit expiration	Never expires
After credits used	Pay As You Go
Nova-3 pricing	$0.007/minute
$200 covers	~476 hours (~238 movies)

How to set up ASR live captions in DualPiP

Setup takes two steps: add an ASR provider in extension settings, then enable live captions in the picture-in-picture window.

Step 1: Add an ASR provider

Open DualPiP's Settings page (click the extension icon → gear icon)
Go to the ASR Settings tab
Click Add Provider
Select a preset: Deepgram (cloud) or Custom Local Backend (local)
Enter your Deepgram API key (obtained in the previous section) or local Whisper server address
Choose the default recognition language (Multilingual auto-detection recommended) and model
Click Save

Step 2: Enable live captions in the PiP window

Open DualPiP picture-in-picture mode on any video site (shortcut Ctrl+Shift+E)
Click the ASR button (microphone icon) in the PiP control bar
Toggle Live Caption on
Real-time captions start appearing immediately above the video

You can also use the shortcut Shift+A to toggle ASR inside the PiP window, or configure a global shortcut for "Toggle Live Captions" in Chrome's extension shortcuts settings (chrome://extensions/shortcuts).

What's the difference between streaming and pre-download modes?

DualPiP ASR offers two audio capture and recognition modes for different viewing scenarios:

Real-time streaming mode (WebSocket)

Audio streams to Deepgram via WebSocket in real time, with sub-300ms caption latency. Deepgram's Interim Results feature provides preliminary transcription before the final result, making captions appear even faster. Best for live streams, video calls, and real-time content.

Pre-download batch mode (HTTP)

DualPiP pre-downloads the video audio and splits it into segments, then sends them via HTTP for batch transcription. Best for published video content — complete captions are generated before playback begins, with zero delay during viewing. Pre-download mode supports both Deepgram cloud and local Whisper backends.

Comparison	Real-time Streaming	Pre-download Batch
Latency	< 300ms	Zero after pre-download completes
Best for	Live streams, real-time content	Published videos, complete coverage
Backends	Deepgram (WebSocket)	Deepgram + local Whisper
Coverage	Real-time, occasional gaps	Complete audio coverage

DualPiP defaults to Auto mode: it tries WebSocket streaming first, and falls back to pre-download batch if the provider doesn't support streaming.

How to use ASR captions with learning mode for language study

DualPiP ASR generates timestamped captions fully compatible with learning mode, turning any video — even those without native subtitles — into a language learning resource:

Subtitle list panel: Each ASR-recognized sentence appears chronologically in the right-side learning panel, clickable for navigation
AB loop repeat: Select any ASR caption for repeated playback and focused listening practice
Auto-pause: Automatic pause after each caption for shadowing and repetition
Bilingual display: ASR captions pair with AI translation engines for side-by-side original + translated subtitles

This means even a video with zero native subtitles can become a full-featured sentence-by-sentence learning tool once ASR generates captions.

ASR and traditional subtitles are mutually exclusive in DualPiP: enabling ASR automatically disables traditional subtitles, and vice versa. If a video has high-quality native subtitles, use those first — or find subtitle files via subtitle search. ASR is ideal for videos with no subtitles or poor-quality auto-generated captions.

How to combine ASR with AI translation for real-time bilingual captions

DualPiP's ASR and AI translation work together to generate real-time bilingual subtitles for any video in any language — solving a scenario traditional subtitles can't cover: the video has no native subtitles, but you need bilingual captions for language learning.

ASR + AI translation workflow

ASR recognizes the original language: Deepgram transcribes the video's audio into source-language text captions
AI LLM translates in real time: DualPiP's AI translation engine translates the ASR captions into your target language
Bilingual captions displayed together: Original and translated text appear as bilingual subtitles overlaid on the video

Why AI LLM translation is strongly recommended for ASR captions

ASR-generated captions differ from traditional subtitle files — they're real-time speech transcriptions that may have incomplete sentence boundaries, colloquial expressions, and proper nouns without context. AI LLM translation (DeepSeek, GPT, Claude) significantly outperforms traditional machine translation (Google, Microsoft) when translating ASR captions:

Aspect	Traditional MT (Google/Microsoft)	AI LLM Translation (DeepSeek/GPT/Claude)
Context awareness	Translates sentence by sentence, no context	DualPiP sends recent N captions as conversation history
Colloquial speech	Literal translation, awkward phrasing	Understands conversational context, natural output
Incomplete sentences	Breaks when ASR sentence boundaries are imperfect	Completes meaning using context, translates correctly
Proper nouns	Frequently mistranslates names and terms	Enhanced by DualPiP's movie info integration
Tone preservation	Mechanical, flat output	Preserves speaker tone and expression style

DualPiP's AI translation engine uses a sliding window context mechanism: each time it translates an ASR caption, it includes previously translated captions as conversation history, ensuring consistent and coherent translation. This is especially important for ASR scenarios — because speech recognition segmentation differs from traditional subtitles, the AI needs prior context to correctly understand the current sentence.

Use cases

Scenario	Description
Learning a language from unsubtitled videos	ASR recognizes the original + AI translates to your native language
Watching live streams	No pre-made subtitles — ASR generates + AI translates in real time
Academic lectures and courses	Some courses lack subtitles — ASR + AI generates translations
Podcasts and interviews	Audio-only content visualized as text via ASR, then translated

DualPiP supports 30+ AI translation providers (DeepSeek, GPT, Claude, Gemini, etc.). For ASR captions, we recommend DeepSeek V4 Flash (best value, ~$0.03–0.07/movie) or Groq Llama (free tier, fastest response). See the complete AI translation setup guide for details.

Which video websites work with DualPiP ASR?

DualPiP ASR uses the browser's Audio Capture API to capture audio, so it can theoretically generate captions for any video playing in Chrome. Verified platforms include:

Platform Type	Supported Sites
Video platforms	YouTube, Netflix, Disney+, Bilibili, Crunchyroll, HiAnime
Learning platforms	Coursera, Udemy, TED, edX, Khan Academy
Live streaming	Twitch, YouTube Live
Meeting tools	Zoom (web), Google Meet
Other	Any site using HTML5 `<video>` element

For videos without native subtitles (live streams, niche platforms, user uploads), ASR is the only way to get captions. Combined with DualPiP's AI LLM translation, you can generate real-time bilingual subtitles for any online video in any language. Since ASR captions are real-time speech transcriptions, AI LLM translation is strongly recommended over traditional machine translation for accurate, context-aware results.

Frequently asked questions

Q: How accurate is ASR real-time speech recognition? Deepgram nova-3 achieves a word error rate (WER) below 8% for English, making it one of the most accurate real-time speech recognition models in 2026. Chinese, Japanese, and other languages also perform well. Accuracy depends on audio quality, speaker accent, and background noise.

Q: What happens after the $200 free credit is used up? It switches to pay-as-you-go billing. Nova-3 costs $0.007/minute — about $0.84 for a 2-hour movie. You can also switch to a local Whisper backend for completely free transcription (requires a local GPU).

Q: Can ASR captions and traditional subtitles be displayed simultaneously? No. DualPiP treats them as mutually exclusive — enabling ASR automatically disables traditional subtitles, and vice versa. If a video has quality native subtitles or you can search for subtitle files, use those first.

Q: Is ASR a free feature or does it require DualPiP Pro? ASR live captions are a Pro feature. DualPiP's YouTube bilingual subtitles and basic PiP player are free. ASR, AI translation, and full learning mode require a Pro subscription.

Q: What hardware do I need for a local Whisper backend? An NVIDIA GPU with 6GB+ VRAM is recommended for smooth real-time recognition. CPU inference works but is slower. Deploying hwdsl2/whisper-server via Docker is the simplest approach — one command to start.

Q: Can it recognize mixed languages in a single video? Deepgram's Multilingual mode detects and switches languages automatically within the same audio stream, ideal for multilingual interviews, podcasts, and educational content. Local Whisper also supports language detection but with lower switching accuracy.

Get started with DualPiP ASR live captions

Four steps to generate AI live captions for any online video:

Install DualPiP: Chrome Web Store | Edge Add-ons
Sign up for Deepgram's free $200 credit: deepgram.com (no credit card needed)
Add a Deepgram provider in DualPiP settings and enter your API key
Open any video in PiP mode and click the ASR button to enable live captions

Whether it's an unsubtitled live stream, a niche platform video, or foreign language content for study, DualPiP ASR generates real-time AI captions. Combine with AI bilingual translation for dual-language subtitles on any video, or use learning mode with AB loop repeat and sentence panels to turn every video into effective language learning material.

Watching on mobile or prefer in-page subtitles? Try CaptionGo — same ASR live caption engine, designed for in-page bilingual subtitles with fullscreen and mobile browser support.

Back to all posts