How to Auto-Generate Live Captions for Any Online Video – DualPiP + Deepgram Setup Guide
How to auto-generate subtitles for online videos that have no captions?
Many online videos lack subtitles entirely, or only have low-quality auto-generated captions with poor punctuation and inaccurate word boundaries. DualPiP 1.7.0 introduces ASR (Automatic Speech Recognition) live captions that generate high-accuracy AI subtitles for any web video in real time, displayed directly inside the picture-in-picture window and fully integrated with learning mode and AI translation.
DualPiP ASR captures the video's audio stream from the browser, sends it to a speech recognition service like Deepgram for real-time transcription, and overlays timestamped captions on the video. If a video has native subtitles but poor quality, you can also use DualPiP's subtitle search to find better subtitle files from OpenSubtitles. For videos with no subtitle source at all, ASR live captions are the best solution.
What's the difference between DualPiP ASR and Chrome's built-in Live Caption?
Chrome has a built-in Live Caption feature available under Settings → Accessibility. However, Chrome's native live captions have significant limitations for video learning — most critically, captions disappear when you enter picture-in-picture mode, which makes it unusable for multitasking while watching videos.
| Feature | Chrome Built-in Live Caption | DualPiP ASR Live Captions |
|---|---|---|
| Picture-in-picture | Captions disappear in PiP | Full captions inside PiP window |
| Accuracy | Basic, poor sentence boundaries | Deepgram nova-3 with smart punctuation |
| Bilingual translation | Separate translation feature, disconnected | Integrated with DualPiP's 12 translation engines, AI LLM translation recommended |
| Learning mode | Not supported | Subtitle panel, AB loop repeat |
| Caption styling | Fixed appearance | Font size, color, position, background fully customizable |
| Caption position | Browser bottom bubble, covers page content | Overlaid on video, follows playback window |
| Language support | ~20 languages | 22 languages + multilingual auto-detection |
| Recognition modes | Real-time streaming only | Real-time WebSocket + pre-download batch |
The key advantage of DualPiP ASR is captions that persist in picture-in-picture mode. When you pop out a video into a floating window, Chrome's built-in captions vanish — DualPiP's ASR captions stay with the PiP window, ideal for watching videos while working.
Which speech recognition services does DualPiP ASR support?
DualPiP ASR uses a BYOK (Bring Your Own Key) architecture — you configure your own speech recognition service, and requests go directly from your browser to the provider without any intermediary server. Two types of backends are supported:
Cloud ASR: Deepgram
Deepgram is the pre-configured cloud ASR provider in DualPiP, using the nova-3 model for speech recognition. Nova-3 is one of the most accurate real-time speech recognition models available today:
- Real-time WebSocket streaming: Sub-300ms latency, captions appear almost instantly
- Smart punctuation and sentence boundaries: Automatic punctuation with accurate sentence segmentation
- 22 languages supported: English, Chinese, Japanese, Korean, French, German, Spanish, and more
- Multilingual auto-detection: Deepgram's unique multi mode detects and switches languages automatically
- Low cost: $0.007/minute (nova-3), approximately $0.84 for a 2-hour movie
Unlike extensions like Immersive Translate that offer built-in ASR with monthly limits (~50 videos/month), DualPiP's BYOK model has no video count limits — you pay only for what you use, with transparent API billing. For a full comparison of DualPiP with other subtitle extensions, see Best Chrome Bilingual Subtitle Extensions 2026.
Local ASR: Whisper
DualPiP also supports locally deployed OpenAI-compatible Whisper servers, processing audio entirely on your machine — ideal for privacy-conscious users or restricted network environments:
| Local Solution | Description |
|---|---|
| Speaches | High-performance Whisper API server with GPU acceleration |
| whisper.cpp | Lightweight C++ implementation, runs on CPU |
| hwdsl2/whisper-server | One-command Docker deployment |
| Any OpenAI-compatible server | Any service with /v1/audio/transcriptions endpoint |
Local backends use HTTP batch recognition mode, with DualPiP sending audio segments (default 5 seconds) for transcription — completely free and works offline.
How to get Deepgram's free $200 credit and API key
No credit card required. Deepgram offers $200 in free credits to new users — no payment method needed during signup. At nova-3's rate of $0.007/minute, $200 covers approximately 476 hours of audio — enough for about 238 two-hour movies.
Step-by-step: Sign up for Deepgram and create an API key
- Visit deepgram.com and click Sign Up Free
- Register with your Google account or email (no credit card information required)
- After logging in, you'll enter the Console dashboard with a default project already created
- Navigate to Settings → API Keys in the left sidebar
- Click Create a New API Key
- Enter a name (e.g., "DualPiP"), set permissions to Member, and click Create Key
- Copy and save the API key immediately — it cannot be viewed again after closing the page
| Detail | Info |
|---|---|
| Free credit | $200 (on signup) |
| Credit card required | No |
| Credit expiration | Never expires |
| After credits used | Pay As You Go |
| Nova-3 pricing | $0.007/minute |
| $200 covers | ~476 hours (~238 movies) |
How to set up ASR live captions in DualPiP
Setup takes two steps: add an ASR provider in extension settings, then enable live captions in the picture-in-picture window.
Step 1: Add an ASR provider
- Open DualPiP's Settings page (click the extension icon → gear icon)
- Go to the ASR Settings tab
- Click Add Provider
- Select a preset: Deepgram (cloud) or Custom Local Backend (local)
- Enter your Deepgram API key (obtained in the previous section) or local Whisper server address
- Choose the default recognition language (Multilingual auto-detection recommended) and model
- Click Save
Step 2: Enable live captions in the PiP window
- Open DualPiP picture-in-picture mode on any video site (shortcut
Ctrl+Shift+E) - Click the ASR button (microphone icon) in the PiP control bar
- Toggle Live Caption on
- Real-time captions start appearing immediately above the video
You can also use the shortcut Shift+A to toggle ASR inside the PiP window, or configure a global shortcut for "Toggle Live Captions" in Chrome's extension shortcuts settings (chrome://extensions/shortcuts).
What's the difference between streaming and pre-download modes?
DualPiP ASR offers two audio capture and recognition modes for different viewing scenarios:
Real-time streaming mode (WebSocket)
Audio streams to Deepgram via WebSocket in real time, with sub-300ms caption latency. Deepgram's Interim Results feature provides preliminary transcription before the final result, making captions appear even faster. Best for live streams, video calls, and real-time content.
Pre-download batch mode (HTTP)
DualPiP pre-downloads the video audio and splits it into segments, then sends them via HTTP for batch transcription. Best for published video content — complete captions are generated before playback begins, with zero delay during viewing. Pre-download mode supports both Deepgram cloud and local Whisper backends.
| Comparison | Real-time Streaming | Pre-download Batch |
|---|---|---|
| Latency | < 300ms | Zero after pre-download completes |
| Best for | Live streams, real-time content | Published videos, complete coverage |
| Backends | Deepgram (WebSocket) | Deepgram + local Whisper |
| Coverage | Real-time, occasional gaps | Complete audio coverage |
DualPiP defaults to Auto mode: it tries WebSocket streaming first, and falls back to pre-download batch if the provider doesn't support streaming.
How to use ASR captions with learning mode for language study
DualPiP ASR generates timestamped captions fully compatible with learning mode, turning any video — even those without native subtitles — into a language learning resource:
- Subtitle list panel: Each ASR-recognized sentence appears chronologically in the right-side learning panel, clickable for navigation
- AB loop repeat: Select any ASR caption for repeated playback and focused listening practice
- Auto-pause: Automatic pause after each caption for shadowing and repetition
- Bilingual display: ASR captions pair with AI translation engines for side-by-side original + translated subtitles
This means even a video with zero native subtitles can become a full-featured sentence-by-sentence learning tool once ASR generates captions.
ASR and traditional subtitles are mutually exclusive in DualPiP: enabling ASR automatically disables traditional subtitles, and vice versa. If a video has high-quality native subtitles, use those first — or find subtitle files via subtitle search. ASR is ideal for videos with no subtitles or poor-quality auto-generated captions.
How to combine ASR with AI translation for real-time bilingual captions
DualPiP's ASR and AI translation work together to generate real-time bilingual subtitles for any video in any language — solving a scenario traditional subtitles can't cover: the video has no native subtitles, but you need bilingual captions for language learning.
ASR + AI translation workflow
- ASR recognizes the original language: Deepgram transcribes the video's audio into source-language text captions
- AI LLM translates in real time: DualPiP's AI translation engine translates the ASR captions into your target language
- Bilingual captions displayed together: Original and translated text appear as bilingual subtitles overlaid on the video
Why AI LLM translation is strongly recommended for ASR captions
ASR-generated captions differ from traditional subtitle files — they're real-time speech transcriptions that may have incomplete sentence boundaries, colloquial expressions, and proper nouns without context. AI LLM translation (DeepSeek, GPT, Claude) significantly outperforms traditional machine translation (Google, Microsoft) when translating ASR captions:
| Aspect | Traditional MT (Google/Microsoft) | AI LLM Translation (DeepSeek/GPT/Claude) |
|---|---|---|
| Context awareness | Translates sentence by sentence, no context | DualPiP sends recent N captions as conversation history |
| Colloquial speech | Literal translation, awkward phrasing | Understands conversational context, natural output |
| Incomplete sentences | Breaks when ASR sentence boundaries are imperfect | Completes meaning using context, translates correctly |
| Proper nouns | Frequently mistranslates names and terms | Enhanced by DualPiP's movie info integration |
| Tone preservation | Mechanical, flat output | Preserves speaker tone and expression style |
DualPiP's AI translation engine uses a sliding window context mechanism: each time it translates an ASR caption, it includes previously translated captions as conversation history, ensuring consistent and coherent translation. This is especially important for ASR scenarios — because speech recognition segmentation differs from traditional subtitles, the AI needs prior context to correctly understand the current sentence.
Use cases
| Scenario | Description |
|---|---|
| Learning a language from unsubtitled videos | ASR recognizes the original + AI translates to your native language |
| Watching live streams | No pre-made subtitles — ASR generates + AI translates in real time |
| Academic lectures and courses | Some courses lack subtitles — ASR + AI generates translations |
| Podcasts and interviews | Audio-only content visualized as text via ASR, then translated |
DualPiP supports 30+ AI translation providers (DeepSeek, GPT, Claude, Gemini, etc.). For ASR captions, we recommend DeepSeek V4 Flash (best value, ~$0.03–0.07/movie) or Groq Llama (free tier, fastest response). See the complete AI translation setup guide for details.
Which video websites work with DualPiP ASR?
DualPiP ASR uses the browser's Audio Capture API to capture audio, so it can theoretically generate captions for any video playing in Chrome. Verified platforms include:
| Platform Type | Supported Sites |
|---|---|
| Video platforms | YouTube, Netflix, Disney+, Bilibili, Crunchyroll, HiAnime |
| Learning platforms | Coursera, Udemy, TED, edX, Khan Academy |
| Live streaming | Twitch, YouTube Live |
| Meeting tools | Zoom (web), Google Meet |
| Other | Any site using HTML5 <video> element |
For videos without native subtitles (live streams, niche platforms, user uploads), ASR is the only way to get captions. Combined with DualPiP's AI LLM translation, you can generate real-time bilingual subtitles for any online video in any language. Since ASR captions are real-time speech transcriptions, AI LLM translation is strongly recommended over traditional machine translation for accurate, context-aware results.
Frequently asked questions
Q: How accurate is ASR real-time speech recognition? Deepgram nova-3 achieves a word error rate (WER) below 8% for English, making it one of the most accurate real-time speech recognition models in 2026. Chinese, Japanese, and other languages also perform well. Accuracy depends on audio quality, speaker accent, and background noise.
Q: What happens after the $200 free credit is used up? It switches to pay-as-you-go billing. Nova-3 costs $0.007/minute — about $0.84 for a 2-hour movie. You can also switch to a local Whisper backend for completely free transcription (requires a local GPU).
Q: Can ASR captions and traditional subtitles be displayed simultaneously? No. DualPiP treats them as mutually exclusive — enabling ASR automatically disables traditional subtitles, and vice versa. If a video has quality native subtitles or you can search for subtitle files, use those first.
Q: Is ASR a free feature or does it require DualPiP Premium? ASR live captions are a Premium feature. DualPiP's YouTube in-page bilingual subtitles and basic PiP player are free. ASR, AI translation, and full learning mode require a Premium subscription.
Q: What hardware do I need for a local Whisper backend?
An NVIDIA GPU with 6GB+ VRAM is recommended for smooth real-time recognition. CPU inference works but is slower. Deploying hwdsl2/whisper-server via Docker is the simplest approach — one command to start.
Q: Can it recognize mixed languages in a single video? Deepgram's Multilingual mode detects and switches languages automatically within the same audio stream, ideal for multilingual interviews, podcasts, and educational content. Local Whisper also supports language detection but with lower switching accuracy.
Get started with DualPiP ASR live captions
Four steps to generate AI live captions for any online video:
- Install DualPiP: Chrome Web Store | Edge Add-ons
- Sign up for Deepgram's free $200 credit: deepgram.com (no credit card needed)
- Add a Deepgram provider in DualPiP settings and enter your API key
- Open any video in PiP mode and click the ASR button to enable live captions
Whether it's an unsubtitled live stream, a niche platform video, or foreign language content for study, DualPiP ASR generates real-time AI captions. Combine with AI bilingual translation for dual-language subtitles on any video, or use learning mode with AB loop repeat and sentence panels to turn every video into effective language learning material.