Tillbaka till listan
guide··Rabbitpair

How to Auto-Generate Live Captions for Any Online Video – DualPiP + Deepgram Setup Guide

#ASR#live captions#speech recognition#Deepgram#Chrome extension#picture-in-picture#language learning#auto subtitles

How to auto-generate subtitles for online videos that have no captions?

Many online videos lack subtitles entirely, or only have low-quality auto-generated captions with poor punctuation and inaccurate word boundaries. DualPiP 1.7.0 introduces ASR (Automatic Speech Recognition) live captions that generate high-accuracy AI subtitles for any web video in real time, displayed directly inside the picture-in-picture window and fully integrated with learning mode and AI translation.

DualPiP ASR captures the video's audio stream from the browser, sends it to a speech recognition service like Deepgram for real-time transcription, and overlays timestamped captions on the video. If a video has native subtitles but poor quality, you can also use DualPiP's subtitle search to find better subtitle files from OpenSubtitles. For videos with no subtitle source at all, ASR live captions are the best solution.


What's the difference between DualPiP ASR and Chrome's built-in Live Caption?

Chrome has a built-in Live Caption feature available under Settings → Accessibility. However, Chrome's native live captions have significant limitations for video learning — most critically, captions disappear when you enter picture-in-picture mode, which makes it unusable for multitasking while watching videos.

FeatureChrome Built-in Live CaptionDualPiP ASR Live Captions
Picture-in-pictureCaptions disappear in PiPFull captions inside PiP window
AccuracyBasic, poor sentence boundariesDeepgram nova-3 with smart punctuation
Bilingual translationSeparate translation feature, disconnectedIntegrated with DualPiP's 12 translation engines, AI LLM translation recommended
Learning modeNot supportedSubtitle panel, AB loop repeat
Caption stylingFixed appearanceFont size, color, position, background fully customizable
Caption positionBrowser bottom bubble, covers page contentOverlaid on video, follows playback window
Language support~20 languages22 languages + multilingual auto-detection
Recognition modesReal-time streaming onlyReal-time WebSocket + pre-download batch

The key advantage of DualPiP ASR is captions that persist in picture-in-picture mode. When you pop out a video into a floating window, Chrome's built-in captions vanish — DualPiP's ASR captions stay with the PiP window, ideal for watching videos while working.


Which speech recognition services does DualPiP ASR support?

DualPiP ASR uses a BYOK (Bring Your Own Key) architecture — you configure your own speech recognition service, and requests go directly from your browser to the provider without any intermediary server. Two types of backends are supported:

Cloud ASR: Deepgram

Deepgram is the pre-configured cloud ASR provider in DualPiP, using the nova-3 model for speech recognition. Nova-3 is one of the most accurate real-time speech recognition models available today:

  • Real-time WebSocket streaming: Sub-300ms latency, captions appear almost instantly
  • Smart punctuation and sentence boundaries: Automatic punctuation with accurate sentence segmentation
  • 22 languages supported: English, Chinese, Japanese, Korean, French, German, Spanish, and more
  • Multilingual auto-detection: Deepgram's unique multi mode detects and switches languages automatically
  • Low cost: $0.007/minute (nova-3), approximately $0.84 for a 2-hour movie

Unlike extensions like Immersive Translate that offer built-in ASR with monthly limits (~50 videos/month), DualPiP's BYOK model has no video count limits — you pay only for what you use, with transparent API billing. For a full comparison of DualPiP with other subtitle extensions, see Best Chrome Bilingual Subtitle Extensions 2026.

Local ASR: Whisper

DualPiP also supports locally deployed OpenAI-compatible Whisper servers, processing audio entirely on your machine — ideal for privacy-conscious users or restricted network environments:

Local SolutionDescription
SpeachesHigh-performance Whisper API server with GPU acceleration
whisper.cppLightweight C++ implementation, runs on CPU
hwdsl2/whisper-serverOne-command Docker deployment
Any OpenAI-compatible serverAny service with /v1/audio/transcriptions endpoint

Local backends use HTTP batch recognition mode, with DualPiP sending audio segments (default 5 seconds) for transcription — completely free and works offline.


How to get Deepgram's free $200 credit and API key

No credit card required. Deepgram offers $200 in free credits to new users — no payment method needed during signup. At nova-3's rate of $0.007/minute, $200 covers approximately 476 hours of audio — enough for about 238 two-hour movies.

Step-by-step: Sign up for Deepgram and create an API key

  1. Visit deepgram.com and click Sign Up Free
  2. Register with your Google account or email (no credit card information required)
  3. After logging in, you'll enter the Console dashboard with a default project already created
  4. Navigate to Settings → API Keys in the left sidebar
  5. Click Create a New API Key
  6. Enter a name (e.g., "DualPiP"), set permissions to Member, and click Create Key
  7. Copy and save the API key immediately — it cannot be viewed again after closing the page
DetailInfo
Free credit$200 (on signup)
Credit card requiredNo
Credit expirationNever expires
After credits usedPay As You Go
Nova-3 pricing$0.007/minute
$200 covers~476 hours (~238 movies)

How to set up ASR live captions in DualPiP

Setup takes two steps: add an ASR provider in extension settings, then enable live captions in the picture-in-picture window.

Step 1: Add an ASR provider

  1. Open DualPiP's Settings page (click the extension icon → gear icon)
  2. Go to the ASR Settings tab
  3. Click Add Provider
  4. Select a preset: Deepgram (cloud) or Custom Local Backend (local)
  5. Enter your Deepgram API key (obtained in the previous section) or local Whisper server address
  6. Choose the default recognition language (Multilingual auto-detection recommended) and model
  7. Click Save

Step 2: Enable live captions in the PiP window

  1. Open DualPiP picture-in-picture mode on any video site (shortcut Ctrl+Shift+E)
  2. Click the ASR button (microphone icon) in the PiP control bar
  3. Toggle Live Caption on
  4. Real-time captions start appearing immediately above the video

You can also use the shortcut Shift+A to toggle ASR inside the PiP window, or configure a global shortcut for "Toggle Live Captions" in Chrome's extension shortcuts settings (chrome://extensions/shortcuts).


What's the difference between streaming and pre-download modes?

DualPiP ASR offers two audio capture and recognition modes for different viewing scenarios:

Real-time streaming mode (WebSocket)

Audio streams to Deepgram via WebSocket in real time, with sub-300ms caption latency. Deepgram's Interim Results feature provides preliminary transcription before the final result, making captions appear even faster. Best for live streams, video calls, and real-time content.

Pre-download batch mode (HTTP)

DualPiP pre-downloads the video audio and splits it into segments, then sends them via HTTP for batch transcription. Best for published video content — complete captions are generated before playback begins, with zero delay during viewing. Pre-download mode supports both Deepgram cloud and local Whisper backends.

ComparisonReal-time StreamingPre-download Batch
Latency< 300msZero after pre-download completes
Best forLive streams, real-time contentPublished videos, complete coverage
BackendsDeepgram (WebSocket)Deepgram + local Whisper
CoverageReal-time, occasional gapsComplete audio coverage

DualPiP defaults to Auto mode: it tries WebSocket streaming first, and falls back to pre-download batch if the provider doesn't support streaming.


How to use ASR captions with learning mode for language study

DualPiP ASR generates timestamped captions fully compatible with learning mode, turning any video — even those without native subtitles — into a language learning resource:

  • Subtitle list panel: Each ASR-recognized sentence appears chronologically in the right-side learning panel, clickable for navigation
  • AB loop repeat: Select any ASR caption for repeated playback and focused listening practice
  • Auto-pause: Automatic pause after each caption for shadowing and repetition
  • Bilingual display: ASR captions pair with AI translation engines for side-by-side original + translated subtitles

This means even a video with zero native subtitles can become a full-featured sentence-by-sentence learning tool once ASR generates captions.

ASR and traditional subtitles are mutually exclusive in DualPiP: enabling ASR automatically disables traditional subtitles, and vice versa. If a video has high-quality native subtitles, use those first — or find subtitle files via subtitle search. ASR is ideal for videos with no subtitles or poor-quality auto-generated captions.


How to combine ASR with AI translation for real-time bilingual captions

DualPiP's ASR and AI translation work together to generate real-time bilingual subtitles for any video in any language — solving a scenario traditional subtitles can't cover: the video has no native subtitles, but you need bilingual captions for language learning.

ASR + AI translation workflow

  1. ASR recognizes the original language: Deepgram transcribes the video's audio into source-language text captions
  2. AI LLM translates in real time: DualPiP's AI translation engine translates the ASR captions into your target language
  3. Bilingual captions displayed together: Original and translated text appear as bilingual subtitles overlaid on the video

ASR-generated captions differ from traditional subtitle files — they're real-time speech transcriptions that may have incomplete sentence boundaries, colloquial expressions, and proper nouns without context. AI LLM translation (DeepSeek, GPT, Claude) significantly outperforms traditional machine translation (Google, Microsoft) when translating ASR captions:

AspectTraditional MT (Google/Microsoft)AI LLM Translation (DeepSeek/GPT/Claude)
Context awarenessTranslates sentence by sentence, no contextDualPiP sends recent N captions as conversation history
Colloquial speechLiteral translation, awkward phrasingUnderstands conversational context, natural output
Incomplete sentencesBreaks when ASR sentence boundaries are imperfectCompletes meaning using context, translates correctly
Proper nounsFrequently mistranslates names and termsEnhanced by DualPiP's movie info integration
Tone preservationMechanical, flat outputPreserves speaker tone and expression style

DualPiP's AI translation engine uses a sliding window context mechanism: each time it translates an ASR caption, it includes previously translated captions as conversation history, ensuring consistent and coherent translation. This is especially important for ASR scenarios — because speech recognition segmentation differs from traditional subtitles, the AI needs prior context to correctly understand the current sentence.

Use cases

ScenarioDescription
Learning a language from unsubtitled videosASR recognizes the original + AI translates to your native language
Watching live streamsNo pre-made subtitles — ASR generates + AI translates in real time
Academic lectures and coursesSome courses lack subtitles — ASR + AI generates translations
Podcasts and interviewsAudio-only content visualized as text via ASR, then translated

DualPiP supports 30+ AI translation providers (DeepSeek, GPT, Claude, Gemini, etc.). For ASR captions, we recommend DeepSeek V4 Flash (best value, ~$0.03–0.07/movie) or Groq Llama (free tier, fastest response). See the complete AI translation setup guide for details.


Which video websites work with DualPiP ASR?

DualPiP ASR uses the browser's Audio Capture API to capture audio, so it can theoretically generate captions for any video playing in Chrome. Verified platforms include:

Platform TypeSupported Sites
Video platformsYouTube, Netflix, Disney+, Bilibili, Crunchyroll, HiAnime
Learning platformsCoursera, Udemy, TED, edX, Khan Academy
Live streamingTwitch, YouTube Live
Meeting toolsZoom (web), Google Meet
OtherAny site using HTML5 <video> element

For videos without native subtitles (live streams, niche platforms, user uploads), ASR is the only way to get captions. Combined with DualPiP's AI LLM translation, you can generate real-time bilingual subtitles for any online video in any language. Since ASR captions are real-time speech transcriptions, AI LLM translation is strongly recommended over traditional machine translation for accurate, context-aware results.


Frequently asked questions

Q: How accurate is ASR real-time speech recognition? Deepgram nova-3 achieves a word error rate (WER) below 8% for English, making it one of the most accurate real-time speech recognition models in 2026. Chinese, Japanese, and other languages also perform well. Accuracy depends on audio quality, speaker accent, and background noise.

Q: What happens after the $200 free credit is used up? It switches to pay-as-you-go billing. Nova-3 costs $0.007/minute — about $0.84 for a 2-hour movie. You can also switch to a local Whisper backend for completely free transcription (requires a local GPU).

Q: Can ASR captions and traditional subtitles be displayed simultaneously? No. DualPiP treats them as mutually exclusive — enabling ASR automatically disables traditional subtitles, and vice versa. If a video has quality native subtitles or you can search for subtitle files, use those first.

Q: Is ASR a free feature or does it require DualPiP Premium? ASR live captions are a Premium feature. DualPiP's YouTube in-page bilingual subtitles and basic PiP player are free. ASR, AI translation, and full learning mode require a Premium subscription.

Q: What hardware do I need for a local Whisper backend? An NVIDIA GPU with 6GB+ VRAM is recommended for smooth real-time recognition. CPU inference works but is slower. Deploying hwdsl2/whisper-server via Docker is the simplest approach — one command to start.

Q: Can it recognize mixed languages in a single video? Deepgram's Multilingual mode detects and switches languages automatically within the same audio stream, ideal for multilingual interviews, podcasts, and educational content. Local Whisper also supports language detection but with lower switching accuracy.


Get started with DualPiP ASR live captions

Four steps to generate AI live captions for any online video:

  1. Install DualPiP: Chrome Web Store | Edge Add-ons
  2. Sign up for Deepgram's free $200 credit: deepgram.com (no credit card needed)
  3. Add a Deepgram provider in DualPiP settings and enter your API key
  4. Open any video in PiP mode and click the ASR button to enable live captions

Whether it's an unsubtitled live stream, a niche platform video, or foreign language content for study, DualPiP ASR generates real-time AI captions. Combine with AI bilingual translation for dual-language subtitles on any video, or use learning mode with AB loop repeat and sentence panels to turn every video into effective language learning material.