字幕生成器如何工作？

使用浏览器内置的 Web Speech API 或 Whisper AI 模型进行语音识别,将视频中的对话自动转换为文字字幕。处理完全在您的设备上进行。

字幕生成器会上传我的视频吗？

不会。语音识别在浏览器中完成,您的视频和音频数据永远不会离开设备。这确保了内容的完全私密性。

支持哪些语言的字幕生成？

支持多种语言,包括英语、中文、西班牙语、日语等。识别准确度取决于音频质量和所选语音识别引擎。

可以导出哪些字幕格式？

支持导出 SRT 和 VTT 字幕格式,这两种格式兼容几乎所有视频播放器和社交媒体平台。您可以在导出前编辑时间轴和文本。

字幕生成器免费吗？有使用限制吗？

完全免费,没有视频长度或使用次数限制。由于处理在浏览器中进行,唯一的限制是设备性能。非常适合内容创作者和视频编辑者。

🎙️ Whisper AI · No Upload · No API Key

AI 字幕生成器

使用 OpenAI Whisper 将任意音频或视频转录为 SRT、VTT 或纯文本 — 完全在您的浏览器中运行。无需上传、无需 API 密钥、无需订阅。首次下载模型后可离线使用。

🎙️

将音频或视频文件拖到此处，或点击选择

MP3, MP4, WAV, M4A, OGG, WebM, MOV, FLAC…

最佳效果：清晰音频、最小背景噪音、不超过30分钟

输出

模型

首次使用：模型从 HuggingFace CDN 下载（根据选择为 75–466 MB）。缓存在浏览器中 — 后续运行即时完成，即使离线也可以。

语言

选择口语语言 — 显著提高准确性和速度。

💡 最佳效果提示

• 以 Base 模型为起点。

• 如果知道语言，手动设置 — 速度约快30%。

• 清晰音频比嘈杂录音转录效果更好。

• 长文件（>30分钟）在较慢设备上可能需要几分钟。

• SRT 适用于所有视频编辑器和社交平台。VTT 用于网页播放器。

Whisper AI transcription — free, local, private

OpenAI Whisper is one of the best automatic speech recognition models ever released. It understands 99 languages, handles accents gracefully, and produces subtitle-quality output including punctuation and casing. Services that offer Whisper as an API charge $0.006 per minute — for a 60-minute interview, that's $0.36, and it goes to their server. This tool runs Whisper directly in your browser via WebAssembly, so it costs you nothing and your audio never leaves your device.

How it works

Drop your file. Any format your browser can decode: MP3, MP4, WAV, M4A, OGG, WebM, FLAC, MOV.
Load a model. We download the Whisper model weights from HuggingFace CDN (75–466 MB). This is a one-time download — the browser caches it, so subsequent runs work instantly, even offline.
Transcribe. The model runs in a Web Worker thread so the page stays responsive. A typical 5-minute audio clip takes 1–3 minutes to transcribe, depending on your CPU.
Export. Download as SRT (for video editors and social platforms), VTT (for web players), plain text, or JSON with raw timestamps.

Which model to choose?

Tiny (75 MB) — quick drafts, strong English. Useful for getting timestamps even if accuracy isn't perfect.
Tiny EN-only (75 MB) — same size as Tiny but faster on English because it skips the language-detection step.
Base (145 MB) — the sweet spot for most use cases. Better multilingual accuracy than Tiny at acceptable speed.
Small (466 MB) — near-professional quality. Recommended for interviews, podcasts, legal/medical content.

Export formats

SRT — supported by YouTube, Vimeo, CapCut, DaVinci Resolve, Premiere, Final Cut Pro and every social platform that accepts caption files.
VTT (WebVTT) — the native caption format for HTML5 <video> elements and web streaming players.
纯文本 — the transcript without any timing markers. Useful for documentation, meeting notes or full-text search.
JSON — raw transcript data with start and end timestamps per segment. Ideal if you want to process the transcript programmatically.

Privacy

Your audio file is decoded and processed in your browser tab using WebAssembly. Nothing is uploaded. We don't store audio, we don't store transcripts, we don't log file names. The only network requests this page makes are the one-time model downloads from HuggingFace. After that, the model runs completely offline.

Common use cases

Subtitle a podcast episode or YouTube video without paying for Rev or Otter.ai.
Transcribe an interview for a journalism piece or research paper.
Generate captions for accessibility compliance.
Create timed subtitles for a Shorts/Reel from a longer video.
Transcribe meeting recordings for written notes.
Create an SRT file to upload to TikTok, YouTube or Instagram as native captions.

Limitations

First load is slow — 75–466 MB model download. Fine on broadband, slow on mobile data.
CPU-bound — transcription happens on your CPU. A modern laptop transcribes ~10× real time with the Tiny model. Older hardware will be slower.
Background noise — like all Whisper deployments, quality degrades significantly with heavy background noise or multiple speakers talking over each other.
Long files — files over 30 minutes should be split first. Very long files may exhaust browser memory.

常见问题

字幕生成器如何工作？: 使用浏览器内置的 Web Speech API 或 Whisper AI 模型进行语音识别,将视频中的对话自动转换为文字字幕。处理完全在您的设备上进行。
字幕生成器会上传我的视频吗？: 不会。语音识别在浏览器中完成,您的视频和音频数据永远不会离开设备。这确保了内容的完全私密性。
支持哪些语言的字幕生成？: 支持多种语言,包括英语、中文、西班牙语、日语等。识别准确度取决于音频质量和所选语音识别引擎。
可以导出哪些字幕格式？: 支持导出 SRT 和 VTT 字幕格式,这两种格式兼容几乎所有视频播放器和社交媒体平台。您可以在导出前编辑时间轴和文本。
字幕生成器免费吗？有使用限制吗？: 完全免费,没有视频长度或使用次数限制。由于处理在浏览器中进行,唯一的限制是设备性能。非常适合内容创作者和视频编辑者。

AI 字幕生成器

输出

Whisper AI transcription — free, local, private

How it works

Which model to choose?

Export formats

Privacy

Common use cases

Limitations

常见问题

相关工具

视频裁剪器

短视频实验室

字数统计器