字幕ジェネレーターはどのように機能しますか？

ブラウザベースの音声認識（Web Speech API）を使用して、音声を時間付き字幕に文字起こしします。

どの字幕形式でエクスポートできますか？

SRTとVTT形式に対応しており、YouTube、TikTok、ほとんどの動画編集ソフトと互換性があります。

複数の言語に対応していますか？

はい、ブラウザの音声認識エンジンで利用可能なすべての言語に対応しています。

音声はサーバーに送信されますか？

音声認識はブラウザの内蔵エンジンを使用します。当社のサーバーに音声が送信されることはありません。

この字幕ジェネレーターは無料ですか？

はい、動画の長さに制限なく完全に無料です。

🎙️ Whisper AI · No Upload · No API Key

AI字幕ジェネレーター

OpenAI Whisper を使用して、任意の音声や動画を SRT、VTT またはプレーンテキスト に文字起こし — 完全にブラウザ内で動作します。アップロード不要、APIキー不要、サブスクリプション不要。初回モデルダウンロード後はオフラインで動作。

🎙️

音声または動画ファイルをここにドロップ、またはクリックして選択

MP3, MP4, WAV, M4A, OGG, WebM, MOV, FLAC…

最良の結果のために：クリアな音声、最小限の背景雑音、30分未満

出力

モデル

初回使用：モデルは HuggingFace CDN からダウンロードされます（選択によって75〜466 MB）。ブラウザにキャッシュされ — 以降の実行は即座、オフラインでも動作します。

言語

話されている言語を選択 — 精度と速度が大幅に向上します。

💡 最良の結果のためのヒント

• Base モデルを出発点として使用。

• 言語がわかる場合は手動で設定 — 約30%速くなります。

• クリアな音声はノイズの多い録音より良く文字起こしされます。

• 長いファイル（>30分）は遅いデバイスでは数分かかる場合があります。

• SRT はすべての動画エディターとSNSプラットフォームで動作。VTT はウェブプレーヤー用。

Whisper AI transcription — free, local, private

OpenAI Whisper is one of the best automatic speech recognition models ever released. It understands 99 languages, handles accents gracefully, and produces subtitle-quality output including punctuation and casing. Services that offer Whisper as an API charge $0.006 per minute — for a 60-minute interview, that's $0.36, and it goes to their server. This tool runs Whisper directly in your browser via WebAssembly, so it costs you nothing and your audio never leaves your device.

How it works

Drop your file. Any format your browser can decode: MP3, MP4, WAV, M4A, OGG, WebM, FLAC, MOV.
Load a model. We download the Whisper model weights from HuggingFace CDN (75–466 MB). This is a one-time download — the browser caches it, so subsequent runs work instantly, even offline.
Transcribe. The model runs in a Web Worker thread so the page stays responsive. A typical 5-minute audio clip takes 1–3 minutes to transcribe, depending on your CPU.
Export. Download as SRT (for video editors and social platforms), VTT (for web players), plain text, or JSON with raw timestamps.

Which model to choose?

Tiny (75 MB) — quick drafts, strong English. Useful for getting timestamps even if accuracy isn't perfect.
Tiny EN-only (75 MB) — same size as Tiny but faster on English because it skips the language-detection step.
Base (145 MB) — the sweet spot for most use cases. Better multilingual accuracy than Tiny at acceptable speed.
Small (466 MB) — near-professional quality. Recommended for interviews, podcasts, legal/medical content.

Export formats

SRT — supported by YouTube, Vimeo, CapCut, DaVinci Resolve, Premiere, Final Cut Pro and every social platform that accepts caption files.
VTT (WebVTT) — the native caption format for HTML5 <video> elements and web streaming players.
プレーンテキスト — the transcript without any timing markers. Useful for documentation, meeting notes or full-text search.
JSON — raw transcript data with start and end timestamps per segment. Ideal if you want to process the transcript programmatically.

Privacy

Your audio file is decoded and processed in your browser tab using WebAssembly. Nothing is uploaded. We don't store audio, we don't store transcripts, we don't log file names. The only network requests this page makes are the one-time model downloads from HuggingFace. After that, the model runs completely offline.

Common use cases

Subtitle a podcast episode or YouTube video without paying for Rev or Otter.ai.
Transcribe an interview for a journalism piece or research paper.
Generate captions for accessibility compliance.
Create timed subtitles for a Shorts/Reel from a longer video.
Transcribe meeting recordings for written notes.
Create an SRT file to upload to TikTok, YouTube or Instagram as native captions.

Limitations

First load is slow — 75–466 MB model download. Fine on broadband, slow on mobile data.
CPU-bound — transcription happens on your CPU. A modern laptop transcribes ~10× real time with the Tiny model. Older hardware will be slower.
Background noise — like all Whisper deployments, quality degrades significantly with heavy background noise or multiple speakers talking over each other.
Long files — files over 30 minutes should be split first. Very long files may exhaust browser memory.

よくある質問

字幕ジェネレーターはどのように機能しますか？: ブラウザベースの音声認識（Web Speech API）を使用して、音声を時間付き字幕に文字起こしします。
どの字幕形式でエクスポートできますか？: SRTとVTT形式に対応しており、YouTube、TikTok、ほとんどの動画編集ソフトと互換性があります。
複数の言語に対応していますか？: はい、ブラウザの音声認識エンジンで利用可能なすべての言語に対応しています。
音声はサーバーに送信されますか？: 音声認識はブラウザの内蔵エンジンを使用します。当社のサーバーに音声が送信されることはありません。
この字幕ジェネレーターは無料ですか？: はい、動画の長さに制限なく完全に無料です。

AI字幕ジェネレーター

出力

Whisper AI transcription — free, local, private

How it works

Which model to choose?

Export formats

Privacy

Common use cases

Limitations

よくある質問

関連ツール

ビデオクロッパー

Shorts Lab

文字数カウンター