Insanely Fast Whisper — Lightning-Fast On-Device ASR

Cut the wait: transcribe hours of audio in minutes (or seconds)

Slow, costly cloud transcription or fragile third‑party services can be a bottleneck for teams that need quick, reliable transcripts. If you’ve ever waited forever for an accurate model to finish transcribing hours of audio, Insanely Fast Whisper aims straight for that pain point. This opinionated CLI brings OpenAI’s Whisper models on‑device and pairs them with Transformers, Optimum, and Flash Attention 2 to make transcription dramatically faster — often by orders of magnitude.

What It Does

Insanely Fast Whisper is a lightweight, community‑driven command line interface to run Whisper‑family automatic speech recognition (ASR) models locally on GPUs and Apple Silicon (MPS). It focuses on throughput and real‑world usability rather than trying to be a one‑size‑fits‑all platform.

The project’s core capabilities include:

Fast, on‑device transcription of audio files using Hugging Face models (default: openai/whisper-large-v3).
Support for high‑throughput optimisation strategies — fp16, batching, BetterTransformer, and Flash Attention 2 — to reduce wall‑clock time.
Simple, opinionated CLI with sensible defaults for batch size, chunking, timestamps (chunk or word), and device selection (CUDA device id or mps for Macs).
Optional diarization integration via pyannote.audio (requires a Hugging Face token) to split speakers.
Support for different models, including distilled variants and distil‑whisper checkpoints for faster but lighter transcription.

Benchmarks from the repo: on an Nvidia A100 (80GB), Whisper Large v3 with fp16 + batching + Flash Attention 2 transcribes ~150 minutes of audio in ~98 seconds. Other setups and model variants are listed in the repo benchmarking table.

Who It’s For

This tool is designed for developers, ML engineers, researchers, podcasters, journalists, and teams who need to transcribe audio at scale on their own hardware or macOS machines. Typical use cases include:

Batch transcription for long audio recordings such as podcasts, interviews, meetings, or lectures.
Prototyping production workflows where you want maximum throughput without cloud costs.
Teams who need offline transcription for privacy or compliance reasons.

No special UI skills are required — familiarity with the command line and a basic understanding of Python packaging and GPU devices (CUDA or Apple MPS) is helpful. The project is opinionated and aims to keep configuration minimal while letting advanced users tweak batching, model choice, and attention implementations.

How It Works

Under the hood the CLI composes a high‑throughput pipeline using components from the Hugging Face ecosystem:

Transformers — the ASR pipeline and pre‑trained Whisper checkpoints (e.g., openai/whisper-large-v3).
Optimum / BetterTransformer — to reduce compute by applying transformer optimisations transparently.
Flash Attention 2 — an alternative attention implementation that significantly speeds up large models when installed and available.
PyTorch and Accelerate — device management, mixed precision (fp16), and model offloading where applicable.

The CLI orchestrates model loading (with fp16 where supported), audio chunking (e.g., chunk_length_s), and batching (defaults to 24 but adjustable). For diarization, it calls into pyannote.audio with an optional Hugging Face token to segment speakers and then uses Whisper for ASR on those segments.

Key technical notes from the repository

Device support: NVIDIA CUDA GPUs and macOS MPS (Apple Silicon).
Optimisations that matter: fp16, batching, BetterTransformer, and Flash Attention 2. The repo shows specific benchmark numbers comparing these combinations.
Flash Attention must be installed carefully; the repo recommends using pipx runpip with the --no-build-isolation flag.

Getting Started

If parts of the installation or environment are missing from the provided text, the article will acknowledge that and point to the exact commands shown in the repo.

Install with pipx (recommended):


# Install a specific version (example from the repo)
pipx install insanely-fast-whisper==0.0.15 --force

# Or install the latest via pipx
pipx install insanely-fast-whisper

# If pipx misparses Python 3.11, force install with pip args
pipx install insanely-fast-whisper --force --pip-args="--ignore-requires-python"

Run transcriptions from any path:


insanely-fast-whisper --file-name 

# Use Flash Attention 2 when available
insanely-fast-whisper --file-name  --flash True

# Run distil-whisper model
insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name 

# Try without installing (one‑off run with pipx)
pipx run insanely-fast-whisper --file-name

Minimal Python snippet (no CLI) from the repo to run a Transformers ASR pipeline:


pip install --upgrade transformers optimum accelerate

import torch
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device="cuda:0",
    model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
)

outputs = pipe(
    "",
    chunk_length_s=30,
    batch_size=24,
    return_timestamps=True,
)

outputs

Important install tips (from the repo): use pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation to install Flash Attention correctly. On Windows you may need to install a CUDA‑compatible PyTorch wheel manually if you hit a “Torch not compiled with CUDA enabled” error. On macOS, reduce batch size (for example --batch-size 4) to avoid OOMs on MPS.

Key Features

Blazing throughput — demonstrates transcript times dramatically faster than baseline fp32 runs by combining fp16, batching, BetterTransformer, and Flash Attention.
Simple CLI — opinionated defaults so you can run transcriptions quickly without deep config.
Model flexibility — switch between large checkpoints and distilled variants for speed/accuracy tradeoffs.
Diarization support — optional integration with pyannote to split speakers before ASR.
Device options — explicit support for CUDA device id or MPS for Mac users.

Comparisons in the repository show how combinations of optimisations stack up: for example, large‑v3 fp32 is orders of magnitude slower than fp16 + batching + Flash Attention 2, and distilled models can further reduce runtime.

Why It’s Worth Trying

If you need fast, local transcription, this project turns Whisper from a research checkpoint into a practical, high‑throughput tool. The repo presents concrete benchmarks (e.g., transcribing 150 minutes of audio in ~98 seconds on A100 with Flash Attention 2) demonstrating the potential speedups.

The project is community‑driven, with multiple community showcases and forks that extend or adapt the approach. Links provided in the repo include:

Community metrics: the provided text does not include explicit GitHub star counts or contributor numbers for the main repo. It does, however, emphasize active community contributions and third‑party projects built on top of the repo’s ideas.

GitHub Link

The official repository name referenced throughout the text is insanely-fast-whisper. While the provided material does not include a direct link to the main project’s GitHub URL, you can find community forks and related implementations linked above. The project also references core resources such as:

Tip: if you want the latest CLI release, the repo recommends installing via pipx install insanely-fast-whisper or running the tool one‑off with pipx run insanely-fast-whisper.

Final Thoughts

Insanely Fast Whisper is a pragmatic, performance‑first wrapper that brings Whisper models into a production‑friendly CLI. Its opinionated design, strong optimisation recipes, and community contributions make it a great starting point if you need to transcribe a lot of audio fast and on your own hardware. Follow the repo’s installation notes carefully (especially around Flash Attention 2 and PyTorch/CUDA on Windows) and experiment with model choice and batch size to find the best mix of speed and accuracy for your workload.

Want to experiment? Start with the pipx install commands above and try the Flash Attention flag on hardware that supports it — the repo’s benchmarks show the payoff can be huge.