Vox
2026-05-15
BackOn any given week I sit through more meetings than I can remember by Friday: a standup over Meet, a check-in over Zoom, two phone calls with a colleague who prefers to talk while he walks, half a dozen WhatsApp voice notes that arrive as a wall of green bubbles. The official tools transcribe roughly none of that. The phone calls live as .m4a files on my phone; the WhatsApp audio lives wherever WhatsApp decides; the Meet captions, when they exist, expire when the tab closes. My workflow had become embarrassingly manual: record on the phone, rename to 2026-02-14 Pedro Investor Call.m4a, AirDrop to the laptop, drop into a folder, and then nothing. The folder grew. I could not search it.
The cloud answer is to send everything to a transcription API and pay per minute. At my volume, and for material that includes things I would rather not hand to a third-party endpoint, that was not the answer. The open-source answer was supposed to be a CLI you point at a folder. What I found instead was a small zoo of GUIs whose install instructions started with a CUDA wheel and ended with an Electron window, none of which did batch, most of which did not do speaker labels, and none of which I would trust to run on five hours of audio overnight. So I wrote vox-transcribe: a local CLI for Apple Silicon that takes an audio file, gives back a transcript with timestamps and speaker labels, and ships through my own Homebrew tap so installing it is one line.
The interesting constraint on a Mac is that the pipeline does not run on a single device. Transcription uses faster-whisper, the maintained C++ port of Whisper most projects converged on after the original openai/whisper repo went quiet, built on CTranslate2. CTranslate2 has no Metal backend; on Apple Silicon it falls back to CPU. The naive thing to do is run the whole pipeline on CPU and accept the speed. The better thing to do is notice that the other two stages, alignment (whisperx's wav2vec aligner) and diarization (pyannote.audio), are plain PyTorch, and plain PyTorch on Apple Silicon runs perfectly well on MPS, the Metal Performance Shaders backend.
So get_device() returns a tuple, not a single device. On a Mac it returns ("cpu", "mps"): Whisper transcription runs on CPU with int8 quantization, alignment and diarization run on MPS, results are merged in plain Python. On a CUDA box it returns ("cuda", "cuda") and everything moves to the GPU. On a plain Linux laptop it returns ("cpu", "cpu") and the pipeline still works, slower. Each stage runs on the best-supported hardware path the machine offers, and nothing in the calling code knows the routing happened. The real-time factor on an M-series MacBook lands around 0.16x on the benchmark set, which means a one-hour recording transcribes in roughly ten minutes and a five-hour overnight batch is done before breakfast.
WhisperX gives back segments with per-word speaker assignments. Raw, that structure is JSON, which is the right shape for the next program and the wrong shape for a human three weeks later. The default .txt formatter walks the segments, collapses consecutive same-speaker runs into a single paragraph, and prefixes each paragraph with [HH:MM:SS.mmm] SPEAKER_00:. That is the difference between an artifact you grep through and one you skim to find the moment the investor asked about pricing. SRT is available for video subtitle workflows; JSON is available for piping into the next tool. The format I care about is the one that lets me re-enter a meeting an hour into reading it.
The diarization models are gated. To get pyannote/speaker-diarization-3.1 from Hugging Face you have to make an account, accept three different license agreements (one for the pipeline, one for the segmentation model, one for the speaker embedding model), generate a token, and export it as an environment variable. That is a fine workflow for a researcher. It is a non-starter for brew install and go. The whole point of shipping through a package manager is that someone who has never heard of Hugging Face can run one command and have a working tool.
The fix is that the maintainer pays the gate cost once and ships the models inside the wheel. scripts/snapshot_models.py downloads all three pyannote artifacts, drops them under src/vox_transcribe/models/, then patches the top-level config.yaml to point at relative paths instead of HF model IDs. uv build packs the result into a fat wheel. At load time, the pipeline checks whether the bundled models exist on disk and uses them if they do, falling back to the HF download path if they do not. The end user runs the brew command and the models are already there.
The specific landmine, which cost me an afternoon, is what you point the loader at. Pyannote's Pipeline.from_pretrained is clever in unhelpful ways. Pointing it at a config.yaml causes a pickle error: the loader sees a path and calls torch.load on the YAML text. Pointing it at the model directory causes it to consult HF Hub for updates and fail when there is no network or no token. Pointing it at the pytorch_model.bin inside that directory is the only path that loads purely locally. The comment in core.py calls this the sweet spot, with the quotation marks earned. The bundled-model patch walks the YAML, finds the relative references, and rewrites them to the absolute .bin paths on the user's machine.
Releases go through a separate homebrew-tap repo. Bump the version, run the snapshot script, uv build, tag, draft a release with the wheel attached. A GitHub Action fires, computes the SHA256, patches the formula in the tap repo, commits. End users run brew upgrade vox-transcribe.
Modern PyTorch defaults torch.load to weights_only=True: it refuses to unpickle arbitrary Python objects, because pickle is a remote code execution vector when the file came from somewhere you do not trust. Pyannote's models predate that default and use the old pickle format. They fail to load under the new rule. The tempting fix is one global line near the top of the program. The right fix is a context manager: unsafe_torch_load() monkey-patches torch.load for the duration of one model load and restores the original inside finally. Everywhere else in the process, the security check is still on. The unsafe window is as narrow as it can be made without forking pyannote.
The benchmark numbers tell on the tool. On clean read speech (JFK reading from a prepared text) the word-error rate is 4.55%. On the LibriSpeech and Harvard sentence sets, both also clean and read, it sits around 15%. On a real two-speaker telephone conversation, the kind of audio I actually feed it, WER jumps to 28.75%. "This is Diane in New Jersey" becomes "This is Daddy and a New Jersey." Crosstalk, accents, the lossy codec on the phone leg, the speakerphone two feet from the mic: each of those is a known failure mode for Whisper and pyannote individually, and they compound. The tool I built is most useful on the audio it is least accurate on. The transcript is a first draft, not a court record, and the timestamps are what make the first draft usable: when a sentence reads as nonsense, I scrub the original audio at the marked second and listen.
Source at github.com/ma-r-s/vox-transcribe. Install with brew tap ma-r-s/tap && brew install vox-transcribe.
CC BY-NC 4.0 © ma-r-s
?