Emacs and whisper.el :Trying out different speech-to-text backends and models
| emacsI was curious about parakeet because I heard that it was faster than Whisper on the HuggingFace leaderboard. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.
In order to try natrys/whisper.el with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.
(defvar my-whisper-url-format "http://%s:%d/transcribe")
(defun whisper--transcribe-via-local-server ()
"Transcribe audio using the local whisper server."
(message "[-] Transcribing via local server")
(whisper--setup-mode-line :show 'transcribing)
(whisper--ensure-server)
(setq whisper--transcribing-process
(whisper--process-curl-request
(format my-whisper-url-format whisper-server-host whisper-server-port)
(list "Content-Type: multipart/form-data")
(list (concat "file=@" whisper--temp-file)
"temperature=0.0"
"temperature_inc=0.2"
"response_format=json"
(concat "model=" whisper-model)
(concat "language=" whisper-language)))))
(defun whisper--check-model-consistency () t)
Then I have this function for trying things out.
(defun my-test-whisper-api (url &optional args)
(with-temp-buffer
(apply #'call-process "curl" nil t nil "-s"
url
(append (mapcan
(lambda (h) (list "-H" h))
(list "Content-Type: multipart/form-data"))
(mapcan
(lambda (h) (list "-F" h))
(list (concat "file=@" whisper--temp-file)
"temperature=0.0"
"temperature_inc=0.2"
"response_format=verbose_json"
(concat "language=" whisper-language)))
args))
(message "%s %s" (buffer-string) url)))
Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.
Download
Code for running the benchmarks
(mapcar
(lambda (group)
(let ((whisper--temp-file "/home/sacha/recordings/whisper/2026-01-19-14-17-53.wav"))
;; warm up the model
(eval (cadr group))
(list
(format "%.3f"
(/ (car
(benchmark-call (lambda () (eval (cadr group))) times))
times))
(car group))))
'(
("parakeet"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 5092)))
("whisper.cpp base-q4_0"
(my-test-whisper-api
(format "http://%s:%d/inference" whisper-server-host 8642)))
("speaches whisper-base"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-base")))
("speaches whisper-base.en"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-base.en")))
("speaches whisper-small"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-small")))
("speaches whisper-small.en"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-small.en")))
("speaches lorneluo/whisper-small-ct2-int8"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=lorneluo/whisper-small-ct2-int8")))
;; needed export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
("whisperx-server Systran/faster-whisper-small"
(my-test-whisper-api
(format "http://%s:%d/transcribe" whisper-server-host 8002)))))
| 3.694 | parakeet |
| 2.484 | whisper.cpp base-q4_0 |
| 1.547 | speaches whisper-base |
| 1.425 | speaches whisper-base.en |
| 4.076 | speaches whisper-small |
| 3.735 | speaches whisper-small.en |
| 2.870 | speaches lorneluo/whisper-small-ct2-int8 |
| 4.537 | whisperx-server Systran/faster-whisper-small |
I tried it with:
- parakeet
- whisper.cpp (as whisper.el sets it up)
- speaches, which is a front-end for faster-whisper, and
- whisperx-server, which is a front-end for whisperx
Looks like speaches + faster-whisper-base is the winner for now. I like how speaches lets me switch models on the fly, so maybe I can use base.en generally and switch to base when I want to try dictating in French. Here's how I've set it up to use the server I just set up.
(setq whisper-server-port 8001 whisper-model "Systran/faster-whisper-base.en"
my-whisper-url-format "http://%s:%d/v1/audio/transcriptions")
At some point, I'll override whisper--ensure-server so that starting it up is smoother.
Benchmark notes: I have a Lenovo P52 laptop (released 2018) with an Intel Core i7-8850H (6 cores, 12 threads; 2.6 GHz base / 4.3 GHz turbo) with 64GB RAM and an SSD. I haven't figured out how to get the GPU working under Ubuntu yet.