Emacs and whisper.el: Trying out different speech-to-text backends and models
| audio, emacs, speech-recognitionI was curious about parakeet because I heard that it was faster than Whisper on the HuggingFace leaderboard. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.
In order to try natrys/whisper.el with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.
(defvar my-whisper-url-format "http://%s:%d/transcribe")
;;;###autoload
(defun my-whisper--transcribe-via-local-server ()
"Transcribe audio using the local whisper server."
(message "[-] Transcribing via local server")
(whisper--setup-mode-line :show 'transcribing)
(whisper--ensure-server)
(setq whisper--transcribing-process
(whisper--process-curl-request
(format my-whisper-url-format whisper-server-host whisper-server-port)
(list "Content-Type: multipart/form-data")
(list (concat "file=@" whisper--temp-file)
"temperature=0.0"
"temperature_inc=0.2"
"response_format=json"
(concat "model=" whisper-model)
(concat "language=" whisper-language)))))
;;;###autoload
(defun my-whisper--check-model-consistency () t)
I moved my server code to speech-input-speaches.el.
(with-eval-after-load 'whisper
(advice-add 'whisper--transcribe-via-local-server :override #'my-whisper--transcribe-via-local-server)
(advice-add 'whisper--check-model-consistency :override #'my-whisper--check-model-consistency)
(advice-add 'whisper--ensure-server :override #'speech-input-speaches-server-start)
)
Then I have this function for trying things out.
(defun my-test-whisper-api (url &optional args)
(with-temp-buffer
(apply #'call-process "curl" nil t nil "-s"
url
(append (mapcan
(lambda (h) (list "-H" h))
(list "Content-Type: multipart/form-data"))
(mapcan
(lambda (h) (list "-F" h))
(list (concat "file=@" whisper--temp-file)
"temperature=0.0"
"temperature_inc=0.2"
"response_format=verbose_json"
(concat "language=" whisper-language)))
args))
(message "%s %s" (buffer-string) url)))
Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.
Download
Code for running the benchmarks
(let ((times '3))
(mapcar
(lambda (group)
(let ((whisper--temp-file "/home/sacha/recordings/whisper/2026-01-19-14-17-53.wav"))
;; warm up the model
(eval (cadr group))
(list
(format "%.3f"
(/ (car
(benchmark-call (lambda () (eval (cadr group))) times))
times))
(car group))))
'(
("parakeet"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 5092)))
("whisper.cpp base-q4_0"
(my-test-whisper-api
(format "http://%s:%d/inference" whisper-server-host 8642)))
("speaches whisper-base"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-base")))
("speaches whisper-base.en"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-base.en")))
("speaches whisper-small"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-small")))
("speaches whisper-small.en"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=Systran/faster-whisper-small.en")))
("speaches lorneluo/whisper-small-ct2-int8"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
(list "-F" "model=lorneluo/whisper-small-ct2-int8")))
;; needed export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
("whisperx-server Systran/faster-whisper-small"
(my-test-whisper-api
(format "http://%s:%d/transcribe" whisper-server-host 8002)))))
)
| 3.694 | parakeet |
| 2.484 | whisper.cpp base-q4_0 |
| 1.547 | speaches whisper-base |
| 1.425 | speaches whisper-base.en |
| 4.076 | speaches whisper-small |
| 3.735 | speaches whisper-small.en |
| 2.870 | speaches lorneluo/whisper-small-ct2-int8 |
| 4.537 | whisperx-server Systran/faster-whisper-small |
All the results above were CPU-only, no GPU acceleration.
I tried it with:
- parakeet
- whisper.cpp (as whisper.el sets it up)
- speaches, which is a front-end for faster-whisper, and
- whisperx-server, which is a front-end for whisperx
Update: After a lot of rebooting, I figured out how to get the Nvidia P1000 GPU on my Lenovo P52 to work for up to the Whisper medium models. Large models don't fit in the 4GB of VRAM it has. Here are the benchmarking results:
Benchmarking code
(let ((times '3))
(mapcar
(lambda (group)
(let ((whisper--temp-file "/tmp/2026-01-19-14-17-53.wav"))
;; warm up the model
(eval (cadr group))
(list
(format "%.3f"
(/ (car
(benchmark-call (lambda () (eval (cadr group))) times))
times))
(car group))
))
'(
("speaches whisper-tiny"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
(list "-F" "model=Systran/faster-whisper-tiny.en")))
("speaches whisper-base"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
(list "-F" "model=Systran/faster-whisper-base.en")))
("speaches whisper-small"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
(list "-F" "model=Systran/faster-whisper-small.en")))
("speaches whisper-medium"
(my-test-whisper-api
(format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
(list "-F" "model=Systran/faster-whisper-medium.en"))))))
| 0.596 | speaches whisper-tiny |
| 0.940 | speaches whisper-base |
| 2.909 | speaches whisper-small |
| 8.740 | speaches whisper-medium |
That probably means I can upgrade to using the small model on a regular basis. For large models, I can use the command-line tool, since the overhead of loading the model is probably small compared to the processing time. I could also run two Speaches servers (one on CPU and one on GPU), but I don't think I'll need that level yet.
I like how speaches lets me switch models on the fly, so maybe I can use small.en generally and switch to a different model when I want to try dictating in French. Here's how I've set it up to use the server.
(setq whisper-server-port 8000
whisper-model "Systran/faster-whisper-small.en"
my-whisper-url-format "http://%s:%d/v1/audio/transcriptions")
Benchmark notes: I have a Lenovo P52 laptop (released 2018) with an Intel Core i7-8850H (6 cores, 12 threads; 2.6 GHz base / 4.3 GHz turbo) with 64GB RAM and an SSD. It has a soldered-on Nvidia P1000 with 4GB of VRAM.