Emacs and whisper.el: Trying out different speech-to-text backends and models

| audio, emacs, speech-recognition

I was curious about parakeet because I heard that it was faster than Whisper on the HuggingFace leaderboard. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.

In order to try natrys/whisper.el with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.

(defvar my-whisper-url-format "http://%s:%d/transcribe")
;;;###autoload
(defun my-whisper--transcribe-via-local-server ()
  "Transcribe audio using the local whisper server."
  (message "[-] Transcribing via local server")
  (whisper--setup-mode-line :show 'transcribing)
  (whisper--ensure-server)
  (setq whisper--transcribing-process
        (whisper--process-curl-request
         (format my-whisper-url-format whisper-server-host whisper-server-port)
         (list "Content-Type: multipart/form-data")
         (list (concat "file=@" whisper--temp-file)
               "temperature=0.0"
               "temperature_inc=0.2"
               "response_format=json"
               (concat "model=" whisper-model)
               (concat "language=" whisper-language)))))
;;;###autoload
(defun my-whisper--check-model-consistency () t)

I moved my server code to speech-input-speaches.el.

(with-eval-after-load 'whisper
  (advice-add 'whisper--transcribe-via-local-server :override #'my-whisper--transcribe-via-local-server)
  (advice-add 'whisper--check-model-consistency :override #'my-whisper--check-model-consistency)
  (advice-add 'whisper--ensure-server :override #'speech-input-speaches-server-start)
  )

Then I have this function for trying things out.

(defun my-test-whisper-api (url &optional args)
  (with-temp-buffer
    (apply #'call-process "curl" nil t nil "-s"
           url
         (append (mapcan
                  (lambda (h) (list "-H" h))
                  (list "Content-Type: multipart/form-data"))
                 (mapcan
                  (lambda (h) (list "-F" h))
                  (list (concat "file=@" whisper--temp-file)
                        "temperature=0.0"
                        "temperature_inc=0.2"
                        "response_format=verbose_json"
                        (concat "language=" whisper-language)))
                 args))
    (message "%s %s" (buffer-string) url)))

Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.

Download

Code for running the benchmarks
(let ((times '3))
(mapcar
 (lambda (group)
   (let ((whisper--temp-file "/home/sacha/recordings/whisper/2026-01-19-14-17-53.wav"))
     ;; warm up the model
     (eval (cadr group))
     (list
      (format "%.3f"
              (/ (car
                  (benchmark-call (lambda () (eval (cadr group))) times))
                 times))
      (car group))))
 '(
   ("parakeet"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 5092)))
   ("whisper.cpp base-q4_0"
    (my-test-whisper-api
     (format "http://%s:%d/inference" whisper-server-host 8642)))
   ("speaches whisper-base"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-base")))
   ("speaches whisper-base.en"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-base.en")))
   ("speaches whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-small")))
   ("speaches whisper-small.en"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-small.en")))
   ("speaches lorneluo/whisper-small-ct2-int8"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=lorneluo/whisper-small-ct2-int8")))
   ;; needed export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
   ("whisperx-server Systran/faster-whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/transcribe" whisper-server-host 8002)))))
)
3.694 parakeet
2.484 whisper.cpp base-q4_0
1.547 speaches whisper-base
1.425 speaches whisper-base.en
4.076 speaches whisper-small
3.735 speaches whisper-small.en
2.870 speaches lorneluo/whisper-small-ct2-int8
4.537 whisperx-server Systran/faster-whisper-small

All the results above were CPU-only, no GPU acceleration.

I tried it with:

Update: After a lot of rebooting, I figured out how to get the Nvidia P1000 GPU on my Lenovo P52 to work for up to the Whisper medium models. Large models don't fit in the 4GB of VRAM it has. Here are the benchmarking results:

Benchmarking code
(let ((times '3))
(mapcar
 (lambda (group)
   (let ((whisper--temp-file "/tmp/2026-01-19-14-17-53.wav"))
     ;; warm up the model
     (eval (cadr group))
     (list
      (format "%.3f"
              (/ (car
                  (benchmark-call (lambda () (eval (cadr group))) times))
                 times))
      (car group))
     ))
 '(
   ("speaches whisper-tiny"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
     (list "-F" "model=Systran/faster-whisper-tiny.en")))
   ("speaches whisper-base"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
     (list "-F" "model=Systran/faster-whisper-base.en")))
   ("speaches whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
     (list "-F" "model=Systran/faster-whisper-small.en")))
   ("speaches whisper-medium"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" "localhost" 8000)
     (list "-F" "model=Systran/faster-whisper-medium.en"))))))
0.596 speaches whisper-tiny
0.940 speaches whisper-base
2.909 speaches whisper-small
8.740 speaches whisper-medium

That probably means I can upgrade to using the small model on a regular basis. For large models, I can use the command-line tool, since the overhead of loading the model is probably small compared to the processing time. I could also run two Speaches servers (one on CPU and one on GPU), but I don't think I'll need that level yet.

I like how speaches lets me switch models on the fly, so maybe I can use small.en generally and switch to a different model when I want to try dictating in French. Here's how I've set it up to use the server.

(setq whisper-server-port 8000
      whisper-model "Systran/faster-whisper-small.en"
      my-whisper-url-format "http://%s:%d/v1/audio/transcriptions")

Benchmark notes: I have a Lenovo P52 laptop (released 2018) with an Intel Core i7-8850H (6 cores, 12 threads; 2.6 GHz base / 4.3 GHz turbo) with 64GB RAM and an SSD. It has a soldered-on Nvidia P1000 with 4GB of VRAM.

This is part of my Emacs configuration.
View Org source for this post