Emacs and whisper.el :Trying out different speech-to-text backends and models

| emacs

I was curious about parakeet because I heard that it was faster than Whisper on the HuggingFace leaderboard. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.

In order to try natrys/whisper.el with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.

(defvar my-whisper-url-format "http://%s:%d/transcribe")
(defun whisper--transcribe-via-local-server ()
  "Transcribe audio using the local whisper server."
  (message "[-] Transcribing via local server")
  (whisper--setup-mode-line :show 'transcribing)
  (whisper--ensure-server)
  (setq whisper--transcribing-process
        (whisper--process-curl-request
         (format my-whisper-url-format whisper-server-host whisper-server-port)
         (list "Content-Type: multipart/form-data")
         (list (concat "file=@" whisper--temp-file)
               "temperature=0.0"
               "temperature_inc=0.2"
               "response_format=json"
               (concat "model=" whisper-model)
               (concat "language=" whisper-language)))))
(defun whisper--check-model-consistency () t)

Then I have this function for trying things out.

(defun my-test-whisper-api (url &optional args)
  (with-temp-buffer
    (apply #'call-process "curl" nil t nil "-s"
           url
         (append (mapcan
                  (lambda (h) (list "-H" h))
                  (list "Content-Type: multipart/form-data"))
                 (mapcan
                  (lambda (h) (list "-F" h))
                  (list (concat "file=@" whisper--temp-file)
                        "temperature=0.0"
                        "temperature_inc=0.2"
                        "response_format=verbose_json"
                        (concat "language=" whisper-language)))
                 args))
    (message "%s %s" (buffer-string) url)))

Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.

Download

Code for running the benchmarks
(mapcar
 (lambda (group)
   (let ((whisper--temp-file "/home/sacha/recordings/whisper/2026-01-19-14-17-53.wav"))
     ;; warm up the model
     (eval (cadr group))
     (list
      (format "%.3f"
              (/ (car
                  (benchmark-call (lambda () (eval (cadr group))) times))
                 times))
      (car group))))
 '(
   ("parakeet"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 5092)))
   ("whisper.cpp base-q4_0"
    (my-test-whisper-api
     (format "http://%s:%d/inference" whisper-server-host 8642)))
   ("speaches whisper-base"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-base")))
   ("speaches whisper-base.en"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-base.en")))
   ("speaches whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-small")))
   ("speaches whisper-small.en"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-small.en")))
   ("speaches lorneluo/whisper-small-ct2-int8"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=lorneluo/whisper-small-ct2-int8")))
   ;; needed export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
   ("whisperx-server Systran/faster-whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/transcribe" whisper-server-host 8002)))))
3.694 parakeet
2.484 whisper.cpp base-q4_0
1.547 speaches whisper-base
1.425 speaches whisper-base.en
4.076 speaches whisper-small
3.735 speaches whisper-small.en
2.870 speaches lorneluo/whisper-small-ct2-int8
4.537 whisperx-server Systran/faster-whisper-small

I tried it with:

Looks like speaches + faster-whisper-base is the winner for now. I like how speaches lets me switch models on the fly, so maybe I can use base.en generally and switch to base when I want to try dictating in French. Here's how I've set it up to use the server I just set up.

(setq whisper-server-port 8001 whisper-model "Systran/faster-whisper-base.en"
      my-whisper-url-format "http://%s:%d/v1/audio/transcriptions")

At some point, I'll override whisper--ensure-server so that starting it up is smoother.

Benchmark notes: I have a Lenovo P52 laptop (released 2018) with an Intel Core i7-8850H (6 cores, 12 threads; 2.6 GHz base / 4.3 GHz turbo) with 64GB RAM and an SSD. I haven't figured out how to get the GPU working under Ubuntu yet.

View org source for this post