Queuing multiple transcriptions with whisper.el speech recognition
| audio, speech, emacsI want to be able to talk out loud and have the ideas go into Emacs. I can do this in a number of different ways:
- I briefly demonstrated a step-by-step approach with natrys/whisper.el with a single file. I press a keyboard shortcut to start the recording, another shortcut to stop the recording, and it transcribes it in the background. But the way whisper.el is set up is that if I press the keyboard shortcut to start recording again it will offer to interrupt the transcription process, which is not what I want. I want to just keep talking and have it process results as things come in.
- I'm also experimenting with Google Chrome's web speech API to do continuous speech recognition, which I can get into Emacs using a web socket.
- What I've just figured out is how to layer a semi-continuous interface for speech recognition on top of whisper.el so that while it's processing in the background, I can just press a keyboard shortcut (I'm using numpad 9 to call
my-whisper-continue) to stop the previous recording, queue it for processing, and start the next recording. If I use this keyboard shortcut to separate my thoughts, then Whisper has a much easier time making sense of the whole sentence or paragraph or whatever, instead of trying to use the sliding 30 second context window that many streaming approaches to speech recognition try to use.
Question: Did you fix the keyboard delay you've got while speech catches what you're saying?
Sometimes, when the speed recognition kicks in, my computer gets busy. When my computer gets really busy, it doesn't process my keystrokes in the right order, which is very annoying because then I have to delete the previous word and retype it. I haven't sorted that out yet, but it seems like I probably have to lower the priority on different processes. On the plus side, as I mentioned, if I dictate things instead of typing them, then I don't run into that problem at all.
Also, other notes on delays: The continuous speech recognition via Google Chrome shows up fairly quickly, but it's not very precise, and it doesn't have punctuation. Even if there's a little bit of a delay, as long as I press the my-whisper-continue shortcut after each thought, then I can get that text into my Emacs buffer using the nicer transcription from my selected model. There is going to be a bit of a delay for that one because it gets processed at the end of the thought. Also, I need to start thinking in complete sentences instead of just adding one cause after the other as my brain goes on all of these tangents. I think it's pretty promising. There's the continuous speech recognition via Google Chrome if I don't mind the lower accuracy and lack of punctuation, and I can still get the pretty version on the other side.
Why talk out loud? I liked the Bookclub Tapas presentation that Maddie Sullivan did at EmacsConf 2025. Talking out loud helps me be a lot more verbose about what I'm saying, compared to typing things out or even like having to switch to my notes or interrupting my screen with an Org capture buffer. Of course I want to clean that up for putting into a blog post, but given that my life still sometimes has random interruptions from a kiddo who must have my attention at that very minute, having that kind of record that I can at least try to reread afterwards to reconstruct what I was thinking about sounds like it might be helpful.
Still, making sense out loud is hard. I'm not actually used to talking to people that much now. This is probably a good reason for me to experiment with streaming more. Then I get the practice in talking out loud, there are backup recordings, and people can ask questions when things are unclear.
Of course, sometimes the text doesn't quite make sense because of the speech recognition errors. I can usually figure it out from the context. I save the audio as well so that I can go back and listen to it again if I really need to.
Anyway, here's the code for sending the current recording to whisper in the background and starting another recording. It assumes a lot about how things are set up. For example, I'm only testing this with a local speaches server instead of whisper.cpp. You might need to look at my other speech related configuration blog posts and sections in order to make sense of it.
Code for queuing whisper.el requests to a local server
(defvar my-whisper--queue nil)
(defun my-whisper-continue (&optional arg)
"Send what we've got so far for transcription and then continue recording.
Call with \\[universal-argument] to signal that we can stop."
(interactive "P")
(setq whisper--marker (point-marker) whisper--point-buffer (current-buffer))
(when (process-live-p whisper--recording-process)
(add-to-list
'my-whisper--queue
(list :file whisper--temp-file
:buffer
(format "*result: %s*" (file-name-base whisper--temp-file)))
t)
;; Remove the sentinel; handle results ourselves
(set-process-sentinel whisper--recording-process
(lambda (process event)
(my-whisper-process-queue)))
(interrupt-process whisper--recording-process))
(unless arg
(run-hooks 'whisper-before-transcription-hook)
(whisper--record-audio)))
(defun my-whisper-discard ()
"Ignore the previous recording."
(interactive)
(when (process-live-p whisper--recording-process)
;; Remove the sentinel; handle results ourselves
(set-process-sentinel whisper--recording-process
(lambda (process event)
(when (file-exists-p whisper--temp-file)
(delete-file whisper--temp-file))
(my-whisper-process-queue)))
(interrupt-process whisper--recording-process)))
(defun my-whisper-discard-and-continue ()
"Ignore the previous recording and continue."
(interactive)
(if (process-live-p whisper--recording-process)
(progn
;; Remove the sentinel; handle results ourselves
(set-process-sentinel whisper--recording-process
(lambda (process event)
(my-whisper-process-queue)
(my-whisper-continue)))
(interrupt-process whisper--recording-process))
(my-whisper-continue)))
(defun my-whisper-done ()
(interactive)
(when (process-live-p whisper--recording-process)
(add-to-list
'my-whisper--queue
(list :file whisper--temp-file
:buffer
(format "*result: %s*" (file-name-base whisper--temp-file)))
t)
;; Remove the sentinel; handle results ourselves
(set-process-sentinel whisper--recording-process
(lambda (process event)
(my-whisper-process-queue)))
(interrupt-process whisper--recording-process)))
(defun my-whisper-process-queue-result ()
"Process the first part of the queue that already has results."
(while (plist-get (car my-whisper--queue) :results)
(let ((o (pop my-whisper--queue)))
(unless my-whisper-target-markers
(setq whisper--marker (point-marker)
whisper--point-buffer (current-buffer)))
(with-current-buffer (plist-get o :buffer)
(erase-buffer)
(insert (plist-get o :results))
(goto-char (point-min))
(run-hook-wrapped
'whisper-after-transcription-hook
(lambda (f)
(with-current-buffer (get-buffer (plist-get o :buffer))
(save-excursion
(funcall f)))
nil))))))
(defun my-whisper-process-queue ()
(let (o)
(while (setq o (seq-find (lambda (o) (and (plist-get o :file)
(not (plist-get o :process))
(not (plist-get o :results))))
my-whisper--queue))
(let* ((headers (list "Content-Type: multipart/form-data"))
(params (list (concat "file=@"
(plist-get o :file))
"temperature=0.0"
"temperature_inc=0.2"
"response_format=json"
(concat "model=" whisper-model)
(concat "language=" whisper-language)))
(url (format my-whisper-url-format whisper-server-host whisper-server-port))
(command `("curl" "-s"
,url
,@(mapcan (lambda (h) (list "-H" h)) headers)
,@(mapcan (lambda (p) (list "-F" p)) params))))
(with-current-buffer (get-buffer-create (plist-get o :buffer))
(erase-buffer))
(plist-put
o :process
(make-process
:name "whisper-curl"
:command command
:buffer (plist-get o :buffer)
:coding 'utf-8
:sentinel
(lambda (process event)
(with-current-buffer (process-buffer process)
(let ((current my-whisper--queue-item))
(when (and (get-buffer (plist-get current :buffer))
(string-equal "finished\n" event))
(with-current-buffer (plist-get current :buffer)
(goto-char (point-min))
(plist-put current :results
(or
(condition-case nil
(gethash "text" (json-parse-buffer))
(error ""))
"(error)"))))))
(my-whisper-process-queue-result))))
(plist-put o :command (string-join command " "))
(with-current-buffer (process-buffer (plist-get o :process))
(setq-local my-whisper--queue-item o))))))
(defvar-local my-whisper--queue-item nil)
(defun my-whisper-reprocess-queue ()
(interactive)
(setq whisper--marker (point-marker) whisper--point-buffer (current-buffer))
(mapc (lambda (o)
(when (process-live-p (plist-get o :process))
(kill-process (plist-get o :process)))
(when (get-buffer (plist-get o :buffer))
(kill-buffer (plist-get o :buffer)))
(plist-put o :process nil)
(plist-put o :results nil))
my-whisper--queue)
(my-whisper-process-queue))
(defun my-whisper-clear-queue ()
(interactive)
(mapc (lambda (o)
(when (process-live-p (plist-get o :process))
(kill-process (plist-get o :process)))
(when (get-buffer (plist-get o :buffer))
(kill-buffer (plist-get o :buffer)))
(plist-put o :process nil)
(plist-put o :results nil))
my-whisper--queue)
(setq my-whisper--queue nil))
(keymap-global-set "<kp-9>" #'my-whisper-continue)
(keymap-global-set "<kp-8>" #'my-whisper-discard-and-continue)
(keymap-global-set "C-<kp-9>" #'my-whisper-done)