Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere

Posted: - Modified: | emacs, audio, speech
  • : Change main function to my-whisper-run, use seq-reduce to go through the functions.
  • : Added code for automatically capturing screenshots, saving text, working with a list of functions.
  • : Added demo, fixed some bugs.
  • : Added note about difference from MELPA package, fixed :vc

I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.

I've bound <f9> to the command whisper-run. I press <f9> to start recording, talk, and then press <f9> to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.

(use-package whisper
  :vc (:url "https://github.com/natrys/whisper.el")
  :load-path "~/vendor/whisper.el"
  :config
  (setq whisper-quantize "q4_0")
  (setq whisper-install-directory "~/vendor")
  (setq whisper--install-path (concat
     (expand-file-name (file-name-as-directory whisper-install-directory))
     "whisper.cpp/"))
  ;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
  ;; If you change models,
  ;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
  (setq whisper-server-mode 'local)
  (setq whisper-model "base")
  (setq whisper-return-cursor-to-start nil)
  ;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
  (setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
  (setq whisper-language "en")
  (setq whisper-recording-timeout 3000)
  (setq whisper-before-transcription-hook nil)
  (setq whisper-use-threads (1- (num-processors)))
  (setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
  (add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100)
  :bind
  (("<f9>" . whisper-run)
   ("C-<f9>" . my-whisper-run)
   ("S-<f9>" . my-whisper-replay)
   ("M-<f9>" . my-whisper-toggle-language)))

Let's see if we can process "Computer remind me to…":

(defvar my-whisper-org-reminder-template "t")

(defun my-whisper-org-process-reminder ()
  (let ((text (buffer-string))
        reminder)
    (when (string-match "computer[,\.]? reminds? me to \\(.+\\)" text)
      (setq reminder (match-string 1 text))
      (save-window-excursion
        (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
          (when (markerp whisper--marker) (goto-char whisper--marker))
          (org-capture nil my-whisper-org-reminder-template)
          (insert reminder)
          (org-capture-finalize)))
      (erase-buffer))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-org-process-reminder 50))

I want "checkbox …" to insert ~- [ ] ~ at the beginning of the text.

(defun my-whisper-process-commands ()
  (goto-char (point-min))
  (when (looking-at " +") (replace-match ""))
  (let ((case-fold-search t))
    (cond
     ((looking-at " *check ?box[,\\.]? ")
      (replace-match "\n- [ ] "))))
  (goto-char (point-max))
  (insert " "))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-process-commands 70))

Disk space is inexpensive and backups are great, so let's save each file using the timestamp.

(defvar my-whisper-dir "~/recordings/whisper/")
(defun my-whisper-set-temp-filename ()
  (setq whisper--temp-file (expand-file-name
                            (format-time-string "%Y-%m-%d-%H-%M-%S.wav")
                            my-whisper-dir)))

(with-eval-after-load 'whisper
  (add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename))

The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.

(defun my-whisper-replay (&optional file)
  "Replay the last temporary recording."
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (setq whisper--temp-file (or file whisper--temp-file))
  (mpv-play whisper--temp-file))

(defun my-whisper-insert-retry (&optional file)
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (whisper--cleanup-transcription)
  (setq whisper--marker (point-marker)
        whisper--temp-file (or file whisper--temp-file))
  (whisper--transcribe-audio))

Il peut aussi comprendre le français.

(defun my-whisper-toggle-language ()
  "Set the language explicitly, since sometimes auto doesn't figure out the right one."
  (interactive)
  (setq whisper-language (if (string= whisper-language "en") "fr" "en"))
  ;; If using a server, we need to restart for the language
  (when (process-live-p whisper--server-process) (kill-process whisper--server-process))
  (message "%s" whisper-language))

I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, <f9> to start recording, <f9> to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.

To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.

Then, when I'm looking at something else and I want to record a note, I can press <f9> to start the recording, and then C-<f9> to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)

(defvar my-whisper-targets
  '(my-whisper-save-text
    my-whisper-save-to-file)
  "*Where to save the target.

Nil means jump to the current clocked-in entry and insert it along with
a link, or prompt for a capture template if nothing is clocked in.

If this is set to a string, it should specify a key from
`org-capture-templates'. The text will be in %i, and you can use %a for the link.
For example, you could have a template entry like this:
\(\"c\" \"Contents to current clocked task\" plain (clock) \"%i%?\n%a\" :empty-lines 1)

If this is set to a function, the function will be called from the
original marker with the text as the argument. Note that the window
configuration and message will not be preserved after this function is
run, so if you want to change the window configuration or display a
message, add a timer.

If this is set to a list of functions, the functions are called in
sequence.  The first function is called with the text. The second
function is called with the result from the first function, and so on.")

(defun my-whisper-process ()
  "Process the transcription."
  (let ((text (string-trim (buffer-string))))
    (unless (string= text "")
      (erase-buffer)    ; stops further processing
      (my-whisper-process-text text))
    (setq my-whisper-skip-annotation nil))
  (remove-hook 'my-whisper-targets #'my-whisper-org-save-to-clocked-task))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-process 70))

(defun my-whisper-process-text (text)
  (save-window-excursion
    (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
      (when (markerp whisper--marker) (goto-char whisper--marker))
      (cond
       ((and my-whisper-targets (listp my-whisper-targets))
        (setq text (seq-reduce
                    (lambda (prev cur)
                      (funcall cur prev))
                    my-whisper-targets
                    text)))
       ((functionp my-whisper-targets)
        (setq text (funcall my-whisper-targets text)))
       (my-whisper-targets
        (setq org-capture-initial text)
        (org-capture nil my-whisper-targets)
        (org-capture-finalize)
        ;; Delay the display of the message because whisper--cleanup-transcription clears it
        (run-at-time 0.5 nil (lambda (text) (message "Captured: %s" text)) text))
       (t (setq text (my-whisper-org-save-to-clocked-task text))))))
  (when text
    (my-whisper-insert text)))

(defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.")
(defvar my-whisper-skip-annotation nil)
(defvar my-whisper-target-markers nil "List of markers to send text to.")

(defun my-whisper-insert (text)
  (let ((markers
         (cond
          ((null my-whisper-target-markers)
           (list whisper--marker)) ; current point where whisper was started
          ((listp my-whisper-target-markers)
           my-whisper-target-markers)
          ((markerp my-whisper-target-markers)
           (list my-whisper-target-markers))))
        (orig-point (point))
        (orig-buffer (current-buffer)))
    (when text
      (mapcar (lambda (marker)
                (with-current-buffer (marker-buffer marker)
                  (when (markerp marker) (goto-char marker))
                  (whisper--insert-text
                   (concat
                    (if (looking-back "[ \t\n]\\|^")
                        ""
                      " ")
                    (string-trim text)))
                  ;; Move the marker forward here
                  (move-marker marker (point))))
              markers)
      (when my-whisper-target-markers
        (goto-char orig-point))
      nil)))

(defun my-whisper-maybe-type (text)
  (when text
    (if (frame-focus-state)
        text
      (make-process :name "xdotool" :command
                    (list "xdotool" "type"
                          text))
      nil)))

(defun my-whisper-clear-markers ()
  (interactive)
  (setq my-whisper-target-markers nil))

(defun my-whisper-use-current-point (&optional add)
  (interactive (list current-prefix-arg))
  (if add
      (push (point-marker) my-whisper-target-markers)
    (setq my-whisper-target-markers (list (point-marker)))))

(defun my-whisper-set-current-point-and-run (&optional add)
  (interactive (list current-prefix-arg))
  (my-whisper-use-current-point)
  (whisper-run))

(keymap-global-set "<f9>" #'my-whisper-set-current-point-and-run)
(keymap-global-set "<kp-1>" #'whisper-run)

(defun my-whisper-jump-to-marker ()
  (interactive)
  (with-current-buffer (marker-buffer (car my-whisper-target-markers))
    (goto-char (car my-whisper-target-markers))))

(defun my-whisper-use-currently-clocked-task (&optional add)
  (interactive (list current-prefix-arg))
  (save-window-excursion
    (save-restriction
      (save-excursion
        (org-clock-goto)
        (org-end-of-meta-data)
        (org-end-of-subtree)
        (if add
            (push (point-marker) my-whisper-target-markers)
          (setq my-whisper-target-markers (list (point-marker))))))))

(defun my-whisper-run (&optional skip-annotation)
  (interactive (list current-prefix-arg))
  (require 'whisper)
  (add-hook 'my-whisper-targets #'my-whisper-org-save-to-clocked-task -10)
  (whisper-run)
  (when skip-annotation
    (setq my-whisper-skip-annotation t)))

(defun my-whisper-save-text (text)
  "Save TEXT beside `whisper--temp-file'."
  (when text
    (let ((link (org-store-link nil)))
      (with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt")
        (when link
          (insert link "\n"))
        (insert text)))
    text))

(defun my-whisper-org-save-to-clocked-task (text)
  (when text
    (save-window-excursion
      (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
        (when (markerp whisper--marker) (goto-char whisper--marker))
        ;; Take a screenshot maybe
        (let* ((link (and (not my-whisper-skip-annotation)
                          (org-store-link nil)))
               (region (and (region-active-p) (buffer-substring (region-beginning) (region-end))))
               (screenshot-filename
                (when (or
                       (null link)
                       (not (string= my-whisper-last-annotation link))
                       (not (frame-focus-state))) ; not in focus, take a screenshot
                  (my-screenshot-current-screen (concat (file-name-sans-extension whisper--temp-file) ".png")))))
          (if (org-clocking-p)
              (save-window-excursion
                (save-restriction
                  (save-excursion
                    (org-clock-goto)
                    (org-end-of-subtree)
                    (unless (bolp)
                      (insert "\n"))
                    (insert "\n")
                    (if (and link (not (string= my-whisper-last-annotation link)))
                        (insert
                         (if screenshot-filename
                             (concat "(" (org-link-make-string
                                          (concat "file:" screenshot-filename)
                                          "screenshot") ") ")
                           "")
                         link
                         "\n")
                      (when screenshot-filename
                        (insert (org-link-make-string
                                 (concat "file:" screenshot-filename)
                                 "screenshot")
                                "\n")))
                    (when region
                      (insert "#+begin_example\n" region "\n#+end_example\n"))
                    (insert text "\n")
                    (setq my-whisper-last-annotation link)))
                (run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
            ;; No clocked task, prompt for a place to capture it
            (kill-new text)
            (setq org-capture-initial text)
            (call-interactively 'org-capture)
            ;; Delay the window configuration
            (let ((config (current-window-configuration)))
              (run-at-time 0.5 nil
                           (lambda (text config)
                             (set-window-configuration config)
                             (message "Copied: %s" text))
                           text config))))))))

(with-eval-after-load 'org
  (add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation))

(defun my-whisper-org-clear-saved-annotation ()
  (setq my-whisper-org-last-annotation nil))

Here's an idea for a my-whisper-targets function that saves the recognized text with a timestamp.

(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
  (when text
    (let ((link (org-store-link nil)))
      (with-current-buffer (find-file-noselect my-whisper-notes)
        (goto-char (point-max))
        (insert "\n\n" (format-time-string "%H:%M ") text "\n" link "\n")
        (save-buffer)
        (run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text)))
    text))
; (setq my-whisper-targets #'my-whisper-save-to-file)
; (setq my-whisper-targets '(my-whisper-save-to-file my-whisper-org-save-to-clocked-task))

And now I can redo things if needed:

(defun my-whisper-redo ()
  (interactive)
  (setq whisper--marker (point-marker))
  (whisper--transcribe-audio))

I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them.

systemctl --user restart pipewire
sleep 2
pactl load-module module-null-sink \
  sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink
pactl load-module module-null-sink \
  sink_name="CombinedSink" sink_properties=device.description=CombinedSink
if pactl list short sources | grep -i pci-0000; then
  pactl load-module module-loopback \
    source="alsa_input.pci-0000_00_1f.3.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'" \
    sink_input_properties="node.description='SysToVMic' node.name='SysToVMic' media.role='filter'"
    sink_input_properties=media.role=filter
  pactl load-module module-loopback \    source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \
    sink="CombinedSink" \
    node_name="SystemOutToCombined" \
    source_output_properties="node.description='SysOutToCombined' node.name='SysOutToCombined'" \
    sink_input_properties="node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
if pactl list short sources | grep -i yeti; then
  pactl load-module module-loopback \
    source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \
    sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"
  pactl load-module module-loopback \    source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \
    sink="CombinedSink" \
    source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \
    sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
pactl load-module module-loopback \
  source="VirtualMicSink.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \
  sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

pactl load-module module-null-sink \
  sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1

pactl load-module module-loopback \
  source="ExtraSink1.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \
  sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

Here's a demo:

Screencast of using whisper.el to do speech-to-text into the current buffer, clocked-in task, or other function

Transcript

00:00:00 Inserting into the current buffer
Here's a quick demonstration of using whisper.el to log notes.
00:00:13 Inserting text and moving on
I can insert text into the current buffer one after the other.
00:00:31 Clocking in
If I clock into a task, I can add to the end of that clocked in task using my custom code by pressing C-<f9> or whatever my shortcut was. I can do that multiple times.
00:01:05 Logging a note from a different file
I can do that while looking at a different file.
00:01:15 I can look at an info page
I can do it looking at an info page, for example, and annotations will include a link back to whatever I was looking at.
00:01:33 Adding without an annotation (C-u)
I just added an optional argument so that I can also capture a note without saving an annotation. That way, if I'm going to say a lot of things about the same buffer, I don't have to have a lot of links that I need to edit out.
00:02:42 Saving to a different function
I can also have it save to a different function.

This is part of my Emacs configuration.
View org source for this post