Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere
Posted: - Modified: | emacs, audio, speech-recognition, speech- : Major change: I switched to my fork of natrys/whisper.el so that I can specify functions that change the window configuration etc.
- : Change main function to
my-whisper-run, use seq-reduce to go through the functions. - : Added code for automatically capturing screenshots, saving text, working with a list of functions.
- : Added demo, fixed some bugs.
- : Added note about difference from MELPA package, fixed :vc
I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.
I've bound <f9> to the command whisper-run. I press <f9> to start recording, talk, and then press <f9> to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.
(use-package whisper
:vc (:url "https://github.com/natrys/whisper.el")
:load-path "~/vendor/whisper.el"
:config
(setq whisper--mode-line-recording-indicator "⏺")
(setq whisper-quantize "q4_0")
(setq whisper-install-directory "~/vendor")
(setq whisper--install-path (concat
(expand-file-name (file-name-as-directory whisper-install-directory))
"whisper.cpp/"))
;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
;; If you change models,
;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
(setq whisper-server-mode 'local)
(setq whisper-model "base")
(setq whisper-return-cursor-to-start nil)
;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
(setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
(setq whisper-language "en")
(setq whisper-recording-timeout 3000)
(setq whisper-before-transcription-hook nil)
(setq whisper-use-threads (1- (num-processors)))
(setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
(add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100)
:bind
(("<f9>" . whisper-run)
("C-<f9>" . my-whisper-run)
("S-<f9>" . my-whisper-replay)
("M-<f9>" . my-whisper-toggle-language)))
Let's see if we can process "Computer remind me to…":
(defvar my-whisper-org-reminder-template "t")
(defun my-whisper-org-process-reminder ()
(let ((text (buffer-string))
reminder)
(when (string-match "computer[,\.]? reminds? me to \\(.+\\)" text)
(setq reminder (match-string 1 text))
(save-window-excursion
(with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
(when (markerp whisper--marker) (goto-char whisper--marker))
(org-capture nil my-whisper-org-reminder-template)
(insert reminder)
(org-capture-finalize)))
(erase-buffer))))
(with-eval-after-load 'whisper
(add-hook 'whisper-after-transcription-hook 'my-whisper-org-process-reminder 50))
Disk space is inexpensive and backups are great, so let's save each file using the timestamp.
(defvar my-whisper-dir "~/recordings/whisper/")
(defun my-whisper-set-temp-filename ()
(setq whisper--temp-file (expand-file-name
(format-time-string "%Y-%m-%d-%H-%M-%S.wav")
my-whisper-dir)))
(with-eval-after-load 'whisper
(add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename))
The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.
(defun my-whisper-replay (&optional file)
"Replay the last temporary recording."
(interactive (list
(when current-prefix-arg
(read-file-name "File: " my-whisper-dir))))
(setq whisper--temp-file (or file whisper--temp-file))
(mpv-play whisper--temp-file))
(defun my-whisper-insert-retry (&optional file)
(interactive (list
(when current-prefix-arg
(read-file-name "File: " my-whisper-dir))))
(whisper--cleanup-transcription)
(setq whisper--marker (point-marker)
whisper--temp-file (or file whisper--temp-file))
(whisper--transcribe-audio))
Il peut aussi comprendre le français.
(defun my-whisper-toggle-language ()
"Set the language explicitly, since sometimes auto doesn't figure out the right one."
(interactive)
(setq whisper-language (if (string= whisper-language "en") "fr" "en"))
;; If using a server, we need to restart for the language
(when (process-live-p whisper--server-process) (kill-process whisper--server-process))
(message "%s" whisper-language))
I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, <f9> to start recording, <f9> to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.
To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.
Then, when I'm looking at something else and I want to record a note, I can press <f9> to start the recording, and then C-<f9> to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)
(defun my-whisper-reset (text)
(setq my-whisper-skip-annotation nil)
(remove-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task)
text)
;; Only works with my tweaks to whisper.el
;; https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function
(with-eval-after-load 'whisper
(setq whisper-insert-text-at-point
'(my-whisper-handle-commands
my-whisper-save-text
my-whisper-save-to-file
my-whisper-maybe-type
my-whisper-insert
my-whisper-reset)))
(defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.")
(defvar my-whisper-skip-annotation nil)
(defvar my-whisper-target-markers nil "List of markers to send text to.")
(defun my-whisper-insert (text)
(let ((markers
(cond
((null my-whisper-target-markers)
(list whisper--marker)) ; current point where whisper was started
((listp my-whisper-target-markers)
my-whisper-target-markers)
((markerp my-whisper-target-markers)
(list my-whisper-target-markers))))
(orig-point (point))
(orig-buffer (current-buffer)))
(when text
(mapcar (lambda (marker)
(with-current-buffer (marker-buffer marker)
(save-restriction
(widen)
(when (markerp marker) (goto-char marker))
(when (and (derived-mode-p 'org-mode) (org-at-drawer-p))
(insert "\n"))
(whisper--insert-text
(concat
(if (looking-back "[ \t\n]\\|^")
""
" ")
(string-trim text)))
;; Move the marker forward here
(move-marker marker (point)))))
markers)
(when my-whisper-target-markers
(goto-char orig-point))
nil)))
(defun my-whisper-maybe-type (text)
(when text
(if (frame-focus-state)
text
(make-process :name "xdotool" :command
(list "xdotool" "type"
text))
nil)))
(defun my-whisper-clear-markers ()
(interactive)
(setq my-whisper-target-markers nil))
(defun my-whisper-use-current-point (&optional add)
(interactive (list current-prefix-arg))
(if add
(push (point-marker) my-whisper-target-markers)
(setq my-whisper-target-markers (list (point-marker)))))
(defun my-whisper-run-at-point (&optional add)
(interactive (list current-prefix-arg))
(my-whisper-clear-markers)
(whisper-run))
(keymap-global-set "<f9>" #'my-whisper-run-at-point)
(keymap-global-set "<kp-1>" #'whisper-run)
(defun my-whisper-jump-to-marker ()
(interactive)
(with-current-buffer (marker-buffer (car my-whisper-target-markers))
(goto-char (car my-whisper-target-markers))))
(defun my-whisper-use-currently-clocked-task (&optional add)
(interactive (list current-prefix-arg))
(save-window-excursion
(save-restriction
(save-excursion
(org-clock-goto)
(org-end-of-meta-data)
(org-end-of-subtree)
(if add
(push (point-marker) my-whisper-target-markers)
(setq my-whisper-target-markers (list (point-marker))))))))
(defun my-whisper-run (&optional skip-annotation)
(interactive (list current-prefix-arg))
(require 'whisper)
(add-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task -10)
(whisper-run)
(when skip-annotation
(setq my-whisper-skip-annotation t)))
(defun my-whisper-save-text (text)
"Save TEXT beside `whisper--temp-file'."
(when text
(let ((link (org-store-link nil)))
(with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt")
(when link
(insert link "\n"))
(insert text)))
text))
(defun my-whisper-org-save-to-clocked-task (text)
(when text
(save-window-excursion
(with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
(when (markerp whisper--marker) (goto-char whisper--marker))
;; Take a screenshot maybe
(let* ((link (and (not my-whisper-skip-annotation)
(org-store-link nil)))
(region (and (region-active-p) (buffer-substring (region-beginning) (region-end))))
(screenshot-filename
(when (or
(null link)
(not (string= my-whisper-last-annotation link))
(not (frame-focus-state))) ; not in focus, take a screenshot
(my-screenshot-current-screen (concat (file-name-sans-extension whisper--temp-file) ".png")))))
(if (org-clocking-p)
(save-window-excursion
(save-restriction
(save-excursion
(org-clock-goto)
(org-end-of-subtree)
(unless (bolp)
(insert "\n"))
(insert "\n")
(if (and link (not (string= my-whisper-last-annotation link)))
(insert
(if screenshot-filename
(concat "(" (org-link-make-string
(concat "file:" screenshot-filename)
"screenshot") ") ")
"")
link
"\n")
(when screenshot-filename
(insert (org-link-make-string
(concat "file:" screenshot-filename)
"screenshot")
"\n")))
(when region
(insert "#+begin_example\n" region "\n#+end_example\n"))
(insert text "\n")
(setq my-whisper-last-annotation link)))
(run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
;; No clocked task, prompt for a place to capture it
(kill-new text)
(setq org-capture-initial text)
(call-interactively 'org-capture)
;; Delay the window configuration
(let ((config (current-window-configuration)))
(run-at-time 0.5 nil
(lambda (text config)
(set-window-configuration config)
(message "Copied: %s" text))
text config))))))))
(with-eval-after-load 'org
(add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation))
(defun my-whisper-org-clear-saved-annotation ()
(setq my-whisper-org-last-annotation nil))
Here's an idea for a function that saves the recognized text with a timestamp.
(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
(when text
(let ((link (org-store-link nil)))
(with-current-buffer (find-file-noselect my-whisper-notes)
(goto-char (point-max))
(insert "\n\n" (format-time-string "%H:%M ") text "\n" (if link (concat link "\n") ""))
(save-buffer)
(run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text)))
text))
And now I can redo things if needed:
(defun my-whisper-redo ()
(interactive)
(setq whisper--marker (point-marker))
(whisper--transcribe-audio))
I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them.
systemctl --user restart pipewire
sleep 2
pactl load-module module-null-sink \
sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink
pactl load-module module-null-sink \
sink_name="CombinedSink" sink_properties=device.description=CombinedSink
if pactl list short sources | grep -i pci-0000; then
pactl load-module module-loopback \
source="alsa_input.pci-0000_00_1f.3.analog-stereo" \
sink="VirtualMicSink" \
latency_msec=100 \
adjust_time=1 \
source_output_properties="node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'" \
sink_input_properties="node.description='SysToVMic' node.name='SysToVMic' media.role='filter'"
sink_input_properties=media.role=filter
pactl load-module module-loopback \ source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \
sink="CombinedSink" \
node_name="SystemOutToCombined" \
source_output_properties="node.description='SysOutToCombined' node.name='SysOutToCombined'" \
sink_input_properties="node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
fi
if pactl list short sources | grep -i yeti; then
pactl load-module module-loopback \
source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \
sink="VirtualMicSink" \
latency_msec=100 \
adjust_time=1 \
source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \
sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"
pactl load-module module-loopback \ source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \
sink="CombinedSink" \
source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \
sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
fi
pactl load-module module-loopback \
source="VirtualMicSink.monitor" \
sink="CombinedSink" \
source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \
sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
pactl load-module module-null-sink \
sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1
pactl load-module module-loopback \
source="ExtraSink1.monitor" \
sink="CombinedSink" \
source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \
sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
Here's a demo:
Transcript