Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere
Posted: - Modified: | emacs, audio, speech- : Change main function to
my-whisper-run, use seq-reduce to go through the functions. - : Added code for automatically capturing screenshots, saving text, working with a list of functions.
- : Added demo, fixed some bugs.
- : Added note about difference from MELPA package, fixed :vc
I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.
I've bound <f9> to the command whisper-run. I press <f9> to start recording, talk, and then press <f9> to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.
(use-package whisper
:vc (:url "https://github.com/natrys/whisper.el")
:load-path "~/vendor/whisper.el"
:config
(setq whisper-quantize "q4_0")
(setq whisper-install-directory "~/vendor")
(setq whisper--install-path (concat
(expand-file-name (file-name-as-directory whisper-install-directory))
"whisper.cpp/"))
;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
;; If you change models,
;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
(setq whisper-server-mode 'local)
(setq whisper-model "base")
(setq whisper-return-cursor-to-start nil)
;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
(setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
(setq whisper-language "en")
(setq whisper-recording-timeout 3000)
(setq whisper-before-transcription-hook nil)
(setq whisper-use-threads (1- (num-processors)))
(setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
(add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100)
:bind
(("<f9>" . whisper-run)
("C-<f9>" . my-whisper-run)
("S-<f9>" . my-whisper-replay)
("M-<f9>" . my-whisper-toggle-language)))
Disk space is inexpensive and backups are great, so let's save each file using the timestamp.
(defvar my-whisper-dir "~/recordings/whisper/")
(defun my-whisper-set-temp-filename ()
(setq whisper--temp-file (expand-file-name
(format-time-string "%Y-%m-%d-%H-%M-%S.wav")
my-whisper-dir)))
(with-eval-after-load 'whisper
(add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename))
The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.
(defun my-whisper-replay ()
"Replay the last temporary recording."
(interactive)
(mpv-play whisper--temp-file))
Il peut aussi comprendre le français.
(defun my-whisper-toggle-language ()
"Set the language explicitly, since sometimes auto doesn't figure out the right one."
(interactive)
(setq whisper-language (if (string= whisper-language "en") "fr" "en"))
;; If using a server, we need to restart for the language
(when (process-live-p whisper--server-process) (kill-process whisper--server-process))
(message "%s" whisper-language))
I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, <f9> to start recording, <f9> to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.
To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.
Then, when I'm looking at something else and I want to record a note, I can press <f9> to start the recording, and then C-<f9> to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)
(defvar my-whisper-targets
'(my-whisper-save-text
my-whisper-org-save-to-clocked-task
my-whisper-save-to-file)
"*Where to save the target.
Nil means jump to the current clocked-in entry and insert it along with
a link, or prompt for a capture template if nothing is clocked in.
If this is set to a string, it should specify a key from
`org-capture-templates'. The text will be in %i, and you can use %a for the link.
For example, you could have a template entry like this:
\(\"c\" \"Contents to current clocked task\" plain (clock) \"%i%?\n%a\" :empty-lines 1)
If this is set to a function, the function will be called from the
original marker with the text as the argument. Note that the window
configuration and message will not be preserved after this function is
run, so if you want to change the window configuration or display a
message, add a timer.
If this is set to a list of functions, the functions are called in
sequence. The first function is called with the text. The second
function is called with the result from the first function, and so on.")
(defun my-whisper-process ()
"Process the transcription."
(remove-hook 'whisper-after-transcription-hook #'my-whisper-process)
(let ((text (string-trim (buffer-string))))
(erase-buffer) ; stops further processing
(my-whisper-process-text text)
(setq my-whisper-skip-annotation nil)))
(defun my-whisper-process-text (text)
(save-window-excursion
(with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
(when (markerp whisper--marker) (goto-char whisper--marker))
(cond
((and my-whisper-targets (listp my-whisper-targets))
(seq-reduce
(lambda (prev cur)
(funcall cur prev))
my-whisper-targets
text))
((functionp my-whisper-targets)
(funcall my-whisper-targets text))
(my-whisper-targets
(setq org-capture-initial text)
(org-capture nil my-whisper-targets)
(org-capture-finalize)
;; Delay the display of the message because whisper--cleanup-transcription clears it
(run-at-time 0.5 nil (lambda (text) (message "Captured: %s" text)) text))
(t (my-whisper-org-save-to-clocked-task text))))))
(defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.")
(defvar my-whisper-skip-annotation nil)
(defun my-whisper-run (&optional skip-annotation)
(interactive (list current-prefix-arg))
(require 'whisper)
(add-hook 'whisper-after-transcription-hook #'my-whisper-process -50)
(whisper-run)
(when skip-annotation
(setq my-whisper-skip-annotation t)))
(defun my-whisper-save-text (text)
"Save TEXT beside `whisper--temp-file'."
(when text
(let ((link (org-store-link nil)))
(with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt")
(when link
(insert link "\n"))
(insert text)))
text))
(defun my-whisper-org-save-to-clocked-task (text)
(when text
(save-window-excursion
(with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
(when (markerp whisper--marker) (goto-char whisper--marker))
;; Take a screenshot maybe
(let* ((link (and (not my-whisper-skip-annotation)
(org-store-link nil)))
(region (and (region-active-p) (buffer-substring (region-beginning) (region-end))))
(screenshot-filename
(when (or
(null link)
(not (string= my-whisper-last-annotation link)))
(my-screenshot-svg (concat (file-name-sans-extension whisper--temp-file) ".svg")))))
(if (org-clocking-p)
(progn
(let ()
(org-clock-goto)
(org-end-of-subtree)
(unless (bolp)
(insert "\n"))
(insert "\n")
(when screenshot-filename
(insert (org-link-make-string
(concat "file:" screenshot-filename))
"\n"))
(when (and link (not (string= my-whisper-last-annotation link)))
(insert link "\n"))
(when region
(insert "#+begin_example\n" region "\n#+end_example\n"))
(insert text "\n")
(setq my-whisper-last-annotation link))
(run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
;; No clocked task, prompt for a place to capture it
(kill-new text)
(setq org-capture-initial text)
(call-interactively 'org-capture)
;; Delay the window configuration
(let ((config (current-window-configuration)))
(run-at-time 0.5 nil
(lambda (text config)
(set-window-configuration config)
(message "Copied: %s" text))
text config))))))
text))
(with-eval-after-load 'org
(add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation))
(defun my-whisper-org-clear-saved-annotation ()
(setq my-whisper-org-last-annotation nil))
Here's an idea for a my-whisper-targets function that saves the recognized text with a timestamp.
(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
(when text
(let ((link (org-store-link nil)))
(with-current-buffer (find-file-noselect my-whisper-notes)
(goto-char (point-max))
(insert "\n\n" (format-time-string "%H:%M ") text "\n" link "\n")
(save-buffer)
(run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text)))
text))
; (setq my-whisper-targets #'my-whisper-save-to-file)
; (setq my-whisper-targets '(my-whisper-save-to-file my-whisper-org-save-to-clocked-task))
And now I can redo things if needed:
(defun my-whisper-redo ()
(interactive)
(setq whisper--marker (point-marker))
(whisper--transcribe-audio))
I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them.
systemctl --user restart pipewire
sleep 2
pactl load-module module-null-sink \
sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink
pactl load-module module-null-sink \
sink_name="CombinedSink" sink_properties=device.description=CombinedSink
if pactl list short sources | grep -i pci-0000; then
pactl load-module module-loopback \
source="alsa_input.pci-0000_00_1f.3.analog-stereo" \
sink="VirtualMicSink" \
latency_msec=100 \
adjust_time=1 \
source_output_properties="node.name='SysToVMic'" \
sink_input_properties="node.name='SysToVMic' media.role='filter'"
sink_input_properties=media.role=filter
pactl load-module module-loopback \ source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \
sink="CombinedSink" \
node_name="SystemOutToCombined" \
source_output_properties="node.name='SysOutToCombined' node.description='SysOutToCombined'" \
sink_input_properties="node.name='SysOutToCombined' node.description='SysOutToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
fi
if pactl list short sources | grep -i yeti; then
pactl load-module module-loopback \
source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \
sink="VirtualMicSink" \
latency_msec=100 \
adjust_time=1 \
source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \
sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"
pactl load-module module-loopback \ source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \
sink="CombinedSink" \
source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \
sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
fi
pactl load-module module-loopback \
source="VirtualMicSink.monitor" \
sink="CombinedSink" \
source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \
sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
pactl load-module module-null-sink \
sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1
pactl load-module module-loopback \
source="ExtraSink1.monitor" \
sink="CombinedSink" \
source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \
sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \
latency_msec=100 adjust_time=1
Here's a demo:
Transcript