Sacha Chua - category - audio

Expanding yasnippets by voice in Emacs and other applications

2026-03-04T16:17:40Z

Yasnippet is a template system for Emacs. I want to use it by voice. I'd like to be able to say things like "Okay, define interactive function" and have that expand to a matching snippet in Emacs or other applications. Here's a quick demonstration of expanding simple snippets:

Download the video

Screencast of expanding snippets by voice in Emacs and in other applications

Transcript

Using speech recognition for on-the-fly translations in Emacs and faking in-buffer completion for the results

2026-03-31T16:31:15Z

[2026-03-31 Tue]: I added the recognized text so that I can confirm what was translated. I also moved my-type-with-hint to learn-lang-type-with-hint.

When I'm writing a journal entry in French, I sometimes want to translate a phrase that I can't look up word by word using a dictionary. Instead of switching to a browser, I can use an Emacs function to prompt me for text and either insert or display the translation. The plz library makes HTTP requests slightly neater.

(defun my-lang-en-to-fr (text &optional display-only)
  (interactive (list (read-string "Text: ") current-prefix-arg))
  (let* ((url "https://translation.googleapis.com/language/translate/v2")
         (params `(("key" . ,(getenv "GOOGLE_API_KEY"))
                   ("q" . ,text)
                   ("source" . "en")
                   ("target" . "fr")
                   ("format" . "text")))
         (query-string (mapconcat
                        (lambda (pair)
                          (format "%s=%s"
                                  (url-hexify-string (car pair))
                                  (url-hexify-string (cdr pair))))
                        params
                        "&"))
         (full-url (concat url "?" query-string)))
    (let* ((response (plz 'get full-url :as #'json-read))
           (data (alist-get 'data response))
           (translations (alist-get 'translations data))
           (first-translation (car translations))
           (translated-text (alist-get 'translatedText first-translation)))
      (when (called-interactively-p 'any)
        (if display-only
            (message "%s" translated-text)
          (insert translated-text)))
      translated-text)))

I think it would be even nicer if I could use speech synthesis, so I can keep it a little more separate from my typing thoughts. I want to be able to say "Okay, translate …" or "Okay, … in French" to get a translation. I've been using my fork of natrys/whisper.el for speech recognition in English, and I like it a lot. By adding a function to whisper-after-transcription-hook, I can modify the intermediate results before they're inserted into the buffer.

(defun my-whisper-translate ()
  (goto-char (point-min))
  (let ((case-fold-search t))
    (when (re-search-forward "okay[,\\.]? translate[,\\.]? \\(.+\\)\\|okay[,\\.]? \\(.+?\\) in French" nil t)
      (let* ((s (or (match-string 1) (match-string 2)))
             (translation (save-match-data (my-lang-en-to-fr s))))
        (replace-match
         (propertize translation
                     'type-hint translation
                     'type-original s
                     'help-echo s))))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-translate 70))

But that's too easy. I want to actually type things myself so that I get more practice. Something like an autocomplete suggestion would be handy as a way of showing me a hint at the cursor. The usual completion-at-point functions are too eager to insert things if there's only one candidate, so we'll just fake it with an overlay. This code works only with my whisper.el fork because it supports using a list of functions for whisper-insert-text-at-point.

(defun my-whisper-maybe-type-with-hints (text)
  "Add this function to `whisper-insert-text-at-point'."
  (let* ((hint (and text (org-find-text-property-in-string 'type-hint text)))
         (original (and text (org-find-text-property-in-string 'type-original text))))
    (if hint
        (progn
          (learn-lang-type-with-hint hint original)
          nil)
      text)))

Here's a demonstration of me saying "Okay, this is a test, in French.":

Download the video

Screencast of using speech recognition to translate into French and provide a hint when typing

Since we're faking in-buffer completion here, maybe we can still get away with considering this as an entry for Emacs Carnival February 2026: Completion ? =)

This is part of my Emacs configuration.

View Org source for this post

You can e-mail me at sacha@sachachua.com.

Using Silero voice activity detection to automatically queue multiple transcriptions with natrys/whisper.el

2026-03-27T23:03:01Z

[2026-03-27 Fri]: Moved code to sachac/speech-input.

I can queue multiple transcriptions with whisper.el so that they get processed sequentially with backup audio. It catches up when I pause to think. Now I want to use Silero voice activity detection to do that kind of segmentation for me automatically.

First, I need a Python server that can print out events when it notices the start or stop of a speech segment. If I print out the timestamps, I might be able to cross-reference it someday with interestingthings. For now, even just paying attention to the end of a segment is enough for what I want to do.

Python script for printing out events

Queuing multiple transcriptions with whisper.el speech recognition

2026-01-23T19:50:37Z

I want to be able to talk out loud and have the ideas go into Emacs. I can do this in a number of different ways:

I briefly demonstrated a step-by-step approach with natrys/whisper.el with a single file. I press a keyboard shortcut to start the recording, another shortcut to stop the recording, and it transcribes it in the background. But the way whisper.el is set up is that if I press the keyboard shortcut to start recording again it will offer to interrupt the transcription process, which is not what I want. I want to just keep talking and have it process results as things come in.
I'm also experimenting with Google Chrome's web speech API to do continuous speech recognition, which I can get into Emacs using a web socket.
What I've just figured out is how to layer a semi-continuous interface for speech recognition on top of whisper.el so that while it's processing in the background, I can just press a keyboard shortcut (I'm using numpad 9 to call my-whisper-continue) to stop the previous recording, queue it for processing, and start the next recording. If I use this keyboard shortcut to separate my thoughts, then Whisper has a much easier time making sense of the whole sentence or paragraph or whatever, instead of trying to use the sliding 30 second context window that many streaming approaches to speech recognition try to use.

Question: Did you fix the keyboard delay you've got while speech catches what you're saying?

Sometimes, when the speed recognition kicks in, my computer gets busy. When my computer gets really busy, it doesn't process my keystrokes in the right order, which is very annoying because then I have to delete the previous word and retype it. I haven't sorted that out yet, but it seems like I probably have to lower the priority on different processes. On the plus side, as I mentioned, if I dictate things instead of typing them, then I don't run into that problem at all.

Also, other notes on delays: The continuous speech recognition via Google Chrome shows up fairly quickly, but it's not very precise, and it doesn't have punctuation. Even if there's a little bit of a delay, as long as I press the my-whisper-continue shortcut after each thought, then I can get that text into my Emacs buffer using the nicer transcription from my selected model. There is going to be a bit of a delay for that one because it gets processed at the end of the thought. Also, I need to start thinking in complete sentences instead of just adding one cause after the other as my brain goes on all of these tangents. I think it's pretty promising. There's the continuous speech recognition via Google Chrome if I don't mind the lower accuracy and lack of punctuation, and I can still get the pretty version on the other side.

Why talk out loud? I liked the Bookclub Tapas presentation that Maddie Sullivan did at EmacsConf 2025. Talking out loud helps me be a lot more verbose about what I'm saying, compared to typing things out or even like having to switch to my notes or interrupting my screen with an Org capture buffer. Of course I want to clean that up for putting into a blog post, but given that my life still sometimes has random interruptions from a kiddo who must have my attention at that very minute, having that kind of record that I can at least try to reread afterwards to reconstruct what I was thinking about sounds like it might be helpful.

Still, making sense out loud is hard. I'm not actually used to talking to people that much now. This is probably a good reason for me to experiment with streaming more. Then I get the practice in talking out loud, there are backup recordings, and people can ask questions when things are unclear.

Of course, sometimes the text doesn't quite make sense because of the speech recognition errors. I can usually figure it out from the context. I save the audio as well so that I can go back and listen to it again if I really need to.

Anyway, here's the code for sending the current recording to whisper in the background and starting another recording. It assumes a lot about how things are set up. For example, I'm only testing this with a local speaches server instead of whisper.cpp. You might need to look at my other speech related configuration blog posts and sections in order to make sense of it.

Code for queuing whisper.el requests to a local server

Emacs and whisper.el: Trying out different speech-to-text backends and models

2026-01-20T19:21:46Z

I was curious about parakeet because I heard that it was faster than Whisper on the HuggingFace leaderboard. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.

In order to try natrys/whisper.el with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.

(defvar my-whisper-url-format "http://%s:%d/transcribe")
;;;###autoload
(defun my-whisper--transcribe-via-local-server ()
  "Transcribe audio using the local whisper server."
  (message "[-] Transcribing via local server")
  (whisper--setup-mode-line :show 'transcribing)
  (whisper--ensure-server)
  (setq whisper--transcribing-process
        (whisper--process-curl-request
         (format my-whisper-url-format whisper-server-host whisper-server-port)
         (list "Content-Type: multipart/form-data")
         (list (concat "file=@" whisper--temp-file)
               "temperature=0.0"
               "temperature_inc=0.2"
               "response_format=json"
               (concat "model=" whisper-model)
               (concat "language=" whisper-language)))))
;;;###autoload
(defun my-whisper--check-model-consistency () t)

I moved my server code to speech-input-speaches.el.

(with-eval-after-load 'whisper
  (advice-add 'whisper--transcribe-via-local-server :override #'my-whisper--transcribe-via-local-server)
  (advice-add 'whisper--check-model-consistency :override #'my-whisper--check-model-consistency)
  (advice-add 'whisper--ensure-server :override #'speech-input-speaches-server-start)
  )

Then I have this function for trying things out.

(defun my-test-whisper-api (url &optional args)
  (with-temp-buffer
    (apply #'call-process "curl" nil t nil "-s"
           url
         (append (mapcan
                  (lambda (h) (list "-H" h))
                  (list "Content-Type: multipart/form-data"))
                 (mapcan
                  (lambda (h) (list "-F" h))
                  (list (concat "file=@" whisper--temp-file)
                        "temperature=0.0"
                        "temperature_inc=0.2"
                        "response_format=verbose_json"
                        (concat "language=" whisper-language)))
                 args))
    (message "%s %s" (buffer-string) url)))

Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.

Download the audio

Download

Code for running the benchmarks

Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere

2026-01-13T18:38:54Z

[2026-01-30 Fri]: Major change: I switched to my fork of natrys/whisper.el so that I can specify functions that change the window configuration etc.
[2026-01-13 Tue]: Change main function to my-whisper-run, use seq-reduce to go through the functions.
[2026-01-09 Fri]: Added code for automatically capturing screenshots, saving text, working with a list of functions.
[2026-01-08 Thu]: Added demo, fixed some bugs.
[2026-01-04 Sun]: Added note about difference from MELPA package, fixed :vc

I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.

I've bound to the command whisper-run. I press to start recording, talk, and then press to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.

(use-package whisper
  :vc (:url "https://github.com/natrys/whisper.el")
  :load-path "~/vendor/whisper.el"
  :config
  (setq whisper--mode-line-recording-indicator "⏺")
  (setq whisper-quantize "q4_0")
  (setq whisper-install-directory "~/vendor")
  (setq whisper--install-path (concat
     (expand-file-name (file-name-as-directory whisper-install-directory))
     "whisper.cpp/"))
  ;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
  ;; If you change models,
  ;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
  (setq whisper-server-mode 'local)
  (setq whisper-model "base")
  (setq whisper-return-cursor-to-start nil)
  ;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
  (setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
  (setq whisper-language "en")
  (setq whisper-recording-timeout 3000)
  (setq whisper-before-transcription-hook nil)
  (setq whisper-use-threads (1- (num-processors)))
  (setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
  (add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100)
  :bind
  (("" . whisper-run)
   ("C-" . my-whisper-run)
   ("S-" . my-whisper-replay)
   ("M-" . my-whisper-toggle-language)))

Let's see if we can process "Computer remind me to…":

(defvar my-whisper-org-reminder-template "t")

(defun my-whisper-org-process-reminder ()
  (let ((text (buffer-string))
        reminder)
    (when (string-match "computer[,\.]? reminds? me to \\(.+\\)" text)
      (setq reminder (match-string 1 text))
      (save-window-excursion
        (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
          (when (markerp whisper--marker) (goto-char whisper--marker))
          (org-capture nil my-whisper-org-reminder-template)
          (insert reminder)
          (org-capture-finalize)))
      (erase-buffer))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-org-process-reminder 50))

Disk space is inexpensive and backups are great, so let's save each file using the timestamp.

(defvar my-whisper-dir "~/recordings/whisper/")
(defun my-whisper-set-temp-filename ()
  (setq whisper--temp-file (expand-file-name
                            (format-time-string "%Y-%m-%d-%H-%M-%S.wav")
                            my-whisper-dir)))

(with-eval-after-load 'whisper
  (add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename))

The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.

(defun my-whisper-replay (&optional file)
  "Replay the last temporary recording."
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (setq whisper--temp-file (or file whisper--temp-file))
  (mpv-play whisper--temp-file))

(defun my-whisper-insert-retry (&optional file)
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (whisper--cleanup-transcription)
  (setq whisper--marker (point-marker)
        whisper--temp-file (or file whisper--temp-file))
  (whisper--transcribe-audio))

Il peut aussi comprendre le français.

(defun my-whisper-toggle-language ()
  "Set the language explicitly, since sometimes auto doesn't figure out the right one."
  (interactive)
  (setq whisper-language (if (string= whisper-language "en") "fr" "en"))
  ;; If using a server, we need to restart for the language
  (when (process-live-p whisper--server-process) (kill-process whisper--server-process))
  (message "%s" whisper-language))

I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, to start recording, to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.

To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.

Then, when I'm looking at something else and I want to record a note, I can press to start the recording, and then C- to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)

(defun my-whisper-reset (text)
  (setq my-whisper-skip-annotation nil)
  (remove-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task)
  text)

;; Only works with my tweaks to whisper.el
;; https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function
(with-eval-after-load 'whisper
  (setq whisper-insert-text-at-point
        '(my-whisper-handle-commands
          my-whisper-save-text
          my-whisper-save-to-file
          my-whisper-maybe-expand-snippet
          my-whisper-maybe-type
          my-whisper-maybe-type-with-hints
          my-whisper-insert
          my-whisper-reset)))

(defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.")
(defvar my-whisper-skip-annotation nil)
(defvar my-whisper-target-markers nil "List of markers to send text to.")

(defun my-whisper-insert (text)
  (let ((markers
         (cond
          ((null my-whisper-target-markers)
           (list whisper--marker)) ; current point where whisper was started
          ((listp my-whisper-target-markers)
           my-whisper-target-markers)
          ((markerp my-whisper-target-markers)
           (list my-whisper-target-markers))))
        (orig-point (point))
        (orig-buffer (current-buffer)))
    (when text
      (mapcar (lambda (marker)
                (with-current-buffer (marker-buffer marker)
                  (save-restriction
                    (widen)
                    (when (markerp marker) (goto-char marker))
                    (when (and (derived-mode-p 'org-mode) (org-at-drawer-p))
                      (insert "\n"))
                    (whisper--insert-text
                     (concat
                      (if (looking-back "[ \t\n]\\|^")
                          ""
                        " ")
                      (string-trim text)))
                    ;; Move the marker forward here
                    (move-marker marker (point)))))
              markers)
      (when my-whisper-target-markers
        (goto-char orig-point))
      nil)))

(defun my-whisper-maybe-type (text)
  (when text
    (if (frame-focus-state)
        text
      (make-process :name "xdotool" :command
                    (list "xdotool" "type"
                          text))
      nil)))

(defun my-whisper-clear-markers ()
  (interactive)
  (setq my-whisper-target-markers nil))

(defun my-whisper-use-current-point (&optional add)
  (interactive (list current-prefix-arg))
  (if add
      (push (point-marker) my-whisper-target-markers)
    (setq my-whisper-target-markers (list (point-marker)))))

(defun my-whisper-run-at-point (&optional add)
  (interactive (list current-prefix-arg))
  (my-whisper-clear-markers)
  (whisper-run))

(keymap-global-set "" #'my-whisper-run-at-point)
(keymap-global-set "" #'whisper-run)

(defun my-whisper-jump-to-marker ()
  (interactive)
  (with-current-buffer (marker-buffer (car my-whisper-target-markers))
    (goto-char (car my-whisper-target-markers))))

(defun my-whisper-use-currently-clocked-task (&optional add)
  (interactive (list current-prefix-arg))
  (save-window-excursion
    (save-restriction
      (save-excursion
        (org-clock-goto)
        (org-end-of-meta-data)
        (org-end-of-subtree)
        (if add
            (push (point-marker) my-whisper-target-markers)
          (setq my-whisper-target-markers (list (point-marker))))))))

(defun my-whisper-run (&optional skip-annotation)
  (interactive (list current-prefix-arg))
  (require 'whisper)
  (add-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task -10)
  (whisper-run)
  (when skip-annotation
    (setq my-whisper-skip-annotation t)))

(defun my-whisper-save-text (text)
  "Save TEXT beside `whisper--temp-file'."
  (when text
    (let ((link (org-store-link nil)))
      (with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt")
        (when link
          (insert link "\n"))
        (insert text)))
    text))

(defun my-whisper-org-save-to-clocked-task (text)
  (when text
    (save-window-excursion
      (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
        (when (markerp whisper--marker) (goto-char whisper--marker))
        ;; Take a screenshot maybe
        (let* ((link (and (not my-whisper-skip-annotation)
                          (org-store-link nil)))
               (region (and (region-active-p) (buffer-substring (region-beginning) (region-end))))
               (screenshot-filename
                (when (or
                       (null link)
                       (not (string= my-whisper-last-annotation link))
                       (not (frame-focus-state))) ; not in focus, take a screenshot
                  (my-screenshot-current-screen (concat (file-name-sans-extension whisper--temp-file) ".png")))))
          (if (org-clocking-p)
              (save-window-excursion
                (save-restriction
                  (save-excursion
                    (org-clock-goto)
                    (org-end-of-subtree)
                    (unless (bolp)
                      (insert "\n"))
                    (insert "\n")
                    (if (and link (not (string= my-whisper-last-annotation link)))
                        (insert
                         (if screenshot-filename
                             (concat "(" (org-link-make-string
                                          (concat "file:" screenshot-filename)
                                          "screenshot") ") ")
                           "")
                         link
                         "\n")
                      (when screenshot-filename
                        (insert (org-link-make-string
                                 (concat "file:" screenshot-filename)
                                 "screenshot")
                                "\n")))
                    (when region
                      (insert "#+begin_example\n" region "\n#+end_example\n"))
                    (insert text "\n")
                    (setq my-whisper-last-annotation link)))
                (run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
            ;; No clocked task, prompt for a place to capture it
            (kill-new text)
            (setq org-capture-initial text)
            (call-interactively 'org-capture)
            ;; Delay the window configuration
            (let ((config (current-window-configuration)))
              (run-at-time 0.5 nil
                           (lambda (text config)
                             (set-window-configuration config)
                             (message "Copied: %s" text))
                           text config))))))))

(with-eval-after-load 'org
  (add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation))

(defun my-whisper-org-clear-saved-annotation ()
  (setq my-whisper-org-last-annotation nil))

Here's an idea for a function that saves the recognized text with a timestamp.

(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
  (when text
    (let ((link (org-store-link nil)))
      (with-current-buffer (find-file-noselect my-whisper-notes)
        (goto-char (point-max))
        (insert "\n\n" (format-time-string "%H:%M ") text "\n" (if link (concat link "\n") ""))
        (save-buffer)
        (run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text)))
    text))

And now I can redo things if needed:

(defun my-whisper-redo ()
  (interactive)
  (setq whisper--marker (point-marker))
  (whisper--transcribe-audio))

I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them.

systemctl --user restart pipewire
sleep 2
pactl load-module module-null-sink \
  sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink
pactl load-module module-null-sink \
  sink_name="CombinedSink" sink_properties=device.description=CombinedSink
if pactl list short sources | grep -i pci-0000; then
  pactl load-module module-loopback \
    source="alsa_input.pci-0000_00_1f.3.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'" \
    sink_input_properties="node.description='SysToVMic' node.name='SysToVMic' media.role='filter'"
    sink_input_properties=media.role=filter
  pactl load-module module-loopback \    source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \
    sink="CombinedSink" \
    node_name="SystemOutToCombined" \
    source_output_properties="node.description='SysOutToCombined' node.name='SysOutToCombined'" \
    sink_input_properties="node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
if pactl list short sources | grep -i yeti; then
  pactl load-module module-loopback \
    source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \
    sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"
  pactl load-module module-loopback \    source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \
    sink="CombinedSink" \
    source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \
    sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
pactl load-module module-loopback \
  source="VirtualMicSink.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \
  sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

pactl load-module module-null-sink \
  sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1

pactl load-module module-loopback \
  source="ExtraSink1.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \
  sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

Here's a demo:

Download the video

Screencast of using whisper.el to do speech-to-text into the current buffer, clocked-in task, or other function

Transcript

#YayEmacs 9: Trimming/adding silences to get to a target; subed-record-sum-time

2025-01-09T15:24:29Z

New in this video: subed-record-sum-time, #+PAD_LEFT and #+PAD_RIGHT

Video not supported. Thumbnail:

Download the video

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be. First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times. Then I can skip my oopses. Sometimes WhisperX doesn't catch them, so I also look at waveforms and characters per second. I already talk quickly, so I'm not going to speed that up but I can trim the pauses in between phrases which is easy to do with waveforms. Sometimes, after reviewing a draft, I realize I need a little more time. If the original audio has some silence, I can just copy and paste it. If not, I can pad left or pad right to add some silence. I can try the flow of some sections and compile the video when I'm ready. Emacs can do almost anything. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Play by play:

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be.
- subed-record uses subtitles and directives in comments in a VTT subtitle file to edit audio and video. subed-record-sum-time calculates the resulting duration and displays it in the minibuffer.
First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times.
- I'm experimenting with an algorithmic way to combine the breaks from my script with the text from the transcript. subed-align calls the aeneas forced alignment tool to match up the text with the timestamps. I use subed-waveform-show-all to show all the waveforms.
Then I can skip my oopses.
- Adding a NOTE #+SKIP comment before a subtitle makes subed-record-compile-video and subed-record-compile-flow skip that part of the audio.
Sometimes WhisperX doesn't catch them,
- WhisperX sometimes doesn't transcribe my false starts if I repeat things quickly.
so I also look at waveforms
- subed-waveform-show-all adds waveforms for all the subtitles. If I notice there's a pause or a repeated shape in the waveform, or if I listen and notice the repetition, I can confirm by middle-clicking on the waveform to sample part of it.
and characters per second.
- Low characters per second is sometimes a sign that the timestamps are incorrect or there's a repetition that wasn't transcribed.
I already talk quickly, so I'm not going to speed that up
- Also, I already sound like a chipmunk; mechanically speeding up my recording to fit in a certain time will make that worse =)
but I can trim the pauses in between phrases which is easy to do with waveforms.
- left-click to set the start, right-click to set the stop. If I want to adjust the previous/next one at the same time, I would use shift-left-click or shift-right-click, but here I want to skip the gaps between phrases, so I adjust the current subtitle without making the previous/next one longer.
Sometimes, after reviewing a draft, I realize I need a little more time.
- I can specify visuals like a video, animated GIF, or an image by adding a [[file:...]] link in the comment for a subtitle. That visual will be used until the next visual is specified in a comment on a different subtitle. subed-record-compile-video can automatically speed up video clips to fit in the time for the current audio segment, which is the set of subtitles before the next visual is defined. After I compile and review the video, sometimes I notice that something goes by too quickly.
If the original audio has some silence, I can just copy and paste it.
- This can sometimes feel more natural than adding in complete silence.
If not, I can pad left or pad right to add some silence.
- I added a new feature so that I could specify something like #+PAD_RIGHT: 1.5 in a comment to add 1.5 seconds of silence after the audio specified by that subtitle.
I can try the flow of some sections
- I can select a region and then use M-x subed-record-compile-try-flow to play the audio or C-u M-x subed-record-compile-try-flow to play the audio+video for that region.
and compile the video when I'm ready.
- subed-record-compile-video compiles the video to the file specified in #+OUTPUT: filename. ffmpeg is very arcane, so I'm glad I can simplify my use of it with Emacs Lisp.
Emacs can do almost anything. Yay Emacs!
- Non-linear audio and video editing is actually pretty fun in a text editor, especially when I can just use M-x vundo to navigate my undo history.

Links:

View org source for this post

You can e-mail me at sacha@sachachua.com.