Categories: speech-recognition

View topic page - RSS - Atom - Subscribe via email

Categorizing Emacs News items by voice in Org Mode

| speech, speech-recognition, emacs, org

I'm having fun exploring which things might actually be easier to do by voice than by typing. For example, after I wrote some code to expand yasnippets by voice, I realized that it was easier to:

  1. press my shortcut,
  2. say "okay, define interactive function",
  3. and then press my shortcut again,

than to:

  1. mentally say it,
  2. get the first initials,
  3. type in "dfi",
  4. and press Tab to expand.

Another area where I do this kind of mental translation for keyboard shortcuts is when I categorize dozens of Emacs-related links each week for Emacs News. I used to do this by hand. Then I wrote a function to try to guess the category based on regular expressions (my-emacs-news-guess-category in emacs-news/index.org, which is large). Then I set up a menu that lets me press numbers corresponding to the most frequent categories and use tab completion for the rest. 1 is Emacs Lisp, 2 is Emacs development, 3 is Emacs configuration, 4 is appearance, 5 is navigation, and so on. It's not very efficient, but some of it has at least gotten into muscle memory, which is also part of why it's hard to change the mapping. I don't come across that many links for Emacs development or Spacemacs, and I could probably change them to something else, but… Anyway.

2026-03-23_20-38-33.png
Figure 1: Screenshot of my menu for categorizing links

I wanted to see if I could categorize links by voice instead. I might not always be able to count on being able to type a lot, and it's always fun to experiment with other modes of input. Here's a demonstration showing how Emacs can automatically open the URLs, wait for voice input, and categorize the links using a reasonably close match. The *Messages* buffer displays the recognized output to help with debugging.

Screencast with audio: categorizing links by voice

This is how it works:

  1. It starts an ffmpeg recording process.
  2. It starts Silero voice activity detection.
  3. When it detects that speech has ended, it use curl to send the WAV to an OpenAI-compatible server (in my case, Speaches with the Systran/faster-whisper-base.en model) for transcription, along with a prompt to try to influence the recognition.
  4. It compares the result with the candidates using string-distance for an approximate match. It calls the code to move the current item to the right category, creating the category if needed.

Since this doesn't always result in the right match, I added an Undo command. I also have a Delete command for removing the current item, Scroll Up and Scroll Down, and a way to quit.

Initial thoughts

I used it to categorize lots of links in this week's Emacs News, and I think it's promising. I loved the way my hands didn't have to hover over the number keys or move between those and the characters. Using voice activity detection meant that I could just keep dictating categories instead of pressing keyboard shortcuts or using the foot pedal I recently dusted off. There's a slight delay, of course, but I think it's worth it. If this settles down and becomes a solid part of my workflow, I might even be able to knit or hand-sew while doing this step, or simply do some stretching exercises.

What about using streaming speech recognition? I've written some code to use streaming speech recognition, but the performance wasn't good enough when I tried it on my laptop (Lenovo P52 released in 2018, no configured GPU under Linux). The streaming server dropped audio segments in order to try to catch up. I'd rather have everything transcribed at the level of the model I want, even if I have to wait a little while. I also tried using the Web Speech API in Google Chrome for real-time speech transcription, but it's a little finicky. I'm happy with the performance I get from either manually queueing speech segments or using VAD and then using batch speech recognition with a model that's kept in memory (which is why I use a local server instead of a command-line tool). Come to think of it, I should try this with a higher-quality model like medium or large, just in case the latency turns out to be not that much more for this use case.

What about external voice control systems like Talon Voice or Cursorless? They seem like neat ideas and lots of people use them. I think hacking something into Emacs with full access to its internals could be lots of fun too.

A lot of people have experimented with voice input for Emacs over the years. It could be fun to pick up ideas for commands and grammars. Some examples:

What about automating myself out of this loop? I've considered training a classifier or sending the list to a large language model to categorize links in order to set more reasonable defaults, but I think I'd still want manual control, since the fun is in getting a sense of all the cool things that people are tinkering around with in the Emacs community. I found that with voice control, it was easier for me to say the category than to look for the category it suggested and then say "Okay" to accept the default. If I display the suggested category in a buffer with very large text (and possibly category-specific background colours), then I can quickly glance at it or use my peripheral vision. But yeah, it's probably easier to look at a page and say "Org Mode" than to look at the page, look at the default text, see if it matches Org Mode, and then say okay if it is.

Ideas for next steps

I wonder how to line up several categories. I could probably rattle off a few without waiting for the next one to load, and just pause when I'm not sure. Maybe while there's a reasonably good match within the first 1-3 words, I'll take candidates from the front of the queue. Or I could delimit it with another easily-recognized word, like "next".

I want to make a more synchronous version of this idea so that I can have a speech-enabled drop-in replacement that I can use as my y-or-n-p while still being able to type y or n. This probably involves using sit-for and polling to see if it's done. And then I can use that to play Twenty Questions, but also to do more serious stuff. It would also be nice to have replacements for read-string and completing-read, since those block Emacs until the user enters something.

I might take a side-trip into a conversational interface for M-x doctor and M-x dunnet, because why not. Naturally, it also makes sense to voice-enable agent-shell and gptel interactions.

I'd like to figure out a number- or word-based completion mechanism so that I can control Reddit link replacement as well, since I want to select from a list of links from the page. Maybe something similar to the way voicemacs adds numbers to helm and company or how flexi-choose.el works.

I'm also thinking about how I can shift seamlessly between typing and speaking, like when I want to edit a link title. Maybe I can check if I'm in the minibuffer and what kind of minibuffer I'm in, perhaps like the way Embark does.

It would be really cool to define speech commands by reusing the keymap structure that menus also use. This is how to define a menu in Emacs Lisp:

(easy-menu-define words-menu global-map
  "Menu for word navigation commands."
  '("Words"
     ["Forward word" forward-word]
     ["Backward word" backward-word]))

and this is how to set just one binding:

(keymap-set-after my-menu "<drink>"
  '("Drink" . drink-command) 'eat)

That makes sense to reuse for speech commands. I'd also like to be able to specify aliases while hiding them or collapsing them for a "What can I say" help view… Also, if keymaps work, then maybe minor modes or transient maps could work? This sort of feels like it should be the voice equivalent of a transient map.

The code so far

(defun my-emacs-news-categorize-with-voice (&optional skip-browse)
  (interactive (list current-prefix-arg))
  (unless skip-browse
    (my-spookfox-browse))
  (speech-input-cancel-recording)
  (let ((default (if (fboundp 'my-emacs-news-guess-category) (my-emacs-news-guess-category))))
    (speech-input-from-list
     (if default
         (format "Category (%s): " default)
       "Category: ")
     '(("Org Mode" "Org" "Org Mode")
       "Other"
       "Emacs Lisp"
       "Coding"
       ("Emacs configuration" "Config" "Configuration")
       ("Appearance" "Appearance")
       ("Default" "Okay" "Default")
       "Community"
       "AI"
       "Writing"
       ("Reddit" "Read it" "Reddit")
       "Shells"
       "Navigation"
       "Fun"
       ("Dired" "Directory" "Dir ed")
       ("Mail, news, and chat" "News" "Mail" "Chat")
       "Multimedia"
       "Scroll down"
       "Scroll up"
       "Web"
       "Delete"
       "Skip"
       "Undo"
       ("Quit" "Quit" "Cancel" "All done"))
     (lambda (result text)
       (message "Recognized %s original %s" result text)
       (pcase result
         ("Undo"
          (undo)
          (my-emacs-news-categorize-with-voice t))
         ("Skip"
          (forward-line)
          (my-emacs-news-categorize-with-voice))
         ("Quit"
          (message "All done.")
          (speech-input-cancel-recording))
         ("Reddit"
          (my-emacs-news-replace-reddit-link)
          (my-emacs-news-categorize-with-voice t))
         ("Scroll down"
          (my-spookfox-scroll-down)
          (my-emacs-news-categorize-with-voice t))
         ("Scroll up"
          (my-spookfox-scroll-up)
          (my-emacs-news-categorize-with-voice t))
         ("Delete"
          (delete-line)
          (undo-boundary)
          (my-emacs-news-categorize-with-voice))
         ("Default"
          (my-org-move-current-item-to-category
           (concat default ":"))
          (undo-boundary)
          (my-emacs-news-categorize-with-voice))
         (_
          (my-org-move-current-item-to-category
           (concat result ":"))
          (undo-boundary)
          (my-emacs-news-categorize-with-voice))))
     t)))

It uses Spookfox to control Firefox from Emacs:

(defun my-spookfox-scroll-down ()
  (interactive)
  (spookfox-js-injection-eval-in-active-tab "window.scrollBy(0, document.documentElement.clientHeight);" t))

(defun my-spookfox-scroll-up ()
  (interactive)
  (spookfox-js-injection-eval-in-active-tab "window.scrollBy(0, -document.documentElement.clientHeight);"))

(defun my-spookfox-background-tab (url &rest args)
  "Open URL as a background tab."
  (if spookfox--connected-clients
      (spookfox-tabs--request (cl-first spookfox--connected-clients) "OPEN_TAB" `(:url ,url))
    (browse-url url)))

It also uses these functions for categorizing Org Mode items:

(defun my-org-move-current-item-to-category (category)
    "Move current list item under CATEGORY earlier in the list.
  CATEGORY can be a string or a list of the form (text indent regexp).
  Point should be on the next line to process, even if a new category
  has been inserted."
    (interactive (list (completing-read "Category: " (my-org-get-list-categories))))
    (when category
      (let* ((col (current-column))
             (item (point-at-bol))
             (struct (org-list-struct))
             (category-text (if (stringp category) category (elt category 0)))
             (category-indent (if (stringp category) 2 (+ 2 (elt category 1))))
             (category-regexp (if (stringp category) category (elt category 2)))
             (end (elt (car (last struct)) 6))
             (pos (point))
             s)
        (setq s (org-remove-indentation (buffer-substring-no-properties item (org-list-get-item-end item struct))))
        (save-excursion
          (if (string= category-text "x")
              (org-list-send-item item 'delete struct)
            (goto-char (caar struct))
            (if (re-search-forward (concat "^ *- +" category-regexp) end t)
                (progn
                  ;; needs a patch to ol.el to check if stringp
                  (org-list-send-item item (point-at-bol) struct)
                  (org-move-item-down)
                  (org-indent-item))
              (goto-char end)
              (org-list-insert-item
               (point-at-bol)
               struct (org-list-prevs-alist struct))
              (let ((old-struct (copy-tree struct)))
                (org-list-set-ind (point-at-bol) struct 0)
                (org-list-struct-fix-bul struct (org-list-prevs-alist struct))
                (org-list-struct-apply-struct struct old-struct))
              (goto-char (point-at-eol))
              (insert category-text)
              (org-list-send-item item 'end struct)
              (org-indent-item)
              (org-indent-item))
            (recenter))))))

(defun my-org-guess-list-category (&optional categories)
  (interactive)
  (require 'cl-lib)
  (unless categories
    (setq categories
          (my-helm-org-list-categories-init-candidates)))
  (let* ((beg (line-beginning-position))
         (end (line-end-position))
         (string (buffer-substring-no-properties beg end))
         (found
          (cl-member string
                     categories
                     :test
                     (lambda (string cat-entry)
                       (unless (string= (car cat-entry) "x")
                         (string-match (regexp-quote (downcase (car cat-entry)))
                                       string))))))
    (when (car found)
      (my-org-move-current-item-to-category
       (cdr (car found)))
      t)))

For the speech-input functions, experimental code is at https://codeberg.org/sachac/speech-input .

View Org source for this post

Expanding yasnippets by voice in Emacs and other applications

| emacs, audio, speech-recognition

Yasnippet is a template system for Emacs. I want to use it by voice. I'd like to be able to say things like "Okay, define interactive function" and have that expand to a matching snippet in Emacs or other applications. Here's a quick demonstration of expanding simple snippets:

Screencast of expanding snippets by voice in Emacs and in other applications

Transcript
  • 00:00 So I've defined some yasnippets with names that I can say. Here, for example, in this menu, you can see I've got "define interactive function" and "with a buffer that I'll display." And in fundamental mode, I have some other things too. Let's give it a try.
  • 00:19 I press my shortcut. "Okay, define an interactive function." You can see that this is a yasnippet. Tab navigation still works.
  • 00:33 I can say, "OK, with a buffer that I'll display," and it expands that also.
  • 00:45 I can expand snippets in other applications as well, thanks to a global keyboard shortcut.
  • 00:50 Here, for example, I can say, "OK, my email." It inserts my email address.
  • 01:02 Yasnippet definitions can also execute Emacs Lisp. So I can say, "OK, date today," and have that evaluated to the actual date.
  • 01:21 So that's an example of using voice to expand snippets.

This is handled by the following code:

(defun my-whisper-maybe-expand-snippet (text)
  "Add to `whisper-insert-text-at-point'."
  (if (and text
           (string-match
            "^ok\\(?:ay\\)?[,\\.]? \\(.+\\)" text))
    (let* ((name
            (downcase
             (string-trim
              (replace-regexp-in-string "[,\\.]" "" (match-string 1 text)))))
           (matching
            (seq-find (lambda (o)
                        (subed-word-data-compare-normalized-string-distance
                         name
                         (downcase (yas--template-name o))))
                      (yas--all-templates (yas--get-snippet-tables)))))
      (if matching
          (progn
            (if (frame-focus-state)
                (progn
                  (yas-expand-snippet matching)
                  nil)
              ;; In another application
              (with-temp-buffer
                (yas-minor-mode)
                (yas-expand-snippet matching)
                (buffer-string))))
        text))
    text))

This code relies on my fork of whisper.el, which lets me specify a list of functions for whisper-insert-text-at-point. (I haven't asked for upstream review yet because I'm still testing things, and I don't know if it actually works for anyone else yet.) It does approximate matching on the snippet name using a function from subed-word-data.el which just uses string-distance. I could probably duplicate the function in my config, but then I'd have to update it in two places if I come up with more ideas.

The code for inserting into other functions is defined in my-whisper-maybe-type, which is very simple:

(defun my-whisper-maybe-type (text)
  "If Emacs is not the focused app, simulate typing TEXT.
Add this function to `whisper-insert-text-at-point'."
  (when text
    (if (frame-focus-state)
        text
      (make-process :name "xdotool" :command
                    (list "xdotool" "type"
                          text))
      nil)))

Someday I'd like to provide alternative names for snippets. I also want to make it easy to fill in snippet fields by voice. I'd love to be able to answer minibuffer questions from yas-choose-value, yas-completing-read, and other functions by voice too. Could be fun!

Related:

This is part of my Emacs configuration.
View Org source for this post

Using speech recognition for on-the-fly translations in Emacs and faking in-buffer completion for the results

| audio, speech-recognition, emacs, speech

When I'm writing a journal entry in French, I sometimes want to translate a phrase that I can't look up word by word using a dictionary. Instead of switching to a browser, I can use an Emacs function to prompt me for text and either insert or display the translation. The plz library makes HTTP requests slightly neater.

(defun my-french-en-to-fr (text &optional display-only)
  (interactive (list (read-string "Text: ") current-prefix-arg))
  (let* ((url "https://translation.googleapis.com/language/translate/v2")
         (params `(("key" . ,(getenv "GOOGLE_API_KEY"))
                   ("q" . ,text)
                   ("source" . "en")
                   ("target" . "fr")
                   ("format" . "text")))
         (query-string (mapconcat
                        (lambda (pair)
                          (format "%s=%s"
                                  (url-hexify-string (car pair))
                                  (url-hexify-string (cdr pair))))
                        params
                        "&"))
         (full-url (concat url "?" query-string)))
    (let* ((response (plz 'get full-url :as #'json-read))
           (data (alist-get 'data response))
           (translations (alist-get 'translations data))
           (first-translation (car translations))
           (translated-text (alist-get 'translatedText first-translation)))
      (when (called-interactively-p 'any)
        (if display-only
            (message "%s" translated-text)
          (insert translated-text)))
      translated-text)))

I think it would be even nicer if I could use speech synthesis, so I can keep it a little more separate from my typing thoughts. I want to be able to say "Okay, translate …" or "Okay, … in French" to get a translation. I've been using my fork of natrys/whisper.el for speech recognition in English, and I like it a lot. By adding a function to whisper-after-transcription-hook, I can modify the intermediate results before they're inserted into the buffer.

(defun my-whisper-translate ()
  (goto-char (point-min))
  (let ((case-fold-search t))
    (when (re-search-forward "okay[,\\.]? translate[,\\.]? \\(.+\\)\\|okay[,\\.]? \\(.+?\\) in French" nil t)
      (let* ((s (or (match-string 1) (match-string 2)))
             (translation (save-match-data (my-french-en-to-fr s))))
        (replace-match
         (propertize translation
                     'type-hint translation
                     'help-echo s))))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-translate 70))

But that's too easy. I want to actually type things myself so that I get more practice. Something like an autocomplete suggestion would be handy as a way of showing me a hint at the cursor. The usual completion-at-point functions are too eager to insert things if there's only one candidate, so we'll just fake it with an overlay. This code works only with my whisper.el fork because it supports using a list of functions for whisper-insert-text-at-point.

(defun my-whisper-maybe-type-with-hints (text)
  "Add this function to `whisper-insert-text-at-point'."
  (let ((hint (and text (org-find-text-property-in-string 'type-hint text))))
    (if hint
        (progn
          (my-type-with-hint hint)
          nil)
      text)))

(defvar-local my-practice-overlay nil)
(defvar-local my-practice-target nil)
(defvar-local my-practice-start nil)

(defun my-practice-cleanup ()
  "Remove the overlay and stop monitoring."
  (when (overlayp my-practice-overlay)
    (delete-overlay my-practice-overlay))
  (setq my-practice-overlay nil
        my-practice-target nil
        my-practice-start nil)
  (remove-hook 'post-command-hook #'my-practice-monitor t))

(defun my-practice-monitor ()
  "Updates hint or cancels."
  (let* ((pos (point))
         (input (buffer-substring-no-properties my-practice-start pos))
         (input-len (length input))
         (target-len (length my-practice-target)))
    (cond
     ((or (< pos my-practice-start)
          (> pos (+ my-practice-start target-len))
          (string-match "[\n\t]" input)
          (string= input my-practice-target))
      (my-practice-cleanup))
     ((string-prefix-p (downcase input) (downcase my-practice-target))
      (let ((remaining (substring my-practice-target input-len)))
        (move-overlay my-practice-overlay pos pos)
        (overlay-put my-practice-overlay 'after-string
                     (propertize remaining 'face 'shadow))))
     (t                                 ; typo
      (move-overlay my-practice-overlay pos pos)
      (overlay-put my-practice-overlay 'after-string
                   (propertize (substring my-practice-target input-len) 'face 'error))))))

(defun my-type-with-hint (string)
  "Show hints for STRING."
  (interactive "sString to practice: ")
  (my-practice-cleanup)
  (setq-local my-practice-target string)
  (setq-local my-practice-start (point))
  (setq-local my-practice-overlay (make-overlay (point) (point) nil t t))
  (overlay-put my-practice-overlay 'after-string (propertize string 'face 'shadow))
  (add-hook 'post-command-hook #'my-practice-monitor nil t))

Here's a demonstration of me saying "Okay, this is a test, in French.":

Screencast of using speech recognition to translate into French and provide a hint when typing

Since we're faking in-buffer completion here, maybe we can still get away with considering this as an entry for Emacs Carnival February 2026: Completion ? =)

This is part of my Emacs configuration.
View Org source for this post

Using Silero voice activity detection to automatically queue multiple transcriptions with natrys/whisper.el

Posted: - Modified: | audio, speech-recognition, emacs

: Moved code to sachac/speech-input.

I can queue multiple transcriptions with whisper.el so that they get processed sequentially with backup audio. It catches up when I pause to think. Now I want to use Silero voice activity detection to do that kind of segmentation for me automatically.

First, I need a Python server that can print out events when it notices the start or stop of a speech segment. If I print out the timestamps, I might be able to cross-reference it someday with interestingthings. For now, even just paying attention to the end of a segment is enough for what I want to do.

Python script for printing out events
import sounddevice as sd
import numpy as np
import torch
import sys
from datetime import datetime, timedelta

SILENCE_DURATION = 500
SAMPLING_RATE = 16000
CHUNK_SIZE = 512
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=False)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
vad_iterator = VADIterator(model, threshold=0.5, min_silence_duration_ms=SILENCE_DURATION)

stream_start_time = None

def format_iso_with_offset(offset_seconds):
    if stream_start_time is None:
        return "PENDING"
    event_time = stream_start_time + timedelta(seconds=offset_seconds)
    return event_time.astimezone().isoformat(timespec='milliseconds')

def audio_callback(indata, frames, time, status):
    global stream_start_time
    if status:
        print(status, file=sys.stderr)
    if stream_start_time is None:
        stream_start_time = datetime.now()
    tensor_input = torch.from_numpy(indata.copy()).flatten()
    speech_dict = vad_iterator(tensor_input, return_seconds=True)
    if speech_dict:
        if "start" in speech_dict:
            print(f"START {format_iso_with_offset(speech_dict['start'])}", flush=True)
        if "end" in speech_dict:
            print(f"END {format_iso_with_offset(speech_dict['end'])}", flush=True)
try:
    with sd.InputStream(samplerate=SAMPLING_RATE,
                        channels=1,
                        callback=audio_callback,
                        blocksize=CHUNK_SIZE):
        while True:
            pass
except KeyboardInterrupt:
    print("\nStopping...")

I've mode the code to speech-input-vad.el, so now I can use speech-input-vad-start to start monitoring speech events.

(use-package speech-input :load-path "~/proj/speech-input/")

Because I added Pulse properties to the process environment, I can easily use epwgraph to rewire the input so that it gets the input from my VirtualMicSink instead of the default system audio device. (Someday I'll figure out how to specify that as the input automatically.)

Then I can start this process from Emacs:

(with-eval-after-load 'speech-input-vad
  (add-hook 'speech-input-vad-on-end-functions #'my-whisper-maybe-continue))

(defun my-whisper-maybe-continue ()
  (when (process-live-p whisper--recording-process)
    (my-whisper-continue)))

Now I can press my shortcut for my-whisper-continue to start the process. As I keep talking, it will continue to record. When I pause for more than a second between sentences, then it will send that chunk to the server for transcription without me having to press another button, while still listening for more speech.

How is this different from the streaming approach that many real-time speech recognition services offer? I think this gives me a bit more visibility into and control of the process. For my personal use, I don't need to have everything processed as quickly as possible, and I'm not trying to replicate live captions. I just want to be able to look back over the last five minutes to try to remember what I was talking about. I usually have a lot of quiet time as I think through my next steps, and it's fine to have it catch up then. I also like that I can save time-stamped audio files for later processing, divided according to the speech segments. Those might be a little bit easier to work with when I get around to compositing them into a video.

This is part of my Emacs configuration.
View Org source for this post

Queuing multiple transcriptions with whisper.el speech recognition

| audio, speech-recognition, emacs

I want to be able to talk out loud and have the ideas go into Emacs. I can do this in a number of different ways:

  1. I briefly demonstrated a step-by-step approach with natrys/whisper.el with a single file. I press a keyboard shortcut to start the recording, another shortcut to stop the recording, and it transcribes it in the background. But the way whisper.el is set up is that if I press the keyboard shortcut to start recording again it will offer to interrupt the transcription process, which is not what I want. I want to just keep talking and have it process results as things come in.
  2. I'm also experimenting with Google Chrome's web speech API to do continuous speech recognition, which I can get into Emacs using a web socket.
  3. What I've just figured out is how to layer a semi-continuous interface for speech recognition on top of whisper.el so that while it's processing in the background, I can just press a keyboard shortcut (I'm using numpad 9 to call my-whisper-continue) to stop the previous recording, queue it for processing, and start the next recording. If I use this keyboard shortcut to separate my thoughts, then Whisper has a much easier time making sense of the whole sentence or paragraph or whatever, instead of trying to use the sliding 30 second context window that many streaming approaches to speech recognition try to use.

Question: Did you fix the keyboard delay you've got while speech catches what you're saying?

Sometimes, when the speed recognition kicks in, my computer gets busy. When my computer gets really busy, it doesn't process my keystrokes in the right order, which is very annoying because then I have to delete the previous word and retype it. I haven't sorted that out yet, but it seems like I probably have to lower the priority on different processes. On the plus side, as I mentioned, if I dictate things instead of typing them, then I don't run into that problem at all.

Also, other notes on delays: The continuous speech recognition via Google Chrome shows up fairly quickly, but it's not very precise, and it doesn't have punctuation. Even if there's a little bit of a delay, as long as I press the my-whisper-continue shortcut after each thought, then I can get that text into my Emacs buffer using the nicer transcription from my selected model. There is going to be a bit of a delay for that one because it gets processed at the end of the thought. Also, I need to start thinking in complete sentences instead of just adding one cause after the other as my brain goes on all of these tangents. I think it's pretty promising. There's the continuous speech recognition via Google Chrome if I don't mind the lower accuracy and lack of punctuation, and I can still get the pretty version on the other side.

Why talk out loud? I liked the Bookclub Tapas presentation that Maddie Sullivan did at EmacsConf 2025. Talking out loud helps me be a lot more verbose about what I'm saying, compared to typing things out or even like having to switch to my notes or interrupting my screen with an Org capture buffer. Of course I want to clean that up for putting into a blog post, but given that my life still sometimes has random interruptions from a kiddo who must have my attention at that very minute, having that kind of record that I can at least try to reread afterwards to reconstruct what I was thinking about sounds like it might be helpful.

Still, making sense out loud is hard. I'm not actually used to talking to people that much now. This is probably a good reason for me to experiment with streaming more. Then I get the practice in talking out loud, there are backup recordings, and people can ask questions when things are unclear.

Of course, sometimes the text doesn't quite make sense because of the speech recognition errors. I can usually figure it out from the context. I save the audio as well so that I can go back and listen to it again if I really need to.

Anyway, here's the code for sending the current recording to whisper in the background and starting another recording. It assumes a lot about how things are set up. For example, I'm only testing this with a local speaches server instead of whisper.cpp. You might need to look at my other speech related configuration blog posts and sections in order to make sense of it.

Code for queuing whisper.el requests to a local server
(defvar my-whisper--queue nil)
(defun my-whisper-continue (&optional arg)
  "Send what we've got so far for transcription and then continue recording.
Call with \\[universal-argument] to signal that we can stop."
  (interactive "P")
  (require 'whisper)
  (if arg
      (my-whisper-done)
    (setq whisper--marker (point-marker) whisper--point-buffer (current-buffer))
    (when (process-live-p whisper--recording-process)
      ;; queue only if the last one is not asking for the same file
      (unless
          (string=
           (plist-get
            (car
             (last my-whisper--queue))
            :file)
           whisper--temp-file)
        (add-to-list
         'my-whisper--queue
         (list :file whisper--temp-file
               :buffer
               (format "*result: %s*" (file-name-base whisper--temp-file)))
         t))
      ;; Remove the sentinel; handle results ourselves
      (set-process-sentinel whisper--recording-process
                            (lambda (process event)
                              (my-whisper-process-queue)))
      (interrupt-process whisper--recording-process))
    (run-hooks 'whisper-before-transcription-hook)
    (whisper--setup-mode-line :show 'recording)
    (whisper--record-audio)))

(defun my-whisper-discard ()
 "Ignore the previous recording."
  (interactive)
  (when (process-live-p whisper--recording-process)
    ;; Remove the sentinel; handle results ourselves
    (set-process-sentinel whisper--recording-process
                          (lambda (process event)
                            (when (file-exists-p whisper--temp-file)
                              (delete-file whisper--temp-file))
                            (my-whisper-process-queue)))
    (interrupt-process whisper--recording-process)))

(defun my-whisper-discard-and-continue ()
 "Ignore the previous recording and continue."
  (interactive)
  (if (process-live-p whisper--recording-process)
      (progn
        ;; Remove the sentinel; handle results ourselves
        (set-process-sentinel whisper--recording-process
                              (lambda (process event)
                                (my-whisper-process-queue)
                                (my-whisper-continue)))
        (interrupt-process whisper--recording-process))
    (my-whisper-continue)))

(defun my-whisper-done ()
  (interactive)
  (when (process-live-p whisper--recording-process)
    (add-to-list
     'my-whisper--queue
     (list :file whisper--temp-file
           :buffer
           (format "*result: %s*" (file-name-base whisper--temp-file)))
     t)
    ;; Remove the sentinel; handle results ourselves
    (set-process-sentinel whisper--recording-process
                          (lambda (process event)
                            (my-whisper-process-queue)))
    (whisper--setup-mode-line :hide 'recording)
    (interrupt-process whisper--recording-process)))

(defun my-whisper-process-queue-result ()
  "Process the first part of the queue that already has results."
  (while (plist-get (car my-whisper--queue) :results)
    (let ((o (pop my-whisper--queue)))
      (unless my-whisper-target-markers
        (setq whisper--marker (point-marker)
              whisper--point-buffer (current-buffer)))
      (with-current-buffer (plist-get o :buffer)
        (erase-buffer)
        (insert (plist-get o :results)))
      ;; Only works with my fork: https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function
      (whisper--handle-transcription-output nil (plist-get o :buffer)))))

(defun my-whisper-process-queue ()
  (let (o)
    (while (setq o (seq-find (lambda (o) (and (plist-get o :file)
                                              (not (plist-get o :process))
                                              (not (plist-get o :results))))
                             my-whisper--queue))
      (let* ((headers (list "Content-Type: multipart/form-data"))
             (params (list (concat "file=@"
                                   (plist-get o :file))
                           "temperature=0.0"
                           "temperature_inc=0.2"
                           "response_format=json"
                           (concat "model=" whisper-model)
                           (concat "language=" whisper-language)))
             (url (format my-whisper-url-format whisper-server-host whisper-server-port))
             (command `("curl" "-s"
                        ,url
                        ,@(mapcan (lambda (h) (list "-H" h)) headers)
                        ,@(mapcan (lambda (p) (list "-F" p)) params))))
        (with-current-buffer (get-buffer-create (plist-get o :buffer))
          (erase-buffer))
        (plist-put
         o :process
         (make-process
          :name "whisper-curl"
          :command command
          :buffer (plist-get o :buffer)
          :coding 'utf-8
          :sentinel
          (lambda (process event)
            (with-current-buffer (process-buffer process)
              (let ((current my-whisper--queue-item))
                (when (and (get-buffer (plist-get current :buffer))
                           (string-equal "finished\n" event))
                  (with-current-buffer (plist-get current :buffer)
                    (goto-char (point-min))
                    (plist-put current :results
                               (or
                                (condition-case nil
                                    (gethash "text" (json-parse-buffer))
                                  (error ""))
                                "(error)"))))))
            (my-whisper-process-queue-result))))
        (plist-put o :command (string-join command " "))
        (with-current-buffer (process-buffer (plist-get o :process))
          (setq-local my-whisper--queue-item o))))))
(defvar-local my-whisper--queue-item nil)

(defun my-whisper-reprocess-queue ()
  (interactive)
  (setq whisper--marker (point-marker) whisper--point-buffer (current-buffer))
  (mapc (lambda (o)
          (when (process-live-p (plist-get o :process))
            (kill-process (plist-get o :process)))
          (when (get-buffer (plist-get o :buffer))
            (kill-buffer (plist-get o :buffer)))
          (plist-put o :process nil)
          (plist-put o :results nil))
        my-whisper--queue)
  (my-whisper-process-queue))

(defun my-whisper-clear-queue ()
  (interactive)
  (mapc (lambda (o)
          (when (process-live-p (plist-get o :process))
            (kill-process (plist-get o :process)))
          (when (get-buffer (plist-get o :buffer))
            (kill-buffer (plist-get o :buffer)))
          (plist-put o :process nil)
          (plist-put o :results nil))
        my-whisper--queue)
  (setq my-whisper--queue nil))

(keymap-global-set "<kp-9>" #'my-whisper-continue)
(keymap-global-set "<kp-8>" #'my-whisper-discard-and-continue)
(keymap-global-set "C-<kp-9>" #'my-whisper-done)
This is part of my Emacs configuration.
View org source for this post

Emacs and whisper.el: Trying out different speech-to-text backends and models

| audio, emacs, speech-recognition

I was curious about parakeet because I heard that it was faster than Whisper on the HuggingFace leaderboard. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.

In order to try natrys/whisper.el with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.

(defvar my-whisper-url-format "http://%s:%d/transcribe")
(defun my-whisper--transcribe-via-local-server ()
  "Transcribe audio using the local whisper server."
  (message "[-] Transcribing via local server")
  (whisper--setup-mode-line :show 'transcribing)
  (whisper--ensure-server)
  (setq whisper--transcribing-process
        (whisper--process-curl-request
         (format my-whisper-url-format whisper-server-host whisper-server-port)
         (list "Content-Type: multipart/form-data")
         (list (concat "file=@" whisper--temp-file)
               "temperature=0.0"
               "temperature_inc=0.2"
               "response_format=json"
               (concat "model=" whisper-model)
               (concat "language=" whisper-language)))))
(defun my-whisper--check-model-consistency () t)
(defun my-whisper--ensure-server () t)

(with-eval-after-load 'whisper
  (advice-add 'whisper--transcribe-via-local-server :override #'my-whisper--transcribe-via-local-server)
  (advice-add 'whisper--check-model-consistency :override #'my-whisper--check-model-consistency)
  (advice-add 'whisper--ensure-server :override #'my-whisper--ensure-server)
  )

Then I have this function for trying things out.

(defun my-test-whisper-api (url &optional args)
  (with-temp-buffer
    (apply #'call-process "curl" nil t nil "-s"
           url
         (append (mapcan
                  (lambda (h) (list "-H" h))
                  (list "Content-Type: multipart/form-data"))
                 (mapcan
                  (lambda (h) (list "-F" h))
                  (list (concat "file=@" whisper--temp-file)
                        "temperature=0.0"
                        "temperature_inc=0.2"
                        "response_format=verbose_json"
                        (concat "language=" whisper-language)))
                 args))
    (message "%s %s" (buffer-string) url)))

Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.

Download

Code for running the benchmarks
(let ((times '3))
(mapcar
 (lambda (group)
   (let ((whisper--temp-file "/home/sacha/recordings/whisper/2026-01-19-14-17-53.wav"))
     ;; warm up the model
     (eval (cadr group))
     (list
      (format "%.3f"
              (/ (car
                  (benchmark-call (lambda () (eval (cadr group))) times))
                 times))
      (car group))))
 '(
   ("parakeet"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 5092)))
   ("whisper.cpp base-q4_0"
    (my-test-whisper-api
     (format "http://%s:%d/inference" whisper-server-host 8642)))
   ("speaches whisper-base"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-base")))
   ("speaches whisper-base.en"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-base.en")))
   ("speaches whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-small")))
   ("speaches whisper-small.en"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=Systran/faster-whisper-small.en")))
   ("speaches lorneluo/whisper-small-ct2-int8"
    (my-test-whisper-api
     (format "http://%s:%d/v1/audio/transcriptions" whisper-server-host 8001)
     (list "-F" "model=lorneluo/whisper-small-ct2-int8")))
   ;; needed export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
   ("whisperx-server Systran/faster-whisper-small"
    (my-test-whisper-api
     (format "http://%s:%d/transcribe" whisper-server-host 8002)))))
)

I tried it with:

Looks like speaches + faster-whisper-base is the winner for now. I like how speaches lets me switch models on the fly, so maybe I can use base.en generally and switch to base when I want to try dictating in French. Here's how I've set it up to use the server I just set up.

(setq whisper-server-port 8001 whisper-model "Systran/faster-whisper-base.en"
      my-whisper-url-format "http://%s:%d/v1/audio/transcriptions")

At some point, I'll override whisper--ensure-server so that starting it up is smoother.

Benchmark notes: I have a Lenovo P52 laptop (released 2018) with an Intel Core i7-8850H (6 cores, 12 threads; 2.6 GHz base / 4.3 GHz turbo) with 64GB RAM and an SSD. I haven't figured out how to get the GPU working under Ubuntu yet.

This is part of my Emacs configuration.
View org source for this post

Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere

Posted: - Modified: | emacs, audio, speech-recognition, speech
  • : Major change: I switched to my fork of natrys/whisper.el so that I can specify functions that change the window configuration etc.
  • : Change main function to my-whisper-run, use seq-reduce to go through the functions.
  • : Added code for automatically capturing screenshots, saving text, working with a list of functions.
  • : Added demo, fixed some bugs.
  • : Added note about difference from MELPA package, fixed :vc

I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.

I've bound <f9> to the command whisper-run. I press <f9> to start recording, talk, and then press <f9> to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.

(use-package whisper
  :vc (:url "https://github.com/natrys/whisper.el")
  :load-path "~/vendor/whisper.el"
  :config
  (setq whisper--mode-line-recording-indicator "⏺")
  (setq whisper-quantize "q4_0")
  (setq whisper-install-directory "~/vendor")
  (setq whisper--install-path (concat
     (expand-file-name (file-name-as-directory whisper-install-directory))
     "whisper.cpp/"))
  ;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
  ;; If you change models,
  ;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
  (setq whisper-server-mode 'local)
  (setq whisper-model "base")
  (setq whisper-return-cursor-to-start nil)
  ;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
  (setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
  (setq whisper-language "en")
  (setq whisper-recording-timeout 3000)
  (setq whisper-before-transcription-hook nil)
  (setq whisper-use-threads (1- (num-processors)))
  (setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
  (add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100)
  :bind
  (("<f9>" . whisper-run)
   ("C-<f9>" . my-whisper-run)
   ("S-<f9>" . my-whisper-replay)
   ("M-<f9>" . my-whisper-toggle-language)))

Let's see if we can process "Computer remind me to…":

(defvar my-whisper-org-reminder-template "t")

(defun my-whisper-org-process-reminder ()
  (let ((text (buffer-string))
        reminder)
    (when (string-match "computer[,\.]? reminds? me to \\(.+\\)" text)
      (setq reminder (match-string 1 text))
      (save-window-excursion
        (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
          (when (markerp whisper--marker) (goto-char whisper--marker))
          (org-capture nil my-whisper-org-reminder-template)
          (insert reminder)
          (org-capture-finalize)))
      (erase-buffer))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-org-process-reminder 50))

Disk space is inexpensive and backups are great, so let's save each file using the timestamp.

(defvar my-whisper-dir "~/recordings/whisper/")
(defun my-whisper-set-temp-filename ()
  (setq whisper--temp-file (expand-file-name
                            (format-time-string "%Y-%m-%d-%H-%M-%S.wav")
                            my-whisper-dir)))

(with-eval-after-load 'whisper
  (add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename))

The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.

(defun my-whisper-replay (&optional file)
  "Replay the last temporary recording."
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (setq whisper--temp-file (or file whisper--temp-file))
  (mpv-play whisper--temp-file))

(defun my-whisper-insert-retry (&optional file)
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (whisper--cleanup-transcription)
  (setq whisper--marker (point-marker)
        whisper--temp-file (or file whisper--temp-file))
  (whisper--transcribe-audio))

Il peut aussi comprendre le français.

(defun my-whisper-toggle-language ()
  "Set the language explicitly, since sometimes auto doesn't figure out the right one."
  (interactive)
  (setq whisper-language (if (string= whisper-language "en") "fr" "en"))
  ;; If using a server, we need to restart for the language
  (when (process-live-p whisper--server-process) (kill-process whisper--server-process))
  (message "%s" whisper-language))

I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, <f9> to start recording, <f9> to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.

To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.

Then, when I'm looking at something else and I want to record a note, I can press <f9> to start the recording, and then C-<f9> to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)

(defun my-whisper-reset (text)
  (setq my-whisper-skip-annotation nil)
  (remove-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task)
  text)
;; Only works with my tweaks to whisper.el
;; https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function
(with-eval-after-load 'whisper
  (setq whisper-insert-text-at-point
        '(my-whisper-handle-commands
          my-whisper-save-text
          my-whisper-save-to-file
          my-whisper-maybe-expand-snippet
          my-whisper-maybe-type
          my-whisper-maybe-type-with-hints
          my-whisper-insert
          my-whisper-reset)))
(defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.")
(defvar my-whisper-skip-annotation nil)
(defvar my-whisper-target-markers nil "List of markers to send text to.")

(defun my-whisper-insert (text)
  (let ((markers
         (cond
          ((null my-whisper-target-markers)
           (list whisper--marker)) ; current point where whisper was started
          ((listp my-whisper-target-markers)
           my-whisper-target-markers)
          ((markerp my-whisper-target-markers)
           (list my-whisper-target-markers))))
        (orig-point (point))
        (orig-buffer (current-buffer)))
    (when text
      (mapcar (lambda (marker)
                (with-current-buffer (marker-buffer marker)
                  (save-restriction
                    (widen)
                    (when (markerp marker) (goto-char marker))
                    (when (and (derived-mode-p 'org-mode) (org-at-drawer-p))
                      (insert "\n"))
                    (whisper--insert-text
                     (concat
                      (if (looking-back "[ \t\n]\\|^")
                          ""
                        " ")
                      (string-trim text)))
                    ;; Move the marker forward here
                    (move-marker marker (point)))))
              markers)
      (when my-whisper-target-markers
        (goto-char orig-point))
      nil)))

(defun my-whisper-maybe-type (text)
  (when text
    (if (frame-focus-state)
        text
      (make-process :name "xdotool" :command
                    (list "xdotool" "type"
                          text))
      nil)))

(defun my-whisper-clear-markers ()
  (interactive)
  (setq my-whisper-target-markers nil))

(defun my-whisper-use-current-point (&optional add)
  (interactive (list current-prefix-arg))
  (if add
      (push (point-marker) my-whisper-target-markers)
    (setq my-whisper-target-markers (list (point-marker)))))

(defun my-whisper-run-at-point (&optional add)
  (interactive (list current-prefix-arg))
  (my-whisper-clear-markers)
  (whisper-run))

(keymap-global-set "<f9>" #'my-whisper-run-at-point)
(keymap-global-set "<kp-1>" #'whisper-run)

(defun my-whisper-jump-to-marker ()
  (interactive)
  (with-current-buffer (marker-buffer (car my-whisper-target-markers))
    (goto-char (car my-whisper-target-markers))))

(defun my-whisper-use-currently-clocked-task (&optional add)
  (interactive (list current-prefix-arg))
  (save-window-excursion
    (save-restriction
      (save-excursion
        (org-clock-goto)
        (org-end-of-meta-data)
        (org-end-of-subtree)
        (if add
            (push (point-marker) my-whisper-target-markers)
          (setq my-whisper-target-markers (list (point-marker))))))))

(defun my-whisper-run (&optional skip-annotation)
  (interactive (list current-prefix-arg))
  (require 'whisper)
  (add-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task -10)
  (whisper-run)
  (when skip-annotation
    (setq my-whisper-skip-annotation t)))

(defun my-whisper-save-text (text)
  "Save TEXT beside `whisper--temp-file'."
  (when text
    (let ((link (org-store-link nil)))
      (with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt")
        (when link
          (insert link "\n"))
        (insert text)))
    text))

(defun my-whisper-org-save-to-clocked-task (text)
  (when text
    (save-window-excursion
      (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
        (when (markerp whisper--marker) (goto-char whisper--marker))
        ;; Take a screenshot maybe
        (let* ((link (and (not my-whisper-skip-annotation)
                          (org-store-link nil)))
               (region (and (region-active-p) (buffer-substring (region-beginning) (region-end))))
               (screenshot-filename
                (when (or
                       (null link)
                       (not (string= my-whisper-last-annotation link))
                       (not (frame-focus-state))) ; not in focus, take a screenshot
                  (my-screenshot-current-screen (concat (file-name-sans-extension whisper--temp-file) ".png")))))
          (if (org-clocking-p)
              (save-window-excursion
                (save-restriction
                  (save-excursion
                    (org-clock-goto)
                    (org-end-of-subtree)
                    (unless (bolp)
                      (insert "\n"))
                    (insert "\n")
                    (if (and link (not (string= my-whisper-last-annotation link)))
                        (insert
                         (if screenshot-filename
                             (concat "(" (org-link-make-string
                                          (concat "file:" screenshot-filename)
                                          "screenshot") ") ")
                           "")
                         link
                         "\n")
                      (when screenshot-filename
                        (insert (org-link-make-string
                                 (concat "file:" screenshot-filename)
                                 "screenshot")
                                "\n")))
                    (when region
                      (insert "#+begin_example\n" region "\n#+end_example\n"))
                    (insert text "\n")
                    (setq my-whisper-last-annotation link)))
                (run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
            ;; No clocked task, prompt for a place to capture it
            (kill-new text)
            (setq org-capture-initial text)
            (call-interactively 'org-capture)
            ;; Delay the window configuration
            (let ((config (current-window-configuration)))
              (run-at-time 0.5 nil
                           (lambda (text config)
                             (set-window-configuration config)
                             (message "Copied: %s" text))
                           text config))))))))

(with-eval-after-load 'org
  (add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation))

(defun my-whisper-org-clear-saved-annotation ()
  (setq my-whisper-org-last-annotation nil))

Here's an idea for a function that saves the recognized text with a timestamp.

(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
  (when text
    (let ((link (org-store-link nil)))
      (with-current-buffer (find-file-noselect my-whisper-notes)
        (goto-char (point-max))
        (insert "\n\n" (format-time-string "%H:%M ") text "\n" (if link (concat link "\n") ""))
        (save-buffer)
        (run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text)))
    text))

And now I can redo things if needed:

(defun my-whisper-redo ()
  (interactive)
  (setq whisper--marker (point-marker))
  (whisper--transcribe-audio))

I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them.

systemctl --user restart pipewire
sleep 2
pactl load-module module-null-sink \
  sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink
pactl load-module module-null-sink \
  sink_name="CombinedSink" sink_properties=device.description=CombinedSink
if pactl list short sources | grep -i pci-0000; then
  pactl load-module module-loopback \
    source="alsa_input.pci-0000_00_1f.3.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'" \
    sink_input_properties="node.description='SysToVMic' node.name='SysToVMic' media.role='filter'"
    sink_input_properties=media.role=filter
  pactl load-module module-loopback \    source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \
    sink="CombinedSink" \
    node_name="SystemOutToCombined" \
    source_output_properties="node.description='SysOutToCombined' node.name='SysOutToCombined'" \
    sink_input_properties="node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
if pactl list short sources | grep -i yeti; then
  pactl load-module module-loopback \
    source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \
    sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"
  pactl load-module module-loopback \    source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \
    sink="CombinedSink" \
    source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \
    sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
pactl load-module module-loopback \
  source="VirtualMicSink.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \
  sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

pactl load-module module-null-sink \
  sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1

pactl load-module module-loopback \
  source="ExtraSink1.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \
  sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

Here's a demo:

Screencast of using whisper.el to do speech-to-text into the current buffer, clocked-in task, or other function

Transcript

00:00:00 Inserting into the current buffer
Here's a quick demonstration of using whisper.el to log notes.
00:00:13 Inserting text and moving on
I can insert text into the current buffer one after the other.
00:00:31 Clocking in
If I clock into a task, I can add to the end of that clocked in task using my custom code by pressing C-<f9> or whatever my shortcut was. I can do that multiple times.
00:01:05 Logging a note from a different file
I can do that while looking at a different file.
00:01:15 I can look at an info page
I can do it looking at an info page, for example, and annotations will include a link back to whatever I was looking at.
00:01:33 Adding without an annotation (C-u)
I just added an optional argument so that I can also capture a note without saving an annotation. That way, if I'm going to say a lot of things about the same buffer, I don't have to have a lot of links that I need to edit out.
00:02:42 Saving to a different function
I can also have it save to a different function.

And then I define a global shortcut in KDE that runs:

/home/sacha/bin/xdotool-emacs key --clearmodifiers F9

So now I can dictate into other applications or save into Emacs. Which suggests of course that I should get it working with C-f9 as well, if I can avoid the keyboard shortcut loop…

This is part of my Emacs configuration.
View Org source for this post