Categories: audio

RSS - Atom - Subscribe via email

Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere

Posted: - Modified: | emacs, audio, speech

: Added note about difference from MELPA package, fixed :vc

I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.

I've bound <f9> to the command whisper-run. I press <f9> to start recording, talk, and then press <f9> to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.

(use-package whisper
  :vc (:url "https://github.com/natrys/whisper.el")
  :load-path "~/vendor/whisper.el"
  :config
  (setq whisper-quantize "q4_0")
  (setq whisper-install-directory "~/vendor")
  ;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
  ;; If you change models,
  ;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
  (setq whisper-server-mode 'local)
  (setq whisper-model "base")
  (setq whisper-return-cursor-to-start nil)
  (setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
  ;(setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
  (setq whisper--ffmpeg-input-device "audiorelay-virtual-mic-sink:monitor_FL")
  (setq whisper-language "en")
  (setq whisper-before-transcription-hook nil)
  (setq whisper-use-threads (1- (num-processors)))
  (setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
  (add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start)
  :bind
  (("<f9>" . whisper-run)
   ("C-<f9>" . my-whisper-org-capture-to-clock)
   ("S-<f9>" . my-whisper-replay)
   ("M-<f9>" . my-whisper-toggle-language)))

The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.

(defun my-whisper-replay ()
  "Replay the last temporary recording."
  (interactive)
  (mpv-play whisper--temp-file))

Il peut aussi comprendre le français.

(defun my-whisper-toggle-language ()
  "Set the language explicitly, since sometimes auto doesn't figure out the right one."
  (interactive)
  (setq whisper-language (if (string= whisper-language "en") "fr" "en"))
  ;; If using a server, we need to restart for the language
  (when (process-live-p whisper--server-process) (kill-process whisper--server-process))
  (message "%s" whisper-language))

I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, <f9> to start recording, <f9> to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.

To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.

Then, when I'm looking at something else and I want to record a note, I can press <f9> to start the recording, and then C-<f9> to save it to my currently clocked task along with a link to whatever I'm looking at.

(defvar my-whisper-org-target nil
  "*Where to save the target.

Nil means jump to the current clocked-in entry and insert it along with
a link, or prompt for a capture template if nothing is clocked in.

If this is set to a string, it should specify a key from
`org-capture-templates'. The text will be in %i, and you can use %a for the link.
For example, you could have a template entry like this:
\(\"c\" \"Contents to current clocked task\" plain (clock) \"%i%?\n%a\" :empty-lines 1)

If this is set to a function, the function will be called from the
original marker with the text as the argument. Note that the window
configuration and message will not be preserved after this function is
run, so if you want to change the window configuration or display a
message, add a timer.")

(defun my-whisper-org-capture-to-clock ()
  (interactive)
  (require 'whisper)
  (add-hook 'whisper-after-transcription-hook #'my-whisper-org-save 50)
  (whisper-run))

(defun my-whisper-org-save ()
  "Save the transcription."
  (let ((text (string-trim (buffer-string))))
    (remove-hook 'whisper-after-transcription-hook #'my-whisper-org-save)
    (erase-buffer)      ; stops further processing
    (save-window-excursion
      (with-current-buffer (marker-buffer whisper--marker)
        (goto-char whisper--marker)
        (cond
         ((functionp my-whisper-org-target)
          (funcall my-whisper-org-target text))
         (my-whisper-org-target
          (setq org-capture-initial text)
          (org-capture nil my-whisper-org-target)
          (org-capture-finalize)
          ;; Delay the display of the message because whisper--cleanup-transcription clears it
          (run-at-time 0.5 nil (lambda (text) (message "Captured: %s" text)) text))
         ((org-clocking-p)
          (let ((link (org-store-link nil)))
            (org-clock-goto)
            (org-end-of-subtree)
            (unless (bolp)
              (insert "\n"))
            (insert "\n" text "\n" link "\n"))
          (run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
         (t
          (kill-new text)
          (setq org-capture-initial text)
          (call-interactively 'org-capture)
          ;; Delay the window configuration
          (let ((config (current-window-configuration)))
            (run-at-time 0.5 nil
                         (lambda (text config)
                           (set-window-configuration config)
                           (message "Copied: %s" text))
                         text config))))))))

Here's an idea for a my-whisper-org-target function that saves the recognized text with a timestamp.

(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
  (let ((link (org-store-link nil)))
    (with-current-buffer (find-file-noselect my-whisper-notes)
      (goto-char (point-max))
      (insert "\n\n" (format-time-string "%H:%M ") text "\n" link "\n")
      (save-buffer)
      (run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text))))

I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them. Actually making a demonstration video will probably need to wait for another day, though!

This is part of my Emacs configuration.
View org source for this post

#YayEmacs 9: Trimming/adding silences to get to a target; subed-record-sum-time

| audio, subed, yay-emacs, emacs, video

New in this video: subed-record-sum-time, #+PAD_LEFT and #+PAD_RIGHT

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be. First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times. Then I can skip my oopses. Sometimes WhisperX doesn't catch them, so I also look at waveforms and characters per second. I already talk quickly, so I'm not going to speed that up but I can trim the pauses in between phrases which is easy to do with waveforms. Sometimes, after reviewing a draft, I realize I need a little more time. If the original audio has some silence, I can just copy and paste it. If not, I can pad left or pad right to add some silence. I can try the flow of some sections and compile the video when I'm ready. Emacs can do almost anything. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Play by play:

  • I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be.
    • subed-record uses subtitles and directives in comments in a VTT subtitle file to edit audio and video. subed-record-sum-time calculates the resulting duration and displays it in the minibuffer.
  • First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times.
    • I'm experimenting with an algorithmic way to combine the breaks from my script with the text from the transcript. subed-align calls the aeneas forced alignment tool to match up the text with the timestamps. I use subed-waveform-show-all to show all the waveforms.
  • Then I can skip my oopses.
    • Adding a NOTE #+SKIP comment before a subtitle makes subed-record-compile-video and subed-record-compile-flow skip that part of the audio.
  • Sometimes WhisperX doesn't catch them,
    • WhisperX sometimes doesn't transcribe my false starts if I repeat things quickly.
  • so I also look at waveforms
    • subed-waveform-show-all adds waveforms for all the subtitles. If I notice there's a pause or a repeated shape in the waveform, or if I listen and notice the repetition, I can confirm by middle-clicking on the waveform to sample part of it.
  • and characters per second.
    • Low characters per second is sometimes a sign that the timestamps are incorrect or there's a repetition that wasn't transcribed.
  • I already talk quickly, so I'm not going to speed that up
    • Also, I already sound like a chipmunk; mechanically speeding up my recording to fit in a certain time will make that worse =)
  • but I can trim the pauses in between phrases which is easy to do with waveforms.
    • left-click to set the start, right-click to set the stop. If I want to adjust the previous/next one at the same time, I would use shift-left-click or shift-right-click, but here I want to skip the gaps between phrases, so I adjust the current subtitle without making the previous/next one longer.
  • Sometimes, after reviewing a draft, I realize I need a little more time.
    • I can specify visuals like a video, animated GIF, or an image by adding a [[file:...]] link in the comment for a subtitle. That visual will be used until the next visual is specified in a comment on a different subtitle. subed-record-compile-video can automatically speed up video clips to fit in the time for the current audio segment, which is the set of subtitles before the next visual is defined. After I compile and review the video, sometimes I notice that something goes by too quickly.
  • If the original audio has some silence, I can just copy and paste it.
    • This can sometimes feel more natural than adding in complete silence.
  • If not, I can pad left or pad right to add some silence.
    • I added a new feature so that I could specify something like #+PAD_RIGHT: 1.5 in a comment to add 1.5 seconds of silence after the audio specified by that subtitle.
  • I can try the flow of some sections
    • I can select a region and then use M-x subed-record-compile-try-flow to play the audio or C-u M-x subed-record-compile-try-flow to play the audio+video for that region.
  • and compile the video when I'm ready.
    • subed-record-compile-video compiles the video to the file specified in #+OUTPUT: filename. ffmpeg is very arcane, so I'm glad I can simplify my use of it with Emacs Lisp.
  • Emacs can do almost anything. Yay Emacs!
    • Non-linear audio and video editing is actually pretty fun in a text editor, especially when I can just use M-x vundo to navigate my undo history.

Links:

Related:

View org source for this post

Remove filler words at the start and upcase the next word

| audio, speechtotext, emacs

[2024-11-21 Thu]: Fixed the second filler words regexp, and make it work at the start of lines too. Thanks to @arialdo@mastodon.online for the feedback!

Like many people, I tend to use "So", "And", "You know", and "Uh" to bridge between sentences when thinking. WhisperX does a reasonable job of detecting sentences and splitting them up anyway, but it leaves those filler words in at the start of the sentence. I usually like to remove these from transcripts so that they read more smoothly.

Here's a short Emacs Lisp function that removes those filler words when they start a sentence, capitalizing the next word. When called interactively, it prompts while displaying an overlay. When called from Emacs Lisp, it changes without asking for confirmation.

(defvar my-filler-words-regexp "\\(\\. \\|^\\)\\(?:So?\\|And\\|You know\\|Uh\\)\\(?:,\\|\\.\\.\\.\\)? \\(.\\)")
(defun my-remove-filler-words-at-start ()
  (interactive)
  (save-excursion
    (let ((case-fold-search nil))
      (while (re-search-forward my-filler-words-regexp nil t)
        (if (and (called-interactively-p) (not current-prefix-arg))
            (let ((overlay (make-overlay (match-beginning 0)
                                         (match-end 0))))
              (overlay-put overlay 'common-edit t)
              (overlay-put
               overlay 'display
               (propertize (concat (match-string 0) " -> "
                                   (match-string 1)
                                   (upcase (match-string 2)))
                           'face 'modus-themes-mark-sel))
              (unwind-protect
                  (pcase (save-match-data (read-char-choice "Replace (y/n/!/q)? " "yn!q"))
                    (?!
                     (replace-match (concat (match-string 1) (upcase (match-string 2))) t)
                     (while (re-search-forward my-filler-words-regexp nil t)
                       (replace-match (concat (match-string 1) (upcase (match-string 2))) t)))
                    (?y
                     (replace-match (concat (match-string 1) (upcase (match-string 2))) t))
                    (?n nil)
                    (?q (goto-char (point-max))))
                (delete-overlay overlay)))
          (replace-match (concat (match-string 1) (upcase (match-string 2))) t))))))
This is part of my Emacs configuration.
View org source for this post

Using WhisperX to get word-level timestamps for audio editing with Emacs and subed-record

Posted: - Modified: | audio, emacs, subed
  • [2024-12-09 Mon]: Split whisperx to pass one file per call.
  • [2024-11-16 Sat]: Removed highlight_words, made max_line_width use the environment variable if specified.
  • [2024-10-14 Mon]: Actually, WhisperX makes a JSON with word-level timing data, so let's use that instead.

I'm gradually shifting more things to this Lenovo P52 to take advantage of its newer processor, 64 GB of RAM, and 2 TB drive. (Whee!) One of the things I'm curious about is how I can make better use of multimedia. I couldn't get whisper.cpp to work on my Lenovo X230T, so I mostly relied on the automatic transcripts from Google Recorder (with timestamps generated by aeneas) or cloud-based transcription services like Deepgram.

I have a lot of silences in my voice notes when I think out loud. whisper.cpp got stuck in loops during silent parts, but WhisperX handles them perfectly. WhisperX is also fast enough for me to handle audio files locally instead of relying on Deepgram. With the default model, I can process the files faster than real-time:

File length Transcription time
42s 17s
7m48s 1m41s

I used this command to get word-level timing data. (Experimenting with options from this post)

MAX_LINE_WIDTH="${MAX_LINE_WIDTH:-50}"
MODEL="${MODEL:-large-v2}"
for FILE in "$@"; do
    ~/vendor/whisperx/.venv/bin/whisperx --model "$MODEL" --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --compute_type int8 --print_progress True --max_line_width $MAX_LINE_WIDTH --segment_resolution chunk --max_line_count 1 --language en "$FILE"
done

Among other things, it makes a text file that looks like this:

I often need to... I sometimes need to replace or navigate by symbols.
Casual symbol overlays a new package that adds those shortcuts so that I don't have to remember the other keywords for them.

and a JSON file that looks like this:

{"segments": [{"start": 0.427, "end": 7.751, "text": " I often need to... I sometimes need to replace or navigate by symbols.", "words": [{"word": "I", "start": 0.427, "end": 0.507, "score": 0.994}, {"word": "often", "start": 0.587, "end": 0.887, "score": 0.856}, {"word": "need", "start": 0.987, "end": 1.227, "score": 0.851}, {"word": "to...", "start": 1.267, "end": 1.508, "score": 0.738}, {"word": "I", "start": 4.329, "end": 4.429, "score": 0.778}, ...]}, ...]}

Sometimes I just want the text so that I can use an audio braindump as the starting point for a blog post or for notes. WhisperX is way more accurate than Google Recorder, so that will probably be easier once I update my workflow for that.

Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the JSON like this:

(defun my-subed-word-tsv-from-whisperx-json (file)
  (interactive "FJSON: ")
  (let* ((json-array-type 'list)
         (json-object-type 'alist)
         (data (json-read-file file))
         (filename (concat (file-name-sans-extension file) ".tsv"))
         (base (seq-mapcat
                (lambda (segment)
                  (seq-map (lambda (word)
                             (let-alist word
                               (list nil
                                     (and .start (* 1000 .start))
                                     (and .end (* 1000 .end))
                                     .word)))
                           (alist-get 'words segment)))
                (alist-get 'segments data)))
         (current base)
         (last-end 0))
     ;; numbers at the end of a sentence sometimes don't end up with times
     ;; so we need to fix them
    (while current
      (unless (elt (car current) 1)           ; start
        (setf (elt (car current) 1) (1+ last-end)))
      (unless (elt (car current) 2)
        (setf (elt (car current) 2) (1- (elt (cadr current) 1))))
      (setq
       last-end (elt (car current) 2)
       current (cdr current)))
    (subed-create-file
     filename
     base
     t
     'subed-tsv-mode)
    (find-file filename)))

Here's my old code for parsing the highlighted VTT or SRT files that underline each word:

(defun my-subed-load-word-data-from-whisperx-highlights (file)
  "Return a list of word cues from FILE.
FILE should be a VTT or SRT file produced by whisperx with the
--highlight_words True option."
  (seq-keep (lambda (sub)
              (when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
                (setf (elt sub 3) (match-string 1 (elt sub 3)))
                sub))
            (subed-parse-file file)))

(defun my-subed-word-tsv-from-whisperx-highlights (file)
  (interactive "FVTT: ")
  (with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
    (erase-buffer)
    (subed-tsv-mode)
    (subed-auto-insert)
    (mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
          (my-subed-load-word-data-from-whisperx-highlights file))
    (switch-to-buffer (current-buffer))))

I like to use the TSV format for this one because it's easy to scan down the right side. Incidentally, this format is compatible with Audacity labels, so I could import that there if I wanted. I like Emacs much more, though. I'm used to having all my keyboard shortcuts at hand.

0.427000	0.507000	I
0.587000	0.887000	often
0.987000	1.227000	need
1.267000	1.508000	to...
4.329000	4.429000	I
4.469000	4.869000	sometimes
4.950000	5.170000	need
5.210000	5.410000	to
5.530000	6.090000	replace

Once I've deleted the words I don't want to include, I can merge subtitles for phrases so that I can keep the pauses between words. A quick heuristic is to merge subtitles if they don't have much of a pause between them.

(defvar my-subed-merge-close-subtitles-threshold 500)
(defun my-subed-merge-close-subtitles (threshold)
  "Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
  (interactive (list (read-number "Threshold in msecs: " my-subed-merge-close-subtitles-threshold)))
  (goto-char (point-min))
  (while (not (eobp))
    (let ((end (subed-subtitle-msecs-stop))
          (next-start (save-excursion
                        (and (subed-forward-subtitle-time-start)
                             (subed-subtitle-msecs-stop)))))
      (if (and end next-start (< (- next-start end) threshold))
          (subed-merge-with-next)
        (or (subed-forward-subtitle-end) (goto-char (point-max)))))))

Then I can use subed-waveform-show-all to tweak the start and end timestamps. Here I switch to another file I've been editing…

2024-09-17-12-06-12.svg
Figure 1: Screenshot of subed-waveform

After that, I can use subed-record to compile the audio into an .opus file that sounds reasonably smooth.

I sometimes need to replace or navigate by symbols. casual-symbol-overlay is a package that adds a transient menu so that I don't have to remember the keyboard shortcuts for them. I've added it to my embark-symbol-keymap so I can call it with embark-act. That way it's just a C-. z away.

I want to make lots of quick audio notes that I can shuffle and listen to in order to remember things I'm learning about Emacs (might even come up with some kind of spaced repetition system), and I'd like to make more videos someday too. I think WhisperX, subed, and Org Mode will be fun parts of my workflow.

This is part of my Emacs configuration.
View org source for this post