Using WhisperX to get word-level timestamps for audio editing with Emacs and subed-record

| emacs, subed

I'm gradually shifting more things to this Lenovo P52 to take advantage of its newer processor, 64 GB of RAM, and 2 TB drive. (Whee!) One of the things I'm curious about is how I can make better use of multimedia. I couldn't get whisper.cpp to work on my Lenovo X230T, so I mostly relied on the automatic transcripts from Google Recorder (with timestamps generated by aeneas) or cloud-based transcription services like Deepgram.

I have a lot of silences in my voice notes when I think out loud. whisper.cpp got stuck in loops during silent parts, but WhisperX handles them perfectly. WhisperX is also fast enough for me to handle audio files locally instead of relying on Deepgram. With the default model, I can process the files faster than real-time:

File length Transcription time
42s 17s
7m48s 1m41s

I used this command to get word-level timing data. It makes VTT and SRT files that underline the specific word:

~/vendor/whisperx/.venv/bin/whisperx --compute_type int8 --highlight_words True --print_progress True "$1"

The resulting VTT file looks like this:

WEBVTT

00:00.427 --> 00:00.507
<u>I</u> often need to... I sometimes need to replace or navigate by symbols.

00:00.507 --> 00:00.587
I often need to... I sometimes need to replace or navigate by symbols.

00:00.587 --> 00:00.887
I <u>often</u> need to... I sometimes need to replace or navigate by symbols.

00:00.887 --> 00:00.987
I often need to... I sometimes need to replace or navigate by symbols.

Sometimes I just want the text so that I can use an audio braindump as the starting point for a blog post or for notes. WhisperX is way more accurate than Google Recorder, so that will probably be easier once I update my workflow for that.

Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the underlined text like this:

(defun my-subed-load-word-data-from-whisperx-highlights (file)
  "Return a list of word cues from FILE.
FILE should be a VTT or SRT file produced by whisperx with the
--highlight_words True option."
  (seq-keep (lambda (sub)
              (when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
                (setf (elt sub 3) (match-string 1 (elt sub 3)))
                sub))
            (subed-parse-file file)))

(defun my-subed-word-tsv-from-whisperx-highlights (file)
  (interactive "FVTT: ")
  (with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
    (erase-buffer)
    (subed-tsv-mode)
    (subed-auto-insert)
    (mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
          (my-subed-load-word-data-from-whisperx-highlights file))
    (switch-to-buffer (current-buffer))))

I like to use the TSV format for this one because it's easy to scan down the right side. Incidentally, this format is compatible with Audacity labels, so I could import that there if I wanted. I like Emacs much more, though. I'm used to having all my keyboard shortcuts at hand.

0.427000	0.507000	I
0.587000	0.887000	often
0.987000	1.227000	need
1.267000	1.508000	to...
4.329000	4.429000	I
4.469000	4.869000	sometimes
4.950000	5.170000	need
5.210000	5.410000	to
5.530000	6.090000	replace
6.270000	6.370000	or
6.490000	6.971000	navigate

Once I've deleted the words I don't want to include, I can merge subtitles for phrases so that I can keep the pauses between words. A quick heuristic is to merge subtitles if they don't have much of a pause between them.

(defvar my-subed-merge-close-subtitles-threshold 500)
(defun my-subed-merge-close-subtitles (threshold)
  "Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
  (interactive (list (read-number "Threshold in msecs " my-subed-merge-close-subtitles-threshold)))
  (while (not (eobp))
    (let ((end (subed-subtitle-msecs-stop))
          (next-start (save-excursion
                        (and (subed-forward-subtitle-time-start)
                             (subed-subtitle-msecs-stop)))))
      (if (and end next-start (< (- next-start end) threshold))
          (subed-merge-with-next)
        (or (subed-forward-subtitle-end) (goto-char (point-max)))))))

Then I can use subed-waveform-show-all to tweak the start and end timestamps.

2024-09-17-12-06-12.svg
Figure 1: Screenshot of subed-waveform

After that, I can use subed-record to compile the audio into an .opus file that sounds reasonably smooth.

I sometimes need to replace or navigate by symbols. casual-symbol-overlay is a package that adds a transient menu so that I don't have to remember the keyboard shortcuts for them. I've added it to my embark-symbol-keymap so I can call it with embark-act. That way it's just a C-. z away.

I want to make lots of quick audio notes that I can shuffle and listen to in order to remember things I'm learning about Emacs (might even come up with some kind of spaced repetition system), and I'd like to make more videos someday too. I think WhisperX, subed, and Org Mode will be fun parts of my workflow.

View org source for this post
You can comment with Disqus or you can e-mail me at sacha@sachachua.com.