[2024-11-21 Thu]: Fixed the second filler words regexp, and make it work at the start of lines too. Thanks to @arialdo@mastodon.online for the feedback!
Like many people, I tend to use "So", "And", "You
know", and "Uh" to bridge between sentences when
thinking. WhisperX does a reasonable job of
detecting sentences and splitting them up anyway,
but it leaves those filler words in at the start
of the sentence. I usually like to remove these
from transcripts so that they read more smoothly.
Here's a short Emacs Lisp function that removes
those filler words when they start a sentence,
capitalizing the next word. When called
interactively, it prompts while displaying an
overlay. When called from Emacs Lisp, it changes
without asking for confirmation.
[2024-12-09 Mon]: Split whisperx to pass one file per call.
[2024-11-16 Sat]: Removed highlight_words, made max_line_width use the environment variable if specified.
[2024-10-14 Mon]: Actually, WhisperX makes a JSON with word-level timing data, so let's use that instead.
I'm gradually shifting more things to this Lenovo
P52 to take advantage of its newer processor, 64
GB of RAM, and 2 TB drive. (Whee!) One of the
things I'm curious about is how I can make better
use of multimedia. I couldn't get whisper.cpp to
work on my Lenovo X230T, so I mostly relied on the
automatic transcripts from Google Recorder (with
timestamps generated by aeneas) or cloud-based
transcription services like Deepgram.
I have a lot of silences in my voice notes when I
think out loud. whisper.cpp got stuck in loops
during silent parts, but WhisperX handles them
perfectly. WhisperX is also fast enough for me to
handle audio files locally instead of relying on
Deepgram. With the default model, I can process
the files faster than real-time:
Among other things, it makes a text file that looks like this:
I often need to... I sometimes need to replace or navigate by symbols.
Casual symbol overlays a new package that adds those shortcuts so that I don't have to remember the other keywords for them.
and a JSON file that looks like this:
{"segments": [{"start": 0.427, "end": 7.751, "text": " I often need to... I sometimes need to replace or navigate by symbols.", "words": [{"word": "I", "start": 0.427, "end": 0.507, "score": 0.994}, {"word": "often", "start": 0.587, "end": 0.887, "score": 0.856}, {"word": "need", "start": 0.987, "end": 1.227, "score": 0.851}, {"word": "to...", "start": 1.267, "end": 1.508, "score": 0.738}, {"word": "I", "start": 4.329, "end": 4.429, "score": 0.778}, ...]}, ...]}
Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the JSON like this:
(defunmy-subed-word-tsv-from-whisperx-json (file)
(interactive"FJSON: ")
(let* ((json-array-type 'list)
(json-object-type 'alist)
(data (json-read-file file))
(filename (concat (file-name-sans-extension file) ".tsv"))
(base (seq-mapcat
(lambda (segment)
(seq-map (lambda (word)
(let-alist word
(list nil
(and .start (* 1000 .start))
(and .end (* 1000 .end))
.word)))
(alist-get 'words segment)))
(alist-get 'segments data)))
(current base)
(last-end 0))
;; numbers at the end of a sentence sometimes don't end up with times;; so we need to fix them
(while current
(unless (elt (car current) 1) ; start
(setf (elt (car current) 1) (1+ last-end)))
(unless (elt (car current) 2)
(setf (elt (car current) 2) (1- (elt (cadr current) 1))))
(setq
last-end (elt (car current) 2)
current (cdr current)))
(subed-create-file
filename
base
t
'subed-tsv-mode)
(find-file filename)))
Here's my old code for parsing the highlighted VTT or SRT files that underline each word:
(defunmy-subed-load-word-data-from-whisperx-highlights (file)
"Return a list of word cues from FILE.FILE should be a VTT or SRT file produced by whisperx with the--highlight_words True option."
(seq-keep (lambda (sub)
(when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
(setf (elt sub 3) (match-string 1 (elt sub 3)))
sub))
(subed-parse-file file)))
(defunmy-subed-word-tsv-from-whisperx-highlights (file)
(interactive"FVTT: ")
(with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
(erase-buffer)
(subed-tsv-mode)
(subed-auto-insert)
(mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
(my-subed-load-word-data-from-whisperx-highlights file))
(switch-to-buffer (current-buffer))))
I like to use the TSV format for this one because
it's easy to scan down the right side.
Incidentally, this format is compatible with
Audacity labels, so I could import that there if I
wanted. I like Emacs much more, though. I'm used
to having all my keyboard shortcuts at hand.
0.427000 0.507000 I
0.587000 0.887000 often
0.987000 1.227000 need
1.267000 1.508000 to...
4.329000 4.429000 I
4.469000 4.869000 sometimes
4.950000 5.170000 need
5.210000 5.410000 to
5.530000 6.090000 replace
Once I've deleted the words I don't want to
include, I can merge subtitles for phrases so that
I can keep the pauses between words. A quick
heuristic is to merge subtitles if they don't have
much of a pause between them.
(defvarmy-subed-merge-close-subtitles-threshold 500)
(defunmy-subed-merge-close-subtitles (threshold)
"Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
(interactive (list (read-number "Threshold in msecs: " my-subed-merge-close-subtitles-threshold)))
(goto-char (point-min))
(while (not (eobp))
(let ((end (subed-subtitle-msecs-stop))
(next-start (save-excursion
(and (subed-forward-subtitle-time-start)
(subed-subtitle-msecs-stop)))))
(if (and end next-start (< (- next-start end) threshold))
(subed-merge-with-next)
(or (subed-forward-subtitle-end) (goto-char (point-max)))))))
Then I can use subed-waveform-show-all to tweak the start and end timestamps.
Here I switch to another file I've been editing…
I sometimes need to replace or navigate by symbols.casual-symbol-overlayis a package that adds atransient menu so that I don't have to remember the keyboard shortcuts for them.I've added it to myembark-symbol-keymapso I can call it with embark-act.That way it's just a C-. z away.
I want to make lots of quick audio notes that I
can shuffle and listen to in order to remember
things I'm learning about Emacs (might even come
up with some kind of spaced repetition system),
and I'd like to make more videos someday too. I
think WhisperX, subed, and Org Mode will be fun
parts of my workflow.