Sometimes automatic subtitle timing tools like
Aeneas can get confused by silences, extraneous
sounds, filler words, mis-starts, and text that
I've edited out of the raw captions for easier
readability. It's good to quickly check each
caption. I used to listen to captions at 1.5x
speed, watching carefully as each caption
displayed. This took a fair bit of time and focus,
so… it usually didn't happen. Sampling the first
second of each caption is faster and requires a
little less attention.
Skimming with subed.el
Here's a function that I wrote to play the first
second of each subtitle.
Now I can read the lines as the subtitles play,
and I can press any key to stop so that I can fix
timestamps.
Skimming with Javascript
I also want to check the times on the Web in case
there have been caching issues. Here's some
Javascript to skim the first second of each cue in
the first text track for a video, with some code
to make it easy to process the first video in the
visible area.
function getVisibleVideo() {
const videos = document.querySelectorAll('video');
for (const video of videos) {
const rect = video.getBoundingClientRect();
if (
rect.top >= 0 &&
rect.left >= 0 &&
rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
rect.right <= (window.innerWidth || document.documentElement.clientWidth)
) {
return video;
}
}
return null;
}
async function skimVideo(video=getVisibleVideo(), msecs=1000) {
// Get the first text track (assumed to be captions/subtitles)
const textTrack = video.textTracks[0];
if (!textTrack) return;
const remaining = [...textTrack.cues].filter((cue) => cue.endTime >= video.currentTime);
video.play();
// Play the first 1 second of each visible subtitle
for (let i = 0; i < remaining.length && !video.paused; i++) {
video.currentTime = remaining[i].startTime;
await new Promise((resolve) => setTimeout(resolve, msecs));
}
}
Then I can call it with skimVideo();. Actually,
in our backstage area, it might be useful to add a
Skim button so that I can skim things from my
phone.
It looks like for EmacsConf talks where we
typically format captions to be one long line each
(< 60 characters), this can be a speed-up of about
4x compared to listening to the video at normal
speed. More usefully, it's different enough to get
my brain to do it instead of putting it off.
Most of the automatically-generated timestamps are
fine. It's just a few that might need tweaking.
It's nice to be able to skim them with fewer
keystrokes.
When I work with video captions, I often want to split long captions using subed-split-subtitle. If my player is somewhere in the current subtitle, it'll use that timestamp. If not, it'll make a reasonable guess based on character position.
I can use subed-word-data.el to load word-level times from WhisperX JSON or from Youtube SRV2 files. This allows me to split a subtitle using the timestamp for that word.
Because subed-word-data colours words based on transcription confidence, I can see where something might need to be closely examined, like when there's no timing information for the words at the start or end.
If I combine that with subed-waveform, I can see silences. Then I can tweak start times by shift-left-clicking on the waveform. This automatically adjusts the end time of the previous subtitle too.
I like how Emacs makes it easy to use word timing data when editing captions. Yay Emacs!
Note: Sometimes WhisperX gives me overlapping
timestamps for captions, so I use M-x
subed-align to get the aeneas forced alignment
tool to give me subtitle-level timestamps. Then I
use the word-level data from WhisperX for further
splitting.
Aside: I was trying to find some kind of
value-to-color translator for Emacs Lisp for
easier visualization, like the way the d3
Javascript library makes it easy to translate a
range of numbers (say, linear 0.0 to 1.0) to
colors (ex: red-yellow-green). I found
color-hsl-to-rgb and also the range of colours
defined by the faces calendar-scale-1 to
calendar-scale-10. There's also prism, which
colours code by depth and allows people to specify
the colour transformations (saturation, lightness,
etc.). I wonder if someone's already written a
general-purpose data-to-fg/bg-color Elisp library
that supports numerical and categorical data…
#!/usr/bin/python# place in .git/hooks/post-commit# Based on https://gist.github.com/ajmirsky/1245103import subprocess
import re
print("checking for version change...",)
output= subprocess.check_output(['git', 'diff', 'HEAD^', 'HEAD', '-U0']).decode("utf-8")
version_info=Nonefor d in output.split("\n"):
rg= re.compile(r'\+(?:;;\s+)?Version:\s+(?P<major>[0-9]+)\.(?P<minor>[0-9]+)\.(?P<rev>[0-9]+)')
m= rg.search(d)
if m:
version_info= m.groupdict()
breakif version_info:
tag="v%s.%s.%s"% (version_info['major'], version_info['minor'], version_info['rev'])
existing= subprocess.check_output(['git', 'tag']).decode("utf-8").split("\n")
if tag in existing:
print("%s is already tagged, not updating"% tag)
else:
result= subprocess.run(['git', 'tag', '-f', tag])
if result.returncode:
raiseException('tagging not successful: %s %s'% (result.stdout, result.returncode))
print("tagged revision: %s"% tag)
else:
print("none found.")
I'm tweaking my video workflow. I use Orgzly Revived on my Android phone to write the text, and I use Easy Voice Recorder to record it. Syncthing automatically copies both to my laptop. I use WhisperX to transcribe my recording, and I use a little bit of Emacs Lisp to figure out timestamps for each word. I edit this to fix errors. I can even rearrange things and get rid of umms or ahs or anything I don't want.Then I use subed-convert to turn it into a VTT file. I can tweak the start and end times by looking at the waveforms. Then I add comments with the visuals I want. I can add images, animated GIFs, or videos, and they're automatically squeezed or stretched to fit. I can also have them play at original speed. Then I set up open captions and use subed-record-compile-video. Tada!
[2024-11-16 Sat]: Removed highlight_words, made max_line_width use the environment variable if specified.
[2024-10-14 Mon]: Actually, WhisperX makes a JSON with word-level timing data, so let's use that instead.
I'm gradually shifting more things to this Lenovo
P52 to take advantage of its newer processor, 64
GB of RAM, and 2 TB drive. (Whee!) One of the
things I'm curious about is how I can make better
use of multimedia. I couldn't get whisper.cpp to
work on my Lenovo X230T, so I mostly relied on the
automatic transcripts from Google Recorder (with
timestamps generated by aeneas) or cloud-based
transcription services like Deepgram.
I have a lot of silences in my voice notes when I
think out loud. whisper.cpp got stuck in loops
during silent parts, but WhisperX handles them
perfectly. WhisperX is also fast enough for me to
handle audio files locally instead of relying on
Deepgram. With the default model, I can process
the files faster than real-time:
Among other things, it makes a text file that looks like this:
I often need to... I sometimes need to replace or navigate by symbols.
Casual symbol overlays a new package that adds those shortcuts so that I don't have to remember the other keywords for them.
and a JSON file that looks like this:
{"segments": [{"start": 0.427, "end": 7.751, "text": " I often need to... I sometimes need to replace or navigate by symbols.", "words": [{"word": "I", "start": 0.427, "end": 0.507, "score": 0.994}, {"word": "often", "start": 0.587, "end": 0.887, "score": 0.856}, {"word": "need", "start": 0.987, "end": 1.227, "score": 0.851}, {"word": "to...", "start": 1.267, "end": 1.508, "score": 0.738}, {"word": "I", "start": 4.329, "end": 4.429, "score": 0.778}, ...]}, ...]}
Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the JSON like this:
(defunmy-subed-word-tsv-from-whisperx-json (file)
(interactive"FJSON: ")
(let* ((json-array-type 'list)
(json-object-type 'alist)
(data (json-read-file file))
(filename (concat (file-name-sans-extension file) ".tsv"))
(base (seq-mapcat
(lambda (segment)
(seq-map (lambda (word)
(let-alist word
(list nil
(and .start (* 1000 .start))
(and .end (* 1000 .end))
.word)))
(alist-get 'words segment)))
(alist-get 'segments data)))
(current base)
(last-end 0))
;; numbers at the end of a sentence sometimes don't end up with times;; so we need to fix them
(while current
(unless (elt (car current) 1) ; start
(setf (elt (car current) 1) (1+ last-end)))
(unless (elt (car current) 2)
(setf (elt (car current) 2) (1- (elt (cadr current) 1))))
(setq
last-end (elt (car current) 2)
current (cdr current)))
(subed-create-file
filename
base
t
'subed-tsv-mode)
(find-file filename)))
Here's my old code for parsing the highlighted VTT or SRT files that underline each word:
(defunmy-subed-load-word-data-from-whisperx-highlights (file)
"Return a list of word cues from FILE.FILE should be a VTT or SRT file produced by whisperx with the--highlight_words True option."
(seq-keep (lambda (sub)
(when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
(setf (elt sub 3) (match-string 1 (elt sub 3)))
sub))
(subed-parse-file file)))
(defunmy-subed-word-tsv-from-whisperx-highlights (file)
(interactive"FVTT: ")
(with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
(erase-buffer)
(subed-tsv-mode)
(subed-auto-insert)
(mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
(my-subed-load-word-data-from-whisperx-highlights file))
(switch-to-buffer (current-buffer))))
I like to use the TSV format for this one because
it's easy to scan down the right side.
Incidentally, this format is compatible with
Audacity labels, so I could import that there if I
wanted. I like Emacs much more, though. I'm used
to having all my keyboard shortcuts at hand.
0.427000 0.507000 I
0.587000 0.887000 often
0.987000 1.227000 need
1.267000 1.508000 to...
4.329000 4.429000 I
4.469000 4.869000 sometimes
4.950000 5.170000 need
5.210000 5.410000 to
5.530000 6.090000 replace
Once I've deleted the words I don't want to
include, I can merge subtitles for phrases so that
I can keep the pauses between words. A quick
heuristic is to merge subtitles if they don't have
much of a pause between them.
(defvarmy-subed-merge-close-subtitles-threshold 500)
(defunmy-subed-merge-close-subtitles (threshold)
"Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
(interactive (list (read-number "Threshold in msecs: " my-subed-merge-close-subtitles-threshold)))
(goto-char (point-min))
(while (not (eobp))
(let ((end (subed-subtitle-msecs-stop))
(next-start (save-excursion
(and (subed-forward-subtitle-time-start)
(subed-subtitle-msecs-stop)))))
(if (and end next-start (< (- next-start end) threshold))
(subed-merge-with-next)
(or (subed-forward-subtitle-end) (goto-char (point-max)))))))
Then I can use subed-waveform-show-all to tweak the start and end timestamps.
Here I switch to another file I've been editing…
I sometimes need to replace or navigate by symbols.casual-symbol-overlayis a package that adds atransient menu so that I don't have to remember the keyboard shortcuts for them.I've added it to myembark-symbol-keymapso I can call it with embark-act.That way it's just a C-. z away.
I want to make lots of quick audio notes that I
can shuffle and listen to in order to remember
things I'm learning about Emacs (might even come
up with some kind of spaced repetition system),
and I'd like to make more videos someday too. I
think WhisperX, subed, and Org Mode will be fun
parts of my workflow.
It's nice to feel like you're saying someone's name correctly. We ask
EmacsConf speakers to introduce themselves in the first few seconds of
their video, but people often forget to do that, so that's okay. We
started recording introductions for EmacsConf 2022 so that stream
hosts don't have to worry about figuring out pronunciation while
they're live. Here's how I used subed-record to turn my recordings
into lots of little videos.
First, I generated the title images by using Emacs Lisp to replace
text in a template SVG and then using Inkscape to convert the SVG into
a PNG. Each image showed information for the previous talk as well as
the upcoming talk. (emacsconf-stream-generate-in-between-pages)
Then I generated the text for each talk based on the title, the
speaker names, pronunciation notes, pronouns, and type of Q&A. Each
introduction generally followed the pattern, "Next we have title by
speakers. Details about Q&A." (emacsconf-pad-expand-intro and
emacsconf-subed-intro-subtitles below)
00:00:00.000 --> 00:00:00.999
#+OUTPUT: sat-open.webm
[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/sat-open.svg.png]]
Next, we have "Saturday opening remarks".
00:00:05.000 --> 00:00:04.999
#+OUTPUT: adventure.webm
[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/adventure.svg.png]]
Next, we have "An Org-Mode based text adventure game for learning the basics of Emacs, inside Emacs, written in Emacs Lisp", by Chung-hong Chan. He will answer questions via Etherpad.
I copied the text into an Org note in my inbox, which Syncthing copied
over to the Orgzly Revived app on my Android phone. I used Google
Recorder to record the audio. I exported the m4a audio file and a
rough transcript, copied them back via Syncthing, and used
subed-record to edit the audio into a clean audio file without
oopses.
Each intro had a set of captions that started with a NOTE comment.
The NOTE comment specified the following:
#+AUDIO:: the audio source to use for the timestamped captions
that follow
[[file:...]]: the title image I generated for each talk. When
subed-record-compile-video sees a comment with a link to an image,
video, or animated GIF, it takes that visual and uses it for the
span of time until the next visual.
#+OUTPUT: the file to create.
NOTE #+OUTPUT: hyperdrive.webm[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/hyperdrive.svg.png]]#+AUDIO: intros-2023-11-21-cleaned.opus00:00:15.680-->00:00:17.599
Next, we have "hyperdrive.el:
00:00:17.600-->00:00:21.879
Peer-to-peer filesystem in Emacs", by Joseph Turner
00:00:21.880-->00:00:25.279
and Protesilaos Stavrou (also known as Prot).
00:00:25.280-->00:00:27.979
Joseph will answer questions via BigBlueButton,
00:00:27.980-->00:00:31.080
and Prot might be able to join depending on the weather.
00:00:31.081-->00:00:33.439
You can join using the URL from the talk page
00:00:33.440-->00:00:36.320
or ask questions through Etherpad or IRC.
NOTE#+OUTPUT: steno.webm[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/steno.svg.png]]#+AUDIO: intros-2023-11-19-cleaned.opus00:03:23.260-->00:03:25.480
Next, we have "Programming with steno",
00:03:25.481-->00:03:27.700
by Daniel Alejandro Tapia.
NOTE#+AUDIO: intro-2023-11-29-cleaned.opus00:00:13.620-->00:00:16.580
You can ask your questions via Etherpad and IRC.
00:00:16.581-->00:00:18.079
We'll send them to the speaker
00:00:18.080-->00:00:19.919
and post the answers in the talk page
00:00:19.920-->00:00:21.320
after the conference.
I could then call subed-record-compile-video to create the videos
for all the intros, or mark a region with C-SPC and then
subed-record-compile-video only the intros inside that region.
Using Emacs to edit the audio and compile videos worked out really
well because it made it easy to change things.
Changing pronunciation or titles: For EmacsConf 2023, I got the
recordings sorted out in time for the speakers to correct my
pronunciation if they wanted to. Some speakers also changed their
talk titles midway. If I wanted to redo an intro, I just had to
rerecord that part, run it through my subed-record audio cleaning
process, add an #+AUDIO: comment specifying which file I want to
take the audio from, paste it into my main intros.vtt, and
recompile the video.
Cancelling talks: One of the talks got cancelled, so I needed to
update the images for the talk before it and the talk after it. I
regenerated the title images and recompiled the videos. I didn't
even need to figure out which talk needed to be updated - it was easy
enough to just recompile all of them.
Changing type of Q&A: For example, some speakers needed to switch
from answering questions live to answering them after the
conference. I could just delete the old instructions, paste in the
instructions from elsewhere in my intros.vtt (making sure to set
#+AUDIO to the file if it came from a different take), and
recompile the video.
And of course, all the videos were captioned. Bonus!
So that's how using Emacs to edit and compile simple videos saved me a
lot of time. I don't know how I'd handle this otherwise. 47 video
projects that might all need to be updated if, say, I changed the
template? Yikes. Much better to work with text. Here are the technical
details.
Generating the title images
I used Inkscape to add IDs to our template SVG so that I could edit
them with Emacs Lisp. From emacsconf-stream.el:
emacsconf-stream-generate-in-between-pages: Generate the title images.
emacsconf-stream-svg-set-text: Update DOM to set the tspan in the element with ID to TEXT.
(defunemacsconf-stream-svg-set-text (dom id text)
"Update DOM to set the tspan in the element with ID to TEXT.If the element doesn't have a tspan child, use the element itself."
(if (or (null text) (string= text ""))
(let ((node (dom-by-id dom id)))
(when node
(dom-set-attribute node 'style"visibility: hidden")
(dom-set-attribute (dom-child-by-tag node 'tspan) 'style"fill: none; stroke: none")))
(setq text (svg--encode-text text))
(let ((node (or (dom-child-by-tag
(car (dom-by-id dom id))
'tspan)
(dom-by-id dom id))))
(cond
((null node)
(error"Could not find node %s" id)) ; skip
((= (length node) 2)
(nconc node (list text)))
(t (setf (elt node 2) text))))))
emacsconf-pad-expand-intro: Make an intro for TALK.
(defunemacsconf-pad-expand-intro (talk)
"Make an intro for TALK."
(cond
((null (plist-get talk :speakers))
(format "Next, we have \"%s\"." (plist-get talk :title)))
((plist-get talk :intro-note)
(plist-get talk :intro-note))
(t
(let ((pronoun (pcase (plist-get talk :pronouns)
((rx"she") "She")
((rx"\"ou\"""Ou"))
((or'nil"nil" (rx string-start "he") (rx"him")) "He")
((rx"they") "They")
(_ (or (plist-get talk :pronouns) "")))))
(format "Next, we have \"%s\", by %s%s.%s"
(plist-get talk :title)
(replace-regexp-in-string ", \\([^,]+\\)$"", and \\1"
(plist-get talk :speakers))
(emacsconf-surround " (" (plist-get talk :pronunciation) ")""")
(pcase (plist-get talk :q-and-a)
((or'nil"") "")
((rx"after") " You can ask questions via Etherpad and IRC. We'll send them to the speaker, and we'll post the answers on the talk page afterwards.")
((rx"live")
(format " %s will answer questions via BigBlueButton. You can join using the URL from the talk page or ask questions through Etherpad or IRC."
pronoun
))
((rx"pad")
(format " %s will answer questions via Etherpad."
pronoun
))
((rx"IRC")
(format " %s will answer questions via IRC in the #%s channel."
pronoun
(plist-get talk :channel)))))))))
Finding enough quiet focused time to record audio is a challenge. I
often have to re-record segments in order to correct brain hiccups or
to restart after interruptions. It's also hard for me to sit still and
listen to my recordings looking for mistakes to edit out. I'm not
familiar enough with Audacity to zip around with keyboard shortcuts,
and I don't like listening to myself again and again in order to find
my way around an audio file.
Sure, I could take the transcript, align it with subed-align and
Aeneas to get the timestamps, and then use subed-convert to get a CSV (actually a TSV since it uses tabs)
that I can import into Audacity as labels, but it still feels a little
awkward to navigate. I have to zoom in a lot for the text to be
readable.
So here's a workflow I've been experimenting with for cleaning up my
recorded audio.
Just like with my audio braindumps, I use Google Recorder on my phone
because I can get the audio file and a rough transcript, and because
the microphone on it is better than on my laptop. For narration
recordings, I hide in the closet because the clothes muffle echoes. I
don't feel as self-conscious there as I might be if I recorded in the
kitchen, where my computer usually is. I used to record in Emacs using
subed-record by pressing left to redo a segment and right to
move on to the next one, but using my phone means I don't have to deal
with the computer's noises or get the good mic from downstairs.
I start the recorder on my phone and then switch to my Org file in
Orgzly Revived, where I've added my script. I read it
as far as I can go. If I want to redo a segment, I say "Oops" and then
just redo the last phrase or so.
Screenshot of Google Recorder on my phone
I export the transcript and the M4A audio file using Syncthing, which
copies them to my computer. I have a function that copies the latest
recording and even sets things up for removing oops segments
(my-subed-copy-latest-phone-recording, which calls my-split-oops).
If I want to process several files, I can copy them over with
my-subed-copy-recording.
my-subed-copy-latest-phone-recording: Copy the latest recording transcript and audio to DESTINATION.
I'll use Aeneas to get the timestamps for each line of text, so a
little bit of text processing will let me identify the segments that I
want to remove. The way my-split-oops works is that it looks for
"oops" in the transcript. Whenever it finds "oops", it adds a newline
afterwards. Then it takes the next five words and sees if it can
search backward for them within 300 characters. If it finds the words,
then that's the start of my repeated segment, and we can add a newline
before that. If it doesn't find the words, we try again with four
words, then three, then two, then one. I can also manually review the
file and see if the oopses are well lined up. When they're detected
properly, I should see partially duplicated lines.
I used to record using sub-record by using by. Oops,
I used to record. Oops,
I used to record an emacs using subhead record, by pressing left to reduce segment, and write to move on to the next one.
But using my phone means, I don't have to deal with them. Oops.
But using my phone means, I don't have to deal with the computer's noises or get the good mic from downstairs. I started recorder on my phone
my-split-oops: Look for oops and make it easier to split.
(defunmy-split-oops ()
"Look for oops and make it easier to split."
(interactive)
(let ((scan-window 300))
(while (re-search-forward "oops[,\.]?[ \n]+" nil t)
(let ((start (min (line-beginning-position) (- (point) scan-window)))
start-search
found
search-for)
(if (bolp)
(progn
(backward-char)
(setq start (min (line-beginning-position) (- (point) scan-window))))
(insert "\n"))
(save-excursion
(setq start-search (point))
;; look for 1..5 words back
(goto-char
(or
(cl-loop
for n downfrom 5 downto 1
do
(save-excursion
(dotimes (_ n) (forward-word))
(setq search-for (downcase (string-trim (buffer-substring start-search (point)))))
(goto-char start-search)
(when (re-search-backward (regexp-quote search-for) start t)
(goto-char (match-beginning 0))
(cl-return (point)))))
(and (call-interactively 'isearch-backward) (point))))
(insert "\n"))))))
Once the lines are split up, I use subed-align and get a VTT file.
The oops segments will be in their own subtitles.
The timestamps still need a bit of tweaking sometimes, so I use
subed-waveform-show-current or subed-waveform-show-all. I can use
the following bindings:
middle-click to play a sample
M-left-click to set the start and copy to the previous subtitle
left-click to set the start without changing the previous one
M-right-click to set the end and copy to the next subtitle
right-click to set the end without changing the next one
M-j to jump to the current subtitle and play it again in MPV
M-J to jump to close to the end of the current subtitle and play it in MPV
I use my-subed-delete-oops to delete the oops segments. I can also
just mark them for skipping by calling C-u M-x my-subed-delete-oops
instead.
Then I add a #+OUTPUT: filename-cleaned.opus comment under a NOTE
near the beginning of the file. This tells
subed-record~compile-audio where to put the output.
WEBVTT
NOTE #+SKIP00:00:00.000-->00:00:10.319
Finding enough. Oops.
NOTE#+OUTPUT: 2023-12-subed-record-cleaned.opus00:00:10.320-->00:00:36.319
Finding enough quiet Focused. Time to record. Audio is a challenge. I often have to re-record segments in order to correct brain hiccups, or to restart after interruptions.
I can test short segments by marking the region with C-SPC and using
subed-record-compile-try-flow. This lets me check if the transitions
between segments make sense.
When I'm happy with everything, I can use subed-record-compile-audio
to extract the segments specified by the start and end times of each
subtitle and concatenate them one after the other in the audio file
specified by the output. The result should be a clean audio file.
If I need to compile an audio file from several takes, I process each
take separately. Once I've adjusted the timestamps and deleted or
skipped the oops segments, I add #+AUDIO: input-filename.opus to a
NOTE at the beginning of the file.
subed-record-insert-audio-source-note makes this easier. Then I copy
the file's subtitles into my main file. subed-record-compile-audio will take
the audio from whichever file was specified by the #+AUDIO: comment,
so I can use audio from different files.
Example VTT segment with multiple audio files
NOTE#+AUDIO: 2023-11-11-emacsconf.m4a00:10:55.617-->00:10:58.136
Sometimes we send emails one at a time.
NOTE#+AUDIO: 2023-11-15-emacsconf.m4a00:10:55.625-->00:11:03.539
Like when you let a speaker know that we've received a proposal That's mostly a matter of plugging the talks properties into the right places in the template.
Now I have a clean audio file that corresponds to my script. I can use
subed-align on my script to get the timestamps for each line using
the cleaned audio. Once I have a subtitle file, I can use
emacsconf-subed-split (in emacsconf-subed.el - which I
probably should add to subed-mode sometime) to quickly split the
captions up to fit the line lengths. Then I redo the timestamps with
subed-align and adjust timestamps with
subed-waveform-show-current.
So that's how I go from rough recordings with stutters and oopses to a
clean audio file with captions based on my script. People can probably
edit faster with Audacity wizardry or the AI audio editors that are in
vogue these days, but this little workflow gets around my impatience
with audio by turning it into (mostly) text, so that's cool. Let's see
if I can make more presentations now that I've gotten the audio side
figured out!