category - subed :: Sacha Chua

Categories: geek » emacs » subed

View topic page - RSS - Atom - Subscribe via email

Improving subed-vtt parsing; using dedicated windows in Emacs; training my intuition

Jan 28, 2025| emacs, subed

While putting together some notes on how to use subed.el with auto-generated YouTube captions, I decided to get subed-word-data working with the Youtube VTT format. The sample file I downloaded from one of my YouTube videos had a cue whose text started with a blank line, so I ended up redoing the way subed-vtt.el parsed cues. Twice, actually. The first time, I changed it to handle multiple blocks in cue text (separated by blank lines). Then I came across this test suite for WebVTT parsing and found out that the timing line for a cue doesn't have to be preceded by a blank line, so then I needed to change the VTT parsing again.

Fortunately, I inherited a large Buttercup test suite from subed.el's original author. I've been adding to it over the years. As I learned more about how I wanted subed.el to behave, I added more cases. Then I worked on shifting the code to behave the way I wanted it to. At first, I didn't quite understand what I wanted the code to do, but as I pinned down more test cases, I was able to figure it out.

This was the first time I used toggle-window-dedicated (C-x w d) extensively. This function keeps the same buffer displayed in the window instead of letting Emacs Lisp functions replace it. First, I set up a large window for my subed-vtt.el. I split the other side into a smaller window for my test-subed-vtt.el, another window for a temporary VTT file, and a window for output from whatever command I'm running. I set most of the windows to dedicated except for my temporary output window. I also enabled winner-mode just in case I messed things up so that I could restore with winner-undo. Dedicating the windows meant I didn't have to keep fussing around with my windows and buffers to get things back to where I wanted them to be. I did use C-x o (other-window) a lot, so maybe it'll be worth getting the hang of ace-window.

Figure 1: Dedicated windows for the code, the test file, and a VTT buffer for interactive testing; a spare window for other output

Sometimes I wanted to focus on one of those small windows. prot/window-single-toggle was helpful for maximizing a window and then returning to the previous configuration.

I mostly evaluated or edebugged my code and then used my-buttercup-run-dwim to run a suite in my test-subed-vtt.el. sp-backward-up-sexp and sp-narrow-to-sexp were also helpful for navigating the suites and focusing on whatever I was working on.

I'm looking forward to exploring the other test cases from that repository. It feels good to get better as a coder.

I just finished reading Mathematica: a Secret World of Imagination and Curiosity, by David Bessis. The idea of training your intuition echoed in my mind as I wrote test cases and changed the code. I started to look forward to the gaps between my understanding of the spec and my understanding of the test cases, and the gaps between the test cases and my implementation. It felt almost like a conversation. Sometimes it was hard to translate an idea into code. I felt myself getting muddled and turned around. Whenever I noticed that, I just had to back up and start from something I understood, figure out what I was uncertain about, and then go from there.

I like the way that subed.el gives me tools for thinking about text, times, and metadata together. That abstraction has been helpful for editing audio and making videos. The more solid I can make it, the easier it will be to imagine other things that use those ideas.

View org source for this post

You can e-mail me at sacha@sachachua.com.

Automatically correcting phrasing and misrecognized words in speech-to-text captions by using a script

Jan 11, 2025| speechtotext, subed, emacs

I usually write my scripts with phrases that could be turned into the subtitles. I figured I might as well combine that information with the WhisperX transcripts which I use to cut out my false starts and oopses. To do that, I use the string-distance function, which calculates how similar strings are, based on the Levenshtein [distance] algorithm. If I take each line of the script and compare it with the list of words in the transcription, I can add one transcribed word at a time, until I find the number with the minimum distance from my current script phrase. This lets me approximately match strings despite misrecognized words. I use oopses to signal mistakes. When I detect those, I look for the previous script line that is closest to the words I restart with. I can then skip the previous lines automatically. When the script and the transcript are close, I can automatically correct the words. If not, I can use comments to easily compare them at that point. Even though I haven't optimized anything, it runs well enough for my short videos. With these subtitles as a base, I can get timestamps with subed-align and then there's just the matter of tweaking the times and adding the visuals.

2025-01-09-01 Matching a script with a transcript – subed emacs

Text from sketch

Matching a script with a transcript 2025-01-09-01

script
record on my phone
WhisperX transcript (with false starts and recognition errors)

My current implementation is totally unoptimized (n²) but it's fine for short videos.

Process:

While there are transcript words to process
- Find the script line that has the minimum distance to the words left in the transcript. restart after oopses
Script
Transcript: min. distance between script phrase & transcript
Restarting after oops: find script phrase with minimum distance
Ex. script phrase: The Emacs text editor
Transcript: The Emax text editor is a…
Bar graph of distance decreasing, and then increasing again
Minimum distance
Oops?
- N: Use transcript words, or diff > threshold?
  - Y: Add script words as comment
  - N: Correct minor errors
- Y: Mark caption for skipping and look for the previous script line with minimum distance.

Result:

Untimed captions with comments
Aeneas
Timed captions for editing

This means I can edit a nicely-split, mostly-corrected file.

I've included the links to various files below so you can get a sense of how it works. Let's focus on an excerpt from the middle of my script file.

it runs well enough for my short videos.
With these subtitles as a base,
I can get timestamps with subed-align

When I call WhisperX with large-v2 as the model and --max_line_width 50 --segment_resolution chunk --max_line_count 1 as the options, it produces these captions corresponding to that part of the script.

01:25.087 --> 01:29.069
runs well enough for my short videos. With these subtitles

01:29.649 --> 01:32.431
as a base, I can get... Oops. With these subtitles as a base, I

01:33.939 --> 01:41.205
can get timestamps with subedeline, and then there's just

Running subed-word-data-use-script-file results in a VTT file containing this excerpt:

00:00:00.000 --> 00:00:00.000
it runs well enough for my short videos.

NOTE #+SKIP

00:00:00.000 --> 00:00:00.000
With these subtitles as a base,

NOTE #+SKIP

00:00:00.000 --> 00:00:00.000
I can get... Oops.

00:00:00.000 --> 00:00:00.000
With these subtitles as a base,

NOTE
#+TRANSCRIPT: I can get timestamps with subedeline,
#+DISTANCE: 0.14

00:00:00.000 --> 00:00:00.000
I can get timestamps with subed-align

There are no timestamps yet, but subed-align can add them. Because subed-align uses the Aeneas forced alignment tool to figure out timestamps by lining up waveforms for speech-synthesized text with the recorded audio, it's important to keep the false starts in the subtitle file. Once subed-align has filled in the timestamps and I've tweaked the timestamps by using the waveforms, I can use subed-record to create an audio file that omits the subtitles that have #+SKIP comments.

The code is available as subed-word-data-use-script-file in subed-word-data.el. I haven't released a new version of subed.el yet, but you can get it from the repository.

In addition to making my editing workflow a little more convenient, I think it might also come in handy for applying the segmentation from tools like sub-seg or lachesis to captions that might already have been edited by volunteers. (I got sub-seg working on my system, but I haven't figured out lachesis.) If I call subed-word-data-use-script-file with the universal prefix arg C-u, it should set keep-transcript-words to true and keep any corrections we've already made to the caption text while still approximately matching and using the other file's segments. Neatly-segmented captions might be more pleasant to read and may require less cognitive load.

There's probably some kind of fancy Python project that already does this kind of false start identification and script reconciliation. I just did it in Emacs Lisp because that was handy and because that way, I can make it part of subed. If you know of a more robust or full-featured approach, please let me know!

View org source for this post

You can view 1 comment or e-mail me at sacha@sachachua.com.

#YayEmacs 9: Trimming/adding silences to get to a target; subed-record-sum-time

Jan 9, 2025| audio, subed, yay-emacs, emacs, video

New in this video: subed-record-sum-time, #+PAD_LEFT and #+PAD_RIGHT

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be. First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times. Then I can skip my oopses. Sometimes WhisperX doesn't catch them, so I also look at waveforms and characters per second. I already talk quickly, so I'm not going to speed that up but I can trim the pauses in between phrases which is easy to do with waveforms. Sometimes, after reviewing a draft, I realize I need a little more time. If the original audio has some silence, I can just copy and paste it. If not, I can pad left or pad right to add some silence. I can try the flow of some sections and compile the video when I'm ready. Emacs can do almost anything. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Play by play:

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be.
- subed-record uses subtitles and directives in comments in a VTT subtitle file to edit audio and video. subed-record-sum-time calculates the resulting duration and displays it in the minibuffer.
First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times.
- I'm experimenting with an algorithmic way to combine the breaks from my script with the text from the transcript. subed-align calls the aeneas forced alignment tool to match up the text with the timestamps. I use subed-waveform-show-all to show all the waveforms.
Then I can skip my oopses.
- Adding a NOTE #+SKIP comment before a subtitle makes subed-record-compile-video and subed-record-compile-flow skip that part of the audio.
Sometimes WhisperX doesn't catch them,
- WhisperX sometimes doesn't transcribe my false starts if I repeat things quickly.
so I also look at waveforms
- subed-waveform-show-all adds waveforms for all the subtitles. If I notice there's a pause or a repeated shape in the waveform, or if I listen and notice the repetition, I can confirm by middle-clicking on the waveform to sample part of it.
and characters per second.
- Low characters per second is sometimes a sign that the timestamps are incorrect or there's a repetition that wasn't transcribed.
I already talk quickly, so I'm not going to speed that up
- Also, I already sound like a chipmunk; mechanically speeding up my recording to fit in a certain time will make that worse =)
but I can trim the pauses in between phrases which is easy to do with waveforms.
- left-click to set the start, right-click to set the stop. If I want to adjust the previous/next one at the same time, I would use shift-left-click or shift-right-click, but here I want to skip the gaps between phrases, so I adjust the current subtitle without making the previous/next one longer.
Sometimes, after reviewing a draft, I realize I need a little more time.
- I can specify visuals like a video, animated GIF, or an image by adding a [[file:...]] link in the comment for a subtitle. That visual will be used until the next visual is specified in a comment on a different subtitle. subed-record-compile-video can automatically speed up video clips to fit in the time for the current audio segment, which is the set of subtitles before the next visual is defined. After I compile and review the video, sometimes I notice that something goes by too quickly.
If the original audio has some silence, I can just copy and paste it.
- This can sometimes feel more natural than adding in complete silence.
If not, I can pad left or pad right to add some silence.
- I added a new feature so that I could specify something like #+PAD_RIGHT: 1.5 in a comment to add 1.5 seconds of silence after the audio specified by that subtitle.
I can try the flow of some sections
- I can select a region and then use M-x subed-record-compile-try-flow to play the audio or C-u M-x subed-record-compile-try-flow to play the audio+video for that region.
and compile the video when I'm ready.
- subed-record-compile-video compiles the video to the file specified in #+OUTPUT: filename. ffmpeg is very arcane, so I'm glad I can simplify my use of it with Emacs Lisp.
Emacs can do almost anything. Yay Emacs!
- Non-linear audio and video editing is actually pretty fun in a text editor, especially when I can just use M-x vundo to navigate my undo history.

Links:

View org source for this post

You can e-mail me at sacha@sachachua.com.

Editing videos with Emacs and subed-record.el

Jan 1, 2025| emacs, subed, video

I want to document more of my Minecraft adventures with A+. Video is a natural way to do this. It builds on her familiarity with the tutorials and streams she enjoys watching. I set up OBS on her laptop and plugged in my Blue Yeti microphone. We did our first interview yesterday. I edited and subtitled it (because why not!), uploaded it as an unlisted YouTube video, and shared it with her dad, sister, and cousins.

I did the video editing in Emacs with subed-record. First, I used WhisperX to transcribe the video, and I used subed-align to fix the timestamps with aeneas. I normalized the audio with Audacity and I exported the .opus file for use in subed-record.el. Then I added NOTE #+SKIP before times I wanted to remove, like when she asked for a retake. Here's what that subtitle markup looks like:

WEBVTT

NOTE #+SKIP

00:00:00.000 --> 00:00:16.679
And then I'll record in my side also
and we'll just put it in somehow.
Somehow. Okay. We can edit that, right?
Yeah, we'll learn how to edit things.
It'll be great.

NOTE
Introduction
#+AUDIO: cuberventures-001.opus
[[file:intro.webm]]
#+OUTPUT: cuberventures-001-fxnt-create-2-windmill-home-cafe-trains-hotel-half-underwater.webm

00:00:16.680 --> 00:00:19.399
Okay, so now we're here with <username>.

00:00:19.400 --> 00:00:23.039
I want to find out what you like about Minecraft and

00:00:23.040 --> 00:00:26.079
all the cool things that you have been building lately.

This was a little different from my usual video creation workflow, where I record the audio and the video separately. When I wrote subed-record.el, I assumed I'd edit the audio first, choose images/GIFs/videos that were already ready to go, and then combine those visuals with sections of audio, speeding things up or slowing things down as needed. Now I wanted to apply the same edits to the video as I did to the audio. A+ did a great job of looking at stuff in Minecraft while talking about them, so I wanted to keep her narration in sync. I added some code to allow me to specify a same-edits keyword for the visuals. That meant that I would use the same selection list that I used for cutting the audio. Here's what that subtitle markup looks like:

NOTE
[[file:2024-12-31 10-35-14.mkv]]
#+OPTIONS: same-edits

00:00:43.860 --> 00:00:45.941
Shall we take a tour of my world?

00:00:45.942 --> 00:00:50.079
Sure, let's tell people which mod pack this is.

00:00:50.080 --> 00:00:55.639
This is FXNT Create 2, also known as FoxyNoTail Create 2.

NOTE Windmill

00:00:55.640 --> 00:00:58.239
I've got this little bit of path leading to the interview

00:00:58.240 --> 00:01:01.839
room. This is my unfinished windmill. I've been meaning to

This workflow lets me cut out segments in the middle of the video, like this:

00:17:30.200 --> 00:17:33.119
great start for a tour. I'm looking forward to seeing what

00:17:33.120 --> 00:17:34.112
you will build next.

NOTE #+SKIP

00:17:34.113 --> 00:18:02.379
Do you have any last words before
we try to figure out this video editing thing?
Yeah. We'll cut that last part out.
Let's just do a retake on that last part.
Someday. Out here. Okay. There you go.
This is a beautiful view.

00:18:02.380 --> 00:18:08.119
The last things I want to say about this world is there'll be

I also wanted to start the video with a segment from my recording, so we could see her avatar on screen during the introduction. She kept her computer on first-person POV instead of changing the camera. I used mpv to figure out the timestamps for the start and end of the part that I wanted to use, then I used ffmpeg to cut that clip. I added a comment with a link to that video in order to use it before the main video. That's the [[file:intro.webm]] in the first section's comments.

After testing a small section of the transcript by selecting a region and using subed-record-compile-video, I deselected the region and used subed-record-compile-video to produce the whole video.

I also modified subed-record-compile-subtitles to include the other non-directive comments, so I can include the section headings in the raw VTT file and have them turn up in the exported version. Then I can use the new subed-section-comments-as-chapters command to copy those as chapters for the YouTube description.

We're not going to share that particular video yet, but I'm looking forward to trying that technique with videos about stuff I'm figuring out in Minecraft or Emacs. It's also tempting me to think about ways to specify transitions like crossfades and other fancy effects like overlays.

I like using the transcript as the starting point for video editing. It just makes sense to me to work with it as text. I also like this experiment with documenting more of our Minecraft experimentation. It seems to get her talking and encourages her to build more. I'm looking forward to learning more about Minecraft and making videos too.

We did another video today using the new shortcuts I've just set up for toggling OBS recording. This time we didn't even need to do any editing. I used Org Export to make her a little HTML file that had the two videos on it, so she can review it any time. Onward!

View org source for this post

You can e-mail me at sacha@sachachua.com.

subed.el: Tweaking subtitle times

Dec 19, 2024| emacs, subed

When subtitle times are too far off from the video or audio, people start worrying if their video has frozen or jumped ahead. It's good to keep subtitles roughly in time with the audio.

For EmacsConf, we can get timing information from two places. WhisperX produces a JSON file with word data in the process of doing the speech recognition, and the aeneas forced alignment tool can use synthesized text-to-speech to figure out the timestamps for each line of text compared to a media file.

Aeneas timestamps are more helpful once we start editing, but it can be confused by long silences, extraneous noises, multiple speakers, and inaccurate transcripts (words added or removed).

When I combine the WhisperX word data with subtitles, I can see where the times might need a closer look because matching words weren't found.

Figure 1: Screenshot with word data loaded

Loading word data requires a pretty close match at the moment, but since we change only about 4% of the subtitle text when editing, those cues are still helpful. (I measured this by the Levenshtein distance between the combined cue texts of edited subtitles versus the original WhisperX transcripts, using string-distance to approximate the editing percentage.)

Calculating how much we edited

(let ((sum-original 0)
      (sum-dist 0))
  (append
   (seq-keep
    (lambda (talk)
      (when (and (emacsconf-talk-file talk "--main.vtt")
                 (emacsconf-talk-file talk "--reencoded.json"))
        (let* ((json-object-type 'alist)
               (json-array-type 'list)
               (edited-text
                (mapconcat (lambda (sub) (elt sub 3))
                           (subed-parse-file (emacsconf-talk-file talk "--main.vtt"))
                           " "))
               (original-text
                (mapconcat
                 (lambda (word)
                   (assoc-default 'word word))
                 (assoc-default
                  'word_segments
                  (json-read-file (emacsconf-talk-file talk "--reencoded.json")))
                 " "))
               (dist (string-distance original-text edited-text)))
          (setq sum-original (+ sum-original (length original-text)))
          (setq sum-dist (+ sum-dist dist))
          (list
           (length original-text)
           (length edited-text)
           dist))))
    (emacsconf-get-talk-info))
   '(hline)
   (list
    (list
     sum-original
     (format "%d%%" (/ (* 100.0 sum-dist) sum-original))
     sum-dist))))

To make it easier to correct subtitle timing, I added a few ways to tweak subtitle timing for a region of subtitles.

WhisperX: subed-word-data-fix-subtitle-timing in subed-word-data.el tries to match the word data from WhisperX against the text of the current subtitle, using string-distance for approximate matches. I start at about two words shorter than what's in the subtitle, and then increase the number of words taken from the data while the string distance decreases. I skip the data for words before the beginning of the first subtitle in the region.

Screencast of subed-word-data-fix-subtitle-timing

Aeneas: subed-align-region uses Aeneas to realign the subtitles from the region using the section of the media file between the start of the first subtitle and the end of the last subtitle in the region. When I notice that the times are off, I skim the subtitles (or just skim them visually) to find the last well-timed subtitle. Then I pick a subtitle that's in the incorrectly-timed section. I use subed-mpv-jump-to-current-subtitle (M-j) to jump to that position, and I play back that subtitle. It usually belongs to some text further down, so I reset to that position with M-j, set my mark before the previous correctly-timed subtitle with C-SPC, go to the subtitle that matches that time, and use subed-copy-player-pos-to-start-time (C-c [) to set the proper timestamp. Then I can go to the previous incorrectly-timed subtitle and use M-x subed-align-region. This runs the Aeneas forced alignment tool using just the subtitle text in the region, the starting timestamp of the first subtitle, and the ending timestamp of the last subtitle, making it easy to adjust that section. subed-align-region is in subed-align.el

Retiming by pressing SPC after each subtitle: As an experiment, I've also added a subed-retime-subtitles command that plays through the subtitles so that I can press SPC when the next subtitle starts. It begins with the current subtitle and stops when you press a key that's not in its keymap.

Screencast with audio: subed-retime-subtitles

Manual adjustments: For fine-tuning timestamps, I usually turn on subed-waveform-show-all and shift-left-click (subed-waveform-set-start-and-copy-to-previous) or shift-right-click (subed-waveform-set-stop-and-copy-to-next) on the waveforms because it's easy to see where the words and pauses are. When I'm not sure, I can use middle-click (subed-waveform-play-sample) to play part of the file without changing the subtitle start/stop or the MPV playback position.

Screencast with audio of using the waveforms

I'm experimenting with adding repeating keybindings. There was a subed-mpv-frame-step-map that was bound to C-c C-f, so I've renamed it to subed-mpv-control, added a whole bunch of keybindings to the subed-mpv-control-map based on MPV and Aegisub shortcuts, and made it a repeating transient map.

Screencast with audio, experimenting with the mpv control map

Ideas for next steps:

Gotta get the hang of all these new capabilities through practice! =)

To make my subed-align-region workflow even more convenient, I could use completing-read to let me select a future subtitle with completion, and then Emacs could automatically fix the subtitle start time, go to the previous subtitle, and realign the region.

Also, I think switching the waveforms from overlays to text properties could be a good idea. When I cut text, the overlays get left behind, but I want the waveforms to go away too.

While writing this post and fiddling with subed, I ended up adding a bunch of keybindings and a menu. I figured this was as good a time as any to stop tweaking it and finally publish. (But it's fun! Just one more idea…)

View org source for this post

You can e-mail me at sacha@sachachua.com.

Checking caption timing by skimming with Emacs Lisp or JS

Nov 17, 2024| js, emacs, subed

Sometimes automatic subtitle timing tools like Aeneas can get confused by silences, extraneous sounds, filler words, mis-starts, and text that I've edited out of the raw captions for easier readability. It's good to quickly check each caption. I used to listen to captions at 1.5x speed, watching carefully as each caption displayed. This took a fair bit of time and focus, so… it usually didn't happen. Sampling the first second of each caption is faster and requires a little less attention.

Skimming with subed.el

Here's a function that I wrote to play the first second of each subtitle.

(defvar my-subed-skim-msecs 1000 "Number of milliseconds to play when skimming.")
(defun my-subed-skim-starts ()
  (interactive)
  (subed-mpv-unpause)
  (subed-disable-loop-over-current-subtitle)
  (catch 'done
    (while (not (eobp))
      (subed-mpv-jump-to-current-subtitle)
      (let ((ch
             (read-char "(q)uit? " nil (/ my-subed-skim-msecs 1000.0))))
        (when ch
          (throw 'done t)))
      (subed-forward-subtitle-time-start)
      (when (and subed-waveform-minor-mode
                 (not subed-waveform-show-all))
        (subed-waveform-refresh))
      (recenter)))
  (subed-mpv-pause))

Now I can read the lines as the subtitles play, and I can press any key to stop so that I can fix timestamps.

Skimming with Javascript

I also want to check the times on the Web in case there have been caching issues. Here's some Javascript to skim the first second of each cue in the first text track for a video, with some code to make it easy to process the first video in the visible area.

function getVisibleVideo() {
  const videos = document.querySelectorAll('video');
  for (const video of videos) {
    const rect = video.getBoundingClientRect();
    if (
      rect.top >= 0 &&
      rect.left >= 0 &&
      rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
      rect.right <= (window.innerWidth || document.documentElement.clientWidth)
    ) {
      return video;
    }
  }
  return null;
}

async function skimVideo(video=getVisibleVideo(), msecs=1000) {
  // Get the first text track (assumed to be captions/subtitles)
  const textTrack = video.textTracks[0];
  if (!textTrack) return;
  const remaining = [...textTrack.cues].filter((cue) => cue.endTime >= video.currentTime);
  video.play();
  // Play the first 1 second of each visible subtitle
  for (let i = 0; i < remaining.length && !video.paused; i++) {
    video.currentTime = remaining[i].startTime;
    await new Promise((resolve) => setTimeout(resolve, msecs));
  }
}

Then I can call it with skimVideo();. Actually, in our backstage area, it might be useful to add a Skim button so that I can skim things from my phone.

function handleSkimButton(event) {
   const vid = event.target.closest('.vid').querySelector('video');
   skimVideo(vid);
 }

document.querySelectorAll('video').forEach((vid) => {
   const div = document.createElement('div');
   const skim = document.createElement('button');
   skim.textContent = 'Skim';
   div.appendChild(skim);
   vid.parentNode.insertBefore(div, vid.nextSibling);
   skim.addEventListener('click', handleSkimButton);
});

Results

How much faster is it this way?

Some code to help figure out the speedup

(-let* ((files (directory-files "~/proj/emacsconf/2024/cache" t "--main\\.vtt"))
        ((count-subs sum-seconds)
         (-unzip (mapcar
                  (lambda (file)
                    (list
                     (length (subed-parse-file file))
                     (/ (compile-media-get-file-duration-ms
                         (concat (file-name-sans-extension file) ".webm")) 1000.0)))
                  files)))
        (total-seconds (-reduce #'+ sum-seconds))
        (total-subs (-reduce #'+ count-subs)))
  (format "%d files, %.1f hours, %d total captions, speed up of %.1f"
          (length files)
          (/ total-seconds 3600.0)
          total-subs
          (/ total-seconds total-subs)))

It looks like for EmacsConf talks where we typically format captions to be one long line each (< 60 characters), this can be a speed-up of about 4x compared to listening to the video at normal speed. More usefully, it's different enough to get my brain to do it instead of putting it off.

Most of the automatically-generated timestamps are fine. It's just a few that might need tweaking. It's nice to be able to skim them with fewer keystrokes.

View org source for this post

You can e-mail me at sacha@sachachua.com.

Yay Emacs 7: Using word timing in caption editing with subed-word-data

Posted: Nov 16, 2024 - Modified: Aug 8, 2025| yay-emacs, subed

When I work with video captions, I often want to split long captions using subed-split-subtitle. If my player is somewhere in the current subtitle, it'll use that timestamp. If not, it'll make a reasonable guess based on character position.

I can use subed-word-data.el to load word-level times from WhisperX JSON or from Youtube SRV2 files. This allows me to split a subtitle using the timestamp for that word.

Because subed-word-data colours words based on transcription confidence, I can see where something might need to be closely examined, like when there's no timing information for the words at the start or end.

If I combine that with subed-waveform, I can see silences. Then I can tweak start times by shift-left-clicking on the waveform. This automatically adjusts the end time of the previous subtitle too.

I like how Emacs makes it easy to use word timing data when editing captions. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Note: Sometimes WhisperX gives me overlapping timestamps for captions, so I use M-x subed-align to get the aeneas forced alignment tool to give me subtitle-level timestamps. Then I use the word-level data from WhisperX for further splitting.

Links:

Aside: I was trying to find some kind of value-to-color translator for Emacs Lisp for easier visualization, like the way the d3 Javascript library makes it easy to translate a range of numbers (say, linear 0.0 to 1.0) to colors (ex: red-yellow-green). I found color-hsl-to-rgb and also the range of colours defined by the faces calendar-scale-1 to calendar-scale-10. There's also prism, which colours code by depth and allows people to specify the colour transformations (saturation, lightness, etc.). I wonder if someone's already written a general-purpose data-to-fg/bg-color Elisp library that supports numerical and categorical data…

View org source for this post

You can e-mail me at sacha@sachachua.com.