Categories: geek

RSS - Atom - Subscribe via email

Automatically correcting phrasing and misrecognized words in speech-to-text captions by using a script

| speechtotext, subed, emacs

I usually write my scripts with phrases that could be turned into the subtitles. I figured I might as well combine that information with the WhisperX transcripts which I use to cut out my false starts and oopses. To do that, I use the string-distance function, which calculates how similar strings are, based on the Levenshtein [distance] algorithm. If I take each line of the script and compare it with the list of words in the transcription, I can add one transcribed word at a time, until I find the number with the minimum distance from my current script phrase. This lets me approximately match strings despite misrecognized words. I use oopses to signal mistakes. When I detect those, I look for the previous script line that is closest to the words I restart with. I can then skip the previous lines automatically. When the script and the transcript are close, I can automatically correct the words. If not, I can use comments to easily compare them at that point. Even though I haven't optimized anything, it runs well enough for my short videos. With these subtitles as a base, I can get timestamps with subed-align and then there's just the matter of tweaking the times and adding the visuals.

Text from sketch

Matching a script with a transcript 2025-01-09-01

  • script
  • record on my phone
  • WhisperX transcript (with false starts and recognition errors)

My current implementation is totally unoptimized (n²) but it's fine for short videos.

Process:

  • While there are transcript words to process
    • Find the script line that has the minimum distance to the words left in the transcript. restart after oopses
  • Script
  • Transcript: min. distance between script phrase & transcript
  • Restarting after oops: find script phrase with minimum distance
  • Ex. script phrase: The Emacs text editor
  • Transcript: The Emax text editor is a…
  • Bar graph of distance decreasing, and then increasing again
  • Minimum distance
  • Oops?
    • N: Use transcript words, or diff > threshold?
      • Y: Add script words as comment
      • N: Correct minor errors
    • Y: Mark caption for skipping and look for the previous script line with minimum distance.

Result:

  • Untimed captions with comments
  • Aeneas
  • Timed captions for editing

This means I can edit a nicely-split, mostly-corrected file.

I've included the links to various files below so you can get a sense of how it works. Let's focus on an excerpt from the middle of my script file.

it runs well enough for my short videos.
With these subtitles as a base,
I can get timestamps with subed-align

When I call WhisperX with large-v2 as the model and --max_line_width 50 --segment_resolution chunk --max_line_count 1 as the options, it produces these captions corresponding to that part of the script.

01:25.087 --> 01:29.069
runs well enough for my short videos. With these subtitles

01:29.649 --> 01:32.431
as a base, I can get... Oops. With these subtitles as a base, I

01:33.939 --> 01:41.205
can get timestamps with subedeline, and then there's just

Running subed-word-data-use-script-file results in a VTT file containing this excerpt:

00:00:00.000 --> 00:00:00.000
it runs well enough for my short videos.

NOTE #+SKIP

00:00:00.000 --> 00:00:00.000
With these subtitles as a base,

NOTE #+SKIP

00:00:00.000 --> 00:00:00.000
I can get... Oops.

00:00:00.000 --> 00:00:00.000
With these subtitles as a base,

NOTE
#+TRANSCRIPT: I can get timestamps with subedeline,
#+DISTANCE: 0.14

00:00:00.000 --> 00:00:00.000
I can get timestamps with subed-align

There are no timestamps yet, but subed-align can add them. Because subed-align uses the Aeneas forced alignment tool to figure out timestamps by lining up waveforms for speech-synthesized text with the recorded audio, it's important to keep the false starts in the subtitle file. Once subed-align has filled in the timestamps and I've tweaked the timestamps by using the waveforms, I can use subed-record to create an audio file that omits the subtitles that have #+SKIP comments.

The code is available as subed-word-data-use-script-file in subed-word-data.el. I haven't released a new version of subed.el yet, but you can get it from the repository.

In addition to making my editing workflow a little more convenient, I think it might also come in handy for applying the segmentation from tools like sub-seg or lachesis to captions that might already have been edited by volunteers. (I got sub-seg working on my system, but I haven't figured out lachesis.) If I call subed-word-data-use-script-file with the universal prefix arg C-u, it should set keep-transcript-words to true and keep any corrections we've already made to the caption text while still approximately matching and using the other file's segments. Neatly-segmented captions might be more pleasant to read and may require less cognitive load.

There's probably some kind of fancy Python project that already does this kind of false start identification and script reconciliation. I just did it in Emacs Lisp because that was handy and because that way, I can make it part of subed. If you know of a more robust or full-featured approach, please let me know!

View org source for this post

Org Babel, Mermaid JS, and fixing "Failed to launch the browser process" on Ubuntu 24

| emacs, org

Mermaid makes pretty diagrams from text. It's Javascript-based, so the command-line tool (mmdc) uses Puppeteer to get the results of evaluating the diagram in the browser. I was running into some errors trying to get it to work from Org Mode over ob-mermaid on Ubuntu 24, since apparently AppArmor restricts Puppeteer. (Error: Failed to launch the browser process! · Issue #730 · mermaid-js/mermaid-cli).

I put together a pull request to modify ob-mermaid-cli-path so that it doesn't get quoted and can therefore have the aa-exec command needed to work around that. With that modified org-babel-execute:mermaid, I can then configure ob-mermaid like this:

(use-package ob-mermaid
  :load-path "~/vendor/ob-mermaid")
;; I need to override this so that the executable isn't quoted
(setq ob-mermaid-cli-path "aa-exec --profile chrome mmdc -c ~/.config/mermaid/config.json")

I also ran into a problem where the library that Emacs uses to display SVGs could not handle the foreignObject elements used for the labels. mermaid missing text in svg · Issue #112 · mermaid-js/mermaid-cli . Using the following ~/.config/mermaid/config.json fixed it, and I put the option in the ob-mermaid-cli-path above so that it always gets loaded.

{
  "flowchart": {
    "useMaxWidth": false,
    "htmlLabels": false
  }
}

Here's sample Mermaid markup and the file it creates:

mindmap
  root((test))
    Node 1
      Node 1A
      Node 1B
    Node 2
    Node 3
testNode 1Node 2Node 3Node 1ANode 1B

Now I can see the labeled diagrams inside Emacs, too.

This is part of my Emacs configuration.
View org source for this post

#YayEmacs 9: Trimming/adding silences to get to a target; subed-record-sum-time

| audio, subed, yay-emacs, emacs, video

New in this video: subed-record-sum-time, #+PAD_LEFT and #+PAD_RIGHT

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be. First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times. Then I can skip my oopses. Sometimes WhisperX doesn't catch them, so I also look at waveforms and characters per second. I already talk quickly, so I'm not going to speed that up but I can trim the pauses in between phrases which is easy to do with waveforms. Sometimes, after reviewing a draft, I realize I need a little more time. If the original audio has some silence, I can just copy and paste it. If not, I can pad left or pad right to add some silence. I can try the flow of some sections and compile the video when I'm ready. Emacs can do almost anything. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Play by play:

  • I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be.
    • subed-record uses subtitles and directives in comments in a VTT subtitle file to edit audio and video. subed-record-sum-time calculates the resulting duration and displays it in the minibuffer.
  • First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times.
    • I'm experimenting with an algorithmic way to combine the breaks from my script with the text from the transcript. subed-align calls the aeneas forced alignment tool to match up the text with the timestamps. I use subed-waveform-show-all to show all the waveforms.
  • Then I can skip my oopses.
    • Adding a NOTE #+SKIP comment before a subtitle makes subed-record-compile-video and subed-record-compile-flow skip that part of the audio.
  • Sometimes WhisperX doesn't catch them,
    • WhisperX sometimes doesn't transcribe my false starts if I repeat things quickly.
  • so I also look at waveforms
    • subed-waveform-show-all adds waveforms for all the subtitles. If I notice there's a pause or a repeated shape in the waveform, or if I listen and notice the repetition, I can confirm by middle-clicking on the waveform to sample part of it.
  • and characters per second.
    • Low characters per second is sometimes a sign that the timestamps are incorrect or there's a repetition that wasn't transcribed.
  • I already talk quickly, so I'm not going to speed that up
    • Also, I already sound like a chipmunk; mechanically speeding up my recording to fit in a certain time will make that worse =)
  • but I can trim the pauses in between phrases which is easy to do with waveforms.
    • left-click to set the start, right-click to set the stop. If I want to adjust the previous/next one at the same time, I would use shift-left-click or shift-right-click, but here I want to skip the gaps between phrases, so I adjust the current subtitle without making the previous/next one longer.
  • Sometimes, after reviewing a draft, I realize I need a little more time.
    • I can specify visuals like a video, animated GIF, or an image by adding a [[file:...]] link in the comment for a subtitle. That visual will be used until the next visual is specified in a comment on a different subtitle. subed-record-compile-video can automatically speed up video clips to fit in the time for the current audio segment, which is the set of subtitles before the next visual is defined. After I compile and review the video, sometimes I notice that something goes by too quickly.
  • If the original audio has some silence, I can just copy and paste it.
    • This can sometimes feel more natural than adding in complete silence.
  • If not, I can pad left or pad right to add some silence.
    • I added a new feature so that I could specify something like #+PAD_RIGHT: 1.5 in a comment to add 1.5 seconds of silence after the audio specified by that subtitle.
  • I can try the flow of some sections
    • I can select a region and then use M-x subed-record-compile-try-flow to play the audio or C-u M-x subed-record-compile-try-flow to play the audio+video for that region.
  • and compile the video when I'm ready.
    • subed-record-compile-video compiles the video to the file specified in #+OUTPUT: filename. ffmpeg is very arcane, so I'm glad I can simplify my use of it with Emacs Lisp.
  • Emacs can do almost anything. Yay Emacs!
    • Non-linear audio and video editing is actually pretty fun in a text editor, especially when I can just use M-x vundo to navigate my undo history.

Links:

Related:

View org source for this post

2025-01-06 Emacs news

| emacs, emacs-news

[2025-01-06 Mon]: Fix typo in exitter's name

Links from reddit.com/r/emacs, r/orgmode, r/spacemacs, r/planetemacs, Mastodon #emacs, Bluesky #emacs, Hacker News, lobste.rs, programming.dev, lemmy.world, lemmy.ml, communick.news, planet.emacslife.com, YouTube, the Emacs NEWS file, Emacs Calendar, and emacs-devel. Thanks to Andrés Ramírez for emacs-devel links. Do you have an Emacs-related link or announcement? Please e-mail me at sacha@sachachua.com. Thank you!

View org source for this post

Yay Emacs 8: which-key-replacement-alist

Posted: - Modified: | yay-emacs, emacs

[2025-01-07 Tue]: I like the way Karthik's blog posts have play-by-plays for the videos (ex: Fringe Matters: Finding the Right Difference), so I've added an experimental play by play at the bottom of this post.

I wish I could memorize all these keyboard shortcuts in Emacs, but I just don't use some commands often enough to build the muscle memory for them, especially since I have so many context-sensitive shortcuts thanks to Embark. That's why which-key is great. It used to be a separate package, but it's now part of Emacs 30. You just have to turn on which-key mode, and then, if you hesitate in the middle of a keyboard shortcut, Emacs will show you what you can do. It turns out that you can customize this menu,so here are some tweaks I'm trying. First, I want to change which-key-sort-order to the one that sorts by description. This groups similar functions together. Then I want to change the labels using which-key-replacement-alist. Let's try it without the dashes between words. Now let's turn on multiple replacements and make things even simpler. I can even use Unicode to make it easier to spot certain commands. And of course there's an Elisp way to customize all of this. I want to see if these little tweaks can help me use Emacs better. If you use which-key, maybe you'll also want to customize your menu. Let me know what you figure out!

You can watch this on YouTube, download the video, or download the audio.

Things I'm trying:

(with-eval-after-load 'which-key
  (setopt which-key-allow-multiple-replacements t
        which-key-sort-order 'which-key-description-order
        which-key-replacement-alist
        (seq-map
         (lambda (rep)
           `((nil . ,(elt rep 0))
             . (nil . ,(elt rep 1))))
         '(("my-subed-set-timestamp-to-mpv-position" "set to MPV")
           ("my-embark-org-copy-exported-url" "⭐🗐🔗 copy exported URL")
           ("my-subed-copy-timestamp-dwim" "copy")
           ("my-sketch-insert-file-as-link" "insert")
           ("my-geeqie-view" "geeqie")
           ("my-journal-edit" "edit")
           ("my-org-link-qr" "qr")
           ("my-image-open-in-" "")
           ("org-babel-" "ob-")
           ("next" "🠆")
           ("previous" "🠄")
           ("my-image-" "")
           ("my-embark-org-blog-" "")
           ("embark-collect" "⇶ collect")
           ("my-embark-org-" "")
           ("my-" "")
           ("embark-" "")
           ("embark-act-all" "all")
           ("embark-become" "become")
           ("embark-collect" "collect")
           ("-" " ")))))

To use which-key as part of Embark, see the Embark wiki's suggested configuration.

Play by play for the video:

  • I wish I could memorize all these keyboard shortcuts in Emacs, but I just don't use some commands often enough to build the muscle memory for them, especially since I have so many context-sensitive shortcuts thanks to Embark.
    • Showing the Embark context menu with C-. (embark-act) and then cycling through them with C-. (embark-cycle). In this case, I showed the Embark menus for a link, so it cycled through embark-url-map, embark-org-link-map, embark-identifier-map, and embark-expression-map. The active keymap and the value that will be passed as the first argument to functions is displayed in the echo area (ex: Act on identifier 'Embark').
  • That's why which-key is great. It used to be a separate package, but it's now part of Emacs 30. You just have to turn on which-key mode,
    • M-x customize-variable which-key-mode - toggle it on and save it.
  • and then, if you hesitate in the middle of a keyboard shortcut, Emacs will show you what you can do.
    • Here I started with C-c and then paused. Emacs displays the keyboard shortcuts that follow C-c.
  • It turns out that you can customize this menu,so here are some tweaks I'm trying. First, I want to change which-key-sort-order to the one that sorts by description.
    • M-x customize-variable which-key-sort-order, change it to which-key-description-order, and save.
  • This groups similar functions together.
    • The which-key screen now shows the embark- functions grouped together, the org- functions grouped together, etc.
  • Then I want to change the labels using which-key-replacement-alist. Let's try it without the dashes between words.
    • M-x customize-variable which-key-replacement-alist. I used INS to add a new entry. I clicked the first Value Menu to change it to nil, set the next value's regexp to "-", clicked the first Value Menu in the second cons cell to change it to nil, and set the next value's string to " ". The which-key menu now replaces most dashes with spaces, making things a little neater to look at.
  • Now let's turn on multiple replacements and make things even simpler.
    • M-x customize-variable which-key-allow-multiple-replacements, toggle it on, and save it.
    • I also added a bunch of other replacements, like removing the "embark " and "org " prefixes.
  • I can even use Unicode to make it easier to spot certain commands.
    • I added a replacement with some Unicode to make "⭐🗐🔗 copy exported URL" stand out in the menu.
  • And of course there's an Elisp way to customize all of this.
    • I showed the code snippet above.
  • I want to see if these little tweaks can help me use Emacs better. If you use which-key, maybe you'll also want to customize your menu. Let me know what you figure out!
    • We end with the screenshot of the which-key menu.
View org source for this post

Using image-dired to browse the latest screenshots from multiple directories

Posted: - Modified: | emacs, image, org, link

[2025-01-06 Mon]: Patch in progress, Stefan Kangas is looking into it.

Since A+ and I play lots of Minecraft together, I figured it's a good opportunity to slowly get her into the idea of documenting learning. Besides, I can always practise it myself. Screenshots are handy for that. In Minecraft Java, F1 hides the usual heads-up display, and F2 takes the screenshot. Usually, when I start taking screenshots. A+ starts taking screenshots too. I want to build on her enthusiasm by including her screenshots in notes. To make it easy to incorporate her pictures into our notes, I've shared her GDLauncher folder and her Videos folder with my computer using Syncthing so that I can grab any screenshots or videos that she's taken.

In Emacs, image-dired makes it easy to see thumbnails. The neat thing is that it doesn't just work with a single directory. Just like Dired, you can give it a cons cell with a directory in the first part and a list of files in the second part as the first argument to the function, and it will display those files. This means I can use directory-files-recursively to make a list of files, sort it to show most recent screenshots first, limit it to the most recent items, and then display a buffer with thumbnails. image-dired-show-all-from-dir reports a small error when you do this (I need to send a patch upstream), so we hush it with condition-case in my-show-combined-screenshots.

(defvar my-screenshot-dirs
  '("~/recordings"
    "~/.var/app/org.prismlauncher.PrismLauncher/data/PrismLauncher/instances/"
    "~/sync/gdlauncher-instances/"))
(defvar my-recent-screenshot-limit 50)

(defun my-combined-screenshots (&optional limit)
  (seq-take
   (sort
    (seq-mapcat (lambda (dir)
                  (directory-files-recursively dir "^[0-9][0-9][0-9][0-9]-.*\\.\\(png\\|webm\\)"))
                my-screenshot-dirs)
    :key #'file-name-base
    :lessp #'string<
    :reverse t)
   (or limit my-recent-screenshot-limit)))

(defun my-latest-screenshot ()
  (car (my-combined-screenshots)))

(defun my-show-combined-screenshots (&optional limit)
  "Show thumbnails for combined screenshots."
  (interactive (list (when current-prefix-arg (read-number "Limit: "))))
  (condition-case nil
      ;; ignore errors from image-dired trying to set default-directory
      (image-dired-show-all-from-dir
       (cons (car my-screenshot-dirs) (my-combined-screenshots limit)))
    (error nil)))
2025-01-02_08-19-32.png
Figure 1: The result of my-show-combined-screenshots

In the *image-dired* buffer created by my-show-combined-screenshots, I can use m (image-dired-mark-thumb-original-file) to mark images and C-u w (image-dired-copy-filename-as-kill) to copy their absolute paths.

To make it easier to create links to a file by using org-store-link (which I've bound to C-c l) and org-insert-link (C-c C-l in an Org buffer), I can define a link-storing function that takes the original filename:

(defun my-org-image-dired-store-link ()
  (when (and (derived-mode-p 'image-dired-thumbnail-mode)
             (get-text-property (point) 'original-file-name))
    (org-link-store-props
     :link (concat "file:" (get-text-property (point) 'original-file-name)))))

(with-eval-after-load 'org
  (org-link-set-parameters
   "image-dired"
   :store #'my-org-image-dired-store-link))

To make it easier to insert the marked links that have been copied as absolute paths:

(defun my-org-yank-file-links-from-kill-ring ()
  (interactive)
  (dolist (file (read (concat "(" (current-kill 0) ")")))
      (insert (org-link-make-string (concat "file:" file)) "\n")))

I usually want to copy those files to another directory anyway. I have a my-org-copy-linked-files function in Copy linked file and change link that copies the files and rewrites the Org links. This means that I can copy my notes to an index.org in a directory I share with A+, save the images to an images subdirectory, and export the index.html so that she can read the notes any time she likes.

This is part of my Emacs configuration.
View org source for this post

Editing videos with Emacs and subed-record.el

| emacs, subed, video

I want to document more of my Minecraft adventures with A+. Video is a natural way to do this. It builds on her familiarity with the tutorials and streams she enjoys watching. I set up OBS on her laptop and plugged in my Blue Yeti microphone. We did our first interview yesterday. I edited and subtitled it (because why not!), uploaded it as an unlisted YouTube video, and shared it with her dad, sister, and cousins.

I did the video editing in Emacs with subed-record. First, I used WhisperX to transcribe the video, and I used subed-align to fix the timestamps with aeneas. I normalized the audio with Audacity and I exported the .opus file for use in subed-record.el. Then I added NOTE #+SKIP before times I wanted to remove, like when she asked for a retake. Here's what that subtitle markup looks like:

WEBVTT

NOTE #+SKIP

00:00:00.000 --> 00:00:16.679
And then I'll record in my side also
and we'll just put it in somehow.
Somehow. Okay. We can edit that, right?
Yeah, we'll learn how to edit things.
It'll be great.

NOTE
Introduction
#+AUDIO: cuberventures-001.opus
[[file:intro.webm]]
#+OUTPUT: cuberventures-001-fxnt-create-2-windmill-home-cafe-trains-hotel-half-underwater.webm

00:00:16.680 --> 00:00:19.399
Okay, so now we're here with <username>.

00:00:19.400 --> 00:00:23.039
I want to find out what you like about Minecraft and

00:00:23.040 --> 00:00:26.079
all the cool things that you have been building lately.

This was a little different from my usual video creation workflow, where I record the audio and the video separately. When I wrote subed-record.el, I assumed I'd edit the audio first, choose images/GIFs/videos that were already ready to go, and then combine those visuals with sections of audio, speeding things up or slowing things down as needed. Now I wanted to apply the same edits to the video as I did to the audio. A+ did a great job of looking at stuff in Minecraft while talking about them, so I wanted to keep her narration in sync. I added some code to allow me to specify a same-edits keyword for the visuals. That meant that I would use the same selection list that I used for cutting the audio. Here's what that subtitle markup looks like:

NOTE
[[file:2024-12-31 10-35-14.mkv]]
#+OPTIONS: same-edits

00:00:43.860 --> 00:00:45.941
Shall we take a tour of my world?

00:00:45.942 --> 00:00:50.079
Sure, let's tell people which mod pack this is.

00:00:50.080 --> 00:00:55.639
This is FXNT Create 2, also known as FoxyNoTail Create 2.

NOTE Windmill

00:00:55.640 --> 00:00:58.239
I've got this little bit of path leading to the interview

00:00:58.240 --> 00:01:01.839
room. This is my unfinished windmill. I've been meaning to

This workflow lets me cut out segments in the middle of the video, like this:

00:17:30.200 --> 00:17:33.119
great start for a tour. I'm looking forward to seeing what

00:17:33.120 --> 00:17:34.112
you will build next.

NOTE #+SKIP

00:17:34.113 --> 00:18:02.379
Do you have any last words before
we try to figure out this video editing thing?
Yeah. We'll cut that last part out.
Let's just do a retake on that last part.
Someday. Out here. Okay. There you go.
This is a beautiful view.

00:18:02.380 --> 00:18:08.119
The last things I want to say about this world is there'll be

I also wanted to start the video with a segment from my recording, so we could see her avatar on screen during the introduction. She kept her computer on first-person POV instead of changing the camera. I used mpv to figure out the timestamps for the start and end of the part that I wanted to use, then I used ffmpeg to cut that clip. I added a comment with a link to that video in order to use it before the main video. That's the [[file:intro.webm]] in the first section's comments.

After testing a small section of the transcript by selecting a region and using subed-record-compile-video, I deselected the region and used subed-record-compile-video to produce the whole video.

I also modified subed-record-compile-subtitles to include the other non-directive comments, so I can include the section headings in the raw VTT file and have them turn up in the exported version. Then I can use the new subed-section-comments-as-chapters command to copy those as chapters for the YouTube description.

We're not going to share that particular video yet, but I'm looking forward to trying that technique with videos about stuff I'm figuring out in Minecraft or Emacs. It's also tempting me to think about ways to specify transitions like crossfades and other fancy effects like overlays.

I like using the transcript as the starting point for video editing. It just makes sense to me to work with it as text. I also like this experiment with documenting more of our Minecraft experimentation. It seems to get her talking and encourages her to build more. I'm looking forward to learning more about Minecraft and making videos too.

We did another video today using the new shortcuts I've just set up for toggling OBS recording. This time we didn't even need to do any editing. I used Org Export to make her a little HTML file that had the two videos on it, so she can review it any time. Onward!

View org source for this post