Categories: geek » emacs » subed

RSS - Atom - Subscribe via email

Yay Emacs 5: Tweaking my video workflow with WhisperX and subed-record

| speechtotext, emacs, subed, yay-emacs

I'm tweaking my video workflow. I use Orgzly Revived on my Android phone to write the text, and I use Easy Voice Recorder to record it. Syncthing automatically copies both to my laptop. I use WhisperX to transcribe my recording, and I use a little bit of Emacs Lisp to figure out timestamps for each word. I edit this to fix errors. I can even rearrange things and get rid of umms or ahs or anything I don't want.Then I use subed-convert to turn it into a VTT file. I can tweak the start and end times by looking at the waveforms. Then I add comments with the visuals I want. I can add images, animated GIFs, or videos, and they're automatically squeezed or stretched to fit. I can also have them play at original speed. Then I set up open captions and use subed-record-compile-video. Tada!

Links:

You can watch this on YouTube, download the video, or download the audio.

View org source for this post

Using WhisperX to get word-level timestamps for audio editing with Emacs and subed-record

Posted: - Modified: | audio, emacs, subed
  • [2024-12-09 Mon]: Split whisperx to pass one file per call.
  • [2024-11-16 Sat]: Removed highlight_words, made max_line_width use the environment variable if specified.
  • [2024-10-14 Mon]: Actually, WhisperX makes a JSON with word-level timing data, so let's use that instead.

I'm gradually shifting more things to this Lenovo P52 to take advantage of its newer processor, 64 GB of RAM, and 2 TB drive. (Whee!) One of the things I'm curious about is how I can make better use of multimedia. I couldn't get whisper.cpp to work on my Lenovo X230T, so I mostly relied on the automatic transcripts from Google Recorder (with timestamps generated by aeneas) or cloud-based transcription services like Deepgram.

I have a lot of silences in my voice notes when I think out loud. whisper.cpp got stuck in loops during silent parts, but WhisperX handles them perfectly. WhisperX is also fast enough for me to handle audio files locally instead of relying on Deepgram. With the default model, I can process the files faster than real-time:

File length Transcription time
42s 17s
7m48s 1m41s

I used this command to get word-level timing data. (Experimenting with options from this post)

MAX_LINE_WIDTH="${MAX_LINE_WIDTH:-50}"
MODEL="${MODEL:-large-v2}"
for FILE in "$@"; do
    ~/vendor/whisperx/.venv/bin/whisperx --model "$MODEL" --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --compute_type int8 --print_progress True --max_line_width $MAX_LINE_WIDTH --segment_resolution chunk --max_line_count 1 --language en "$FILE"
done

Among other things, it makes a text file that looks like this:

I often need to... I sometimes need to replace or navigate by symbols.
Casual symbol overlays a new package that adds those shortcuts so that I don't have to remember the other keywords for them.

and a JSON file that looks like this:

{"segments": [{"start": 0.427, "end": 7.751, "text": " I often need to... I sometimes need to replace or navigate by symbols.", "words": [{"word": "I", "start": 0.427, "end": 0.507, "score": 0.994}, {"word": "often", "start": 0.587, "end": 0.887, "score": 0.856}, {"word": "need", "start": 0.987, "end": 1.227, "score": 0.851}, {"word": "to...", "start": 1.267, "end": 1.508, "score": 0.738}, {"word": "I", "start": 4.329, "end": 4.429, "score": 0.778}, ...]}, ...]}

Sometimes I just want the text so that I can use an audio braindump as the starting point for a blog post or for notes. WhisperX is way more accurate than Google Recorder, so that will probably be easier once I update my workflow for that.

Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the JSON like this:

(defun my-subed-word-tsv-from-whisperx-json (file)
  (interactive "FJSON: ")
  (let* ((json-array-type 'list)
         (json-object-type 'alist)
         (data (json-read-file file))
         (filename (concat (file-name-sans-extension file) ".tsv"))
         (base (seq-mapcat
                (lambda (segment)
                  (seq-map (lambda (word)
                             (let-alist word
                               (list nil
                                     (and .start (* 1000 .start))
                                     (and .end (* 1000 .end))
                                     .word)))
                           (alist-get 'words segment)))
                (alist-get 'segments data)))
         (current base)
         (last-end 0))
     ;; numbers at the end of a sentence sometimes don't end up with times
     ;; so we need to fix them
    (while current
      (unless (elt (car current) 1)           ; start
        (setf (elt (car current) 1) (1+ last-end)))
      (unless (elt (car current) 2)
        (setf (elt (car current) 2) (1- (elt (cadr current) 1))))
      (setq
       last-end (elt (car current) 2)
       current (cdr current)))
    (subed-create-file
     filename
     base
     t
     'subed-tsv-mode)
    (find-file filename)))

Here's my old code for parsing the highlighted VTT or SRT files that underline each word:

(defun my-subed-load-word-data-from-whisperx-highlights (file)
  "Return a list of word cues from FILE.
FILE should be a VTT or SRT file produced by whisperx with the
--highlight_words True option."
  (seq-keep (lambda (sub)
              (when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
                (setf (elt sub 3) (match-string 1 (elt sub 3)))
                sub))
            (subed-parse-file file)))

(defun my-subed-word-tsv-from-whisperx-highlights (file)
  (interactive "FVTT: ")
  (with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
    (erase-buffer)
    (subed-tsv-mode)
    (subed-auto-insert)
    (mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
          (my-subed-load-word-data-from-whisperx-highlights file))
    (switch-to-buffer (current-buffer))))

I like to use the TSV format for this one because it's easy to scan down the right side. Incidentally, this format is compatible with Audacity labels, so I could import that there if I wanted. I like Emacs much more, though. I'm used to having all my keyboard shortcuts at hand.

0.427000	0.507000	I
0.587000	0.887000	often
0.987000	1.227000	need
1.267000	1.508000	to...
4.329000	4.429000	I
4.469000	4.869000	sometimes
4.950000	5.170000	need
5.210000	5.410000	to
5.530000	6.090000	replace

Once I've deleted the words I don't want to include, I can merge subtitles for phrases so that I can keep the pauses between words. A quick heuristic is to merge subtitles if they don't have much of a pause between them.

(defvar my-subed-merge-close-subtitles-threshold 500)
(defun my-subed-merge-close-subtitles (threshold)
  "Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
  (interactive (list (read-number "Threshold in msecs: " my-subed-merge-close-subtitles-threshold)))
  (goto-char (point-min))
  (while (not (eobp))
    (let ((end (subed-subtitle-msecs-stop))
          (next-start (save-excursion
                        (and (subed-forward-subtitle-time-start)
                             (subed-subtitle-msecs-stop)))))
      (if (and end next-start (< (- next-start end) threshold))
          (subed-merge-with-next)
        (or (subed-forward-subtitle-end) (goto-char (point-max)))))))

Then I can use subed-waveform-show-all to tweak the start and end timestamps. Here I switch to another file I've been editing…

2024-09-17-12-06-12.svg
Figure 1: Screenshot of subed-waveform

After that, I can use subed-record to compile the audio into an .opus file that sounds reasonably smooth.

I sometimes need to replace or navigate by symbols. casual-symbol-overlay is a package that adds a transient menu so that I don't have to remember the keyboard shortcuts for them. I've added it to my embark-symbol-keymap so I can call it with embark-act. That way it's just a C-. z away.

I want to make lots of quick audio notes that I can shuffle and listen to in order to remember things I'm learning about Emacs (might even come up with some kind of spaced repetition system), and I'd like to make more videos someday too. I think WhisperX, subed, and Org Mode will be fun parts of my workflow.

This is part of my Emacs configuration.
View org source for this post

EmacsConf backstage: making lots of intro videos with subed-record

| emacsconf, subed, emacs

Summary (735 words): Emacs is a handy audio/video editor. subed-record can combine multiple audio files and images to create multiple output videos.

Watch on YouTube

It's nice to feel like you're saying someone's name correctly. We ask EmacsConf speakers to introduce themselves in the first few seconds of their video, but people often forget to do that, so that's okay. We started recording introductions for EmacsConf 2022 so that stream hosts don't have to worry about figuring out pronunciation while they're live. Here's how I used subed-record to turn my recordings into lots of little videos.

First, I generated the title images by using Emacs Lisp to replace text in a template SVG and then using Inkscape to convert the SVG into a PNG. Each image showed information for the previous talk as well as the upcoming talk. (emacsconf-stream-generate-in-between-pages)

emacsconf.svg.png
Figure 1: Sample title image

Then I generated the text for each talk based on the title, the speaker names, pronunciation notes, pronouns, and type of Q&A. Each introduction generally followed the pattern, "Next we have title by speakers. Details about Q&A." (emacsconf-pad-expand-intro and emacsconf-subed-intro-subtitles below)

00:00:00.000 --> 00:00:00.999
#+OUTPUT: sat-open.webm
[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/sat-open.svg.png]]
Next, we have "Saturday opening remarks".

00:00:05.000 --> 00:00:04.999
#+OUTPUT: adventure.webm
[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/adventure.svg.png]]
Next, we have "An Org-Mode based text adventure game for learning the basics of Emacs, inside Emacs, written in Emacs Lisp", by Chung-hong Chan. He will answer questions via Etherpad.

I copied the text into an Org note in my inbox, which Syncthing copied over to the Orgzly Revived app on my Android phone. I used Google Recorder to record the audio. I exported the m4a audio file and a rough transcript, copied them back via Syncthing, and used subed-record to edit the audio into a clean audio file without oopses.

Each intro had a set of captions that started with a NOTE comment. The NOTE comment specified the following:

  • #+AUDIO:: the audio source to use for the timestamped captions that follow
  • [[file:...]]: the title image I generated for each talk. When subed-record-compile-video sees a comment with a link to an image, video, or animated GIF, it takes that visual and uses it for the span of time until the next visual.
  • #+OUTPUT: the file to create.
NOTE #+OUTPUT: hyperdrive.webm
[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/hyperdrive.svg.png]]
#+AUDIO: intros-2023-11-21-cleaned.opus

00:00:15.680 --> 00:00:17.599
Next, we have "hyperdrive.el:

00:00:17.600 --> 00:00:21.879
Peer-to-peer filesystem in Emacs", by Joseph Turner

00:00:21.880 --> 00:00:25.279
and Protesilaos Stavrou (also known as Prot).

00:00:25.280 --> 00:00:27.979
Joseph will answer questions via BigBlueButton,

00:00:27.980 --> 00:00:31.080
and Prot might be able to join depending on the weather.

00:00:31.081 --> 00:00:33.439
You can join using the URL from the talk page

00:00:33.440 --> 00:00:36.320
or ask questions through Etherpad or IRC.

NOTE
#+OUTPUT: steno.webm
[[file:/home/sacha/proj/emacsconf/2023/assets/in-between/steno.svg.png]]
#+AUDIO: intros-2023-11-19-cleaned.opus

00:03:23.260 --> 00:03:25.480
Next, we have "Programming with steno",

00:03:25.481 --> 00:03:27.700
by Daniel Alejandro Tapia.

NOTE
#+AUDIO: intro-2023-11-29-cleaned.opus

00:00:13.620 --> 00:00:16.580
You can ask your questions via Etherpad and IRC.

00:00:16.581 --> 00:00:18.079
We'll send them to the speaker

00:00:18.080 --> 00:00:19.919
and post the answers in the talk page

00:00:19.920 --> 00:00:21.320
after the conference.

I could then call subed-record-compile-video to create the videos for all the intros, or mark a region with C-SPC and then subed-record-compile-video only the intros inside that region.

Sample intro

Using Emacs to edit the audio and compile videos worked out really well because it made it easy to change things.

  • Changing pronunciation or titles: For EmacsConf 2023, I got the recordings sorted out in time for the speakers to correct my pronunciation if they wanted to. Some speakers also changed their talk titles midway. If I wanted to redo an intro, I just had to rerecord that part, run it through my subed-record audio cleaning process, add an #+AUDIO: comment specifying which file I want to take the audio from, paste it into my main intros.vtt, and recompile the video.
  • Cancelling talks: One of the talks got cancelled, so I needed to update the images for the talk before it and the talk after it. I regenerated the title images and recompiled the videos. I didn't even need to figure out which talk needed to be updated - it was easy enough to just recompile all of them.
  • Changing type of Q&A: For example, some speakers needed to switch from answering questions live to answering them after the conference. I could just delete the old instructions, paste in the instructions from elsewhere in my intros.vtt (making sure to set #+AUDIO to the file if it came from a different take), and recompile the video.

And of course, all the videos were captioned. Bonus!

So that's how using Emacs to edit and compile simple videos saved me a lot of time. I don't know how I'd handle this otherwise. 47 video projects that might all need to be updated if, say, I changed the template? Yikes. Much better to work with text. Here are the technical details.

Generating the title images

I used Inkscape to add IDs to our template SVG so that I could edit them with Emacs Lisp. From emacsconf-stream.el:

emacsconf-stream-generate-in-between-pages: Generate the title images.
(defun emacsconf-stream-generate-in-between-pages (&optional info)
  "Generate the title images."
  (interactive)
  (setq info (or emacsconf-schedule-draft (emacsconf-publish-prepare-for-display (emacsconf-filter-talks (or info (emacsconf-get-talk-info))))))
  (let* ((by-track (seq-group-by (lambda (o) (plist-get o :track)) info))
         (dir (expand-file-name "in-between" emacsconf-stream-asset-dir))
         (template (expand-file-name "template.svg" dir)))
    (unless (file-directory-p dir)
      (make-directory dir t))
    (mapc (lambda (track)
            (let (prev)
              (mapc (lambda (talk)
                      (let ((dom (xml-parse-file template)))
                        (mapc (lambda (entry)
                                (let ((prefix (car entry)))
                                  (emacsconf-stream-svg-set-text dom (concat prefix "title")
                                                 (plist-get (cdr entry) :title))
                                  (emacsconf-stream-svg-set-text dom (concat prefix "speakers")
                                                 (plist-get (cdr entry) :speakers))
                                  (emacsconf-stream-svg-set-text dom (concat prefix "url")
                                                 (and (cdr entry) (concat emacsconf-base-url (plist-get (cdr entry) :url))))
                                  (emacsconf-stream-svg-set-text
                                   dom
                                   (concat prefix "qa")
                                   (pcase (plist-get (cdr entry) :q-and-a)
                                     ((rx "live") "Live Q&A after talk")
                                     ((rx "pad") "Etherpad")
                                     ((rx "IRC") "IRC Q&A after talk")
                                     (_ "")))))
                              (list (cons "previous-" prev)
                                    (cons "current-" talk)))
                        (with-temp-file (expand-file-name (concat (plist-get talk :slug) ".svg") dir)
                          (dom-print dom))
                        (shell-command
                         (concat "inkscape --export-type=png -w 1280 -h 720 --export-background-opacity=0 "
                                 (shell-quote-argument (expand-file-name (concat (plist-get talk :slug) ".svg")
                                                                         dir)))))
                      (setq prev talk))
                    (emacsconf-filter-talks (cdr track)))))
          by-track)))

emacsconf-stream-svg-set-text: Update DOM to set the tspan in the element with ID to TEXT.
(defun emacsconf-stream-svg-set-text (dom id text)
  "Update DOM to set the tspan in the element with ID to TEXT.
If the element doesn't have a tspan child, use the element itself."
  (if (or (null text) (string= text ""))
      (let ((node (dom-by-id dom id)))
        (when node
          (dom-set-attribute node 'style "visibility: hidden")
          (dom-set-attribute (dom-child-by-tag node 'tspan) 'style "fill: none; stroke: none")))
    (setq text (svg--encode-text text))
    (let ((node (or (dom-child-by-tag
                     (car (dom-by-id dom id))
                     'tspan)
                    (dom-by-id dom id))))
      (cond
       ((null node)
        (error "Could not find node %s" id))                      ; skip
       ((= (length node) 2)
        (nconc node (list text)))
       (t (setf (elt node 2) text))))))

Generating the script

From emacsconf-pad.el:

emacsconf-pad-expand-intro: Make an intro for TALK.
(defun emacsconf-pad-expand-intro (talk)
  "Make an intro for TALK."
  (cond
   ((null (plist-get talk :speakers))
    (format "Next, we have \"%s\"." (plist-get talk :title)))
   ((plist-get talk :intro-note)
    (plist-get talk :intro-note))
   (t
    (let ((pronoun (pcase (plist-get talk :pronouns)
                     ((rx "she") "She")
                     ((rx "\"ou\"" "Ou"))
                     ((or 'nil "nil" (rx string-start "he") (rx "him")) "He")
                     ((rx "they") "They")
                     (_ (or (plist-get talk :pronouns) "")))))
      (format "Next, we have \"%s\", by %s%s.%s"
              (plist-get talk :title)
              (replace-regexp-in-string ", \\([^,]+\\)$"
                                        ", and \\1"
                                        (plist-get talk :speakers))
              (emacsconf-surround " (" (plist-get talk :pronunciation) ")" "")
              (pcase (plist-get talk :q-and-a)
                ((or 'nil "") "")
                ((rx "after") " You can ask questions via Etherpad and IRC. We'll send them to the speaker, and we'll post the answers on the talk page afterwards.")
                ((rx "live")
                 (format " %s will answer questions via BigBlueButton. You can join using the URL from the talk page or ask questions through Etherpad or IRC."
                         pronoun
                         ))
                ((rx "pad")
                 (format " %s will answer questions via Etherpad."
                         pronoun
                         ))
                ((rx "IRC")
                 (format " %s will answer questions via IRC in the #%s channel."
                         pronoun
                         (plist-get talk :channel)))))))))

And from emacsconf-subed.el:

emacsconf-subed-intro-subtitles: Create the introduction as subtitles.
(defun emacsconf-subed-intro-subtitles ()
  "Create the introduction as subtitles."
  (interactive)
  (subed-auto-insert)
  (let ((emacsconf-publishing-phase 'conference))
    (mapc
     (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
     (seq-map-indexed
      (lambda (talk i)
        (list
         nil
         (* i 5000)
         (1- (* i 5000))
         (format "#+OUTPUT: %s.webm\n[[file:%s]]\n%s"
                 (plist-get talk :slug)
                 (expand-file-name
                  (concat (plist-get talk :slug) ".svg.png")
                  (expand-file-name "in-between" emacsconf-stream-asset-dir))
                 (emacsconf-pad-expand-intro talk))))
      (emacsconf-publish-prepare-for-display (emacsconf-get-talk-info))))))

View org source for this post

Using subed-record in Emacs to edit audio and clean up oopses

| emacs, subed

Finding enough quiet focused time to record audio is a challenge. I often have to re-record segments in order to correct brain hiccups or to restart after interruptions. It's also hard for me to sit still and listen to my recordings looking for mistakes to edit out. I'm not familiar enough with Audacity to zip around with keyboard shortcuts, and I don't like listening to myself again and again in order to find my way around an audio file.

Sure, I could take the transcript, align it with subed-align and Aeneas to get the timestamps, and then use subed-convert to get a CSV (actually a TSV since it uses tabs) that I can import into Audacity as labels, but it still feels a little awkward to navigate. I have to zoom in a lot for the text to be readable.

2023-12-29_10-28-32.png
Figure 1: Audacity labels

So here's a workflow I've been experimenting with for cleaning up my recorded audio.

Just like with my audio braindumps, I use Google Recorder on my phone because I can get the audio file and a rough transcript, and because the microphone on it is better than on my laptop. For narration recordings, I hide in the closet because the clothes muffle echoes. I don't feel as self-conscious there as I might be if I recorded in the kitchen, where my computer usually is. I used to record in Emacs using subed-record by pressing left to redo a segment and right to move on to the next one, but using my phone means I don't have to deal with the computer's noises or get the good mic from downstairs.

I start the recorder on my phone and then switch to my Org file in Orgzly Revived, where I've added my script. I read it as far as I can go. If I want to redo a segment, I say "Oops" and then just redo the last phrase or so.

Screenshot of Google Recorder on my phone
Screenshot_20231229-083047.png

I export the transcript and the M4A audio file using Syncthing, which copies them to my computer. I have a function that copies the latest recording and even sets things up for removing oops segments (my-subed-copy-latest-phone-recording, which calls my-split-oops). If I want to process several files, I can copy them over with my-subed-copy-recording.

my-subed-copy-latest-phone-recording: Copy the latest recording transcript and audio to DESTINATION.
(defun my-subed-copy-latest-phone-recording (destination)
  "Copy the latest recording transcript and audio to DESTINATION."
  (interactive
   (list
    (file-name-directory
     (read-file-name (format "Move %s to: "
                             (file-name-base (my-latest-file my-phone-recording-dir ".txt")))
                     nil nil nil nil #'file-directory-p))))
  (let ((base (file-name-base (my-latest-file my-phone-recording-dir ".txt"))))
    (rename-file (expand-file-name (concat base ".txt") my-phone-recording-dir)
                 destination)
    (rename-file (expand-file-name (concat base ".m4a") my-phone-recording-dir)
                 destination)
    (find-file (expand-file-name (concat base ".txt") destination))
    (save-excursion (my-split-oops))
    (goto-char (point-min))
    (flush-lines "^$")
    (goto-char (point-min))
    (subed-forward-subtitle-id)
    (subed-set-subtitle-comment
     (concat "#+OUTPUT: "
             (file-name-base (buffer-file-name))
             "-cleaned.opus"))))

my-subed-copy-recording
(defun my-subed-copy-recording (filename destination)
  (interactive
   (list
    (buffer-file-name)
    (file-name-directory
     (read-file-name (format "Copy %s to: "
                             (file-name-base (buffer-file-name)))
                     nil nil nil nil #'file-directory-p))))
  (dolist (ext '("m4a" "txt" "json" "vtt"))
    (when (file-exists-p (concat (file-name-sans-extension filename) "." ext))
      (copy-file (concat (file-name-sans-extension filename) "." ext)
                 destination t)))
  (when (get-file-buffer filename)
    (kill-buffer (get-file-buffer filename))
    (dired destination)))

I'll use Aeneas to get the timestamps for each line of text, so a little bit of text processing will let me identify the segments that I want to remove. The way my-split-oops works is that it looks for "oops" in the transcript. Whenever it finds "oops", it adds a newline afterwards. Then it takes the next five words and sees if it can search backward for them within 300 characters. If it finds the words, then that's the start of my repeated segment, and we can add a newline before that. If it doesn't find the words, we try again with four words, then three, then two, then one. I can also manually review the file and see if the oopses are well lined up. When they're detected properly, I should see partially duplicated lines.

I used to record using sub-record by using by. Oops,
I used to record. Oops,
I used to record an emacs using subhead record, by pressing left to reduce segment, and write to move on to the next one.
But using my phone means, I don't have to deal with them. Oops.
But using my phone means, I don't have to deal with the computer's noises or get the good mic from downstairs. I started recorder on my phone

my-split-oops: Look for oops and make it easier to split.
(defun my-split-oops ()
  "Look for oops and make it easier to split."
  (interactive)
  (let ((scan-window 300))
    (while (re-search-forward "oops[,\.]?[ \n]+" nil t)
      (let ((start (min (line-beginning-position) (- (point) scan-window)))
            start-search
            found
            search-for)
        (if (bolp)
            (progn
              (backward-char)
              (setq start (min (line-beginning-position) (- (point) scan-window))))
          (insert "\n"))
        (save-excursion
          (setq start-search (point))
          ;; look for 1..5 words back
          (goto-char
           (or
            (cl-loop
             for n downfrom 5 downto 1
             do
             (save-excursion
               (dotimes (_ n) (forward-word))
               (setq search-for (downcase (string-trim (buffer-substring start-search (point)))))
               (goto-char start-search)
               (when (re-search-backward (regexp-quote search-for) start t)
                 (goto-char (match-beginning 0))
                 (cl-return (point)))))
            (and (call-interactively 'isearch-backward) (point))))
          (insert "\n"))))))

Once the lines are split up, I use subed-align and get a VTT file. The oops segments will be in their own subtitles.

Figure 2: Subtitles and waveforms

The timestamps still need a bit of tweaking sometimes, so I use subed-waveform-show-current or subed-waveform-show-all. I can use the following bindings:

  • middle-click to play a sample
  • M-left-click to set the start and copy to the previous subtitle
  • left-click to set the start without changing the previous one
  • M-right-click to set the end and copy to the next subtitle
  • right-click to set the end without changing the next one
  • M-j to jump to the current subtitle and play it again in MPV
  • M-J to jump to close to the end of the current subtitle and play it in MPV

I use my-subed-delete-oops to delete the oops segments. I can also just mark them for skipping by calling C-u M-x my-subed-delete-oops instead.

Then I add a #+OUTPUT: filename-cleaned.opus comment under a NOTE near the beginning of the file. This tells subed-record~compile-audio where to put the output.

WEBVTT

NOTE #+SKIP

00:00:00.000 --> 00:00:10.319
Finding enough. Oops.

NOTE
#+OUTPUT: 2023-12-subed-record-cleaned.opus

00:00:10.320 --> 00:00:36.319
Finding enough quiet Focused. Time to record. Audio is a challenge. I often have to re-record segments in order to correct brain hiccups, or to restart after interruptions.

I can test short segments by marking the region with C-SPC and using subed-record-compile-try-flow. This lets me check if the transitions between segments make sense.

When I'm happy with everything, I can use subed-record-compile-audio to extract the segments specified by the start and end times of each subtitle and concatenate them one after the other in the audio file specified by the output. The result should be a clean audio file.

If I need to compile an audio file from several takes, I process each take separately. Once I've adjusted the timestamps and deleted or skipped the oops segments, I add #+AUDIO: input-filename.opus to a NOTE at the beginning of the file. subed-record-insert-audio-source-note makes this easier. Then I copy the file's subtitles into my main file. subed-record-compile-audio will take the audio from whichever file was specified by the #+AUDIO: comment, so I can use audio from different files.

Example VTT segment with multiple audio files
NOTE
#+AUDIO: 2023-11-11-emacsconf.m4a

00:10:55.617 --> 00:10:58.136
Sometimes we send emails one at a time.

NOTE
#+AUDIO: 2023-11-15-emacsconf.m4a

00:10:55.625 --> 00:11:03.539
Like when you let a speaker know that we've received a proposal That's mostly a matter of plugging the talks properties into the right places in the template.

Now I have a clean audio file that corresponds to my script. I can use subed-align on my script to get the timestamps for each line using the cleaned audio. Once I have a subtitle file, I can use emacsconf-subed-split (in emacsconf-subed.el - which I probably should add to subed-mode sometime) to quickly split the captions up to fit the line lengths. Then I redo the timestamps with subed-align and adjust timestamps with subed-waveform-show-current.

So that's how I go from rough recordings with stutters and oopses to a clean audio file with captions based on my script. People can probably edit faster with Audacity wizardry or the AI audio editors that are in vogue these days, but this little workflow gets around my impatience with audio by turning it into (mostly) text, so that's cool. Let's see if I can make more presentations now that I've gotten the audio side figured out!

Links:

View org source for this post

#EmacsConf backstage: autopilot with crontab

| emacs, emacsconf, subed

[2023-10-26 Thu]: updated handle-session and added talk

I figured out multi-track streaming so close to EmacsConf 2022 that there wasn't enough time to get other volunteers used to working with the setup, especially since I was still scrambling to figure out more infrastructure as the conference approached. We decided I'd run both streams myself, which meant I needed to make things as automatic as possible so that I wouldn't go crazy. I wanted a lot of things to happen automatically: playing recorded intros and videos, browsing to the right URLs depending on the type of Q&A, publishing updates to the wiki, and so on.

I used timers and TODO state changes to execute commands via TRAMP, which was pretty cool for the most part. But it turned out TRAMP doesn't like being called when it's already running, like when it's being called from two timers going off at the same time. It gives a "Forbidden reentrant call of TRAMP". We found a couple of quick workarounds: I could reschedule the talks to be a minute apart, or I could cancel the conflicting timer and just start them with the shell scripts.

Last year, we had a shell script that played the intro and the main talk, and other scripts to handle the Q&A by opening BigBlueButton, Etherpad, or the IRC channel. Much of the logic was in Emacs Lisp because it was easy to write it that way. For this year, I wanted to write a script that handled the intro, video, and Q&A portions. This is now in roles/obs/templates/handle-session.

handle-session
#!/bin/bash
# 
#
# Handle the intro/talk/Q&A for a session
# Usage: handle-session $SLUG

YEAR=""
BASE_DIR=""
FIREFOX_NAME=firefox-esr
SLUG=$1

# Kill background music if playing
if screen -list | grep -q background; then
    screen -S background -X quit
fi

# Update the status
sudo -u  talk $SLUG PLAYING &

# Update the overlay
overlay $SLUG

# Play the intro if it exists. If it doesn't exist, switch to the intro slide and stop processing.

if [[ -f $BASE_DIR/assets/intros/$SLUG.webm ]]; then
  killall -s TERM $FIREFOX_NAME
  mpv $BASE_DIR/assets/intros/$SLUG.webm
else
  firefox --kiosk $BASE_DIR/assets/in-between/$SLUG.png
  exit 0
fi

# Play the video if it exists. If it doesn't exist, switch to the BBB room and stop processing.
if [ "x$TEST_MODE" = "x" ]; then
  LIST=($BASE_DIR/assets/stream/--$SLUG*--main.webm)
else
  LIST=($BASE_DIR/assets/test/--$SLUG*--main.webm)
fi
FILE="${LIST[0]}"
if [ ! -f "$FILE" ]; then
    # Is there an original file?
    LIST=($BASE_DIR/assets/stream/--$SLUG*--original.{webm,mp4,mov})
    FILE="${LIST[0]}"
fi

if [[ -f $FILE ]]; then
  killall -s TERM $FIREFOX_NAME
  mpv $FILE
else
  /usr/local/bin/bbb $SLUG
  exit 0
fi

sudo -u  talk $SLUG CLOSED_Q &

# Open the appropriate Q&A URL
QA=$(jq -r '.talks[] | select(.slug=="'$SLUG'")["qa-backstage-url"]' < $BASE_DIR/talks.json)
QA_TYPE=$(jq -r '.talks[] | select(.slug=="'$SLUG'")["qa-type"]' < $BASE_DIR/talks.json)
echo "QA_TYPE $QA_TYPE QA $QA"
if [ "$QA_TYPE" = "live" ]; then
  /usr/local/bin/bbb $SLUG
elif [ "$QA" != "null" ]; then
  /usr/local/bin/music &
  /usr/bin/firefox $QA
  # i3-msg 'layout splith'
fi
wait

It builds on roles/obs/templates/bbb, roles/obs/templates/overlay, and roles/obs/templates/music. I also have a roles/prerec/templates/talk script that uses emacsclient to update the status of the talk.

I wrote some Tampermonkey scripts to automate joining the web conference and the IRC channel.

Now that we have a script that handles all the different things related to a session, it's easier to schedule the execution of that script. Instead of using Emacs timers and running into that problem with tramp, I want to try using cron. Cron is a standard UNIX and Linux tool for scheduling things to run at certain times. You make a plain text file in a particular format: minute, hour, day of month, month, day of week, and then the command, and then you tell cron to use that file with something like crontab your-file. Since it's plain text, we can generate it with Emacs Lisp and format-time-string, save with TRAMP, and install with ssh. Each track has its own user account for streaming, so each track can have its own file.

emacsconf-stream-format-crontab: Return crontab entries for TALKS.
(defun emacsconf-stream-format-crontab (track talks &optional test-mode)
  "Return crontab entries for TALKS.
Use the display specified in TRACK.
If TEST-MODE is non-nil, load the videos from the test directory."
  (concat
   (format
    "PATH=/usr/local/bin:/usr/bin
MAILTO=\"\"
XDG_RUNTIME_DIR=\"/run/user/%d\"
" (plist-get track :uid))
   (mapconcat
    (lambda (talk)
      (format "%s /usr/bin/screen -dmS play-%s bash -c \"DISPLAY=%s TEST_MODE=%s /usr/local/bin/handle-session %s | tee -a ~/track.log\"\n"
              ;; cron times are UTC
              (format-time-string "%-M %-H %-d %m *" (plist-get talk :start-time))
              (plist-get talk :slug)
              (plist-get track :vnc-display)
              (if test-mode "1" "")
              (plist-get talk :slug)))
    (emacsconf-filter-talks talks))))

emacsconf-stream-crontabs: Write the streaming users’ crontab files.
(defun emacsconf-stream-crontabs (&optional test-mode info)
  "Write the streaming users' crontab files.
If TEST-MODE is non-nil, use the videos in the test directory.
If INFO is non-nil, use that as the schedule instead."
  (interactive)
  (let ((emacsconf-publishing-phase 'conference))
    (setq info (or info (emacsconf-publish-prepare-for-display (emacsconf-get-talk-info))))
    (dolist (track emacsconf-tracks)
      (let ((talks (seq-filter (lambda (talk)
                                 (string= (plist-get talk :track)
                                          (plist-get track :name)))
                               info))
            (crontab (expand-file-name (concat (plist-get track :id) ".crontab")
                                       (concat (plist-get track :tramp) "~"))))
        (with-temp-file crontab
          (when (plist-get track :autopilot)
            (insert (emacsconf-stream-format-crontab track talks test-mode))))
        (emacsconf-stream-track-ssh track (concat "crontab ~/" (plist-get track :id) ".crontab"))))))

I want to test the whole setup before the conference, of course. First, I needed test videos. This generates test videos and subtitles following our naming convention.

emacsconf-stream-generate-test-videos
(defun emacsconf-stream-generate-test-videos (&optional info)
  "Generate 1-minute test videos for INFO."
  (interactive)
  (setq info (or info (emacsconf-publish-prepare-for-display (emacsconf-get-talk-info))))
  (let* ((dir (expand-file-name "test" emacsconf-stream-asset-dir))
         (default-directory dir)
         (subed-default-subtitle-length 1000)
         (test-length 60))
    (unless (file-directory-p dir)
      (make-directory dir t))
    (shell-command
     (format "ffmpeg -y -f lavfi -i testsrc=duration=%d:size=1280x720:rate=10 -i background-music.opus -shortest %s "
             test-length (expand-file-name "template.webm" dir)))
    (dolist (talk info)
      (with-temp-file (expand-file-name (concat (plist-get talk :file-prefix) "--main.vtt") dir)
        (subed-vtt-mode)
        (subed-auto-insert)
        (dotimes (i test-length)
          (subed-append-subtitle
           nil
           (* i 1000)
           (1- (* i 1000))
           (format "%s %02d %s"
                   (plist-get talk :slug)
                   i
                   (substring "123456789 123456789 123456789 123456789 123456789 123456789 "
                              (1+ (length (format "%s %02d" (plist-get talk :slug) i))))))))
      (copy-file
       (expand-file-name "template.webm" dir)
       (expand-file-name (concat (plist-get talk :file-prefix) "--main.webm") dir)
       t))))

Then I needed to write a crontab based on a different schedule. This code sets up a series of test videos to start about a minute after I run the code, with the dev stream set up to start a minute after the gen stream.

(let* ((offset-seconds 60)
       (start-time (time-add (current-time) offset-seconds))
       (emacsconf-schedule-validation-functions nil)
       (emacsconf-schedule-default-buffer-minutes 1)
       (emacsconf-schedule-default-buffer-minutes-for-live-q-and-a 1)
       (emacsconf-schedule-strategies '(emacsconf-schedule-allocate-buffer-time
                                        emacsconf-schedule-copy-previous-track))
       (schedule (emacsconf-schedule-prepare
                  (emacsconf-schedule-inflate-sexp
                   `(("GEN"
                      :start ,(format-time-string "%Y-%m-%d %H:%M" start-time)
                      :set-track "General")
                     (sat-open :time 1)
                     (uni :time 1) ; live Q&A
                     (adventure :time 1) ; pad Q&A
                     ("DEV"
                      :start
                      ,(format-time-string "%Y-%m-%d %H:%M" (time-add start-time 60))
                      :set-track "Development")
                     (repl :time 1) ; IRC
                     (matplotllm :time 1) ; pad
                     (voice :time 1) ; live
                     )))))
  (emacsconf-stream-crontabs t schedule))

That generates gen.crontab and dev.crontab. This is what gen.crontab looks like for testing:

PATH=/usr/local/bin:/usr/bin
MAILTO=""
XDG_RUNTIME_DIR="/run/user/2002"
35 11 26 10 * /usr/bin/screen -dmS play-sat-open bash -c "DISPLAY=:5 TEST_MODE=1 /usr/local/bin/handle-session sat-open | tee -a ~/track.log"
36 11 26 10 * /usr/bin/screen -dmS play-uni bash -c "DISPLAY=:5 TEST_MODE=1 /usr/local/bin/handle-session uni | tee -a ~/track.log"
38 11 26 10 * /usr/bin/screen -dmS play-adventure bash -c "DISPLAY=:5 TEST_MODE=1 /usr/local/bin/handle-session adventure | tee -a ~/track.log"

The result: for both tracks, the intro videos play, the test videos play, and web browsers go to the right places for the Q&A.

In case I need to resume manual control:

emacsconf-stream-cancel-crontab: Remove crontab for TRACK.
(defun emacsconf-stream-cancel-crontab (track)
  "Remove crontab for TRACK."
  (interactive (list (emacsconf-complete-track)))
  (plist-put track :autopilot nil)
  (emacsconf-stream-track-ssh track "crontab -r"))

emacsconf-stream-cancel-all-crontabs: Remove crontabs.
(defun emacsconf-stream-cancel-all-crontabs ()
  "Remove crontabs."
  (interactive)
  (dolist (track emacsconf-tracks)
    (plist-put track :autopilot nil)
    (emacsconf-stream-track-ssh track "crontab -r")))

Here are some things I learned along the way:

  • I needed to use timedatectl set-timezone America/Toronto to change the server's timezone to America/Toronto so that the crontab would run at the right time.

    In Ansible terms, that's:

    	- name: Set system timezone
    		tags: tz
    		community.general.timezone:
    			name: ""
    	- name: Restart cron
    		tags: tz
    		ansible.builtin.service:
    			name: cron
    			state: restarted
    
  • I also needed to specify the PATH so that I didn't need to add the absolute paths in all the other shell scripts, XDG_RUNTIME_DIR to get audio working, and DISPLAY so that windows showed up in the right place.

I think this will let me run both tracks for EmacsConf with more ease and less frantic juggling. We'll see!

Using Emacs and Python to record an animation and synchronize it with audio

| emacs, emacsconf, python, subed, video

[2023-01-14 Sat]: Removed my fork since upstream now has the :eval function.

The Q&A session for Things I'd like to see in Emacs (Richard Stallman) from EmacsConf 2022 was done over Mumble. Amin pasted the questions into the Mumble chat buffer and I copied them into a larger buffer as the speaker answered them, but I didn't do it consistently. I figured it might be worth making another video with easier-to-read visuals. At first, I thought about using LaTeX to create Beamer slides with the question text, which I could then turn into a video using ffmpeg. Then I decided to figure out how to animate the text in Emacs, because why not? I figured a straightforward typing animation would probably be less distracting than animate-string, and emacs-director seems to handle that nicely. I forked it to add a few things I wanted, like variables to make the typing speed slower (so that it could more reliably type things on my old laptop, since sometimes the timers seemed to have hiccups) and an :eval step for running things without needing to log them. (2023-01-14: Upstream has the :eval feature now.)

To make it easy to synchronize the resulting animation with the chapter markers I derived from the transcript of the audio file, I decided to beep between scenes. First step: make a beep file.

ffmpeg -y -f lavfi -i 'sine=frequency=1000:duration=0.1' beep.wav

Next, I animated the text, with a beep between scenes. I used subed-parse-file to read the question text directly from the chapter markers, and I used simplescreenrecorder to set up the recording settings (including audio).

(defun my-beep ()
  (interactive)
  (save-window-excursion
    (shell-command "aplay ~/recordings/beep.wav &" nil nil)))

(require 'director)
(defvar emacsconf-recording-process nil)
(shell-command "xdotool getwindowfocus windowsize 1282 720")
(progn
  (switch-to-buffer (get-buffer-create "*Questions*"))
  (erase-buffer)
  (org-mode)
  (face-remap-add-relative 'default :height 300)
  (setq-local mode-line-format "   Q&A for EmacsConf 2022: What I'd like to see in Emacs (Richard M. Stallman) - emacsconf.org/2022/talks/rms")
  (sit-for 3)
  (delete-other-windows)
  (hl-line-mode -1)
  (when (process-live-p emacsconf-recording-process) (kill-process emacsconf-recording-process))
  (setq emacsconf-recording-process (start-process "ssr" (get-buffer-create "*ssr*")
                                                   "simplescreenrecorder"
                                                   "--start-recording"
                                                   "--start-hidden"))
  (sit-for 3)
  (director-run
   :version 1
   :log-target '(file . "/tmp/director.log")
   :before-start
   (lambda ()
     (switch-to-buffer (get-buffer-create "*Questions*"))
     (delete-other-windows))
   :steps
   (let ((subtitles (subed-parse-file "~/proj/emacsconf/rms/emacsconf-2022-rms--what-id-like-to-see-in-emacs--answers--chapters.vtt")))
     (apply #'append
            (list
             (list :eval '(my-beep))
             (list :type "* Q&A for Richard Stallman's EmacsConf 2022 talk: What I'd like to see in Emacs\nhttps://emacsconf.org/2022/talks/rms\n\n"))
            (mapcar
             (lambda (sub)
               (list
                (list :log (elt sub 3))
                (list :eval '(progn (org-end-of-subtree)
                                    (unless (bolp) (insert "\n"))))
                (list :type (concat "** " (elt sub 3) "\n\n"))
                (list :eval '(org-back-to-heading))
                (list :wait 5)
                (list :eval '(my-beep))))
             subtitles)))
   :typing-style 'human
   :delay-between-steps 0
   :after-end (lambda ()
                (process-send-string emacsconf-recording-process "record-save\nwindow-show\nquit\n"))
   :on-failure (lambda ()
                 (process-send-string emacsconf-recording-process "record-save\nwindow-show\nquit\n"))
   :on-error (lambda ()
               (process-send-string emacsconf-recording-process "record-save\nwindow-show\nquit\n"))))

I used the following code to copy the latest recording to animation.webm and extract the audio to animation.wav. my-latest-file and my-recordings-dir are in my Emacs config.

(let ((name "animation.webm"))
  (copy-file (my-latest-file my-recordings-dir) name t)
  (shell-command
   (format "ffmpeg -y -i %s -ar 8000 -ac 1 %s.wav"
           (shell-quote-argument name)
           (shell-quote-argument (file-name-sans-extension name)))))

Then I needed to get the timestamps of the beeps in the recording. I subtracted a little bit (0.82 seconds) based on comparing the waveform with the results.

filename = "animation.wav"
from scipy.io import wavfile
from scipy import signal
import numpy as np
import re
rate, source = wavfile.read(filename)
peaks = signal.find_peaks(source, height=1000, distance=1000)
base_times = (peaks[0] / rate) - 0.82
print(base_times)

I noticed that the first question didn't seem to get beeped properly, so I tweaked the times. Then I wrote some code to generate a very long ffmpeg command that used trim and tpad to select the segments and extend them to the right durations. There was some drift when I did it without the audio track, but the timestamps seemed to work right when I included the Q&A audio track as well.

import webvtt
import subprocess
chapters_filename =  "emacsconf-2022-rms--what-id-like-to-see-in-emacs--answers--chapters.vtt"
answers_filename = "answers.wav"
animation_filename = "animation.webm"
def get_length(filename):
    result = subprocess.run(["ffprobe", "-v", "error", "-show_entries",
                             "format=duration", "-of",
                             "default=noprint_wrappers=1:nokey=1", filename],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
    return float(result.stdout)

def get_frames(filename):
    result = subprocess.run(["ffprobe", "-v", "error", "-select_streams", "v:0", "-count_packets",
                             "-show_entries", "stream=nb_read_packets", "-of",
                             "csv=p=0", filename],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
    return float(result.stdout)

answers_length = get_length(answers_filename)
# override base_times
times = np.asarray([  1.515875,  13.50, 52.32125 ,  81.368625, 116.66625 , 146.023125,
       161.904875, 182.820875, 209.92125 , 226.51525 , 247.93875 ,
       260.971   , 270.87375 , 278.23325 , 303.166875, 327.44925 ,
       351.616375, 372.39525 , 394.246625, 409.36325 , 420.527875,
       431.854   , 440.608625, 473.86825 , 488.539   , 518.751875,
       544.1515  , 555.006   , 576.89225 , 598.157375, 627.795125,
       647.187125, 661.10875 , 695.87175 , 709.750125, 717.359875])
fps = 30.0
times = np.append(times, get_length(animation_filename))
anim_spans = list(zip(times[:-1], times[1:]))
chapters = webvtt.read(chapters_filename)
if chapters[0].start_in_seconds == 0:
    vtt_times = [[c.start_in_seconds, c.text] for c in chapters]
else:
    vtt_times = [[0, "Introduction"]] + [[c.start_in_seconds, c.text] for c in chapters] 
vtt_times = vtt_times + [[answers_length, "End"]]
# Add ending timestamps
vtt_times = [[x[0][0], x[1][0], x[0][1]] for x in zip(vtt_times[:-1], vtt_times[1:])]
test_rate = 1.0

i = 0
concat_list = ""
groups = list(zip(anim_spans, vtt_times))
import ffmpeg
animation = ffmpeg.input('animation.webm').video
audio = ffmpeg.input('rms.opus')

for_overlay = ffmpeg.input('color=color=black:size=1280x720:d=%f' % answers_length, f='lavfi')
params = {"b:v": "1k", "vcodec": "libvpx", "r": "30", "crf": "63"}
test_limit = 1
params = {"vcodec": "libvpx", "r": "30", "copyts": None, "b:v": "1M", "crf": 24}
test_limit = 0
anim_rate = 1
import math
cursor = 0
if test_limit > 0:
    groups = groups[0:test_limit]
clips = []

# cursor is the current time
for anim, vtt in groups:
    padding = vtt[1] - cursor - (anim[1] - anim[0]) / anim_rate
    if (padding < 0):
        print("Squeezing", math.floor((anim[1] - anim[0]) / (anim_rate * 1.0)), 'into', vtt[1] - cursor, padding)
        clips.append(animation.trim(start=anim[0], end=anim[1]).setpts('PTS-STARTPTS')) 
    elif padding == 0:
        clips.append(animation.trim(start=anim[0], end=anim[1]).setpts('PTS-STARTPTS'))
    else:
        print("%f to %f: Padding %f into %f - pad: %f" % (cursor, vtt[1], (anim[1] - anim[0]) / (anim_rate * 1.0), vtt[1] - cursor, padding))
        cursor = cursor + padding + (anim[1] - anim[0]) / anim_rate
        clips.append(animation.trim(start=anim[0], end=anim[1]).setpts('PTS-STARTPTS').filter('tpad', stop_mode="clone", stop_duration=padding))
    for_overlay = for_overlay.overlay(animation.trim(start=anim[0], end=anim[1]).setpts('PTS-STARTPTS+%f' % vtt[0]))
    clips.append(audio.filter('atrim', start=vtt[0], end=vtt[1]).filter('asetpts', 'PTS-STARTPTS'))
args = ffmpeg.concat(*clips, v=1, a=1).output('output.webm', **params).overwrite_output().compile()
print(' '.join(f'"{item}"' for item in args))

Anyway, it's here for future reference. =)

View org source for this post

subed.el: Word-level timing improvements, TSV support

| emacs, subed

I figured out how to align the subtitles to get word-level timestamps and generate SRV2 files, so now I'm working on improving the support in subed.el so that it can work with those timestamps.

The subed-word-data-load-from-file function in subed-word-data.el should load the word data from the SRV2 file and attempt to match it up with the text, colouring words if they were successfully matched.

Screenshot_2022-10-26_13-46-31.png

Figure 1: After subed-word-data-load-from-file

I also updated and committed code for working with TSV files like the label export from the Audacity audio editor. The concise format might make editing and reviewing easier. The files look like this:

Screenshot_2022-10-26_13-49-00.png

Figure 2: Tab-separated values

To convert an existing file, use subed-convert (from subed-common.el). You can also manually turn on subed-tsv-mode from subed-tsv.el when you're visitng a TSV subtitle/label file. Tab-separated values can be in any sort of text file and tsv is a common file extension, so I don't automatically add it to auto-mode-alist.

The changes should be in 1.0.16 or the latest version from the Git repository at https://github.com/sachac/subed .