subed.el: Tweaking subtitle times

| emacs, subed

When subtitle times are too far off from the video or audio, people start worrying if their video has frozen or jumped ahead. It's good to keep subtitles roughly in time with the audio.

For EmacsConf, we can get timing information from two places. WhisperX produces a JSON file with word data in the process of doing the speech recognition, and the aeneas forced alignment tool can use synthesized text-to-speech to figure out the timestamps for each line of text compared to a media file.

Aeneas timestamps are more helpful once we start editing, but it can be confused by long silences, extraneous noises, multiple speakers, and inaccurate transcripts (words added or removed).

When I combine the WhisperX word data with subtitles, I can see where the times might need a closer look because matching words weren't found.

Figure 1: Screenshot with word data loaded

Loading word data requires a pretty close match at the moment, but since we change only about 4% of the subtitle text when editing, those cues are still helpful. (I measured this by the Levenshtein distance between the combined cue texts of edited subtitles versus the original WhisperX transcripts, using string-distance to approximate the editing percentage.)

Calculating how much we edited
(let ((sum-original 0)
      (sum-dist 0))
    (lambda (talk)
      (when (and (emacsconf-talk-file talk "--main.vtt")
                 (emacsconf-talk-file talk "--reencoded.json"))
        (let* ((json-object-type 'alist)
               (json-array-type 'list)
                (mapconcat (lambda (sub) (elt sub 3))
                           (subed-parse-file (emacsconf-talk-file talk "--main.vtt"))
                           " "))
                 (lambda (word)
                   (assoc-default 'word word))
                  (json-read-file (emacsconf-talk-file talk "--reencoded.json")))
                 " "))
               (dist (string-distance original-text edited-text)))
          (setq sum-original (+ sum-original (length original-text)))
          (setq sum-dist (+ sum-dist dist))
           (length original-text)
           (length edited-text)
     (format "%d%%" (/ (* 100.0 sum-dist) sum-original))

To make it easier to correct subtitle timing, I added a few ways to tweak subtitle timing for a region of subtitles.

WhisperX: subed-word-data-fix-subtitle-timing in subed-word-data.el tries to match the word data from WhisperX against the text of the current subtitle, using string-distance for approximate matches. I start at about two words shorter than what's in the subtitle, and then increase the number of words taken from the data while the string distance decreases. I skip the data for words before the beginning of the first subtitle in the region.

Screencast of subed-word-data-fix-subtitle-timing

Aeneas: subed-align-region uses Aeneas to realign the subtitles from the region using the section of the media file between the start of the first subtitle and the end of the last subtitle in the region. When I notice that the times are off, I skim the subtitles (or just skim them visually) to find the last well-timed subtitle. Then I pick a subtitle that's in the incorrectly-timed section. I use subed-mpv-jump-to-current-subtitle (M-j) to jump to that position, and I play back that subtitle. It usually belongs to some text further down, so I reset to that position with M-j, set my mark before the previous correctly-timed subtitle with C-SPC, go to the subtitle that matches that time, and use subed-copy-player-pos-to-start-time (C-c [) to set the proper timestamp. Then I can go to the previous incorrectly-timed subtitle and use M-x subed-align-region. This runs the Aeneas forced alignment tool using just the subtitle text in the region, the starting timestamp of the first subtitle, and the ending timestamp of the last subtitle, making it easy to adjust that section. subed-align-region is in subed-align.el

Retiming by pressing SPC after each subtitle: As an experiment, I've also added a subed-retime-subtitles command that plays through the subtitles so that I can press SPC when the next subtitle starts. It begins with the current subtitle and stops when you press a key that's not in its keymap.

Screencast with audio: subed-retime-subtitles

Manual adjustments: For fine-tuning timestamps, I usually turn on subed-waveform-show-all and shift-left-click (subed-waveform-set-start-and-copy-to-previous) or shift-right-click (subed-waveform-set-stop-and-copy-to-next) on the waveforms because it's easy to see where the words and pauses are. When I'm not sure, I can use middle-click (subed-waveform-play-sample) to play part of the file without changing the subtitle start/stop or the MPV playback position.

Screencast with audio of using the waveforms

I'm experimenting with adding repeating keybindings. There was a subed-mpv-frame-step-map that was bound to C-c C-f, so I've renamed it to subed-mpv-control, added a whole bunch of keybindings to the subed-mpv-control-map based on MPV and Aegisub shortcuts, and made it a repeating transient map.

Screencast with audio, experimenting with the mpv control map

Ideas for next steps:

Gotta get the hang of all these new capabilities through practice! =)

To make my subed-align-region workflow even more convenient, I could use completing-read to let me select a future subtitle with completion, and then Emacs could automatically fix the subtitle start time, go to the previous subtitle, and realign the region.

Also, I think switching the waveforms from overlays to text properties could be a good idea. When I cut text, the overlays get left behind, but I want the waveforms to go away too.

While writing this post and fiddling with subed, I ended up adding a bunch of keybindings and a menu. I figured this was as good a time as any to stop tweaking it and finally publish. (But it's fun! Just one more idea…)

Checking caption timing by skimming with Emacs Lisp or JS

| js, emacs, subed

Sometimes automatic subtitle timing tools like Aeneas can get confused by silences, extraneous sounds, filler words, mis-starts, and text that I've edited out of the raw captions for easier readability. It's good to quickly check each caption. I used to listen to captions at 1.5x speed, watching carefully as each caption displayed. This took a fair bit of time and focus, so… it usually didn't happen. Sampling the first second of each caption is faster and requires a little less attention.

Skimming with subed.el

Here's a function that I wrote to play the first second of each subtitle.

(defvar my-subed-skim-msecs 1000 "Number of milliseconds to play when skimming.")
(defun my-subed-skim-starts ()
  (catch 'done
    (while (not (eobp))
      (let ((ch
             (read-char "(q)uit? " nil (/ my-subed-skim-msecs 1000.0))))
        (when ch
          (throw 'done t)))
      (when (and subed-waveform-minor-mode
                 (not subed-waveform-show-all))

Now I can read the lines as the subtitles play, and I can press any key to stop so that I can fix timestamps.

Skimming with Javascript

I also want to check the times on the Web in case there have been caching issues. Here's some Javascript to skim the first second of each cue in the first text track for a video, with some code to make it easy to process the first video in the visible area.

function getVisibleVideo() {
  const videos = document.querySelectorAll('video');
  for (const video of videos) {
    const rect = video.getBoundingClientRect();
    if ( >= 0 &&
      rect.left >= 0 &&
      rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
      rect.right <= (window.innerWidth || document.documentElement.clientWidth)
    ) {
      return video;
  return null;

async function skimVideo(video=getVisibleVideo(), msecs=1000) {
  // Get the first text track (assumed to be captions/subtitles)
  const textTrack = video.textTracks[0];
  if (!textTrack) return;
  const remaining = [...textTrack.cues].filter((cue) => cue.endTime >= video.currentTime);;
  // Play the first 1 second of each visible subtitle
  for (let i = 0; i < remaining.length && !video.paused; i++) {
    video.currentTime = remaining[i].startTime;
    await new Promise((resolve) => setTimeout(resolve, msecs));

Then I can call it with skimVideo();. Actually, in our backstage area, it might be useful to add a Skim button so that I can skim things from my phone.

function handleSkimButton(event) {
   const vid ='.vid').querySelector('video');

document.querySelectorAll('video').forEach((vid) => {
   const div = document.createElement('div');
   const skim = document.createElement('button');
   skim.textContent = 'Skim';
   vid.parentNode.insertBefore(div, vid.nextSibling);
   skim.addEventListener('click', handleSkimButton);


How much faster is it this way?

Some code to help figure out the speedup
(-let* ((files (directory-files "~/proj/emacsconf/2024/cache" t "--main\\.vtt"))
        ((count-subs sum-seconds)
         (-unzip (mapcar
                  (lambda (file)
                     (length (subed-parse-file file))
                     (/ (compile-media-get-file-duration-ms
                         (concat (file-name-sans-extension file) ".webm")) 1000.0)))
        (total-seconds (-reduce #'+ sum-seconds))
        (total-subs (-reduce #'+ count-subs)))
  (format "%d files, %.1f hours, %d total captions, speed up of %.1f"
          (length files)
          (/ total-seconds 3600.0)
          (/ total-seconds total-subs)))

It looks like for EmacsConf talks where we typically format captions to be one long line each (< 60 characters), this can be a speed-up of about 4x compared to listening to the video at normal speed. More usefully, it's different enough to get my brain to do it instead of putting it off.

Most of the automatically-generated timestamps are fine. It's just a few that might need tweaking. It's nice to be able to skim them with fewer keystrokes.

Yay Emacs 7: Using word timing in caption editing with subed-word-data

| emacs, yay-emacs, subed

When I work with video captions, I often want to split long captions using subed-split-subtitle. If my player is somewhere in the current subtitle, it'll use that timestamp. If not, it'll make a reasonable guess based on character position.

I can use subed-word-data.el to load word-level times from WhisperX JSON or from Youtube SRV2 files. This allows me to split a subtitle using the timestamp for that word.

Because subed-word-data colours words based on transcription confidence, I can see where something might need to be closely examined, like when there's no timing information for the words at the start or end.

If I combine that with subed-waveform, I can see silences. Then I can tweak start times by shift-left-clicking on the waveform. This automatically adjusts the end time of the previous subtitle too.

I like how Emacs makes it easy to use word timing data when editing captions. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Note: Sometimes WhisperX gives me overlapping timestamps for captions, so I use M-x subed-align to get the aeneas forced alignment tool to give me subtitle-level timestamps. Then I use the word-level data from WhisperX for further splitting.


Aside: I was trying to find some kind of value-to-color translator for Emacs Lisp for easier visualization, like the way the d3 Javascript library makes it easy to translate a range of numbers (say, linear 0.0 to 1.0) to colors (ex: red-yellow-green). I found color-hsl-to-rgb and also the range of colours defined by the faces calendar-scale-1 to calendar-scale-10. There's also prism, which colours code by depth and allows people to specify the colour transformations (saturation, lightness, etc.). I wonder if someone's already written a general-purpose data-to-fg/bg-color Elisp library that supports numerical and categorical data…

A git post-commit hook for tagging my subed.el release version

Posted: - Modified: | git, emacs, subed

[2024-12-09 Mon]: It looks like I also need to use P t to push the tags from Magit, or call git push --tags from the command-line. Since I'm using a postcommit hook, I'm not sure followTags will kick in for that. Hmm…

Debian uses Git repository tags to notice when to update packages. I kept forgetting to tag subed's versions, so now I made a git post-commit hook which I think will do the trick. I based it on, just updated for Python 3 and tweaked to work with how I do versions in subed.el. I've also added it to my


# place in .git/hooks/post-commit
# Based on

import subprocess
import re

print("checking for version change...",)

output = subprocess.check_output(['git', 'diff', 'HEAD^', 'HEAD', '-U0']).decode("utf-8")

version_info = None
for d in output.split("\n"):
    rg = re.compile(r'\+(?:;;\s+)?Version:\s+(?P<major>[0-9]+)\.(?P<minor>[0-9]+)\.(?P<rev>[0-9]+)')
    m =
    if m:
        version_info = m.groupdict()

if version_info:
    tag = "v%s.%s.%s" % (version_info['major'], version_info['minor'], version_info['rev'])
    existing = subprocess.check_output(['git', 'tag']).decode("utf-8").split("\n")
    if tag in existing:
        print("%s is already tagged, not updating" % tag)
        result =['git', 'tag', '-f', tag])
        if result.returncode:
            raise Exception('tagging not successful: %s %s' % (result.stdout, result.returncode))
        print("tagged revision: %s" % tag)
    print("none found.")
Yay Emacs 5: Tweaking my video workflow with WhisperX and subed-record

| speechtotext, emacs, subed, yay-emacs

I'm tweaking my video workflow. I use Orgzly Revived on my Android phone to write the text, and I use Easy Voice Recorder to record it. Syncthing automatically copies both to my laptop. I use WhisperX to transcribe my recording, and I use a little bit of Emacs Lisp to figure out timestamps for each word. I edit this to fix errors. I can even rearrange things and get rid of umms or ahs or anything I don't want.Then I use subed-convert to turn it into a VTT file. I can tweak the start and end times by looking at the waveforms. Then I add comments with the visuals I want. I can add images, animated GIFs, or videos, and they're automatically squeezed or stretched to fit. I can also have them play at original speed. Then I set up open captions and use subed-record-compile-video. Tada!


You can watch this on YouTube, download the video, or download the audio.

Using WhisperX to get word-level timestamps for audio editing with Emacs and subed-record

Posted: - Modified: | audio, emacs, subed
  • [2024-12-09 Mon]: Split whisperx to pass one file per call.
  • [2024-11-16 Sat]: Removed highlight_words, made max_line_width use the environment variable if specified.
  • [2024-10-14 Mon]: Actually, WhisperX makes a JSON with word-level timing data, so let's use that instead.

I'm gradually shifting more things to this Lenovo P52 to take advantage of its newer processor, 64 GB of RAM, and 2 TB drive. (Whee!) One of the things I'm curious about is how I can make better use of multimedia. I couldn't get whisper.cpp to work on my Lenovo X230T, so I mostly relied on the automatic transcripts from Google Recorder (with timestamps generated by aeneas) or cloud-based transcription services like Deepgram.

I have a lot of silences in my voice notes when I think out loud. whisper.cpp got stuck in loops during silent parts, but WhisperX handles them perfectly. WhisperX is also fast enough for me to handle audio files locally instead of relying on Deepgram. With the default model, I can process the files faster than real-time:

File length Transcription time
42s 17s
7m48s 1m41s

I used this command to get word-level timing data. (Experimenting with options from this post)

for FILE in "$@"; do
    ~/vendor/whisperx/.venv/bin/whisperx --model "$MODEL" --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --compute_type int8 --print_progress True --max_line_width $MAX_LINE_WIDTH --segment_resolution chunk --max_line_count 1 --language en "$FILE"

Among other things, it makes a text file that looks like this:

I often need to... I sometimes need to replace or navigate by symbols.
Casual symbol overlays a new package that adds those shortcuts so that I don't have to remember the other keywords for them.

and a JSON file that looks like this:

{"segments": [{"start": 0.427, "end": 7.751, "text": " I often need to... I sometimes need to replace or navigate by symbols.", "words": [{"word": "I", "start": 0.427, "end": 0.507, "score": 0.994}, {"word": "often", "start": 0.587, "end": 0.887, "score": 0.856}, {"word": "need", "start": 0.987, "end": 1.227, "score": 0.851}, {"word": "to...", "start": 1.267, "end": 1.508, "score": 0.738}, {"word": "I", "start": 4.329, "end": 4.429, "score": 0.778}, ...]}, ...]}

Sometimes I just want the text so that I can use an audio braindump as the starting point for a blog post or for notes. WhisperX is way more accurate than Google Recorder, so that will probably be easier once I update my workflow for that.

Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the JSON like this:

(defun my-subed-word-tsv-from-whisperx-json (file)
  (interactive "FJSON: ")
  (let* ((json-array-type 'list)
         (json-object-type 'alist)
         (data (json-read-file file))
         (filename (concat (file-name-sans-extension file) ".tsv"))
         (base (seq-mapcat
                (lambda (segment)
                  (seq-map (lambda (word)
                             (let-alist word
                               (list nil
                                     (and .start (* 1000 .start))
                                     (and .end (* 1000 .end))
                           (alist-get 'words segment)))
                (alist-get 'segments data)))
         (current base)
         (last-end 0))
     ;; numbers at the end of a sentence sometimes don't end up with times
     ;; so we need to fix them
    (while current
      (unless (elt (car current) 1)           ; start
        (setf (elt (car current) 1) (1+ last-end)))
      (unless (elt (car current) 2)
        (setf (elt (car current) 2) (1- (elt (cadr current) 1))))
       last-end (elt (car current) 2)
       current (cdr current)))
    (find-file filename)))

Here's my old code for parsing the highlighted VTT or SRT files that underline each word:

(defun my-subed-load-word-data-from-whisperx-highlights (file)
  "Return a list of word cues from FILE.
FILE should be a VTT or SRT file produced by whisperx with the
--highlight_words True option."
  (seq-keep (lambda (sub)
              (when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
                (setf (elt sub 3) (match-string 1 (elt sub 3)))
            (subed-parse-file file)))

(defun my-subed-word-tsv-from-whisperx-highlights (file)
  (interactive "FVTT: ")
  (with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
    (mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
          (my-subed-load-word-data-from-whisperx-highlights file))
    (switch-to-buffer (current-buffer))))

I like to use the TSV format for this one because it's easy to scan down the right side. Incidentally, this format is compatible with Audacity labels, so I could import that there if I wanted. I like Emacs much more, though. I'm used to having all my keyboard shortcuts at hand.

0.427000	0.507000	I
0.587000	0.887000	often
0.987000	1.227000	need
1.267000	1.508000	to...
4.329000	4.429000	I
4.469000	4.869000	sometimes
4.950000	5.170000	need
5.210000	5.410000	to
5.530000	6.090000	replace

Once I've deleted the words I don't want to include, I can merge subtitles for phrases so that I can keep the pauses between words. A quick heuristic is to merge subtitles if they don't have much of a pause between them.

(defvar my-subed-merge-close-subtitles-threshold 500)
(defun my-subed-merge-close-subtitles (threshold)
  "Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
  (interactive (list (read-number "Threshold in msecs: " my-subed-merge-close-subtitles-threshold)))
  (goto-char (point-min))
  (while (not (eobp))
    (let ((end (subed-subtitle-msecs-stop))
          (next-start (save-excursion
                        (and (subed-forward-subtitle-time-start)
      (if (and end next-start (< (- next-start end) threshold))
        (or (subed-forward-subtitle-end) (goto-char (point-max)))))))

Then I can use subed-waveform-show-all to tweak the start and end timestamps. Here I switch to another file I've been editing…

Figure 1: Screenshot of subed-waveform

After that, I can use subed-record to compile the audio into an .opus file that sounds reasonably smooth.

I sometimes need to replace or navigate by symbols. casual-symbol-overlay is a package that adds a transient menu so that I don't have to remember the keyboard shortcuts for them. I've added it to my embark-symbol-keymap so I can call it with embark-act. That way it's just a C-. z away.

I want to make lots of quick audio notes that I can shuffle and listen to in order to remember things I'm learning about Emacs (might even come up with some kind of spaced repetition system), and I'd like to make more videos someday too. I think WhisperX, subed, and Org Mode will be fun parts of my workflow.

This is part of my Emacs configuration.
EmacsConf backstage: making lots of intro videos with subed-record

| emacsconf, subed, emacs

Summary (735 words): Emacs is a handy audio/video editor. subed-record can combine multiple audio files and images to create multiple output videos.

Watch on YouTube

It's nice to feel like you're saying someone's name correctly. We ask EmacsConf speakers to introduce themselves in the first few seconds of their video, but people often forget to do that, so that's okay. We started recording introductions for EmacsConf 2022 so that stream hosts don't have to worry about figuring out pronunciation while they're live. Here's how I used subed-record to turn my recordings into lots of little videos.

First, I generated the title images by using Emacs Lisp to replace text in a template SVG and then using Inkscape to convert the SVG into a PNG. Each image showed information for the previous talk as well as the upcoming talk. (emacsconf-stream-generate-in-between-pages)

Figure 1: Sample title image

Then I generated the text for each talk based on the title, the speaker names, pronunciation notes, pronouns, and type of Q&A. Each introduction generally followed the pattern, "Next we have title by speakers. Details about Q&A." (emacsconf-pad-expand-intro and emacsconf-subed-intro-subtitles below)

00:00:00.000 --> 00:00:00.999
#+OUTPUT: sat-open.webm
Next, we have "Saturday opening remarks".

00:00:05.000 --> 00:00:04.999
#+OUTPUT: adventure.webm
Next, we have "An Org-Mode based text adventure game for learning the basics of Emacs, inside Emacs, written in Emacs Lisp", by Chung-hong Chan. He will answer questions via Etherpad.

I copied the text into an Org note in my inbox, which Syncthing copied over to the Orgzly Revived app on my Android phone. I used Google Recorder to record the audio. I exported the m4a audio file and a rough transcript, copied them back via Syncthing, and used subed-record to edit the audio into a clean audio file without oopses.

Each intro had a set of captions that started with a NOTE comment. The NOTE comment specified the following:

  • #+AUDIO:: the audio source to use for the timestamped captions that follow
  • [[file:...]]: the title image I generated for each talk. When subed-record-compile-video sees a comment with a link to an image, video, or animated GIF, it takes that visual and uses it for the span of time until the next visual.
  • #+OUTPUT: the file to create.
NOTE #+OUTPUT: hyperdrive.webm
#+AUDIO: intros-2023-11-21-cleaned.opus

00:00:15.680 --> 00:00:17.599
Next, we have "hyperdrive.el:

00:00:17.600 --> 00:00:21.879
Peer-to-peer filesystem in Emacs", by Joseph Turner

00:00:21.880 --> 00:00:25.279
and Protesilaos Stavrou (also known as Prot).

00:00:25.280 --> 00:00:27.979
Joseph will answer questions via BigBlueButton,

00:00:27.980 --> 00:00:31.080
and Prot might be able to join depending on the weather.

00:00:31.081 --> 00:00:33.439
You can join using the URL from the talk page

00:00:33.440 --> 00:00:36.320
or ask questions through Etherpad or IRC.

#+OUTPUT: steno.webm
#+AUDIO: intros-2023-11-19-cleaned.opus

00:03:23.260 --> 00:03:25.480
Next, we have "Programming with steno",

00:03:25.481 --> 00:03:27.700
by Daniel Alejandro Tapia.

#+AUDIO: intro-2023-11-29-cleaned.opus

00:00:13.620 --> 00:00:16.580
You can ask your questions via Etherpad and IRC.

00:00:16.581 --> 00:00:18.079
We'll send them to the speaker

00:00:18.080 --> 00:00:19.919
and post the answers in the talk page

00:00:19.920 --> 00:00:21.320
after the conference.

I could then call subed-record-compile-video to create the videos for all the intros, or mark a region with C-SPC and then subed-record-compile-video only the intros inside that region.

Sample intro

Using Emacs to edit the audio and compile videos worked out really well because it made it easy to change things.

  • Changing pronunciation or titles: For EmacsConf 2023, I got the recordings sorted out in time for the speakers to correct my pronunciation if they wanted to. Some speakers also changed their talk titles midway. If I wanted to redo an intro, I just had to rerecord that part, run it through my subed-record audio cleaning process, add an #+AUDIO: comment specifying which file I want to take the audio from, paste it into my main intros.vtt, and recompile the video.
  • Cancelling talks: One of the talks got cancelled, so I needed to update the images for the talk before it and the talk after it. I regenerated the title images and recompiled the videos. I didn't even need to figure out which talk needed to be updated - it was easy enough to just recompile all of them.
  • Changing type of Q&A: For example, some speakers needed to switch from answering questions live to answering them after the conference. I could just delete the old instructions, paste in the instructions from elsewhere in my intros.vtt (making sure to set #+AUDIO to the file if it came from a different take), and recompile the video.

And of course, all the videos were captioned. Bonus!

So that's how using Emacs to edit and compile simple videos saved me a lot of time. I don't know how I'd handle this otherwise. 47 video projects that might all need to be updated if, say, I changed the template? Yikes. Much better to work with text. Here are the technical details.

Generating the title images

I used Inkscape to add IDs to our template SVG so that I could edit them with Emacs Lisp. From emacsconf-stream.el:

emacsconf-stream-generate-in-between-pages: Generate the title images.
(defun emacsconf-stream-generate-in-between-pages (&optional info)
  "Generate the title images."
  (setq info (or emacsconf-schedule-draft (emacsconf-publish-prepare-for-display (emacsconf-filter-talks (or info (emacsconf-get-talk-info))))))
  (let* ((by-track (seq-group-by (lambda (o) (plist-get o :track)) info))
         (dir (expand-file-name "in-between" emacsconf-stream-asset-dir))
         (template (expand-file-name "template.svg" dir)))
    (unless (file-directory-p dir)
      (make-directory dir t))
    (mapc (lambda (track)
            (let (prev)
              (mapc (lambda (talk)
                      (let ((dom (xml-parse-file template)))
                        (mapc (lambda (entry)
                                (let ((prefix (car entry)))
                                  (emacsconf-stream-svg-set-text dom (concat prefix "title")
                                                 (plist-get (cdr entry) :title))
                                  (emacsconf-stream-svg-set-text dom (concat prefix "speakers")
                                                 (plist-get (cdr entry) :speakers))
                                  (emacsconf-stream-svg-set-text dom (concat prefix "url")
                                                 (and (cdr entry) (concat emacsconf-base-url (plist-get (cdr entry) :url))))
                                   (concat prefix "qa")
                                   (pcase (plist-get (cdr entry) :q-and-a)
                                     ((rx "live") "Live Q&A after talk")
                                     ((rx "pad") "Etherpad")
                                     ((rx "IRC") "IRC Q&A after talk")
                                     (_ "")))))
                              (list (cons "previous-" prev)
                                    (cons "current-" talk)))
                        (with-temp-file (expand-file-name (concat (plist-get talk :slug) ".svg") dir)
                          (dom-print dom))
                         (concat "inkscape --export-type=png -w 1280 -h 720 --export-background-opacity=0 "
                                 (shell-quote-argument (expand-file-name (concat (plist-get talk :slug) ".svg")
                      (setq prev talk))
                    (emacsconf-filter-talks (cdr track)))))

emacsconf-stream-svg-set-text: Update DOM to set the tspan in the element with ID to TEXT.
(defun emacsconf-stream-svg-set-text (dom id text)
  "Update DOM to set the tspan in the element with ID to TEXT.
If the element doesn't have a tspan child, use the element itself."
  (if (or (null text) (string= text ""))
      (let ((node (dom-by-id dom id)))
        (when node
          (dom-set-attribute node 'style "visibility: hidden")
          (dom-set-attribute (dom-child-by-tag node 'tspan) 'style "fill: none; stroke: none")))
    (setq text (svg--encode-text text))
    (let ((node (or (dom-child-by-tag
                     (car (dom-by-id dom id))
                    (dom-by-id dom id))))
       ((null node)
        (error "Could not find node %s" id))                      ; skip
       ((= (length node) 2)
        (nconc node (list text)))
       (t (setf (elt node 2) text))))))

Generating the script

From emacsconf-pad.el:

emacsconf-pad-expand-intro: Make an intro for TALK.
(defun emacsconf-pad-expand-intro (talk)
  "Make an intro for TALK."
   ((null (plist-get talk :speakers))
    (format "Next, we have \"%s\"." (plist-get talk :title)))
   ((plist-get talk :intro-note)
    (plist-get talk :intro-note))
    (let ((pronoun (pcase (plist-get talk :pronouns)
                     ((rx "she") "She")
                     ((rx "\"ou\"" "Ou"))
                     ((or 'nil "nil" (rx string-start "he") (rx "him")) "He")
                     ((rx "they") "They")
                     (_ (or (plist-get talk :pronouns) "")))))
      (format "Next, we have \"%s\", by %s%s.%s"
              (plist-get talk :title)
              (replace-regexp-in-string ", \\([^,]+\\)$"
                                        ", and \\1"
                                        (plist-get talk :speakers))
              (emacsconf-surround " (" (plist-get talk :pronunciation) ")" "")
              (pcase (plist-get talk :q-and-a)
                ((or 'nil "") "")
                ((rx "after") " You can ask questions via Etherpad and IRC. We'll send them to the speaker, and we'll post the answers on the talk page afterwards.")
                ((rx "live")
                 (format " %s will answer questions via BigBlueButton. You can join using the URL from the talk page or ask questions through Etherpad or IRC."
                ((rx "pad")
                 (format " %s will answer questions via Etherpad."
                ((rx "IRC")
                 (format " %s will answer questions via IRC in the #%s channel."
                         (plist-get talk :channel)))))))))

And from emacsconf-subed.el:

emacsconf-subed-intro-subtitles: Create the introduction as subtitles.
(defun emacsconf-subed-intro-subtitles ()
  "Create the introduction as subtitles."
  (let ((emacsconf-publishing-phase 'conference))
     (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
      (lambda (talk i)
         (* i 5000)
         (1- (* i 5000))
         (format "#+OUTPUT: %s.webm\n[[file:%s]]\n%s"
                 (plist-get talk :slug)
                  (concat (plist-get talk :slug) ".svg.png")
                  (expand-file-name "in-between" emacsconf-stream-asset-dir))
                 (emacsconf-pad-expand-intro talk))))
      (emacsconf-publish-prepare-for-display (emacsconf-get-talk-info))))))

