Using word-level timing information when editing subtitles or captions in Emacs

| emacs

2022-10-26: Merged word-level timing support into subed.el, so I don't need my old caption functions.

2022-04-18: Switched to using yt-dlp.

I like to split captions at logical points, such as at the end of a phrase or sentence. At first, I used subed.el to play the video for the caption, pausing it at the appropriate point and then calling subed-split-subtitle to split at the playback position. Then I modified subed-split-subtitle to split at the video position that's proportional to the text position, so that it's roughly in the right spot even if I'm not currently listening. That got me most of the way to being able to quickly edit subtitles.

It turns out that word-level timing is actually available from YouTube if I download the autogenerated SRV2 file using yt-dlp, which I can do with the following function:

(defun my-caption-download-srv2 (id)
  (interactive "MID: ")
  (require 'subed-word-data)
  (when (string-match "v=\\([^&]+\\)" id) (setq id (match-string 1 id)))
  (let ((default-directory "/tmp"))
    (call-process "yt-dlp" nil nil nil "--write-auto-sub" "--write-sub" "--no-warnings" "--sub-lang" "en" "--skip-download" "--sub-format" "srv2"
                  (concat "" id))
    (subed-word-data-load-from-file (my-latest-file "/tmp" "\\.srv2\\'"))))

2022-10-26: I can also generate a SRV2-ish file using torchaudio, which I can then load with subed-word-data-load-from-file.

(defun my-caption-fix-common-errors (data)
  (mapc (lambda (o)
          (mapc (lambda (e)
                  (when (string-match (concat "\\<" (regexp-opt (if (listp e) (seq-remove (lambda (s) (string= "" s)) e)
                                                                  (list e)))
                                      (alist-get 'text o))
                    (map-put! o 'text (replace-match (car (if (listp e) e (list e))) t t (alist-get 'text o)))))

Assuming I start editing from the beginning of the file, then the part of the captions file after point is mostly unedited. That means I can match the remainder of the current caption with the word-level timing to try to figure out the time to use when splitting the subtitle, falling back to the proportional method if the data is not available.

(defun subed-avy-set-up-actions ()
  (make-local-variable 'avy-dispatch-alist)
   (cons ?, 'subed-split-subtitle)))

(use-package subed
  :if my-laptop-p
  :load-path "~/vendor/subed/subed"
  (subed-mode . display-fill-column-indicator-mode)
  (subed-mode . subed-avy-set-up-actions)
  (:map subed-mode-map
        ("M-," . subed-split-subtitle)
        ("M-." . subed-merge-with-next)
        ("M-p" . avy-goto-char-timer)
        ("M-e" . avy-goto-char-timer)))

That way, I can use the word-level timing information for most of the reformatting, but I can easily replay segments of the video if I'm unsure about a word that needs to be changed.

If I want to generate a VTT based on the caption data, breaking it at certain words, these functions help:

(defvar my-caption-breaks
  '("the" "this" "we" "we're" "I" "finally" "but" "and" "when")
  "List of words to try to break at.")
(defun my-caption-make-groups (list &optional threshold)
  (let (result
        (current-length 0)
        (limit (or threshold 70))
        (lower-limit 30)
        (break-regexp (concat "\\<" (regexp-opt my-caption-breaks) "\\>")))
    (while list
       ((null (car list)))
       ((string-match "^\n*$" (alist-get 'text (car list)))
        (push (cons '(text . " ") (car list)) current-item)
        (setq current-length (1+ current-length)))
       ((< (+ current-length (length (alist-get 'text (car list)))) limit)
        (setq current-item (cons (car list) current-item)
              current-length (+ current-length (length (alist-get 'text (car list))) 1)))
       (t (setq done nil)
          (while (not done)
           ((< current-length lower-limit)
            (setq done t))
           ((and (string-match break-regexp (alist-get 'text (car current-item)))
                 (not (string-match break-regexp (alist-get 'text (cadr current-item)))))
            (setq current-length (- current-length (length (alist-get 'text (car current-item)))))
            (push (pop current-item) list)
            (setq done t))
            (setq current-length (- current-length (length (alist-get 'text (car current-item)))))
            (push (pop current-item) list))))
          (push nil list)
          (setq result (cons (reverse current-item) result) current-item nil current-length 0)))
      (setq list (cdr list)))
    (reverse result)))

(defun my-caption-format-as-subtitle (list &optional word-timing)
  "Turn a LIST of the form (((start . ms) (end . ms) (text . s)) ...) into VTT.
If WORD-TIMING is non-nil, include word-level timestamps."
  (format "%s --> %s\n%s\n\n"
          (subed-vtt--msecs-to-timestamp (alist-get 'start (car list)))
          (subed-vtt--msecs-to-timestamp (alist-get 'end (car (last list))))
          (s-trim (mapconcat (lambda (entry)
                               (if word-timing
                                   (format " <%s>%s"
                                           (subed-vtt--msecs-to-timestamp (alist-get 'start entry))
                                           (string-trim (alist-get 'text entry)))
                                 (alist-get 'text entry)))
                             list ""))))

(defun my-caption-to-vtt (&optional data)
  (with-temp-file "captions.vtt"
    (insert "WEBVTT\n\n"
             (lambda (entry) (my-caption-format-as-subtitle entry))
              (or data (my-caption-fix-common-errors subed-word-data--cache)))
This is part of my Emacs configuration.
You can comment with Disqus or you can e-mail me at