Org Mode: Insert YouTube video with separate captions

| emacs

I’m playing around with some ideas for making it easier to post a video with its captions on a webpage or in an Org file so that it’s easier to skim or search.

This requires the youtube-dl command. I’m also learning how to use dash.el‘s threading macro, so you’ll need to install that as well if you want to run it.

(require 'dash)

(defun my/msecs-to-timestamp (msecs)
  "Convert MSECS to string in the format HH:MM:SS.MS."
  (concat (format-seconds "%02h:%02m:%02s" (/ msecs 1000))
          "." (format "%03d" (mod msecs 1000))))

(defun my/org-insert-youtube-video-with-transcript (url)
  (interactive "MURL: ")
  (let* ((id (if (string-match "v=\\([^&]+\\)" url) (match-string 1 url) url))
         (temp-file (make-temp-name "org-youtube-"))
         (temp-file-name (concat temp-file ".en.srv1"))
    (when (and (call-process "youtube-dl" nil nil nil
                             "--write-sub" "--write-auto-sub"  "--no-warnings" "--sub-lang" "en" "--skip-download" "--sub-format" "srv1"
                             "-o" temp-file
                             (format "" id))
               (file-exists-p temp-file-name))
       (format "#+begin_export html
<iframe width=\"560\" height=\"315\" src=\"\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\n#+end_export\n" id)
       (mapconcat (lambda (o)
                    (format "| [[][%s]] | %s |\n"
                            (dom-attr o 'start)
                            (my/msecs-to-timestamp (* 1000 (string-to-number (dom-attr o 'start))))
                            (->> (dom-text o)
                                 (replace-regexp-in-string "[ \n]+" " ")
                                 (replace-regexp-in-string "&#39;" "'")
                                 (replace-regexp-in-string "&quot;" "\""))))
                  (dom-by-tag (xml-parse-file temp-file-name) 'text)
      (delete-file temp-file-name))))

It makes an embedded Youtube video and a table with captions below it. The Org file doesn’t look too bad, either.


I decided to stick to standard Org syntax so that I can read it in Emacs too. With the current implementation, clicking on the timestamps jumps to that position in the video, but on the Youtube website. I haven’t coded anything fancy like keeping the embedded video at a fixed position, controlling it from the clicks, or highlighting the current position. It’s a start, though!

Here’s the output of running it with my talk from the last EmacsConf.

00:00:00.000 I’m Sacha Chua, and welcome to EmacsConf 2020.
00:00:04.000 To kick things off, here are ten cool things
00:00:07.000 that people have been working on
00:00:08.000 since the conference last year.
00:00:10.000 If you want to follow the links
00:00:11.000 or if you’d like to add something I’ve missed,
00:00:14.000 add them to the collaborative pad
00:00:16.000 if you’re watching this live
00:00:17.000 or check out the EmacsConf wiki page for this talk.

… (omitted for brevity)

This is part of my Emacs configuration.
View or add comments

2021-03-29 Emacs news

| emacs, emacs-news

Links from, r/orgmode, r/spacemacs, r/planetemacs, Hacker News,, YouTube, the Emacs NEWS file and emacs-devel.

View or add comments

Weekly review: Week ending March 26, 2021

| review, weekly
  • Emacs:
    • I added the ability to expand all, collapse all, and toggle visibility of headings in my exported HTML.
    • I learned how to use org-special-blocks’ defblock for Org special blocks.
    • I added a preview to my consult function for reading a sketch filename.
  • Other tech:
    • I removed the search form and sidebar from my blog. I added night mode.
    • It turns out my laptop has 8 GB of RAM. That might have something to do with the CPU load with OBS – maybe it’s swapping? I looked at the prices for 16GB RAM kits, but I might still want to upgrade my laptop in order to have more screen choices and CPU power.
    • I started looking into the GEDCOM export from Geni, and I noticed that it was incomplete. I may have to manually re-enter the ones that other people put in.
  • Gardening:
    • W- and I moved the garden cage to the deck. I planted peas, calendula, and radishes in the planter outside. I started a small pot of pumpkin seeds.
    • I made sliding doors based on the LEGO Technic idea book we borrowed from the library.
  • Drawing:
    • I tried sketching in both Concepts and Procreate. They’re both nice.
    • I practised sketching plants and insects following “Illustration School: Let’s Draw Plants and Small Creatures.”
    • I modified my sketch viewer to handle SVGs. I figured out how to sketch with a dark background and change the colours for posting.
  • I sewed two long dresses and a bonnet for A-. I also sewed a pair of pajama pants for myself.

Blog posts



Category The other week % Last week % Diff % h/wk Diff h/wk
A- 41.7 48.3 6.6 81.6 11.1
Discretionary – Play 0.0 1.2 1.2 2.0 2.0
Unpaid work 2.4 3.2 0.8 5.4 1.3
Discretionary – Family 0.0 0.2 0.2 0.3 0.3
Discretionary – Productive 11.6 10.9 -0.7 18.4 -1.2
Personal 5.6 4.2 -1.3 7.1 -2.2
Sleep 38.8 32.0 -6.7 54.1 -11.3
View or add comments

Add a note to the bottom of blog posts exported from my config file

Posted: - Modified: | emacs, org

Update: 2021-04-18: Tweaked the code so that I could add it to the main org-export-filter-body-functions list now that I'm using Eleventy and ox-11ty.el instead of Wordpress and org2blog.

I occasionally post snippets from my Emacs configuration file, drafting the notes directly in my literate config and posting them via org2blog. I figured it might be a good idea to include a link to my config at the end of the posts, but I didn't want to scatter redundant links in my config file itself. Wouldn't it be cool if the link could be automatically added whenever I use org2blog to post a subtree from my config file? I think the code below accomplishes that.

(defun my/org-export-filter-body-add-emacs-configuration-link (string backend info)
  (when (and (plist-get info :input-file) (string-match "\\.emacs\\.d/Sacha\\.org" (plist-get info :input-file)))
    (concat string
            (let ((id (org-entry-get-with-inheritance "CUSTOM_ID")))
               "\n<div class=\"note\">This is part of my <a href=\"\">Emacs configuration.</a></div>"
               (if id (concat "#" id) ""))))))

(use-package org
  (add-to-list 'org-export-filter-body-functions #'my/org-export-filter-body-add-emacs-configuration-link))
This is part of my Emacs configuration.
View or add comments

2021-03-22 Emacs news

| emacs, emacs-news

Links from, r/orgmode, r/spacemacs, r/planetemacs, Hacker News,, YouTube, the Emacs NEWS file and emacs-devel.

View or add comments

Weekly review: Week ending March 19, 2021

| review, weekly
  • I pulled the other meetups’ iCal feeds in automatically.
  • I added announcement timers for upcoming Emacs events.
  • I listened to the EmacsSF meetup on retro Emacs.
  • I experimented with fast ffmpeg cuts in Emacs, compensating for the distance between keyframes.
  • I submitted some tests for subed.el.
  • I tried streaming again, but I think OBS was taking up too much CPU and it made my computer a little too unresponsive. I’ll try streaming from OBS to Twitch instead of using ffmpeg to multicast next time.
  • I wrote about my word-level timing code.
  • I edited a few more subtitles.

Blog posts



Category The other week % Last week % Diff % h/wk Diff h/wk
Sleep 33.7 38.8 5.1 64.7 8.5
A- 41.2 41.7 0.4 69.6 0.7
Personal 5.3 5.6 0.2 9.3 0.4
Business 1.4 0.0 -1.4 0.0 -2.3
Discretionary – Productive 13.7 11.6 -2.1 19.4 -3.5
Unpaid work 4.7 2.4 -2.3 4.0 -3.9
View or add comments

Using word-level timing information when editing subtitles or captions in Emacs

Posted: - Modified: | emacs

I like to split captions at logical points, such as at the end of a phrase or sentence. At first, I used subed.el to play the video for the caption, pausing it at the appropriate point and then calling subed-split-subtitle to split at the playback position. Then I modified subed-split-subtitle to split at the video position that’s proportional to the text position, so that it’s roughly in the right spot even if I’m not currently listening. That got me most of the way to being able to quickly edit subtitles.

It turns out that word-level timing is actually available from YouTube if I download the autogenerated SRV2 file using youtube-dl, which I can do with the following function:

(defun my/caption-download-srv2 (id)
  (interactive "MID: ")
  (when (string-match "v=\\([^&]+\\)" id) (setq id (match-string 1 id)))
  (call-process "youtube-dl" nil nil nil "--write-auto-sub" "--sub-lang" "en" "--skip-download" "--sub-format" "srv2"
                (concat "" id))
  (my/caption-load-word-data (my/latest-file "." "\\.srv2\\'")))

I started parsing JSON files, but SRV2 seemed to be more reliably avaliable, so here are the parsing functions for both. I also change common recognition errors along the way, using the my/subed-common-edits variable defined in my config for subtitles. To change those ones in the VTT file I’m editing, I use my/subed-fix-common-errors, also defined elsewhere.

(defvar-local my/caption-cache nil "Word-level timing in the form ((start . ms) (end . ms) (text . ms))")
(defun my/caption-json-time-to-ms (json)
  (+ (* 1000 (string-to-number (alist-get 'seconds json)))
     (/ (alist-get 'nanos json) 1000000)))

(defun my/caption-extract-words-from-json3 ()
  (let* ((data (progn (goto-char (point-min)) (json-read)))
         (json3-p (alist-get 'events data))
         (reversed (reverse
                    (or (alist-get 'events data)
                        (cl-loop for seg in (car (alist-get 'results data))
                                 nconc (alist-get 'words (car (alist-get 'alternatives seg)))))))
         (last-event (seq-first reversed))
         (last-ms (if json3-p
                      (+ (alist-get 'tStartMs last-event)
                         (alist-get 'dDurationMs last-event)))))
     (cl-loop for e across reversed append
              (if json3-p
                   (lambda (seg)
                     (let ((rec
                            `((start ,(+ (alist-get 'tStartMs e)
                                         (or (alist-get 'tOffsetMs seg) 0)))
                              (end ,(min last-ms
                                         (+ (alist-get 'tStartMs e)
                                            (or (alist-get 'dDurationMs e) 0))))
                              (text ,(alist-get 'utf8 seg)))))
                       (setq last-ms (alist-get 'start rec))
                   (reverse (alist-get 'segs e)))
                `((start ,(my/caption-json-time-to-ms (alist-get 'startTime seg)))
                  (end ,(my/caption-json-time-to-ms (alist-get 'endTime seg)))
                  (text ,(alist-get 'word seg))))))))

(defun my/caption-extract-words-from-srv2 ()
  (let* ((data (xml-parse-region))
         (text-elements (reverse (dom-by-tag data 'text)))
         (last-start (+ (string-to-number
                         (alist-get 't (xml-node-attributes (car text-elements))))
                        (string-to-number (alist-get 'd (xml-node-attributes (car text-elements)))))))
     (mapcar #'(lambda (element)
                 (let ((rec (list (cons 'start (string-to-number (alist-get 't (xml-node-attributes element))))
                                  (cons 'end last-start)
                                  (cons 'text (car (xml-node-children element))))))
                   (setq last-start (alist-get 'start rec))

(defun my/caption-fix-common-errors (data)
  (mapc (lambda (o)
          (mapc (lambda (e)
                  (when (string-match (concat "\\<" (car e) "\\>") (alist-get 'text o))
                    (map-put! o 'text (replace-match (cadr e) t t (alist-get 'text o)))))

(defun my/caption-load-word-data (file)
  "Load word-level timing from FILE."
  (interactive "fFile: ")
  (let (data)
    (with-current-buffer (find-file-noselect file)
       ((string-match "\\.json" file)
        (setq data (my/caption-extract-words-from-json3)))
       ((string-match "\\.srv2\\'" file)
        (setq data (my/caption-extract-words-from-srv2)))
       (t (error "Unknown format."))))
    (setq-local my/caption-cache
                (mapcar (lambda (entry)
                          (setf (alist-get 'text entry)
                                (replace-regexp-in-string "&#39;" "'" (alist-get 'text entry)))
                        (my/caption-fix-common-errors data)))))

Assuming I start editing from the beginning of the file, then the part of the captions file after point is mostly unedited. That means I can match the remainder of the current caption with the word-level timing to try to figure out the time to use when splitting the subtitle, falling back to the proportional method if the data is not available.

(defun my/caption-look-up-word ()
    (let* ((end (subed-subtitle-msecs-stop))
           (start (subed-subtitle-msecs-start))
           (remaining-words (split-string (buffer-substring (point) (subed-jump-to-subtitle-end))))
           (words (reverse (seq-filter (lambda (o)
                                         (and (<= (alist-get 'end o) end)
                                              (>= (alist-get 'start o) start)
                                              (not (string-match "^\n*$" (alist-get 'text o)))))
           (offset 0)
           candidate done)
      (while (not done)
        (setq candidate (elt words (+ (1- (length remaining-words)) offset)))
         ((and candidate (string-match (concat "\\<" (car remaining-words) "\\>") (alist-get 'text candidate)))
          (setq done t))
         ((> offset (length words)) (setq done t))
         ((> offset 0) (setq offset (- offset)))
         (t (setq offset (1+ (- offset))))))

(defun my/caption-unwrap ()
  (let ((limit (save-excursion (or (subed-jump-to-subtitle-end) (point)))))
         (while (re-search-forward "\n" limit t)
           (replace-match " "))))
(defun my/caption-split ()
  "Split the current subtitle based on word-level timing if available."
    (let ((data (my/caption-look-up-word)))
      (prin1 data)
      (subed-split-subtitle (and data (- (alist-get 'start data) (subed-subtitle-msecs-start)))))))
(defun my/caption-split-and-merge-with-next ()
(defun my/caption-split-and-merge-with-previous ()
(use-package subed
  :if my/laptop-p
  :load-path "~/vendor/subed/subed"
  (:map subed-mode-map
        ("M-'" . my/caption-split)
        ("M-," . my/caption-split-and-merge-with-previous)
        ("M-q" . my/caption-unwrap)
        ("M-." . my/caption-split-and-merge-with-next)))

That way, I can use the word-level timing information for most of the reformatting, but I can easily replay segments of the video if I’m unsure about a word that needs to be changed.

If I want to generate a VTT based on the caption data, breaking it at certain words, these functions help:

(defvar my/caption-breaks
  '("the" "this" "we" "we're" "I" "finally" "but" "and" "when")
  "List of words to try to break at.")
(defun my/caption-make-groups (list)
  (let (result
        (current-length 0)
        (limit 70)
        (lower-limit 30)
        (break-regexp (concat "\\<" (regexp-opt my/caption-breaks) "\\>")))
    (while list
       ((null (car list)))
       ((string-match "^\n*$" (alist-get 'text (car list)))
        (push (cons '(text . " ") (car list)) current-item)
        (setq current-length (1+ current-length)))
       ((< (+ current-length (length (alist-get 'text (car list)))) limit)
        (setq current-item (cons (car list) current-item)
              current-length (+ current-length (length (alist-get 'text (car list))) 1)))
       (t (setq done nil)
          (while (not done)
           ((< current-length lower-limit)
            (setq done t))
           ((and (string-match break-regexp (alist-get 'text (car current-item)))
                 (not (string-match break-regexp (alist-get 'text (cadr current-item)))))
            (setq current-length (- current-length (length (alist-get 'text (car current-item)))))
            (push (pop current-item) list)
            (setq done t))
            (setq current-length (- current-length (length (alist-get 'text (car current-item)))))
            (push (pop current-item) list))))
          (push nil list)
          (setq result (cons (reverse current-item) result) current-item nil current-length 0)))
      (setq list (cdr list)))
    (reverse result)))

(defun my/caption-format-as-subtitle (list &optional word-timing)
  "Turn a LIST of the form (((start . ms) (end . ms) (text . s)) ...) into VTT.
If WORD-TIMING is non-nil, include word-level timestamps."
  (format "%s --> %s\n%s\n\n"
          (subed-vtt--msecs-to-timestamp (alist-get 'start (car list)))
          (subed-vtt--msecs-to-timestamp (alist-get 'end (car (last list))))
          (s-trim (mapconcat (lambda (entry)
                               (if word-timing
                                   (format " <%s>%s"
                                           (subed-vtt--msecs-to-timestamp (alist-get 'start entry))
                                           (string-trim (alist-get 'text entry)))
                                 (alist-get 'text entry)))
                             list ""))))

(defun my/caption-to-vtt (&optional data)
  (with-temp-file "captions.vtt"
    (insert "WEBVTT\n\n"
             (lambda (entry) (my/caption-format-as-subtitle entry))
              (or data (my/caption-fix-common-errors my/caption-cache)))
View or add comments