category - speechtotext

Dec 22, 2023| speech, speechtotext, emacs

[2023-12-24 Sun] Added a quick video!

Summary sketch

2023-12-15-01 Audio braindump workflow tweaks #emacs #orgmode #subed.png

Audio recording is handy for capturing thoughts as I wait, walk around, or do chores. But my wireless earbuds don't have a good mic, I rarely got back to reviewing the wall of text, and I don't trust speech recognition to catch all my words.

Here's a new brain-dumping workflow that I've been experimenting with, though. I use a lapel mic to record in my phone. Google Recorder gives me an audio file as well as a rough transcript right away.

Animated GIF showing Google Recorder 's real-time transcript

I copy those with Syncthing.

If I use keywords like "start" or "stop" along with things like "topic", "reminder", or "summary", then I can put those on separate lines automatically (my-transcript-prepare-alignment-breaks).

...
News. Miscellaneous little tasks that he doing. I do want to
finish that blog post about the playlist Just so that it's out.
Something else that people can, you know, refer to or that I can refer
to. Uh, And at some point I want to think about, This second brain
stuff.
So, right now, What's my current state? Uh,
START CHAPTER second brain STOP CHAPTER
Right now, I dumped everything into originally. In my inbox, if I come
across an interesting website. As usually in my phone. So then I share
it. As. Something links at those or four none. Uh, into my inbox.
...

I use subed-align to get the timestamps, and add the headings.

00:20:18.680 --> 00:20:24.679
So, right now, What's my current state? Uh,

NOTE CHAPTER: second brain

00:20:24.680 --> 00:20:30.719
START CHAPTER second brain STOP CHAPTER

I can then create an Org Mode TODO item with a quick hyperlinked summary as well as my transcript.

Figure 1: Summary with headings and links

I can jump to the audio if there are misrecognized words.

Screenshot of jumping to the audio

Figure 2: Following the link to the chapter in the VTT file

I can use subed-waveform to tweak the start and end times. (subed-waveform-show-current, then left-clicking to set the start or right-clicking to set the end, or using keybindings to adjust the start/stop).

Someday I'll write code to send sections to a better speech recognition engine or to AI. In the meantime, this is pretty good.

Here's how the code works:

Recognizing keyword phrases

There are several things I want to do while dictating.

I want to mark different topics so that it's easy to find the section where I was talking about something.
I might want to set tags or priorities, or even schedule something (today, tomorrow, next week, next month).
I can also use commands to trigger different things, like sending the section to a better speech recognition engine.

By analyzing the text, I might be able to make my own command system.

So far, for starting keywords, I can use "start", "begin", or "open". I pair that with one of these part keywords:

"section", "chapter", "topic", "summary": I use these pretty interchangeably at the moment. I want them to make a new Org heading.
"next steps": could be handy for being able to quickly see what to do next
"reminder":
"interruption": don't know what I'll use this for yet, but it might be useful to note this.
"tag", "keyword": maybe use this to add tags to the current section?

Then the code can extract the text until the matching "stop/close/end <part>", assuming it happens within 50 words or so. (my-transcript-close-keyword-distance-words)

Sometimes keywords get misrecognized. "Begin summary" sometimes becomes "again summary" or "the game summary". I could try "open" and "close". Commercial dictation programs like Dragon NaturallySpeaking use "open" and "close" for punctuation, so that would probably work fine. "Start" works well, but "end" doesn't because it can confused with "and".

Sometimes an extra word sneaks in, either because I say it or because the speech recognition tries too hard to guess. "Begin reminder" ends up as "Begin a reminder." I changed from using regular expressions that searched for just start-keyword + part-keyword to one that looked for the start of the keyword phrase and then looked for the next keyword within the next X words. (my-transcript-scan-for-part-keyword)

Recognizing phrases

(defvar my-transcript-open-keywords '("start" "begin" "open"))
(defvar my-transcript-close-keywords '("stop" "end" "close"))
(defvar my-transcript-part-keywords '("summary" "chapter" "topic"
                                      "section"
                                 "action" "idea" "journal" "reminder"
                                 "command" "interruption" "note"
                                 "next step" "next steps" "tags" "tag" "keywords" "keyword"))

(defvar my-transcript-part-keyword-distance-words 2 "Number of words to scan for part keyword.")
(defvar my-transcript-close-keyword-distance-words 50 "number of words to scan for stop keyword.
Put the keywords on the same line if found.")
(defun my-transcript-scan-for-part-keyword (before-part &optional part-keywords within-distance before-distance)
  "Look for BEFORE-PART followed by PART-KEYWORDS.
There might be WITHIN-DISTANCE words between BEFORE-PART and PART-KEYWORDS,
and the pair might be within BEFORE-DISTANCE from point.
Distances are in words.
Return (start end before-part part) if found, nil otherwise."
  (setq before-part (pcase before-part
                      ('start my-transcript-open-keywords)
                      ('stop my-transcript-close-keywords)
                      ('nil (append my-transcript-open-keywords my-transcript-close-keywords))
                      (_ before-part)))
  (setq part-keywords (or part-keywords my-transcript-part-keywords))
  (when (stringp part-keywords) (setq part-keywords (list part-keywords)))
  (setq within-distance (or within-distance my-transcript-part-keyword-distance-words))
  (setq before-distance (if (eq before-distance t)
                            (point-max)
                          (or before-distance my-transcript-close-keyword-distance-words)))
  (let (result
        start end
        (before-point (save-excursion (forward-word before-distance) (point)))
        before-word
        part-word)
    (save-excursion
      (when (looking-at (regexp-opt before-part))
        (setq before-word (match-string 0) start (match-beginning 0))
        (when (re-search-forward (regexp-opt part-keywords) (save-excursion (forward-word within-distance) (point)) t)
          (setq result (list start (match-end 0) before-word (match-string 0)))))
      (while (and (not result)
                  (re-search-forward (regexp-opt before-part) before-point t))
        (setq before-word (match-string 0) start (match-beginning 0))
        (when (re-search-forward (regexp-opt part-keywords) (save-excursion (forward-word within-distance) (point)) t)
          (setq result (list start (match-end 0) before-word (match-string 0)))))
      (when result (goto-char (elt result 1)))
      result)))

(ert-deftest my-transcript-scan-for-part-keyword ()
  (with-temp-buffer
    (insert "some text start a reminder hello world stop there and do something stop reminder more text")
    (goto-char (point-min))
    (let ((result (my-transcript-scan-for-part-keyword 'start nil)))
      (expect (elt result 2) :to-equal "start")
      (expect (elt result 3) :to-equal "reminder"))
    (let ((result (my-transcript-scan-for-part-keyword 'stop "reminder")))
      (expect (elt result 2) :to-equal "stop")
      (expect (elt result 3) :to-equal "reminder"))))

Splitting the lines based on keywords and oopses

Now I can use that to scan through the text. I want to put commands on their own lines so that subed-align will get the timestamp for that segment and so that the commands are easier to parse.

I also want to detect "oops" and split things up so that the start of that line matches my correction after the "oops". I use my-subed-split-oops for that, which I should write about in another post. By putting the oops fragment on its own line, I can use subed-align to get a timestamp for just that segment. Then I can either use flush-lines to get rid of anything with "oops" in it. I can even remove the subtitle and use subed-record-compile-media to compile audio/video without that segment, if I want to use the audio without rerecording it.

And the way I can help is by jotting words down in a mind map,
typing her sentences. Oops
typing, her sentences And generating, follow-up questions.

I also all-caps the keyword phrases so that they're easier to see when skimming the text file.

Alignment breaks

(defun my-transcript-prepare-alignment-breaks ()
  "Split lines in preparation for forced alignment with aeneas.

Split \"oops\" so that it's at the end of the line and the
previous line starts with roughly the same words as the next
line, for easier removal.

Add a linebreak before \"begin/start\" followed by
`my-transcript-part-keywords'.

Add a linebreak after \"stop\" followed by
`my-transcript-part-keywords'.

Look for begin keyword ... stop keyword with at most
`my-transcript-part-keyword-distance-words' between them and put them on one
line."
  (interactive)
  (let ((case-fold-search t) result close-result)
    (my-split-oops)
    ;; break "begin/start keyword"
    (goto-char (point-min))
    (while (setq result (my-transcript-scan-for-part-keyword 'start nil nil t))
      (goto-char (car result))
      (delete-region (car result) (elt result 1))
      (insert "\n" (upcase (concat (elt result 2) " " (elt result 3))) "\n"))
    ;; break stop
    (goto-char (point-min))
    (while (setq result (my-transcript-scan-for-part-keyword 'stop nil nil t))
      (goto-char (car result))
      (delete-region (car result) (elt result 1))
      (insert (upcase (concat (elt result 2) " " (elt result 3))) "\n"))
    ;; try to get start and end sections on one line
    (goto-char (point-min))
    (while (setq result (my-transcript-scan-for-part-keyword 'start nil nil t))
      (goto-char (elt result 1))
      (setq stop-result (my-transcript-scan-for-part-keyword 'stop (elt result 3)))
      (if stop-result
          (progn
            (goto-char (car stop-result))
            (while (re-search-backward " *\n+ *" (car result) t)
              (replace-match " ")))
        ;; no stop keyword; are we on an empty line? If so, just merge it with the next one
        (when (looking-at "\n+ *")
          (replace-match " "))))
    ;; remove empty lines
    (goto-char (point-min))
    (when (looking-at "\n+") (replace-match ""))
    (while (re-search-forward "\n\n+" nil t)
      (replace-match "\n"))
    (goto-char (point-min))
    (while (re-search-forward " *\n *" nil t)
      (replace-match "\n"))))

(ert-deftest my-transcript-prepare-alignment-breaks ()
  (with-temp-buffer
    (insert "some text start a reminder hello world stop there and do something stop reminder more text")
    (goto-char (point-min))
    (my-transcript-prepare-alignment-breaks)
    (expect (buffer-string) :to-equal
            "some text
START REMINDER hello world stop there and do something STOP REMINDER
more text")))

Preparing the VTT subtitles

subed-align gives me a VTT subtitle file with timestamps and text. I add NOTE comments with the keywords and make subed: links to the timestamps using the ol-subed.el that I just added.

Putting keyword phrases in comments

(defun my-transcript-get-subtitle-note-based-on-keywords (sub-text)
  (let ((case-fold-search t))
    (when (string-match (concat "^"
                                (regexp-opt my-transcript-open-keywords)
                                " \\(" (regexp-opt my-transcript-part-keywords) "\\) \\(.+?\\)\\( "
                                (regexp-opt my-transcript-close-keywords) " "
                                (regexp-opt my-transcript-part-keywords) "\\)?$")
                        sub-text)
      (concat (match-string 1 sub-text) ": " (match-string 2 sub-text)))))
(ert-deftest my-transcript-get-subtitle-note-based-on-keywords ()
  (expect (my-transcript-get-subtitle-note-based-on-keywords "BEGIN NEXT STEPS . Think about how dictation helps me practice slower speed. CLOSE NEXT STEPS")
          :to-equal "NEXT STEPS: . Think about how dictation helps me practice slower speed.")
  (expect (my-transcript-get-subtitle-note-based-on-keywords "START SUMMARY hello world STOP SUMMARY")
          :to-equal "SUMMARY: hello world")
  (expect (my-transcript-get-subtitle-note-based-on-keywords "START CHAPTER hello world again")
          :to-equal "CHAPTER: hello world again")
  )

Formatting the subtitles into Org Mode subtrees

The last step is to take the list of subtitles and format it into the subtree.

Formatting the subtree

;; todo: sort the completion? https://emacs.stackexchange.com/questions/55502/list-files-in-directory-in-reverse-order-of-date
;;
(defun my-transcript-insert-subtitles-as-org-tree (vtt-filename)
  (interactive (list (read-file-name "VTT: " (expand-file-name "./" my-phone-recording-dir) nil t nil
                                     (lambda (s) (string-match "\\.vtt$" s)))))
  (let* ((subtitles
          (mapcar (lambda (sub)
                    (unless (elt sub 4)
                      (setf (elt sub 4)
                            (my-transcript-get-subtitle-note-based-on-keywords (elt sub 3))))
                    sub)
                  (subed-parse-file vtt-filename)))
         (start-date (my-transcript-get-file-start-time vtt-filename))
         chapters tags
         start-of-entry)
    (setq start-of-entry (point))
    (insert (format "* TODO Review braindump from %s  :braindump:\n\n" (file-name-base vtt-filename)))
    (org-entry-put (point) "CREATED"
                   (concat "[" (format-time-string
                                (cdr org-timestamp-formats)
                                (my-transcript-get-file-start-time
                                 (file-name-nondirectory vtt-filename))) "]"))
    (insert
     (format "%s - %s - %s\n"
             (org-link-make-string (concat "file:" (file-name-sans-extension vtt-filename) ".vtt")
                                   "VTT")
             (org-link-make-string (concat "file:" (file-name-sans-extension vtt-filename) ".txt")
                                   "Text")
             (org-link-make-string (concat "file:" (file-name-sans-extension vtt-filename) ".m4a")
                                   "Audio")))
    (save-excursion
      (insert "** Transcript\n")
      ;; add each subtitle; add an ID in case we change the title
      (mapc
       (lambda (sub)
         (when (elt sub 4)
           (let ((note (my-transcript-get-subtitle-note-based-on-keywords (elt sub 3))))
             (insert (concat "*** "
                             note " "
                             (org-link-make-string
                              (format "subed:%s::%s"
                                      vtt-filename
                                      (my-msecs-to-timestamp (elt sub 1)))
                              "VTT")
                             "\n\n"))
             (org-entry-put (point) "CREATED"
                   (concat "[" (format-time-string
                                (cdr org-timestamp-formats)
                                (time-add start-date
                                          (seconds-to-time (/ (elt sub 1) 1000.0)))) "]"))
             (org-entry-put (point) "START" (my-msecs-to-timestamp (elt sub 2)))
             (when (elt sub 4)
               (when (string-match "command: .*recognize" (elt sub 4))
                 (save-excursion
                   ;; TODO: scope this to just the section someday
                   (goto-char start-of-entry)
                   (org-set-tags (append (list "recognize") (org-get-tags)))))
               (when (string-match "command: .*outline" (elt sub 4))
                 (save-excursion
                   (goto-char start-of-entry)
                   (org-set-tags (append (list "outline") (org-get-tags)))))
               (when (string-match "^time" (elt sub 4))
                 (insert "[" (org-format-time-string (cdr org-timestamp-formats)
                                                     (time-add start-date (seconds-to-time (/ (elt sub 1) 1000))))
                         "]\n"))
               (when (string-match "command: .+\\(high\\|low\\)" (elt sub 4))
                 (save-excursion
                   (goto-char start-of-entry)
                   (org-priority (if (string= (downcase (match-string 1)) "high") ?A ?C))))
               (when (string-match "\\(?:tags?\\|keywords?\\): \\(.+\\)" (elt sub 4))
                 (save-excursion
                   (goto-char start-of-entry)
                   (org-set-tags (append (split-string (match-string 1) " ") (org-get-tags))))))
             (add-to-list 'chapters
                          (format "- %s (%s)"
                                  (org-link-make-string (concat "id:" (org-id-get-create))
                                                        note)
                                  (org-link-make-string
                                   (format "subed:%s::%s"
                                           vtt-filename
                                           (my-msecs-to-timestamp (elt sub 1)))
                                   "VTT")))))
         (insert (elt sub 3) "\n"))
       subtitles))
    (when chapters
      (insert (string-join (nreverse chapters) "\n") "\n"))))

(defun my-transcript-get-file-start-time (filename)
  (setq filename (file-name-base filename))
  (cond
   ((string-match "^\\([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]\\.[0-9][0-9]\\)" filename)
    (date-to-time (replace-regexp-in-string "\\." ":" (match-string 0 filename))))
   ((string-match "^\\(?:Copy of \\)?\\([^ ][^ ][^ ]\\)[^ ]+ at \\([0-9]+\\)-\\([0-9]+\\)" filename)
    (let* ((day (match-string 1 filename))
           (hour (match-string 2 filename))
           (min (match-string 3 filename))
           (changed-time (file-attribute-modification-time
                          (file-attributes filename)))
           (decoded-time (decode-time changed-time)))
      ;; get the day on or before changed-time
      (if (string= (format-time-string "%a" changed-time) day)
          (encode-time (append
                        (list
                         0
                         (string-to-number min)
                         (string-to-number hour))
                        (seq-drop decoded-time 3)))
        ;; synchronized maybe within the week after
        (org-read-date t t (concat "-" day " " hour ":" min)))))))

Process a single transcript from the raw text file

So now we put that all together: rename the file using the calculated start time, prepare the alignment breaks, align the file to get the timestamps, and add the subtree to an Org file.

Making the TODO

(defvar my-transcript-braindump-file "~/sync/orgzly/braindump.org")

(defun my-transcript-make-todo (text-file &optional force)
  "Add TEXT-FILE as a TODO."
  (interactive (list (buffer-file-name) current-prefix-arg))
  ;; rename the files to use the timestamps
  (unless (string-match "^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"
                        (file-name-base text-file))
    (setq text-file (my-transcript-rename-files-based-on-time text-file)))
  (let* ((recording (concat (file-name-sans-extension text-file) ".m4a"))
         (start (my-transcript-get-file-start-time text-file))
         (vtt (concat (file-name-sans-extension text-file) ".vtt"))
         chapters
         (title (concat "Review braindump " text-file))
         existing)
    ;; check if already exists
    (with-current-buffer (find-file-noselect my-transcript-braindump-file)
      (save-excursion
        (goto-char (point-min))
        (setq existing (org-find-exact-headline-in-buffer title))))
    (if (and existing (not force))
        (progn
          (message "Going to existing heading")
          (org-goto-marker-or-bmk existing))
      (if (or (null my-transcript-last-processed-time)
              (time-less-p my-transcript-last-processed-time start))
          (customize-save-variable 'my-transcript-last-processed-time start))
      (find-file text-file)
      (my-transcript-prepare-alignment-breaks)
      (save-buffer)
      (when (file-exists-p vtt) (delete-file vtt))
      (when (get-file-buffer vtt) (kill-buffer (get-file-buffer vtt)))
      (subed-align recording text-file "VTT")
      (when (get-file-buffer vtt) (kill-buffer (get-file-buffer vtt)))
      (find-file my-transcript-braindump-file)
      (goto-char (point-min))
      (if existing
          (progn
            (org-goto-marker-or-bmk existing)
            (delete-region (point) (org-end-of-subtree)))
        (org-next-visible-heading 1))
      (my-transcript-insert-subtitles-as-org-tree vtt))))

Process multiple files

I want to process multiple files in one batch.

(defun my-transcript-process (files &optional force)
  (interactive (list (cond
                      ((and (derived-mode-p 'dired-mode)
                            (dired-get-marked-files))
                       (dired-get-marked-files))
                      ((derived-mode-p 'dired-mode)
                       (list (dired-get-filename)))
                      ((string-match "\\.txt$" (buffer-file-name))
                       (list (buffer-file-name)))
                      (t (read-file-name "Transcript: ")))
                     current-prefix-arg))
  (mapc (lambda (f)
          (when (string-match "txt" f)
            (my-transcript-make-todo f force))) files))

It would be nice to have it automatically keep track of the latest one that's been processed, maybe via customize-save-variable. This still needs some tinkering with.

Processing new files

(defcustom my-transcript-last-processed-time nil
  "The timestamp of the last processed transcript."
  :group 'sacha
  :type '(repeat integer))

(defun my-transcript-process-since-last ()
  (interactive)
  (let ((files
         (seq-filter
          (lambda (f)
            (or (null my-transcript-last-processed-time)
                (time-less-p my-transcript-last-processed-time
                             (my-transcript-get-file-start-time f))))
          (directory-files my-phone-recording-dir 'full " at [0-9][0-9]-[0-9][0-9]\\.txt\\|^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]\\.[0-9][0-9]\\.txt"))))
    (mapc (lambda (f)
            (my-transcript-make-todo f)
            (let ((start (my-transcript-get-file-start-time f)))
              (if (time-less-p my-transcript-last-processed-time start)
                  (setq my-transcript-last-processed-time start))))
          files))
  (customize-save-variable 'my-transcript-last-processed-time my-transcript-last-processed-time))

(defun my-transcript-rename-files-based-on-time (text-file)
  "Rename TEXT-FILE based on date. Return the new text file."
  (interactive (list (if (derived-mode-p 'dired-mode) (dired-get-filename)
                       (buffer-file-name))))
  (if (string-match "^[0-9][0-9][0-9][0-9]" text-file)
      text-file                         ; no change, already uses date
    (let* ((start (my-transcript-get-file-start-time (file-name-base text-file)))
           (new-base (format-time-string "%Y-%m-%dT%H.%M" start)))
      (if (file-exists-p (expand-file-name (concat new-base ".txt")
                                           (file-name-directory text-file)))
          (error "%s already exists" new-base)
        (dolist (ext '(".txt" ".m4a" ".vtt"))
          (if (file-exists-p (concat (file-name-sans-extension text-file) ext))
              (rename-file (concat (file-name-sans-extension text-file) ext)
                           (expand-file-name (concat new-base ext)
                                             (file-name-directory text-file)))))
        (expand-file-name (concat new-base ".txt")
                          (file-name-directory text-file))))))

Ideas for next steps

Make the commands process things even more automatically.
Experiment with just sending everything to OpenAI Whisper instead of conditionally sending it based on the keywords (which might not be recognized).
See if I want to reuse more sentences or move them around.
Find out where people who have thought about dictation keywords have their notes; probably don't have to reinvent the wheel here

View org source for this post

This is part of my Emacs configuration.

You can e-mail me at sacha@sachachua.com.

Getting live speech into Emacs with Deepgram's streaming API

Dec 19, 2023| speechtotext, emacs, speech, streaming

[2023-12-26 Tue]: Reorganized code to call a list of functions and pass the recognition results. Added Etherpad. Took out the mode; will just use the functions. Related: getting live speech from Emacs into Etherpad

This is a quick demonstration of using Deepgram's streaming API to do speech recognition live. It isn't as accurate as OpenAI Whisper but since Whisper doesn't have a streaming API, it'll do for now. I can correct misrecognized words manually. I tend to talk really quickly, so it displays the words per minute in my modeline. I put the words into an Org Mode buffer so I can toggle headings with avy and cycle visibility. When I'm done, it saves the text, JSON, and WAV for further processing. I think it'll be handy to have a quick way to take live notes during interviews or when I'm thinking out loud. Could be fun!

I'm still getting some weirdness when the mode turns on when I don't expect it, so that's something to look into. Maybe I won't use it as a mode for now. I'll just use my-live-speech-start and my-live-speech-stop.

General code

(defvar my-live-speech-buffer "*Speech*")
(defvar my-live-speech-process nil)
(defvar my-live-speech-output-buffer "*Speech JSON*")

(defvar my-live-speech-functions
  '(my-live-speech-display-in-speech-buffer
    my-live-speech-display-wpm
    my-live-speech-append-to-etherpad)
  "Functions to call with one argument, the recognition results.")

(defun my-live-speech-start ()
  "Turn on live captions."
  (interactive)
  (with-current-buffer (get-buffer-create my-live-speech-buffer)
    (unless (process-live-p my-live-speech-process)
      (let ((default-directory "~/proj/deepgram-live"))
        (message "%s" default-directory)
        (with-current-buffer (get-buffer-create my-live-speech-output-buffer)
          (erase-buffer))
        (setq my-live-speech-recent-words nil
              my-live-speech-wpm-string "READY ")
        (setq my-deepgram-process
              (make-process
               :command '("bash" "run.sh")
               :name "speech"
               :filter 'my-live-speech-json-filter
               :sentinel #'my-live-speech-process-sentinel
               :buffer my-live-speech-output-buffer)))
      (org-mode))
    (display-buffer (current-buffer))))

(defun my-live-speech-stop ()
  (interactive)
  (if (process-live-p my-live-speech-process)
      (kill-process my-live-speech-process))
  (setq my-live-speech-wpm-string nil))

;; (define-minor-mode my-live-speech-mode
;;  "Show live speech and display WPM.
;; Need to check how to reliably turn this on and off."
;;  :global t :group 'sachac
;;  (if my-live-speech-mode
;;      (my-live-speech-start)
;;    (my-live-speech-stop)
;;    (setq my-live-speech-wpm-string nil)))

;; based on subed-mpv::client-filter
(defun my-live-speech-handle-json (line-object)
  "Process the JSON object in LINE."
  (run-hook-with-args 'my-live-speech-functions (json-parse-string line :object-type 'alist)))

(defun my-live-speech-process-sentinel (proc event)
  (when (string-match "finished" event)
    (my-live-speech-stop)
    ;(my-live-speech-mode -1)
    ))

(defun my-live-speech-json-filter (proc string)
  (when (buffer-live-p (process-buffer proc))
    (with-current-buffer (process-buffer proc)
      (let* ((proc-mark (process-mark proc))
             (moving (= (point) proc-mark)))
        ;;  insert the output
        (save-excursion
          (goto-char proc-mark)
          (insert string)
          (set-marker proc-mark (point)))
        (if moving (goto-char proc-mark))
        ;; process and remove all complete lines of JSON (lines are complete if ending with \n)
        (let ((pos (point-min)))
          (while (progn (goto-char pos)
                        (end-of-line)
                        (equal (following-char) ?\n))
            (let* ((end (point))
                   (line (buffer-substring pos end)))
              (delete-region pos (+ end 1))
              (with-current-buffer (get-buffer my-live-speech-buffer)
                (my-live-speech-handle-json line)))))))))

Python code based on the Deepgram streaming test suite:

Very rough app.py

# Based on streaming-test-suite
# https://developers.deepgram.com/docs/getting-started-with-the-streaming-test-suite

import pyaudio
import asyncio
import json
import os
import websockets
from datetime import datetime
import wave
import sys

startTime = datetime.now()
key = os.environ['DEEPGRAM_API_KEY']
live_json = os.environ.get('LIVE_CAPTIONS_JSON', True)
all_mic_data = []
all_transcripts = []
all_words = []
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 8000

audio_queue = asyncio.Queue()
REALTIME_RESOLUTION = 0.250
SAMPLE_SIZE = 0

def save_info():
    global SAMPLE_SIZE
    base = startTime.strftime('%Y%m%d%H%M')
    wave_file_path = os.path.abspath(f"{base}.wav")
    wave_file = wave.open(wave_file_path, "wb")
    wave_file.setnchannels(CHANNELS)
    wave_file.setsampwidth(SAMPLE_SIZE)
    wave_file.setframerate(RATE)
    wave_file.writeframes(b"".join(all_mic_data))
    wave_file.close()
    with open(f"{base}.txt", "w") as f:
        f.write("\n".join(all_transcripts))
    with open(f"{base}.json", "w") as f:
        f.write(json.dumps(all_words))
    if live_json:
        print(f'{{"msg": "🟢 Saved to {base}.txt , {base}.json , {base}.wav", "base": "{base}"}}')
    else:
        print(f"🟢 Saved to {base}.txt , {base}.json , {base}.wav")

# Used for microphone streaming only.
def mic_callback(input_data, frame_count, time_info, status_flag):
    audio_queue.put_nowait(input_data)
    return (input_data, pyaudio.paContinue)

async def run(key, method="mic", format="text", **kwargs):
    deepgram_url = f'wss://api.deepgram.com/v1/listen?punctuate=true&smart_format=true&utterances=true&encoding=linear16&sample_rate=16000'
    async with websockets.connect(
        deepgram_url, extra_headers={"Authorization": "Token {}".format(key)}
    ) as ws:
        async def sender(ws):
            try:
                while True:
                    mic_data = await audio_queue.get()
                    all_mic_data.append(mic_data)
                    await ws.send(mic_data)
            except websockets.exceptions.ConnectionClosedOK:
                await ws.send(json.dumps({"type": "CloseStream"}))
                if live_json:
                    print('{"msg": "Closed."}')
                else:
                    print("Closed.")
        async def receiver(ws):
            global all_words
            """Print out the messages received from the server."""
            first_message = True
            first_transcript = True
            transcript = ""
            async for msg in ws:
                res = json.loads(msg)
                if first_message:
                    first_message = False
                try:
                    # handle local server messages
                    if res.get("msg"):
                        if live_json:
                            print(json.dumps(res))
                        else:
                            print(res["msg"])
                    if res.get("is_final"):
                        transcript = (
                            res.get("channel", {})
                            .get("alternatives", [{}])[0]
                            .get("transcript", "")
                        )
                        if transcript != "":
                            if first_transcript:
                                first_transcript = False
                            if live_json:
                                print(json.dumps(res.get("channel", {}).get("alternatives", [{}])[0]))
                            else:
                                print(transcript)
                            all_transcripts.append(transcript)
                            all_words = all_words + res.get("channel", {}).get("alternatives", [{}])[0].get("words", [])
                        # if using the microphone, close stream if user says "goodbye"
                        if method == "mic" and "goodbye" in transcript.lower():
                            await ws.send(json.dumps({"type": "CloseStream"}))
                            if live_json:
                                print('{"msg": "Done."}')
                            else:
                                print("Done.")
                    # handle end of stream
                    if res.get("created"):
                        save_info()
                except KeyError:
                    print(f"🔴 ERROR: Received unexpected API response! {msg}")

        # Set up microphone if streaming from mic
        async def microphone():
            audio = pyaudio.PyAudio()
            stream = audio.open(
                format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK,
                stream_callback=mic_callback,
            )

            stream.start_stream()

            global SAMPLE_SIZE
            SAMPLE_SIZE = audio.get_sample_size(FORMAT)

            while stream.is_active():
                await asyncio.sleep(0.1)

            stream.stop_stream()
            stream.close()

        functions = [
            asyncio.ensure_future(sender(ws)),
            asyncio.ensure_future(receiver(ws)),
        ]

        functions.append(asyncio.ensure_future(microphone()))
        if live_json:
            print('{"msg": "Ready."}')
        else:
            print("🟢 Ready.")
        await asyncio.gather(*functions)

def main():
    """Entrypoint for the example."""
    # Parse the command-line arguments.
    try:
        asyncio.run(run(key, "mic", "text"))
    except websockets.exceptions.InvalidStatusCode as e:
        print(f'🔴 ERROR: Could not connect to Deepgram! {e.headers.get("dg-error")}')
        print(
            f'🔴 Please contact Deepgram Support (developers@deepgram.com) with request ID {e.headers.get("dg-request-id")}'
        )
        return
    except websockets.exceptions.ConnectionClosedError as e:
        error_description = f"Unknown websocket error."
        print(
            f"🔴 ERROR: Deepgram connection unexpectedly closed with code {e.code} and payload {e.reason}"
        )

        if e.reason == "DATA-0000":
            error_description = "The payload cannot be decoded as audio. It is either not audio data or is a codec unsupported by Deepgram."
        elif e.reason == "NET-0000":
            error_description = "The service has not transmitted a Text frame to the client within the timeout window. This may indicate an issue internally in Deepgram's systems or could be due to Deepgram not receiving enough audio data to transcribe a frame."
        elif e.reason == "NET-0001":
            error_description = "The service has not received a Binary frame from the client within the timeout window. This may indicate an internal issue in Deepgram's systems, the client's systems, or the network connecting them."

        print(f"🔴 {error_description}")
        # TODO: update with link to streaming troubleshooting page once available
        # print(f'🔴 Refer to our troubleshooting suggestions: ')
        print(
            f"🔴 Please contact Deepgram Support (developers@deepgram.com) with the request ID listed above."
        )
        return

    except websockets.exceptions.ConnectionClosedOK:
        return

    except Exception as e:
        print(f"🔴 ERROR: Something went wrong! {e}")
        save_info()
        return


if __name__ == "__main__":
    sys.exit(main() or 0)

The Python script sends the microphone stream to Deepgram and prints out the JSON output. The Emacs Lisp code starts an asynchronous process and reads the JSON output, displaying the transcript and calculating the WPM based on the words. run.sh just loads the venv for this project (requirements.txt based on the streaming text suite) and then runs app.py, since some of the Python library versions conflict with other things I want to experiment with.

I also added my-live-speech-wpm-string to my mode-line-format manually using Customize, since I wanted it displayed on the left side instead of getting lost when I turn keycast-mode on.

I'm still a little anxious about accidentally leaving a process running, so I check with ps aux | grep python3. Eventually I'll figure out how to make sure everything gets properly stopped when I'm done.

Anyway, there it is!

Display in speech buffer

(defun my-live-speech-display-in-speech-buffer (recognition-results)
  (with-current-buffer (get-buffer-create my-live-speech-buffer)
    (let-alist recognition-results
      (let* ((pos (point))
             (at-end (eobp)))
        (goto-char (point-max))
        (unless (eolp) (insert "\n"))
        (when .msg
          (insert .msg "\n"))
        (when .transcript
          (insert .transcript "\n"))
        ;; scroll to the bottom if being displayed
        (if at-end
            (when (get-buffer-window (current-buffer))
              (set-window-point (get-buffer-window (current-buffer)) (point)))
          (goto-char pos))))))

(defun my-live-speech-toggle-heading ()
  "Toggle a line as a heading."
  (interactive)
  (with-current-buffer (get-buffer my-live-speech-buffer)
    (display-buffer (current-buffer))
    (with-selected-window (get-buffer-window (get-buffer my-live-speech-buffer))
      (let ((avy-all-windows nil))
        (avy-goto-line 1))
      (org-toggle-heading 1))))

(defun my-live-speech-cycle-visibility ()
  "Get a quick overview."
  (interactive)
  (with-current-buffer (get-buffer my-live-speech-buffer)
    (display-buffer (current-buffer))
    (if (eq org-cycle-global-status 'contents)
        (progn
          (run-hook-with-args 'org-cycle-pre-hook 'all)
          (org-fold-show-all '(headings blocks))
          (setq org-cycle-global-status 'all)
          (run-hook-with-args 'org-cycle-hook 'all))
      (run-hook-with-args 'org-cycle-pre-hook 'contents)
      (org-cycle-content)
      (setq org-cycle-global-status 'contents)
      (run-hook-with-args 'org-cycle-hook 'contents))))

Display words per minute

(defvar my-live-speech-wpm-window-seconds 15 "How many seconds to calculate WPM for.")
(defvar my-live-speech-recent-words nil "Words spoken in `my-live-speech-wpm-window-minutes'.")
(defvar my-live-speech-wpm nil "Current WPM.")
(defvar my-live-speech-wpm-colors  ; haven't figured out how to make these work yet
  '((180 :foreground "red")
    (170 :foreground "yellow")
    (160 :foreground "green")))
(defvar my-live-speech-wpm-string nil "Add this somewhere in `mode-line-format'.")
(defun my-live-speech-wpm-string ()
  (propertize
   (format "%d WPM " my-live-speech-wpm)
   'face
   (cdr (seq-find (lambda (row) (> my-live-speech-wpm (car row))) my-live-speech-wpm-colors))))

(defun my-live-speech-display-wpm (recognition-results)
  (let-alist recognition-results
    (when .words
      ;; calculate WPM
      (setq my-live-speech-recent-words
            (append my-live-speech-recent-words .words nil))
      (let ((threshold (- (assoc-default 'end (aref .words (1- (length .words))))
                          my-live-speech-wpm-window-seconds)))
        (setq my-live-speech-recent-words
              (seq-filter
               (lambda (o)
                 (>= (assoc-default 'start o)
                     threshold))
               my-live-speech-recent-words))
        (setq my-live-speech-wpm
              (/
               (length my-live-speech-recent-words)
               (/ (- (assoc-default 'end (aref .words (1- (length .words))))
                     (assoc-default 'start (car my-live-speech-recent-words)))
                  60.0)))
        (setq my-live-speech-wpm-string (my-live-speech-wpm-string))))))

Append to EmacsConf Etherpad

(defvar my-live-speech-etherpad-id nil)
(defun my-live-speech-append-to-etherpad (recognition-results)
  (when my-live-speech-etherpad-id
    (emacsconf-pad-append-text my-live-speech-etherpad-id (concat " " (assoc-default 'transcript recognition-results)))))

This is part of my Emacs configuration.

You can e-mail me at sacha@sachachua.com.

Late-night braindumps by talking to myself

Nov 18, 2022| geek, speechtotext

At my most recent eye exam, the optometrist recommended that I use a warm compress and massage my eyes afterwards to help with dry eyes, so I ordered a USB-powered eye mask. I like how I don't have to fuss around with figuring out the temperature and duration, especially now that things are colder. It's just there by my bed. It turns on. It has a timer, so I can set it to automatically turn off at 20 minutes. The medium heat setting is warm enough to be warm but not uncomfortable, even in winter.

Since I can't see anything while I have the compress on, I've taken to recording a quick braindump of the day. It's a nice way to wind down. I end up clearing my brain, and I usually just sleep afterwards instead of staying up. W- keeps later hours than I do, so I can talk without disturbing anyone.

Google Recorder automatically transcribes my recording on the fly with varying levels of accuracy. It's enough to mostly recognize my thoughts next day, and I can tap on a word on my phone to jump back to that part of the audio. If I really want to, I can use aeneas to align the text to the recording so that I can do that sort of verification on my computer. For the most part, I've been able to understand things from context.

I've been experimenting with using special keywords to set off parts of the text that I might want to pay special attention to. At first, I tried phrases like "begin elephant" and "end elephant", but "end" and "and" are easily confused by the speech-to-text recognizer, and "elephant" gets misrecognized sometimes too. "Hello" and "goodbye" seemed to work better. "Hello computer" / "Goodbye computer" and "Hello notebook" and "Goodbye notebook" seem to work okay. Might go with "Hello notebook", since it has fewer syllables.

My goal for that time is to do a braindump of the different things I need to think about and capture, and then make it easier to follow up. It might also be a good time to review the day and plan tomorrow. Sometimes I might talk my way through the idea for a blog post or a decision I'm considering. I still tend to stutter and wander all over the place, but it's useful to have notes as a starting point.

How could I improve this workflow? I would like to record my thoughts and ideally, have the transcript automatically show up in a folder in a directory and my computer. To reduce the friction in that, I might need to do the transcription on my computer, because Google Recorder requires me to tap a few buttons in order to share the transcript unless I root my phone, which is a bit of an involved process that might need to wait until after EmacsConf. If I use another recording app that saves the audio to a directory that is synchronized to sync thing to my server, then I can transcribe it with OpenAI Whisper and align it with aeneas to get the timestamps. My laptop is pretty slow, but I'll be asleep anyway, so maybe the laptop can work on it overnight.

I would ideally like it to create Org Mode entries, maybe putting those entries in a separate inbox if I'm worried about messing up my main inbox. It can pull out the text between "Hello, notebook" and "Goodbye, notebook" sections along with links to the files where they came from so that I can review the context. Org Mode links can jump to text or lines in a text file, or I can add support for easily capturing and linking to subtitles in subed.el (maybe a subed-org.el). Who knows? I might even be able to start extracting audio snippets in order to create a quick presentation using something like compile-media. (I'm recording in a relatively quiet room and no one's excitedly trying to interrupt me, so I might as well take advantage of that!) Oh, maybe I should also add a command to copy the subtitles in the region as plain text. If I figure out a grammar, I might even be able to automate or partially automate common commands.

I don't need to record everything 24x7 (although someone's experimented with that), but these short snippets might be a good start. It could lead to getting more stuff out of my brain, and that can be handy too.