Categories: speechtotext

RSS - Atom - Subscribe via email

Updating my audio braindump workflow to take advantage of WhisperX

| emacs, speechtotext, org

I get word timestamps for free when I transcribe with WhisperX, so I can skip the Aeneas alignment step. That means I can update my previous code for handling audio braindumps . Breaking the transcript up into sections Also, I recently updated subed-word-data to colour words based on their transcription score, which draws my attention to things that might be uncertain.

Here's what it looks like when I have the post, the transcript, and the annotated PDF.

2024-11-17_20-44-30.png
Figure 1: Screenshot of draft, transcript, and PDF

Here's what I needed to implement my-audio-braindump-from-whisperx-json (plus some code from my previous audio braindump workflow):

(defun my-whisperx-word-list (file)
  (let* ((json-object-type 'alist)
         (json-array-type 'list))
    (seq-mapcat (lambda (seg)
                  (alist-get 'words seg))
                (alist-get 'segments (json-read-file file)))))

;; (seq-take (my-whisperx-word-list (my-latest-file "~/sync/recordings" "\\.json")) 10)
(defun my-whisperx-insert-word-list (words)
  "Inserts WORDS with text properties."
  (require 'subed-word-data)
  (mapc (lambda (word)
            (let ((start (point)))
              (insert
               (alist-get 'word word))
              (subed-word-data--add-word-properties start (point) word)
              (insert " ")))
        words))

(defun my-audio-braindump-turn-sections-into-headings ()
  (interactive)
  (goto-char (point-min))
  (while (re-search-forward "START SECTION \\(.+?\\) STOP SECTION" nil t)
    (replace-match
     (save-match-data
       (format
        "\n*** %s\n"
        (save-match-data (string-trim (replace-regexp-in-string "^[,\\.]\\|[,\\.]$" "" (match-string 1))))))
     nil t)
    (let ((prop-match (save-excursion (text-property-search-forward 'subed-word-data-start))))
      (when prop-match
        (org-entry-put (point) "START" (format-seconds "%02h:%02m:%02s" (prop-match-value prop-match)))))))

(defun my-audio-braindump-split-sentences ()
  (interactive)
  (goto-char (point-min))
  (while (re-search-forward "[a-z]\\. " nil t)
    (replace-match (concat (string-trim (match-string 0)) "\n") )))

(defun my-audio-braindump-restructure ()
  (interactive)
  (goto-char (point-min))
  (my-subed-fix-common-errors)
  (org-mode)
  (my-audio-braindump-prepare-alignment-breaks)
  (my-audio-braindump-turn-sections-into-headings)
  (my-audio-braindump-split-sentences)
  (goto-char (point-min))
  (my-remove-filler-words-at-start))

(defun my-audio-braindump-from-whisperx-json (file)
  (interactive (list (read-file-name "JSON: " "~/sync/recordings/" nil nil nil (lambda (f) (string-match "\\.json\\'" f)))))
  ;; put them all into a buffer
  (with-current-buffer (get-buffer-create "*Words*")
    (erase-buffer)
    (fundamental-mode)
    (my-whisperx-insert-word-list (my-whisperx-word-list file))
    (my-audio-braindump-restructure)
    (goto-char (point-min))
    (switch-to-buffer (current-buffer))))

(defun my-audio-braindump-process-text (file)
  (interactive (list (read-file-name "Text: " "~/sync/recordings/" nil nil nil (lambda (f) (string-match "\\.txt\\'" f)))))
  (with-current-buffer (find-file-noselect file)
    (my-audio-braindump-restructure)
    (save-buffer)))
;; (my-audio-braindump-from-whisperx-json (my-latest-file "~/sync/recordings" "\\.json"))

Ideas for next steps:

  • I can change my processing script to split up the Whisper TXT into sections and automatically make the PDF with nice sections.
  • I can add reminders and other callouts. I can style them, and I can copy reminders into a different section for easier processing.
  • I can look into extracting PDF annotations so that I can jump to the next highlight or copy highlighted text.
This is part of my Emacs configuration.
View org source for this post

Yay Emacs 5: Tweaking my video workflow with WhisperX and subed-record

| speechtotext, emacs, subed, yay-emacs

I'm tweaking my video workflow. I use Orgzly Revived on my Android phone to write the text, and I use Easy Voice Recorder to record it. Syncthing automatically copies both to my laptop. I use WhisperX to transcribe my recording, and I use a little bit of Emacs Lisp to figure out timestamps for each word. I edit this to fix errors. I can even rearrange things and get rid of umms or ahs or anything I don't want.Then I use subed-convert to turn it into a VTT file. I can tweak the start and end times by looking at the waveforms. Then I add comments with the visuals I want. I can add images, animated GIFs, or videos, and they're automatically squeezed or stretched to fit. I can also have them play at original speed. Then I set up open captions and use subed-record-compile-video. Tada!

Links:

You can watch this on YouTube, download the video, or download the audio.

View org source for this post

Using WhisperX to get word-level timestamps for audio editing with Emacs and subed-record

Posted: - Modified: | speechtotext, audio, emacs, subed
  • [2024-11-16 Sat]: Removed highlight_words, made max_line_width use the environment variable if specified.
  • [2024-10-14 Mon]: Actually, WhisperX makes a JSON with word-level timing data, so let's use that instead.

I'm gradually shifting more things to this Lenovo P52 to take advantage of its newer processor, 64 GB of RAM, and 2 TB drive. (Whee!) One of the things I'm curious about is how I can make better use of multimedia. I couldn't get whisper.cpp to work on my Lenovo X230T, so I mostly relied on the automatic transcripts from Google Recorder (with timestamps generated by aeneas) or cloud-based transcription services like Deepgram.

I have a lot of silences in my voice notes when I think out loud. whisper.cpp got stuck in loops during silent parts, but WhisperX handles them perfectly. WhisperX is also fast enough for me to handle audio files locally instead of relying on Deepgram. With the default model, I can process the files faster than real-time:

File length Transcription time
42s 17s
7m48s 1m41s

I used this command to get word-level timing data. (Experimenting with options from this post)

MAX_LINE_WIDTH="${MAX_LINE_WIDTH:-50}"
~/vendor/whisperx/.venv/bin/whisperx --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --compute_type int8 --print_progress True --max_line_width $MAX_LINE_WIDTH --segment_resolution chunk --max_line_count 1 --language en "$@"

Among other things, it makes a text file that looks like this:

I often need to... I sometimes need to replace or navigate by symbols.
Casual symbol overlays a new package that adds those shortcuts so that I don't have to remember the other keywords for them.

and a JSON file that looks like this:

{"segments": [{"start": 0.427, "end": 7.751, "text": " I often need to... I sometimes need to replace or navigate by symbols.", "words": [{"word": "I", "start": 0.427, "end": 0.507, "score": 0.994}, {"word": "often", "start": 0.587, "end": 0.887, "score": 0.856}, {"word": "need", "start": 0.987, "end": 1.227, "score": 0.851}, {"word": "to...", "start": 1.267, "end": 1.508, "score": 0.738}, {"word": "I", "start": 4.329, "end": 4.429, "score": 0.778}, ...]}, ...]}

Sometimes I just want the text so that I can use an audio braindump as the starting point for a blog post or for notes. WhisperX is way more accurate than Google Recorder, so that will probably be easier once I update my workflow for that.

Sometimes I want to make an edited audio file that sounds smooth so that I can use it in a podcast, a video, or some audio notes. For that, I'd like word-level timing data so that I can cut out words or sections. Aeneas didn't give me word-level timestamps, but WhisperX does, so I can get the time information before I start editing. I can extract the word timestamps from the JSON like this:

(defun my-subed-word-tsv-from-whisperx-json (file)
  (interactive "FJSON: ")
  (let* ((json-array-type 'list)
         (json-object-type 'alist)
         (data (json-read-file file))
         (filename (concat (file-name-sans-extension file) ".tsv"))
         (base (seq-mapcat
                (lambda (segment)
                  (seq-map (lambda (word)
                             (let-alist word
                               (list nil
                                     (and .start (* 1000 .start))
                                     (and .end (* 1000 .end))
                                     .word)))
                           (alist-get 'words segment)))
                (alist-get 'segments data)))
         (current base)
         (last-end 0))
     ;; numbers at the end of a sentence sometimes don't end up with times
     ;; so we need to fix them
    (while current
      (unless (elt (car current) 1)           ; start
        (setf (elt (car current) 1) (1+ last-end)))
      (unless (elt (car current) 2)
        (setf (elt (car current) 2) (1- (elt (cadr current) 1))))
      (setq
       last-end (elt (car current) 2)
       current (cdr current)))
    (subed-create-file
     filename
     base
     t
     'subed-tsv-mode)
    (find-file filename)))

Here's my old code for parsing the highlighted VTT or SRT files that underline each word:

(defun my-subed-load-word-data-from-whisperx-highlights (file)
  "Return a list of word cues from FILE.
FILE should be a VTT or SRT file produced by whisperx with the
--highlight_words True option."
  (seq-keep (lambda (sub)
              (when (string-match "<u>\\(.+?\\)</u>" (elt sub 3))
                (setf (elt sub 3) (match-string 1 (elt sub 3)))
                sub))
            (subed-parse-file file)))

(defun my-subed-word-tsv-from-whisperx-highlights (file)
  (interactive "FVTT: ")
  (with-current-buffer (find-file-noselect (concat (file-name-nondirectory file) ".tsv"))
    (erase-buffer)
    (subed-tsv-mode)
    (subed-auto-insert)
    (mapc (lambda (sub) (apply #'subed-append-subtitle nil (cdr sub)))
          (my-subed-load-word-data-from-whisperx-highlights file))
    (switch-to-buffer (current-buffer))))

I like to use the TSV format for this one because it's easy to scan down the right side. Incidentally, this format is compatible with Audacity labels, so I could import that there if I wanted. I like Emacs much more, though. I'm used to having all my keyboard shortcuts at hand.

0.427000	0.507000	I
0.587000	0.887000	often
0.987000	1.227000	need
1.267000	1.508000	to...
4.329000	4.429000	I
4.469000	4.869000	sometimes
4.950000	5.170000	need
5.210000	5.410000	to
5.530000	6.090000	replace

Once I've deleted the words I don't want to include, I can merge subtitles for phrases so that I can keep the pauses between words. A quick heuristic is to merge subtitles if they don't have much of a pause between them.

(defvar my-subed-merge-close-subtitles-threshold 500)
(defun my-subed-merge-close-subtitles (threshold)
  "Merge subtitles with the following one if there is less than THRESHOLD msecs gap between them."
  (interactive (list (read-number "Threshold in msecs: " my-subed-merge-close-subtitles-threshold)))
  (goto-char (point-min))
  (while (not (eobp))
    (let ((end (subed-subtitle-msecs-stop))
          (next-start (save-excursion
                        (and (subed-forward-subtitle-time-start)
                             (subed-subtitle-msecs-stop)))))
      (if (and end next-start (< (- next-start end) threshold))
          (subed-merge-with-next)
        (or (subed-forward-subtitle-end) (goto-char (point-max)))))))

Then I can use subed-waveform-show-all to tweak the start and end timestamps. Here I switch to another file I've been editing…

2024-09-17-12-06-12.svg
Figure 1: Screenshot of subed-waveform

After that, I can use subed-record to compile the audio into an .opus file that sounds reasonably smooth.

I sometimes need to replace or navigate by symbols. casual-symbol-overlay is a package that adds a transient menu so that I don't have to remember the keyboard shortcuts for them. I've added it to my embark-symbol-keymap so I can call it with embark-act. That way it's just a C-. z away.

I want to make lots of quick audio notes that I can shuffle and listen to in order to remember things I'm learning about Emacs (might even come up with some kind of spaced repetition system), and I'd like to make more videos someday too. I think WhisperX, subed, and Org Mode will be fun parts of my workflow.

This is part of my Emacs configuration.
View org source for this post

Working with the flow of ideas

| speechtotext, metaphor, life, blogging, writing, kaizen

Text from sketch

2023-12-25-07

Flow of ideas

What can I learn from thinking about the flow rate?

input > output, and that's okay

Parts:

  • idea: agenda/review?
  • capture: refile to tags
  • toot: use this more, get stuff out
  • braindump: use transcripts or outline
  • sketch: bedtime
  • post: cut off earlier, can follow up
  • video: workflow tweaks

Thoughts:

  • more input is not always better; already plenty, not limiting factor
  • prioritize, review
  • overflow: add notes and pass it along, if poss.
  • can add things later (results, sketches, posts, videos)
  • manage expectations; minimize commitments
  • favour small things that flow easily
  • collect things in a container
    • tags, outlines
    • posts, videos
  • minimize filing, but still find related notes
  • become more efficient and effective

The heap:

  • Org dates have been working for time-sensitive/urgent things
  • Lots of discretionary things get lost in the shuffle
    • waste info collected but forgotten
    • half-finished posts that have gone stale
    • redoing things
    • late replies to conversations
    • things that are just in my config - some people still find them, so that's fine

Next: toot more experiment with braindumping, video

I come up with way more ideas than I can work on, and that's okay. That's good. It means I can always skim the top for interesting things, and it's fine if things overflow as long as the important stuff stays in the funnel. I'm experimenting with more ways to keep things flowing.

I usually come up with lots of ideas and then revisit my priorities to see if I can figure out 1-3 things I'd like to work on for my next focused time sessions. These priorities are actually pretty stable for the most part, but sometimes an idea jumps the queue and that's okay.

There's a loose net of projects/tasks that I'm currently working on and things I'm currently interested in, so I want to connect ideas and resources to those if I can. If they aren't connected, or if they're low-priority and I probably won't get to them any time soon, it can make a lot of sense to add quick notes and pass it along.

For things I want to think about some more, my audio braindumping workflow seems to be working out as a way to capture lots of text even when I'm away from my computer. I also have a bit more time to sketch while waiting for the kiddo to get ready for bed. I can use the sketchnotes as outlines to talk through while I braindump, and I can take my braindumps and distill them into sketches. Then I can take those and put them into blog posts. Instead of getting tempted to add more and more to a blog post (just one more idea, really!), I can try wrapping up earlier since I can always add a follow-up post. For some things, making a video might be worthwhile, so smoothing out my workflow for creating a video could be useful. I don't want to spend a lot of time filing but I still want to be able to find related notes, so automatically refiling based on tags (or possibly suggesting refile targets based on vector similarity?) might help me shift things out of my inbox.

I'm generally not bothered by the waste of coming up with ideas that I don't get around to, since it's more like daydreaming or fun. I sometimes get a little frustrated when I want to find an interesting resource I remember coming across some time ago and I can't find it with the words I'm looking for. Building more of a habit of capturing interesting resources in my Org files and using my own words in the notes will help while I wait for personal search engines to get better. I'm a little slow when it comes to e-mails because I tend to wait until I'm at my computer–and then when I'm at my computer, I prefer to tinker or write. I occasionally redo things because I didn't have notes from the previous solution or I couldn't find my notes. That's fine too. I can get better at taking notes and finding them.

So I think some next steps for me are:

  • Post more toots on @sachac@emacs.ch; might be useful as a firehose for ideas. Share them back to my Org file so I have a link to the discussion (if any). Could be a quick way to see if anyone already knows of related packages/code or if anyone might have the same itch.
  • See if I can improve my braindumping/sketch workflow so that I can flesh out more ideas
  • Tweak my video process gradually so that I can include more screenshots and maybe eventually longer explanations

Yay, I can get live speech recognition results from Emacs to Etherpad

| speechtotext, emacs, speech

I want to see if we can caption EmacsConf live presentations and Q&A sessions, even if the automated captions need help with misrecognized words. Now that I can get live speech into Emacs using the Deepgram streaming API, I can process that information and send it to other places. Here's a quick demonstration of appending live speech captions to Etherpad:

I added an emacsconf-pad-append-text function to emacsconf-pad.el that uses the appendText function.

Getting live speech into Emacs with Deepgram's streaming API

| speechtotext, emacs, speech, streaming

This is a quick demonstration of using Deepgram's streaming API to do speech recognition live. It isn't as accurate as OpenAI Whisper but since Whisper doesn't have a streaming API, it'll do for now. I can correct misrecognized words manually. I tend to talk really quickly, so it displays the words per minute in my modeline. I put the words into an Org Mode buffer so I can toggle headings with avy and cycle visibility. When I'm done, it saves the text, JSON, and WAV for further processing. I think it'll be handy to have a quick way to take live notes during interviews or when I'm thinking out loud. Could be fun!

I'm still getting some weirdness when the mode turns on when I don't expect it, so that's something to look into. Maybe I won't use it as a mode for now. I'll just use my-live-speech-start and my-live-speech-stop.

General code

(defvar my-live-speech-buffer "*Speech*")
(defvar my-live-speech-process nil)
(defvar my-live-speech-output-buffer "*Speech JSON*")

(defvar my-live-speech-functions
  '(my-live-speech-display-in-speech-buffer
    my-live-speech-display-wpm
    my-live-speech-append-to-etherpad)
  "Functions to call with one argument, the recognition results.")

(defun my-live-speech-start ()
  "Turn on live captions."
  (interactive)
  (with-current-buffer (get-buffer-create my-live-speech-buffer)
    (unless (process-live-p my-live-speech-process)
      (let ((default-directory "~/proj/deepgram-live"))
        (message "%s" default-directory)
        (with-current-buffer (get-buffer-create my-live-speech-output-buffer)
          (erase-buffer))
        (setq my-live-speech-recent-words nil
              my-live-speech-wpm-string "READY ")
        (setq my-deepgram-process
              (make-process
               :command '("bash" "run.sh")
               :name "speech"
               :filter 'my-live-speech-json-filter
               :sentinel #'my-live-speech-process-sentinel
               :buffer my-live-speech-output-buffer)))
      (org-mode))
    (display-buffer (current-buffer))))

(defun my-live-speech-stop ()
  (interactive)
  (if (process-live-p my-live-speech-process)
      (kill-process my-live-speech-process))
  (setq my-live-speech-wpm-string nil))

;; (define-minor-mode my-live-speech-mode
;;  "Show live speech and display WPM.
;; Need to check how to reliably turn this on and off."
;;  :global t :group 'sachac
;;  (if my-live-speech-mode
;;      (my-live-speech-start)
;;    (my-live-speech-stop)
;;    (setq my-live-speech-wpm-string nil)))

;; based on subed-mpv::client-filter
(defun my-live-speech-handle-json (line-object)
  "Process the JSON object in LINE."
  (run-hook-with-args 'my-live-speech-functions (json-parse-string line :object-type 'alist)))

(defun my-live-speech-process-sentinel (proc event)
  (when (string-match "finished" event)
    (my-live-speech-stop)
    ;(my-live-speech-mode -1)
    ))

(defun my-live-speech-json-filter (proc string)
  (when (buffer-live-p (process-buffer proc))
    (with-current-buffer (process-buffer proc)
      (let* ((proc-mark (process-mark proc))
             (moving (= (point) proc-mark)))
        ;;  insert the output
        (save-excursion
          (goto-char proc-mark)
          (insert string)
          (set-marker proc-mark (point)))
        (if moving (goto-char proc-mark))
        ;; process and remove all complete lines of JSON (lines are complete if ending with \n)
        (let ((pos (point-min)))
          (while (progn (goto-char pos)
                        (end-of-line)
                        (equal (following-char) ?\n))
            (let* ((end (point))
                   (line (buffer-substring pos end)))
              (delete-region pos (+ end 1))
              (with-current-buffer (get-buffer my-live-speech-buffer)
                (my-live-speech-handle-json line)))))))))

Python code based on the Deepgram streaming test suite:

Very rough app.py
# Based on streaming-test-suite
# https://developers.deepgram.com/docs/getting-started-with-the-streaming-test-suite

import pyaudio
import asyncio
import json
import os
import websockets
from datetime import datetime
import wave
import sys

startTime = datetime.now()
key = os.environ['DEEPGRAM_API_KEY']
live_json = os.environ.get('LIVE_CAPTIONS_JSON', True)
all_mic_data = []
all_transcripts = []
all_words = []
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 8000

audio_queue = asyncio.Queue()
REALTIME_RESOLUTION = 0.250
SAMPLE_SIZE = 0

def save_info():
    global SAMPLE_SIZE
    base = startTime.strftime('%Y%m%d%H%M')
    wave_file_path = os.path.abspath(f"{base}.wav")
    wave_file = wave.open(wave_file_path, "wb")
    wave_file.setnchannels(CHANNELS)
    wave_file.setsampwidth(SAMPLE_SIZE)
    wave_file.setframerate(RATE)
    wave_file.writeframes(b"".join(all_mic_data))
    wave_file.close()
    with open(f"{base}.txt", "w") as f:
        f.write("\n".join(all_transcripts))
    with open(f"{base}.json", "w") as f:
        f.write(json.dumps(all_words))
    if live_json:
        print(f'{{"msg": "🟢 Saved to {base}.txt , {base}.json , {base}.wav", "base": "{base}"}}')
    else:
        print(f"🟢 Saved to {base}.txt , {base}.json , {base}.wav")

# Used for microphone streaming only.
def mic_callback(input_data, frame_count, time_info, status_flag):
    audio_queue.put_nowait(input_data)
    return (input_data, pyaudio.paContinue)

async def run(key, method="mic", format="text", **kwargs):
    deepgram_url = f'wss://api.deepgram.com/v1/listen?punctuate=true&smart_format=true&utterances=true&encoding=linear16&sample_rate=16000'
    async with websockets.connect(
        deepgram_url, extra_headers={"Authorization": "Token {}".format(key)}
    ) as ws:
        async def sender(ws):
            try:
                while True:
                    mic_data = await audio_queue.get()
                    all_mic_data.append(mic_data)
                    await ws.send(mic_data)
            except websockets.exceptions.ConnectionClosedOK:
                await ws.send(json.dumps({"type": "CloseStream"}))
                if live_json:
                    print('{"msg": "Closed."}')
                else:
                    print("Closed.")
        async def receiver(ws):
            global all_words
            """Print out the messages received from the server."""
            first_message = True
            first_transcript = True
            transcript = ""
            async for msg in ws:
                res = json.loads(msg)
                if first_message:
                    first_message = False
                try:
                    # handle local server messages
                    if res.get("msg"):
                        if live_json:
                            print(json.dumps(res))
                        else:
                            print(res["msg"])
                    if res.get("is_final"):
                        transcript = (
                            res.get("channel", {})
                            .get("alternatives", [{}])[0]
                            .get("transcript", "")
                        )
                        if transcript != "":
                            if first_transcript:
                                first_transcript = False
                            if live_json:
                                print(json.dumps(res.get("channel", {}).get("alternatives", [{}])[0]))
                            else:
                                print(transcript)
                            all_transcripts.append(transcript)
                            all_words = all_words + res.get("channel", {}).get("alternatives", [{}])[0].get("words", [])
                        # if using the microphone, close stream if user says "goodbye"
                        if method == "mic" and "goodbye" in transcript.lower():
                            await ws.send(json.dumps({"type": "CloseStream"}))
                            if live_json:
                                print('{"msg": "Done."}')
                            else:
                                print("Done.")
                    # handle end of stream
                    if res.get("created"):
                        save_info()
                except KeyError:
                    print(f"🔴 ERROR: Received unexpected API response! {msg}")

        # Set up microphone if streaming from mic
        async def microphone():
            audio = pyaudio.PyAudio()
            stream = audio.open(
                format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK,
                stream_callback=mic_callback,
            )

            stream.start_stream()

            global SAMPLE_SIZE
            SAMPLE_SIZE = audio.get_sample_size(FORMAT)

            while stream.is_active():
                await asyncio.sleep(0.1)

            stream.stop_stream()
            stream.close()

        functions = [
            asyncio.ensure_future(sender(ws)),
            asyncio.ensure_future(receiver(ws)),
        ]

        functions.append(asyncio.ensure_future(microphone()))
        if live_json:
            print('{"msg": "Ready."}')
        else:
            print("🟢 Ready.")
        await asyncio.gather(*functions)

def main():
    """Entrypoint for the example."""
    # Parse the command-line arguments.
    try:
        asyncio.run(run(key, "mic", "text"))
    except websockets.exceptions.InvalidStatusCode as e:
        print(f'🔴 ERROR: Could not connect to Deepgram! {e.headers.get("dg-error")}')
        print(
            f'🔴 Please contact Deepgram Support (developers@deepgram.com) with request ID {e.headers.get("dg-request-id")}'
        )
        return
    except websockets.exceptions.ConnectionClosedError as e:
        error_description = f"Unknown websocket error."
        print(
            f"🔴 ERROR: Deepgram connection unexpectedly closed with code {e.code} and payload {e.reason}"
        )

        if e.reason == "DATA-0000":
            error_description = "The payload cannot be decoded as audio. It is either not audio data or is a codec unsupported by Deepgram."
        elif e.reason == "NET-0000":
            error_description = "The service has not transmitted a Text frame to the client within the timeout window. This may indicate an issue internally in Deepgram's systems or could be due to Deepgram not receiving enough audio data to transcribe a frame."
        elif e.reason == "NET-0001":
            error_description = "The service has not received a Binary frame from the client within the timeout window. This may indicate an internal issue in Deepgram's systems, the client's systems, or the network connecting them."

        print(f"🔴 {error_description}")
        # TODO: update with link to streaming troubleshooting page once available
        # print(f'🔴 Refer to our troubleshooting suggestions: ')
        print(
            f"🔴 Please contact Deepgram Support (developers@deepgram.com) with the request ID listed above."
        )
        return

    except websockets.exceptions.ConnectionClosedOK:
        return

    except Exception as e:
        print(f"🔴 ERROR: Something went wrong! {e}")
        save_info()
        return


if __name__ == "__main__":
    sys.exit(main() or 0)

The Python script sends the microphone stream to Deepgram and prints out the JSON output. The Emacs Lisp code starts an asynchronous process and reads the JSON output, displaying the transcript and calculating the WPM based on the words. run.sh just loads the venv for this project (requirements.txt based on the streaming text suite) and then runs app.py, since some of the Python library versions conflict with other things I want to experiment with.

I also added my-live-speech-wpm-string to my mode-line-format manually using Customize, since I wanted it displayed on the left side instead of getting lost when I turn keycast-mode on.

I'm still a little anxious about accidentally leaving a process running, so I check with ps aux | grep python3. Eventually I'll figure out how to make sure everything gets properly stopped when I'm done.

Anyway, there it is!

Display in speech buffer

(defun my-live-speech-display-in-speech-buffer (recognition-results)
  (with-current-buffer (get-buffer-create my-live-speech-buffer)
    (let-alist recognition-results
      (let* ((pos (point))
             (at-end (eobp)))
        (goto-char (point-max))
        (unless (eolp) (insert "\n"))
        (when .msg
          (insert .msg "\n"))
        (when .transcript
          (insert .transcript "\n"))
        ;; scroll to the bottom if being displayed
        (if at-end
            (when (get-buffer-window (current-buffer))
              (set-window-point (get-buffer-window (current-buffer)) (point)))
          (goto-char pos))))))

(defun my-live-speech-toggle-heading ()
  "Toggle a line as a heading."
  (interactive)
  (with-current-buffer (get-buffer my-live-speech-buffer)
    (display-buffer (current-buffer))
    (with-selected-window (get-buffer-window (get-buffer my-live-speech-buffer))
      (let ((avy-all-windows nil))
        (avy-goto-line 1))
      (org-toggle-heading 1))))

(defun my-live-speech-cycle-visibility ()
  "Get a quick overview."
  (interactive)
  (with-current-buffer (get-buffer my-live-speech-buffer)
    (display-buffer (current-buffer))
    (if (eq org-cycle-global-status 'contents)
        (progn
          (run-hook-with-args 'org-cycle-pre-hook 'all)
          (org-fold-show-all '(headings blocks))
          (setq org-cycle-global-status 'all)
          (run-hook-with-args 'org-cycle-hook 'all))
      (run-hook-with-args 'org-cycle-pre-hook 'contents)
      (org-cycle-content)
      (setq org-cycle-global-status 'contents)
      (run-hook-with-args 'org-cycle-hook 'contents))))

Display words per minute

(defvar my-live-speech-wpm-window-seconds 15 "How many seconds to calculate WPM for.")
(defvar my-live-speech-recent-words nil "Words spoken in `my-live-speech-wpm-window-minutes'.")
(defvar my-live-speech-wpm nil "Current WPM.")
(defvar my-live-speech-wpm-colors  ; haven't figured out how to make these work yet
  '((180 :foreground "red")
    (170 :foreground "yellow")
    (160 :foreground "green")))
(defvar my-live-speech-wpm-string nil "Add this somewhere in `mode-line-format'.")
(defun my-live-speech-wpm-string ()
  (propertize
   (format "%d WPM " my-live-speech-wpm)
   'face
   (cdr (seq-find (lambda (row) (> my-live-speech-wpm (car row))) my-live-speech-wpm-colors))))

(defun my-live-speech-display-wpm (recognition-results)
  (let-alist recognition-results
    (when .words
      ;; calculate WPM
      (setq my-live-speech-recent-words
            (append my-live-speech-recent-words .words nil))
      (let ((threshold (- (assoc-default 'end (aref .words (1- (length .words))))
                          my-live-speech-wpm-window-seconds)))
        (setq my-live-speech-recent-words
              (seq-filter
               (lambda (o)
                 (>= (assoc-default 'start o)
                     threshold))
               my-live-speech-recent-words))
        (setq my-live-speech-wpm
              (/
               (length my-live-speech-recent-words)
               (/ (- (assoc-default 'end (aref .words (1- (length .words))))
                     (assoc-default 'start (car my-live-speech-recent-words)))
                  60.0)))
        (setq my-live-speech-wpm-string (my-live-speech-wpm-string))))))

Append to EmacsConf Etherpad

(defvar my-live-speech-etherpad-id nil)
(defun my-live-speech-append-to-etherpad (recognition-results)
  (when my-live-speech-etherpad-id
    (emacsconf-pad-append-text my-live-speech-etherpad-id (concat " " (assoc-default 'transcript recognition-results)))))
This is part of my Emacs configuration.

Late-night braindumps by talking to myself

| geek, speechtotext

At my most recent eye exam, the optometrist recommended that I use a warm compress and massage my eyes afterwards to help with dry eyes, so I ordered a USB-powered eye mask. I like how I don't have to fuss around with figuring out the temperature and duration, especially now that things are colder. It's just there by my bed. It turns on. It has a timer, so I can set it to automatically turn off at 20 minutes. The medium heat setting is warm enough to be warm but not uncomfortable, even in winter.

Since I can't see anything while I have the compress on, I've taken to recording a quick braindump of the day. It's a nice way to wind down. I end up clearing my brain, and I usually just sleep afterwards instead of staying up. W- keeps later hours than I do, so I can talk without disturbing anyone.

Google Recorder automatically transcribes my recording on the fly with varying levels of accuracy. It's enough to mostly recognize my thoughts next day, and I can tap on a word on my phone to jump back to that part of the audio. If I really want to, I can use aeneas to align the text to the recording so that I can do that sort of verification on my computer. For the most part, I've been able to understand things from context.

I've been experimenting with using special keywords to set off parts of the text that I might want to pay special attention to. At first, I tried phrases like "begin elephant" and "end elephant", but "end" and "and" are easily confused by the speech-to-text recognizer, and "elephant" gets misrecognized sometimes too. "Hello" and "goodbye" seemed to work better. "Hello computer" / "Goodbye computer" and "Hello notebook" and "Goodbye notebook" seem to work okay. Might go with "Hello notebook", since it has fewer syllables.

My goal for that time is to do a braindump of the different things I need to think about and capture, and then make it easier to follow up. It might also be a good time to review the day and plan tomorrow. Sometimes I might talk my way through the idea for a blog post or a decision I'm considering. I still tend to stutter and wander all over the place, but it's useful to have notes as a starting point.

How could I improve this workflow? I would like to record my thoughts and ideally, have the transcript automatically show up in a folder in a directory and my computer. To reduce the friction in that, I might need to do the transcription on my computer, because Google Recorder requires me to tap a few buttons in order to share the transcript unless I root my phone, which is a bit of an involved process that might need to wait until after EmacsConf. If I use another recording app that saves the audio to a directory that is synchronized to sync thing to my server, then I can transcribe it with OpenAI Whisper and align it with aeneas to get the timestamps. My laptop is pretty slow, but I'll be asleep anyway, so maybe the laptop can work on it overnight.

I would ideally like it to create Org Mode entries, maybe putting those entries in a separate inbox if I'm worried about messing up my main inbox. It can pull out the text between "Hello, notebook" and "Goodbye, notebook" sections along with links to the files where they came from so that I can review the context. Org Mode links can jump to text or lines in a text file, or I can add support for easily capturing and linking to subtitles in subed.el (maybe a subed-org.el). Who knows? I might even be able to start extracting audio snippets in order to create a quick presentation using something like compile-media. (I'm recording in a relatively quiet room and no one's excitedly trying to interrupt me, so I might as well take advantage of that!) Oh, maybe I should also add a command to copy the subtitles in the region as plain text. If I figure out a grammar, I might even be able to automate or partially automate common commands.

I don't need to record everything 24x7 (although someone's experimented with that), but these short snippets might be a good start. It could lead to getting more stuff out of my brain, and that can be handy too.

Related: Audio braindump workflow tweaks: Adding Org Mode hyperlinks to recordings based on keywords

View org source for this post