Using Silero voice activity detection to automatically queue multiple transcriptions with natrys/whisper.el

| audio, speech-recognition, emacs

I can queue multiple transcriptions with whisper.el so that they get processed sequentially with backup audio. It catches up when I pause to think. Now I want to use Silero voice activity detection to do that kind of segmentation for me automatically.

First, I need a Python server that can print out events when it notices the start or stop of a speech segment. If I print out the timestamps, I might be able to cross-reference it someday with interestingthings. For now, even just paying attention to the end of a segment is enough for what I want to do.

Python script for printing out events
import sounddevice as sd
import numpy as np
import torch
import sys
from datetime import datetime, timedelta

SAMPLING_RATE = 16000
CHUNK_SIZE = 512
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=False)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
vad_iterator = VADIterator(model, threshold=0.5, min_silence_duration_ms=1000)

stream_start_time = None

def format_iso_with_offset(offset_seconds):
    if stream_start_time is None:
        return "PENDING"
    event_time = stream_start_time + timedelta(seconds=offset_seconds)
    return event_time.astimezone().isoformat(timespec='milliseconds')

def audio_callback(indata, frames, time, status):
    global stream_start_time
    if status:
        print(status, file=sys.stderr)
    if stream_start_time is None:
        stream_start_time = datetime.now()
    tensor_input = torch.from_numpy(indata.copy()).flatten()
    speech_dict = vad_iterator(tensor_input, return_seconds=True)
    if speech_dict:
        if "start" in speech_dict:
            print(f"START {format_iso_with_offset(speech_dict['start'])}", flush=True)
        if "end" in speech_dict:
            print(f"END {format_iso_with_offset(speech_dict['end'])}", flush=True)
try:
    with sd.InputStream(samplerate=SAMPLING_RATE,
                        channels=1,
                        callback=audio_callback,
                        blocksize=CHUNK_SIZE):
        while True:
            pass
except KeyboardInterrupt:
    print("\nStopping...")

Then I can start this process from Emacs:

(defvar my-vad-events-process nil)
(defvar my-vad-events-dir "~/proj/speech/")
(defvar my-vad-events-command `(,(expand-file-name ".venv/bin/python" my-vad-events-dir)
                                "vad-events.py"))

(defun my-vad-events-filter (proc string)
  (when (and (string-match "^END" string)
             (process-live-p whisper--recording-process))
    (message "Noticed speech turn...")
    (my-whisper-continue)))

(defun my-vad-events-ensure ()
  "Start the process if it's not already running."
  (interactive)
  (unless (process-live-p my-vad-events-process)
    (let ((default-directory my-vad-events-dir)
          (process-environment
           (cons
            (format
             "PULSE_PROP=node.description='%s' media.name='%s' node.name='%s'"
             "vad" "vad" "vad")
            process-environment)))
      (setq my-vad-events-process
            (make-process
             :name "vad-events"
             :command my-vad-events-command
             :buffer (get-buffer-create "*vad-events*")
             :stderr (get-buffer-create "*vad-events-err*")
             :filter #'my-vad-events-filter)))))

Because I added Pulse properties to the process environment, I can easily use epwgraph to rewire the input so that it gets the input from my VirtualMicSink instead of the default system audio device. (Someday I'll figure out how to specify that as the input automatically.)

Now I can press my shortcut for my-whisper-continue to start the process. As I keep talking, it will continue to record. When I pause for more than a second between sentences, then it will send that chunk to the server for transcription without me having to press another button, while still listening for more speech.

How is this different from the streaming approach that many real-time speech recognition services offer? I think this gives me a bit more visibility into and control of the process. For my personal use, I don't need to have everything processed as quickly as possible, and I'm not trying to replicate live captions. I just want to be able to look back over the last five minutes to try to remember what I was talking about. I usually have a lot of quiet time as I think through my next steps, and it's fine to have it catch up then. I also like that I can save time-stamped audio files for later processing, divided according to the speech segments. Those might be a little bit easier to work with when I get around to compositing them into a video.

This is part of my Emacs configuration.
View org source for this post