Topic - Speech recognition

I'm currently using the Speaches server along with Systran/faster-whisper-base for speech recognition into Emacs. I use my fork of natrys/whisper.el (upstream) so that I can continuously queue transcription, run the output through various functions, and save the audio.

Why I'm interested in speech recognition:

  • Sometimes I have a hard time remembering my train of thought. (Limitations of working memory!) It's useful to quickly capture my thoughts faster than I can type, even if the transcription has errors, and I can replay the recording if needed.
  • Speech would be a great interface for mobile computing.
  • Any number of reasons can make it difficult to type, either temporarily or permanently, such as accidents, disease, or physical decline. My mom has Parkinson's and her tremors make it hard for type most of the time. I know other people who've broken bones or gotten RSI. I enjoy writing and programming, so it makes sense to figure out alternative ways of input before I really need them.
  • I'd like to be able to use my computer while my hands are busy with sewing or knitting.
  • A voice interface could open up interesting possibilities. I'm not quite brave enough to give AI agents full access to my Emacs, but maybe someday.

Here's what I've figured out so far:

Some things I'd like to be able to do by voice:

  • Support multiple wake words in addition to "Okay, …"
  • Scroll up and down in Emacs and other applications.
  • Cut, copy, and paste.
  • Insert or navigate to links based on my bookmarks.
  • Select and act on different logical elements such as sentences or paragraphs.
  • Select a word or a range of text.
  • Press keyboard shortcuts. I might need to use xdotool to simulate keypresses.
  • Run M-x commands.
  • Insert text into the minibuffer and potentially press Enter.
  • Insert names of symbols, such as with-current-buffer.
  • Answer yes or no prompts. This is a little tricky because y-or-n-p is a blocking function and Emacs is single-threaded, so I might need to have an external process that uses xdotool to simulate keystrokes.
  • Select an option by number or letter. This is similar to the y-or-n-p problem, just with a wider range of choices.
  • Monitor simultaneous EmacsConf BigBlueButton web conferences for keywords so I can tell when a speaker needs my help.
View Org source for this post
You can e-mail me at sacha@sachachua.com.