Small steps towards using OpenAI-compatible text-to-speech services with speechd-el or emacspeak
Speech synthesis has come a long way since I first tried out Emacspeak in 2002. Kokoro TTS and Piper offer more natural-sounding voices now, although the initial delay in loading the models and generating speech mean that they aren't quite ready to completely replace espeak, which is faster but more robotic. I've been using the Kokoro FastAPI through my own functions for working with various speech systems. I wanted to see if I could get Kokoro and other OpenAI-compatible text-to-speech services to work with either speechd-el or Emacspeak just in case I could take advantage of the rich functionality either provides for speech-synthesized Emacs use. speechd-el is easier to layer on top of an existing Emacs if you only want occasional speech, while emacspeak voice-enables many packages to an extent beyond speaking simply what's on the screen.
Speech synthesis is particularly helpful when I'm learning French because I can use it as a reference for what a paragraph or sentence should sound like. It's not perfect. Sometimes it uses liaisons that my tutor and Google Translate don't use. But it's a decent enough starting point. I also used it before to read out IRC mentions and compile notifications so that I could hear them even if I was paying attention to a different activity.
Here's a demonstration of speechd reading out the following lines using the code I've just uploaded to https://codeberg.org/sachac/speechd-ai:
- The quick brown fox jumps over the lazy dog.
- Now let's set the language to French so we can read the next line.
- Bonjour, je m'appelle Emacs.
There's about a 2-second delay between the command and the start of the audio for the sentence.
Note that speechd-speak-read-sentence fails in some cases where (forward-sentence 1) isn't the same place as (backward-sentence 1) (forward-sentence 1), which can happen when you're in an Org Mode list. I've submitted a patch upstream.
Aside from that, speechd-speak-set-language, speechd-speak-read-paragraph and
speechd-speak-read-region are also useful
commands. I think the latency makes this best-suited for reading paragraphs, or for shadowing sentences for language learning.
I'm still trying to figure out how to get speechd-speak to work as smoothly as I'd like. I think I've got it set up so that the server falls back to espeak for short texts so that it can handle words or characters better, and uses the specified server for longer ones. I'd like to get to the point where it can handle all the things that speechd usually does, like saying lines as I navigate through them or giving me feedback as I'm typing. Maybe it can use espeak for fast feedback character by character and word by word, and then use Kokoro TTS for the full sentence when I finish. Then it will be possible to use it to type things without looking at the screen.
After putting this together, I still find myself leaning towards my own functions because they make it easy to see the generated speech output to a file, which is handy for saving reference audio that I can play on my phone and for making replays almost instant. That could also be useful for pre-generating the next paragraph to make it flow more smoothly. Still, it was interesting making something that is compatible with existing protocols and libraries.
Posting it in case anyone else wants to use it as a starting point. The repository also contains the starting point for an Emacspeak-compatible speech server. See See speechd-ai/README.org for more details.

