<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="/assets/atom.xsl" type="text/xsl"?><feed
	xmlns="http://www.w3.org/2005/Atom"
	xmlns:thr="http://purl.org/syndication/thread/1.0"
	xml:lang="en-US"
	><title>Sacha Chua - category - audio</title>
	<subtitle>Emacs, sketches, and life</subtitle>
	<link rel="self" type="application/atom+xml" href="https://sachachua.com/blog/category/audio/feed/atom/index.xml" />
  <link rel="alternate" type="text/html" href="https://sachachua.com/blog/category/audio" />
  <id>https://sachachua.com/blog/category/audio/feed/atom/index.xml</id>
  <generator uri="https://11ty.dev">11ty</generator>
	<updated>2026-03-04T16:17:40Z</updated>
<entry>
		<title type="html">Expanding yasnippets by voice in Emacs and other applications</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2026/03/expanding-yasnippet-by-voice/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2026-03-04T16:17:40Z</updated>
    <published>2026-03-04T16:17:40Z</published>
    <category term="emacs" />
<category term="audio" />
<category term="speech-recognition" />
		<id>https://sachachua.com/blog/2026/03/expanding-yasnippet-by-voice/</id>
		<content type="html"><![CDATA[<p>
<a href="https://github.com/joaotavora/yasnippet">Yasnippet</a> is a template system for Emacs. I want to use it by voice. I'd like to be able to say things like "Okay, define interactive function" and have that expand to a matching snippet in Emacs or other applications. Here's a quick demonstration of expanding simple snippets:
</p>

<div class="media-post" id="org7591a42">
<p>
</p><figure><video controls="1" src="https://sachachua.com/blog/2026/03/expanding-yasnippet-by-voice/2026-03-04_12.33.19.webm" type="video/webm"><track kind="subtitles" label="Captions" src="https://sachachua.com/blog/2026/03/expanding-yasnippet-by-voice/2026-03-04_12.33.19.vtt" srclang="en" default=""><a href="https://sachachua.com/blog/2026/03/expanding-yasnippet-by-voice/2026-03-04_12.33.19.webm">Download the video</a></video><figcaption><div>Screencast of expanding snippets by voice in Emacs and in other applications</div></figcaption></figure>
<p></p>

<details class="code-details" style="padding: 1em;
                 border-radius: 15px;
                 font-size: 0.9em;
                 box-shadow: 0.05em 0.1em 5px 0.01em  #00000057;">
                  <summary><strong>Transcript</strong></summary>
<ul class="org-ul">
<li><span class="media-time" data-start="0.000">00:00</span> So I've defined some yasnippets with names that I can say. Here, for example, in this menu, you can see I've got "define interactive function" and "with a buffer that I'll display." And in fundamental mode, I have some other things too. Let's give it a try.</li>
<li><span class="media-time" data-start="19.000">00:19</span> I press my shortcut. "Okay, define an interactive function." You can see that this is a yasnippet. Tab navigation still works.</li>
<li><span class="media-time" data-start="33.000">00:33</span> I can say, "OK, with a buffer that I'll display," and it expands that also.</li>
<li><span class="media-time" data-start="45.000">00:45</span> I can expand snippets in other applications as well, thanks to a global keyboard shortcut.</li>
<li><span class="media-time" data-start="50.000">00:50</span> Here, for example, I can say, "OK, my email." It inserts my email address.</li>
<li><span class="media-time" data-start="62.000">01:02</span> Yasnippet definitions can also execute Emacs Lisp. So I can say, "OK, date today," and have that evaluated to the actual date.</li>
<li><span class="media-time" data-start="81.000">01:21</span> So that's an example of using voice to expand snippets.</li>
</ul>


</details>

</div>

<p>
This is handled by the following code:
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-maybe-expand-snippet</span> (text)
  <span class="org-doc">"Add to `</span><span class="org-doc"><span class="org-constant">whisper-insert-text-at-point</span></span><span class="org-doc">'."</span>
  (<span class="org-keyword">if</span> (<span class="org-keyword">and</span> text
           (string-match
            <span class="org-string">"^ok</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">(?:</span></span><span class="org-string">ay</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">)</span></span><span class="org-string">?[,\\.]? </span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">(</span></span><span class="org-string">.+</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">)</span></span><span class="org-string">"</span> text))
    (<span class="org-keyword">let*</span> ((name
            (downcase
             (string-trim
              (replace-regexp-in-string <span class="org-string">"[,\\.]"</span> <span class="org-string">""</span> (match-string 1 text)))))
           (matching
            (seq-find (<span class="org-keyword">lambda</span> (o)
                        (subed-word-data-compare-normalized-string-distance
                         name
                         (downcase (yas&#45;&#45;template-name o))))
                      (yas&#45;&#45;all-templates (yas&#45;&#45;get-snippet-tables)))))
      (<span class="org-keyword">if</span> matching
          (<span class="org-keyword">progn</span>
            (<span class="org-keyword">if</span> (frame-focus-state)
                (<span class="org-keyword">progn</span>
                  (yas-expand-snippet matching)
                  nil)
              <span class="org-comment-delimiter">;; </span><span class="org-comment">In another application</span>
              (<span class="org-keyword">with-temp-buffer</span>
                (yas-minor-mode)
                (yas-expand-snippet matching)
                (buffer-string))))
        text))
    text))
</code></pre>
</div>


<p>
This code relies on <a href="https://github.com/sachac/whisper.el/">my fork of whisper.el</a>, which lets me specify a list of functions for <code>whisper-insert-text-at-point</code>. (I haven't asked for upstream review yet because I'm still testing things, and I don't know if it actually works for anyone else yet.) It does approximate matching on the snippet name using a function from <a href="https://github.com/sachac/subed/blob/main/subed/subed-word-data.el">subed-word-data.el</a> which just uses <code>string-distance</code>. I could probably duplicate the function in my config, but then I'd have to update it in two places if I come up with more ideas.
</p>

<p>
The code for inserting into other functions is defined in <a href="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/#multimedia-whisper">my-whisper-maybe-type</a>, which is very simple:
</p>

<p>

</p><div class="org-src-container">
<pre class="src src-emacs-lisp" id="org3c4a075"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-maybe-type</span> (text)
  <span class="org-doc">"If Emacs is not the focused app, simulate typing TEXT.</span>
<span class="org-doc">Add this function to `</span><span class="org-doc"><span class="org-constant">whisper-insert-text-at-point</span></span><span class="org-doc">'."</span>
  (<span class="org-keyword">when</span> text
    (<span class="org-keyword">if</span> (frame-focus-state)
        text
      (make-process <span class="org-builtin">:name</span> <span class="org-string">"xdotool"</span> <span class="org-builtin">:command</span>
                    (list <span class="org-string">"xdotool"</span> <span class="org-string">"type"</span>
                          text))
      nil)))
</code></pre>
</div>


<p></p>

<p>
Someday I'd like to provide alternative names for snippets. I also want to make it easy to fill in snippet fields by voice. I'd love to be able to answer minibuffer questions from <code>yas-choose-value</code>, <code>yas-completing-read</code>, and other functions by voice too. Could be fun!
</p>

<p>
Related:
</p>

<ul class="org-ul">
<li><a href="https://github.com/sachac/snippets">My snippets repository</a></li>
<li><a href="https://sachachua.com/topic/speech-recognition/">My topic page for speech recognition</a></li>
</ul>

<div class="note">This is part of my <a href="https://sachachua.com/dotemacs#writing-and-editing-speech-recognition-expanding-yasnippet-by-voice">Emacs configuration.</a></div><div><a href="https://sachachua.com/blog/2026/03/expanding-yasnippet-by-voice/index.org">View Org source for this post</a></div><p>You can <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2026%2F03%2Fexpanding-yasnippet-by-voice%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry><entry>
		<title type="html">Using speech recognition for on-the-fly translations in Emacs and faking in-buffer completion for the results</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2026/02/using-speech-recognition-for-on-the-fly-translations-in-emacs-and-faking-in-buffer-completion-for-the-results/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2026-03-31T16:31:15Z</updated>
    <published>2026-02-27T20:11:58Z</published>
    <category term="audio" />
<category term="speech-recognition" />
<category term="emacs" />
<category term="speech" />
		<id>https://sachachua.com/blog/2026/02/using-speech-recognition-for-on-the-fly-translations-in-emacs-and-faking-in-buffer-completion-for-the-results/</id>
		<content type="html"><![CDATA[<div class="update" id="orgeabb215">
<p>
<span class="timestamp-wrapper"><time class="timestamp" datetime="2026-03-31">[2026-03-31 Tue]</time></span>: I added the recognized text so that I can confirm what was translated. I also moved my-type-with-hint to <a href="https://codeberg.org/sachac/learn-lang/src/branch/main/learn-lang-type.el">learn-lang-type-with-hint</a>.
</p>

</div>

<p>
When I'm writing a journal entry in French, I
sometimes want to translate a phrase that I can't
look up word by word using a dictionary.
Instead of switching to a browser, I can use an
Emacs function to prompt me for text and either
insert or display the translation.
The <a href="https://github.com/alphapapa/plz.el">plz</a> library makes HTTP requests slightly
neater.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-lang-en-to-fr</span> (text <span class="org-type">&amp;optional</span> display-only)
  (<span class="org-keyword">interactive</span> (list (read-string <span class="org-string">"Text: "</span>) current-prefix-arg))
  (<span class="org-keyword">let*</span> ((url <span class="org-string">"https://translation.googleapis.com/language/translate/v2"</span>)
         (params <span class="org-highlight-quoted-quote">`</span>((<span class="org-string">"key"</span> . ,(getenv <span class="org-string">"GOOGLE_API_KEY"</span>))
                   (<span class="org-string">"q"</span> . ,text)
                   (<span class="org-string">"source"</span> . <span class="org-string">"en"</span>)
                   (<span class="org-string">"target"</span> . <span class="org-string">"fr"</span>)
                   (<span class="org-string">"format"</span> . <span class="org-string">"text"</span>)))
         (query-string (mapconcat
                        (<span class="org-keyword">lambda</span> (pair)
                          (format <span class="org-string">"%s=%s"</span>
                                  (url-hexify-string (car pair))
                                  (url-hexify-string (cdr pair))))
                        params
                        <span class="org-string">"&amp;"</span>))
         (full-url (concat url <span class="org-string">"?"</span> query-string)))
    (<span class="org-keyword">let*</span> ((response (plz <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">get</span> full-url <span class="org-builtin">:as</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">json-read</span>))
           (data (alist-get <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">data</span> response))
           (translations (alist-get <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">translations</span> data))
           (first-translation (car translations))
           (translated-text (alist-get <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">translatedText</span> first-translation)))
      (<span class="org-keyword">when</span> (called-interactively-p <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">any</span>)
        (<span class="org-keyword">if</span> display-only
            (message <span class="org-string">"%s"</span> translated-text)
          (insert translated-text)))
      translated-text)))
</code></pre>
</div>


<p>
I think it would be even nicer if I could use speech synthesis, so I can keep it a little more separate from my typing thoughts. I want to be able to say "Okay, translate &hellip;" or "Okay, &hellip; in French" to get a translation. I've been using <a href="https://github.com/sachac/whisper.el/">my fork of natrys/whisper.el</a> for <a href="https://sachachua.com/topic/speech-recognition/">speech recognition</a> in English, and I like it a lot. By adding a function to <code>whisper-after-transcription-hook</code>, I can modify the intermediate results before they're inserted into the buffer.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-translate</span> ()
  (goto-char (point-min))
  (<span class="org-keyword">let</span> ((case-fold-search t))
    (<span class="org-keyword">when</span> (re-search-forward <span class="org-string">"okay[,\\.]? translate[,\\.]? </span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">(</span></span><span class="org-string">.+</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">)</span></span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">|</span></span><span class="org-string">okay[,\\.]? </span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">(</span></span><span class="org-string">.+?</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">)</span></span><span class="org-string"> in French"</span> nil t)
      (<span class="org-keyword">let*</span> ((s (<span class="org-keyword">or</span> (match-string 1) (match-string 2)))
             (translation (<span class="org-keyword">save-match-data</span> (my-lang-en-to-fr s))))
        (replace-match
         (propertize translation
                     <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">type-hint</span> translation
                     <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">type-original</span> s
                     <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">help-echo</span> s))))))

(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper</span>
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-after-transcription-hook</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">my-whisper-translate</span> 70))
</code></pre>
</div>


<p>
But that's too easy. I want to actually type things myself so that I get more practice. Something like an autocomplete suggestion would be handy as a way of showing me a hint at the cursor. The usual completion-at-point functions are too eager to insert things if there's only one candidate, so we'll just fake it with an overlay. This code works only with my whisper.el fork because it supports using a list of functions for <code>whisper-insert-text-at-point</code>.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-maybe-type-with-hints</span> (text)
  <span class="org-doc">"Add this function to `</span><span class="org-doc"><span class="org-constant">whisper-insert-text-at-point</span></span><span class="org-doc">'."</span>
  (<span class="org-keyword">let*</span> ((hint (<span class="org-keyword">and</span> text (org-find-text-property-in-string <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">type-hint</span> text)))
         (original (<span class="org-keyword">and</span> text (org-find-text-property-in-string <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">type-original</span> text))))
    (<span class="org-keyword">if</span> hint
        (<span class="org-keyword">progn</span>
          (learn-lang-type-with-hint hint original)
          nil)
      text)))
</code></pre>
</div>


<p>
Here's a demonstration of me saying "Okay, this is a test, in French.":
</p>

<div class="media-post" id="orgdf2cbf2">
<p>
</p><figure><video controls="1" src="https://sachachua.com/blog/2026/02/using-speech-recognition-for-on-the-fly-translations-in-emacs-and-faking-in-buffer-completion-for-the-results/2026-02-27-translate-and-type-with-hints.webm" type="video/webm"><a href="https://sachachua.com/blog/2026/02/using-speech-recognition-for-on-the-fly-translations-in-emacs-and-faking-in-buffer-completion-for-the-results/2026-02-27-translate-and-type-with-hints.webm">Download the video</a></video><figcaption><div>Screencast of using speech recognition to translate into French and provide a hint when typing</div></figcaption></figure>
<p></p>

</div>

<p>
Since we're faking in-buffer completion here, maybe we can still get away with considering this as an entry for <a href="https://sachachua.com/blog/2026/01/emacs-carnival-february-2026-completion/">Emacs Carnival February 2026: Completion</a> ? =)</p>

<div class="note">This is part of my <a href="https://sachachua.com/dotemacs#writing-and-editing-speech-recognition-using-speech-recognition-for-translations-in-emacs-and-faking-in-buffer-completion-for-the-results">Emacs configuration.</a></div><div><a href="https://sachachua.com/blog/2026/02/using-speech-recognition-for-on-the-fly-translations-in-emacs-and-faking-in-buffer-completion-for-the-results/index.org">View Org source for this post</a></div><p>You can <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2026%2F02%2Fusing-speech-recognition-for-on-the-fly-translations-in-emacs-and-faking-in-buffer-completion-for-the-results%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry><entry>
		<title type="html">Using Silero voice activity detection to automatically queue multiple transcriptions with natrys/whisper.el</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2026/01/using-silero-voice-activity-detection-to-automatically-queue-multiple-transcriptions-with-natrys-whisper-el/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2026-03-27T23:03:01Z</updated>
    <published>2026-01-29T16:24:46Z</published>
    <category term="audio" />
<category term="speech-recognition" />
<category term="emacs" />
		<id>https://sachachua.com/blog/2026/01/using-silero-voice-activity-detection-to-automatically-queue-multiple-transcriptions-with-natrys-whisper-el/</id>
		<content type="html"><![CDATA[<div class="update" id="org67634b7">
<p>
<span class="timestamp-wrapper"><time class="timestamp" datetime="2026-03-27">[2026-03-27 Fri]</time></span>: Moved code to <a href="https://codeberg.org/sachac/speech-input">sachac/speech-input</a>.
</p>

</div>

<p>
I can <a href="https://sachachua.com/blog/2026/01/queue-multiple-transcriptions-with-whisper-el-speech-recognition/">queue multiple transcriptions with whisper.el</a> so that they get processed sequentially with backup audio. It catches up when I pause to think. Now I want to use <a href="https://pytorch.org/hub/snakers4_silero-vad_vad/">Silero voice activity detection</a> to do that kind of segmentation for me automatically.
</p>

<p>
First, I need a Python server that can print out events when it notices the start or stop of a speech segment. If I print out the timestamps, I might be able to cross-reference it someday with interestingthings. For now, even just paying attention to the end of a segment is enough for what I want to do.
</p>

<details class="code-details" style="padding: 1em;
                 border-radius: 15px;
                 font-size: 0.9em;
                 box-shadow: 0.05em 0.1em 5px 0.01em  #00000057;">
                  <summary><strong>Python script for printing out events</strong></summary>

<div class="org-src-container">
<pre class="src src-python"><code><span class="org-keyword">import</span> sounddevice <span class="org-keyword">as</span> sd
<span class="org-keyword">import</span> numpy <span class="org-keyword">as</span> np
<span class="org-keyword">import</span> torch
<span class="org-keyword">import</span> sys
<span class="org-keyword">from</span> datetime <span class="org-keyword">import</span> datetime, timedelta

<span class="org-variable-name">SILENCE_DURATION</span> <span class="org-operator">=</span> 500
<span class="org-variable-name">SAMPLING_RATE</span> <span class="org-operator">=</span> 16000
<span class="org-variable-name">CHUNK_SIZE</span> <span class="org-operator">=</span> 512
<span class="org-variable-name">model</span>, <span class="org-variable-name">utils</span> <span class="org-operator">=</span> torch.hub.load(repo_or_dir<span class="org-operator">=</span><span class="org-string">'snakers4/silero-vad'</span>,
                              model<span class="org-operator">=</span><span class="org-string">'silero_vad'</span>,
                              force_reload<span class="org-operator">=</span><span class="org-constant">False</span>)

(<span class="org-variable-name">get_speech_timestamps</span>, <span class="org-variable-name">save_audio</span>, <span class="org-variable-name">read_audio</span>, <span class="org-variable-name">VADIterator</span>, <span class="org-variable-name">collect_chunks</span>) <span class="org-operator">=</span> utils
<span class="org-variable-name">vad_iterator</span> <span class="org-operator">=</span> VADIterator(model, threshold<span class="org-operator">=</span>0.5, min_silence_duration_ms<span class="org-operator">=</span>SILENCE_DURATION)

<span class="org-variable-name">stream_start_time</span> <span class="org-operator">=</span> <span class="org-constant">None</span>

<span class="org-keyword">def</span> <span class="org-function-name">format_iso_with_offset</span>(offset_seconds):
    <span class="org-keyword">if</span> stream_start_time <span class="org-keyword">is</span> <span class="org-constant">None</span>:
        <span class="org-keyword">return</span> <span class="org-string">"PENDING"</span>
    <span class="org-variable-name">event_time</span> <span class="org-operator">=</span> stream_start_time <span class="org-operator">+</span> timedelta(seconds<span class="org-operator">=</span>offset_seconds)
    <span class="org-keyword">return</span> event_time.astimezone().isoformat(timespec<span class="org-operator">=</span><span class="org-string">'milliseconds'</span>)

<span class="org-keyword">def</span> <span class="org-function-name">audio_callback</span>(indata, frames, time, status):
    <span class="org-keyword">global</span> stream_start_time
    <span class="org-keyword">if</span> status:
        <span class="org-builtin">print</span>(status, file<span class="org-operator">=</span>sys.stderr)
    <span class="org-keyword">if</span> stream_start_time <span class="org-keyword">is</span> <span class="org-constant">None</span>:
        <span class="org-variable-name">stream_start_time</span> <span class="org-operator">=</span> datetime.now()
    <span class="org-variable-name">tensor_input</span> <span class="org-operator">=</span> torch.from_numpy(indata.copy()).flatten()
    <span class="org-variable-name">speech_dict</span> <span class="org-operator">=</span> vad_iterator(tensor_input, return_seconds<span class="org-operator">=</span><span class="org-constant">True</span>)
    <span class="org-keyword">if</span> speech_dict:
        <span class="org-keyword">if</span> <span class="org-string">"start"</span> <span class="org-keyword">in</span> speech_dict:
            <span class="org-builtin">print</span>(f<span class="org-string">"START </span>{format_iso_with_offset(speech_dict['start'])}<span class="org-string">"</span>, flush<span class="org-operator">=</span><span class="org-constant">True</span>)
        <span class="org-keyword">if</span> <span class="org-string">"end"</span> <span class="org-keyword">in</span> speech_dict:
            <span class="org-builtin">print</span>(f<span class="org-string">"END </span>{format_iso_with_offset(speech_dict['end'])}<span class="org-string">"</span>, flush<span class="org-operator">=</span><span class="org-constant">True</span>)
<span class="org-keyword">try</span>:
    <span class="org-keyword">with</span> sd.InputStream(samplerate<span class="org-operator">=</span>SAMPLING_RATE,
                        channels<span class="org-operator">=</span>1,
                        callback<span class="org-operator">=</span>audio_callback,
                        blocksize<span class="org-operator">=</span>CHUNK_SIZE):
        <span class="org-keyword">while</span> <span class="org-constant">True</span>:
            <span class="org-keyword">pass</span>
<span class="org-keyword">except</span> <span class="org-type">KeyboardInterrupt</span>:
    <span class="org-builtin">print</span>(<span class="org-string">"</span><span class="org-constant">\n</span><span class="org-string">Stopping..."</span>)
</code></pre>
</div>



</details>

<p>
I've mode the code to <a href="https://codeberg.org/sachac/speech-input/src/branch/main/speech-input-vad.el">speech-input-vad.el</a>, so now I can use <code>speech-input-vad-start</code> to start monitoring speech events.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">use-package</span> speech-input <span class="org-builtin">:load-path</span> <span class="org-string">"~/proj/speech-input/"</span>)
</code></pre>
</div>


<p>
Because I added Pulse properties to the process environment, I can easily use <a href="https://sachachua.com/blog/2026/01/visualizing-and-managing-pipewire-audio-graphs-from-emacs/">epwgraph</a> to rewire the input so that it gets the input from my VirtualMicSink instead of the default system audio device. (Someday I'll figure out how to specify that as the input automatically.)
</p>

<p>
Then I can start this process from Emacs:
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">speech-input-vad</span>
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">speech-input-vad-on-end-functions</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-maybe-continue</span>))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-maybe-continue</span> ()
  (<span class="org-keyword">when</span> (process-live-p whisper&#45;&#45;recording-process)
    (my-whisper-continue)))
</code></pre>
</div>


<p>
Now I can press my shortcut for <code>my-whisper-continue</code> to start the process. As I keep talking, it will continue to record. When I pause for more than a second between sentences, then it will send that chunk to the server for transcription without me having to press another button, while still listening for more speech.
</p>

<p>
How is this different from the streaming approach that many real-time speech recognition services offer? I think this gives me a bit more visibility into and control of the process. For my personal use, I don't need to have everything processed as quickly as possible, and I'm not trying to replicate live captions. I just want to be able to look back over the last five minutes to try to remember what I was talking about. I usually have a lot of quiet time as I think through my next steps, and it's fine to have it catch up then. I also like that I can save time-stamped audio files for later processing, divided according to the speech segments. Those might be a little bit easier to work with when I get around to compositing them into a video.
</p>

<div class="note">This is part of my <a href="https://sachachua.com/dotemacs#writing-and-editing-speech-recognition-using-silero-voice-activity-detection-to-automatically-queue-multiple-transcriptions-with-natrys-whisper-el">Emacs configuration.</a></div><div><a href="https://sachachua.com/blog/2026/01/using-silero-voice-activity-detection-to-automatically-queue-multiple-transcriptions-with-natrys-whisper-el/index.org">View Org source for this post</a></div><p>You can <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2026%2F01%2Fusing-silero-voice-activity-detection-to-automatically-queue-multiple-transcriptions-with-natrys-whisper-el%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry><entry>
		<title type="html">Queuing multiple transcriptions with whisper.el speech recognition</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2026/01/queue-multiple-transcriptions-with-whisper-el-speech-recognition/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2026-01-23T19:50:37Z</updated>
    <published>2026-01-23T19:50:37Z</published>
    <category term="audio" />
<category term="speech-recognition" />
<category term="emacs" />
		<id>https://sachachua.com/blog/2026/01/queue-multiple-transcriptions-with-whisper-el-speech-recognition/</id>
		<content type="html"><![CDATA[<p>
I want to be able to talk out loud and have the ideas go into Emacs. I can do this in a number of different ways:
</p>

<ol class="org-ol">
<li>I briefly demonstrated a step-by-step approach with <a href="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/">natrys/whisper.el with a single file</a>. I press a keyboard shortcut to start the recording, another shortcut to stop the recording, and it transcribes it in the background. But the way whisper.el is set up is that if I press the keyboard shortcut to start recording again it will offer to interrupt the transcription process, which is not what I want. I want to just keep talking and have it process results as things come in.</li>
<li>I'm also experimenting with <a href="https://sachachua.com/dotemacs#writing-and-editing-speech-recognition-streaming-speech-recognition-into-emacs-using-google-chrome-web-speech-api">Google Chrome's web speech API to do continuous speech recognition</a>, which I can get into Emacs using a web socket.</li>
<li>What I've just figured out is how to layer a semi-continuous interface for speech recognition on top of whisper.el so that while it's processing in the background, I can just press a keyboard shortcut (I'm using numpad 9 to call <code>my-whisper-continue</code>) to stop the previous recording, queue it for processing, and start the next recording. If I use this keyboard shortcut to separate my thoughts, then Whisper has a much easier time making sense of the whole sentence or paragraph or whatever, instead of trying to use the sliding 30 second context window that many streaming approaches to speech recognition try to use.</li>
</ol>

<p>
Question: Did you fix the keyboard delay you've got while speech catches what you're saying?
</p>

<p>
Sometimes, when the speed recognition kicks in, my computer gets busy. When my computer gets really busy, it doesn't process my keystrokes in the right order, which is very annoying because then I have to delete the previous word and retype it. I haven't sorted that out yet, but it seems like I probably have to lower the priority on different processes. On the plus side, as I mentioned, if I dictate things instead of typing them, then I don't run into that problem at all.
</p>

<p>
Also, other notes on delays: The continuous speech recognition via Google Chrome shows up fairly quickly, but it's not very precise, and it doesn't have punctuation. Even if there's a little bit of a delay, as long as I press the my-whisper-continue shortcut after each thought, then I can get that text into my Emacs buffer using the nicer transcription from my selected model. There is going to be a bit of a delay for that one because it gets processed at the end of the thought. Also, I need to start thinking in complete sentences instead of just adding one cause after the other as my brain goes on all of these tangents. I think it's pretty promising. There's the continuous speech recognition via Google Chrome if I don't mind the lower accuracy and lack of punctuation, and I can still get the pretty version on the other side.
</p>

<p>
Why talk out loud? I liked the <a href="https://emacsconf.org/2025/talks/bookclub-tapas">Bookclub Tapas</a> presentation that Maddie Sullivan did at EmacsConf 2025. Talking out loud helps me be a lot more verbose about what I'm saying, compared to typing things out or even like having to switch to my notes or interrupting my screen with an Org capture buffer. Of course I want to clean that up for  putting into a blog post, but given that my life still sometimes has random interruptions from a kiddo who must have my attention at that very minute, having that kind of record that I can at least try to reread afterwards to reconstruct what I was thinking about sounds like it might be helpful.
</p>

<p>
Still, making sense out loud is hard. I'm not actually used to talking to people that much now. This is probably a good reason for me to experiment with streaming more. Then I get the practice in talking out loud, there are backup recordings, and people can ask questions when things are unclear.
</p>

<p>
Of course, sometimes the text doesn't quite make sense because of the speech recognition errors. I can usually figure it out from the context. I save the audio as well so that I can go back and listen to it again if I really need to.
</p>

<p>
Anyway, here's the code for sending the current recording to whisper in the background and starting another recording. It assumes a lot about how things are set up. For example, I'm only testing this with a local speaches server instead of whisper.cpp. You might need to look at my other speech related configuration blog posts and sections in order to make sense of it.
</p>

<details><summary>Code for queuing whisper.el requests to a local server</summary>
<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper&#45;&#45;queue</span> nil)
(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-continue</span> (<span class="org-type">&amp;optional</span> arg)
  <span class="org-doc">"Send what we've got so far for transcription and then continue recording.</span>
<span class="org-doc">Call with \\[</span><span class="org-doc"><span class="org-constant">universal-argument</span></span><span class="org-doc">] to signal that we can stop."</span>
  (<span class="org-keyword">interactive</span> <span class="org-string">"P"</span>)
  (<span class="org-keyword">require</span> <span class="org-highlight-quoted-quote">'</span><span class="org-constant">whisper</span>)
  (<span class="org-keyword">if</span> arg
      (my-whisper-done)
    (<span class="org-keyword">setq</span> whisper&#45;&#45;marker (point-marker) whisper&#45;&#45;point-buffer (current-buffer))
    (<span class="org-keyword">when</span> (process-live-p whisper&#45;&#45;recording-process)
      <span class="org-comment-delimiter">;; </span><span class="org-comment">queue only if the last one is not asking for the same file</span>
      (<span class="org-keyword">unless</span>
          (string=
           (plist-get
            (car
             (last my-whisper&#45;&#45;queue))
            <span class="org-builtin">:file</span>)
           whisper&#45;&#45;temp-file)
        (add-to-list
         <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">my-whisper&#45;&#45;queue</span>
         (list <span class="org-builtin">:file</span> whisper&#45;&#45;temp-file
               <span class="org-builtin">:buffer</span>
               (format <span class="org-string">"*result: %s*"</span> (file-name-base whisper&#45;&#45;temp-file)))
         t))
      <span class="org-comment-delimiter">;; </span><span class="org-comment">Remove the sentinel; handle results ourselves</span>
      (set-process-sentinel whisper&#45;&#45;recording-process
                            (<span class="org-keyword">lambda</span> (process event)
                              (my-whisper-process-queue)))
      (interrupt-process whisper&#45;&#45;recording-process))
    (run-hooks <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-before-transcription-hook</span>)
    (whisper&#45;&#45;setup-mode-line <span class="org-builtin">:show</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">recording</span>)
    (whisper&#45;&#45;record-audio)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-discard</span> ()
 <span class="org-doc">"Ignore the previous recording."</span>
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">when</span> (process-live-p whisper&#45;&#45;recording-process)
    <span class="org-comment-delimiter">;; </span><span class="org-comment">Remove the sentinel; handle results ourselves</span>
    (set-process-sentinel whisper&#45;&#45;recording-process
                          (<span class="org-keyword">lambda</span> (process event)
                            (<span class="org-keyword">when</span> (file-exists-p whisper&#45;&#45;temp-file)
                              (delete-file whisper&#45;&#45;temp-file))
                            (my-whisper-process-queue)))
    (interrupt-process whisper&#45;&#45;recording-process)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-discard-and-continue</span> ()
 <span class="org-doc">"Ignore the previous recording and continue."</span>
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">if</span> (process-live-p whisper&#45;&#45;recording-process)
      (<span class="org-keyword">progn</span>
        <span class="org-comment-delimiter">;; </span><span class="org-comment">Remove the sentinel; handle results ourselves</span>
        (set-process-sentinel whisper&#45;&#45;recording-process
                              (<span class="org-keyword">lambda</span> (process event)
                                (my-whisper-process-queue)
                                (my-whisper-continue)))
        (interrupt-process whisper&#45;&#45;recording-process))
    (my-whisper-continue)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-done</span> ()
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">when</span> (process-live-p whisper&#45;&#45;recording-process)
    (add-to-list
     <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">my-whisper&#45;&#45;queue</span>
     (list <span class="org-builtin">:file</span> whisper&#45;&#45;temp-file
           <span class="org-builtin">:buffer</span>
           (format <span class="org-string">"*result: %s*"</span> (file-name-base whisper&#45;&#45;temp-file)))
     t)
    <span class="org-comment-delimiter">;; </span><span class="org-comment">Remove the sentinel; handle results ourselves</span>
    (set-process-sentinel whisper&#45;&#45;recording-process
                          (<span class="org-keyword">lambda</span> (process event)
                            (my-whisper-process-queue)))
    (whisper&#45;&#45;setup-mode-line <span class="org-builtin">:hide</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">recording</span>)
    (interrupt-process whisper&#45;&#45;recording-process)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-process-queue-result</span> ()
  <span class="org-doc">"Process the first part of the queue that already has results."</span>
  (<span class="org-keyword">while</span> (plist-get (car my-whisper&#45;&#45;queue) <span class="org-builtin">:results</span>)
    (<span class="org-keyword">let</span> ((o (<span class="org-keyword">pop</span> my-whisper&#45;&#45;queue)))
      (<span class="org-keyword">unless</span> my-whisper-target-markers
        (<span class="org-keyword">setq</span> whisper&#45;&#45;marker (point-marker)
              whisper&#45;&#45;point-buffer (current-buffer)))
      (<span class="org-keyword">with-current-buffer</span> (plist-get o <span class="org-builtin">:buffer</span>)
        (erase-buffer)
        (insert (plist-get o <span class="org-builtin">:results</span>)))
      <span class="org-comment-delimiter">;; </span><span class="org-comment">Only works with my fork: https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function</span>
      (whisper&#45;&#45;handle-transcription-output nil (plist-get o <span class="org-builtin">:buffer</span>)))))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-process-queue</span> ()
  (<span class="org-keyword">let</span> (o)
    (<span class="org-keyword">while</span> (<span class="org-keyword">setq</span> o (seq-find (<span class="org-keyword">lambda</span> (o) (<span class="org-keyword">and</span> (plist-get o <span class="org-builtin">:file</span>)
                                              (not (plist-get o <span class="org-builtin">:process</span>))
                                              (not (plist-get o <span class="org-builtin">:results</span>))))
                             my-whisper&#45;&#45;queue))
      (<span class="org-keyword">let*</span> ((headers (list <span class="org-string">"Content-Type: multipart/form-data"</span>))
             (params (list (concat <span class="org-string">"file=@"</span>
                                   (plist-get o <span class="org-builtin">:file</span>))
                           <span class="org-string">"temperature=0.0"</span>
                           <span class="org-string">"temperature_inc=0.2"</span>
                           <span class="org-string">"response_format=json"</span>
                           (concat <span class="org-string">"model="</span> whisper-model)
                           (concat <span class="org-string">"language="</span> whisper-language)))
             (url (format my-whisper-url-format whisper-server-host whisper-server-port))
             (command <span class="org-highlight-quoted-quote">`</span>(<span class="org-string">"curl"</span> <span class="org-string">"-s"</span>
                        ,url
                        ,@(mapcan (<span class="org-keyword">lambda</span> (h) (list <span class="org-string">"-H"</span> h)) headers)
                        ,@(mapcan (<span class="org-keyword">lambda</span> (p) (list <span class="org-string">"-F"</span> p)) params))))
        (<span class="org-keyword">with-current-buffer</span> (get-buffer-create (plist-get o <span class="org-builtin">:buffer</span>))
          (erase-buffer))
        (plist-put
         o <span class="org-builtin">:process</span>
         (make-process
          <span class="org-builtin">:name</span> <span class="org-string">"whisper-curl"</span>
          <span class="org-builtin">:command</span> command
          <span class="org-builtin">:buffer</span> (plist-get o <span class="org-builtin">:buffer</span>)
          <span class="org-builtin">:coding</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">utf-8</span>
          <span class="org-builtin">:sentinel</span>
          (<span class="org-keyword">lambda</span> (process event)
            (<span class="org-keyword">with-current-buffer</span> (process-buffer process)
              (<span class="org-keyword">let</span> ((current my-whisper&#45;&#45;queue-item))
                (<span class="org-keyword">when</span> (<span class="org-keyword">and</span> (get-buffer (plist-get current <span class="org-builtin">:buffer</span>))
                           (string-equal <span class="org-string">"finished\n"</span> event))
                  (<span class="org-keyword">with-current-buffer</span> (plist-get current <span class="org-builtin">:buffer</span>)
                    (goto-char (point-min))
                    (plist-put current <span class="org-builtin">:results</span>
                               (<span class="org-keyword">or</span>
                                (<span class="org-keyword">condition-case</span> nil
                                    (gethash <span class="org-string">"text"</span> (json-parse-buffer))
                                  (<span class="org-warning">error</span> <span class="org-string">""</span>))
                                <span class="org-string">"(error)"</span>))))))
            (my-whisper-process-queue-result))))
        (plist-put o <span class="org-builtin">:command</span> (string-join command <span class="org-string">" "</span>))
        (<span class="org-keyword">with-current-buffer</span> (process-buffer (plist-get o <span class="org-builtin">:process</span>))
          (<span class="org-keyword">setq-local</span> my-whisper&#45;&#45;queue-item o))))))
(<span class="org-keyword">defvar-local</span> <span class="org-variable-name">my-whisper&#45;&#45;queue-item</span> nil)

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-reprocess-queue</span> ()
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">setq</span> whisper&#45;&#45;marker (point-marker) whisper&#45;&#45;point-buffer (current-buffer))
  (mapc (<span class="org-keyword">lambda</span> (o)
          (<span class="org-keyword">when</span> (process-live-p (plist-get o <span class="org-builtin">:process</span>))
            (kill-process (plist-get o <span class="org-builtin">:process</span>)))
          (<span class="org-keyword">when</span> (get-buffer (plist-get o <span class="org-builtin">:buffer</span>))
            (kill-buffer (plist-get o <span class="org-builtin">:buffer</span>)))
          (plist-put o <span class="org-builtin">:process</span> nil)
          (plist-put o <span class="org-builtin">:results</span> nil))
        my-whisper&#45;&#45;queue)
  (my-whisper-process-queue))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-clear-queue</span> ()
  (<span class="org-keyword">interactive</span>)
  (mapc (<span class="org-keyword">lambda</span> (o)
          (<span class="org-keyword">when</span> (process-live-p (plist-get o <span class="org-builtin">:process</span>))
            (kill-process (plist-get o <span class="org-builtin">:process</span>)))
          (<span class="org-keyword">when</span> (get-buffer (plist-get o <span class="org-builtin">:buffer</span>))
            (kill-buffer (plist-get o <span class="org-builtin">:buffer</span>)))
          (plist-put o <span class="org-builtin">:process</span> nil)
          (plist-put o <span class="org-builtin">:results</span> nil))
        my-whisper&#45;&#45;queue)
  (<span class="org-keyword">setq</span> my-whisper&#45;&#45;queue nil))

(keymap-global-set <span class="org-string">"&lt;kp-9&gt;"</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-continue</span>)
(keymap-global-set <span class="org-string">"&lt;kp-8&gt;"</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-discard-and-continue</span>)
(keymap-global-set <span class="org-string">"C-&lt;kp-9&gt;"</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-done</span>)
</code></pre>
</div>

</details>

<div class="note">This is part of my <a href="https://sachachua.com/dotemacs#writing-and-editing-speech-recognition-queue-multiple-transcriptions-with-whisper-el-speech-recognition">Emacs configuration.</a></div><div><a href="https://sachachua.com/blog/2026/01/queue-multiple-transcriptions-with-whisper-el-speech-recognition/index.org">View org source for this post</a></div><p>You can <a href="https://social.sachachua.com/@sacha/statuses/01KFP6RDJ52ERX664SPP6CN7S9" target="_blank" rel="noopener noreferrer">comment on Mastodon</a> or <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2026%2F01%2Fqueue-multiple-transcriptions-with-whisper-el-speech-recognition%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry><entry>
		<title type="html">Emacs and whisper.el: Trying out different speech-to-text backends and models</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2026/01/emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2026-01-20T19:21:46Z</updated>
    <published>2026-01-20T19:21:46Z</published>
    <category term="audio" />
<category term="emacs" />
<category term="speech-recognition" />
		<id>https://sachachua.com/blog/2026/01/emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models/</id>
		<content type="html"><![CDATA[<p>
I was curious about <a href="https://parakeettdt.com/">parakeet</a> because I heard that it was faster than Whisper on the <a href="https://huggingface.co/spaces/hf-audio/open_asr_leaderboard">HuggingFace leaderboard</a>. When I installed it and got it running on my laptop (CPU only, no GPU), it seemed like my results were a little faster than whisper.cpp with the large model, but much slower than whisper.cpp with the base model. The base model is decent for quick dictation, so I got curious about other backends and other models.
</p>

<p>
In order to try <a href="https://github.com/natrys/whisper.el">natrys/whisper.el</a> with other backends, I needed to work around how whisper.el validates the model names and sends requests to the servers. Here's the quick and dirty code for doing so, in case you want to try it out for yourself.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-url-format</span> <span class="org-string">"http://%s:%d/transcribe"</span>)
<span class="org-comment-delimiter">;;;</span><span class="org-comment">###</span><span class="org-comment"><span class="org-warning">autoload</span></span>
(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper&#45;&#45;transcribe-via-local-server</span> ()
  <span class="org-doc">"Transcribe audio using the local whisper server."</span>
  (message <span class="org-string">"[-] Transcribing via local server"</span>)
  (whisper&#45;&#45;setup-mode-line <span class="org-builtin">:show</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">transcribing</span>)
  (whisper&#45;&#45;ensure-server)
  (<span class="org-keyword">setq</span> whisper&#45;&#45;transcribing-process
        (whisper&#45;&#45;process-curl-request
         (format my-whisper-url-format whisper-server-host whisper-server-port)
         (list <span class="org-string">"Content-Type: multipart/form-data"</span>)
         (list (concat <span class="org-string">"file=@"</span> whisper&#45;&#45;temp-file)
               <span class="org-string">"temperature=0.0"</span>
               <span class="org-string">"temperature_inc=0.2"</span>
               <span class="org-string">"response_format=json"</span>
               (concat <span class="org-string">"model="</span> whisper-model)
               (concat <span class="org-string">"language="</span> whisper-language)))))
<span class="org-comment-delimiter">;;;</span><span class="org-comment">###</span><span class="org-comment"><span class="org-warning">autoload</span></span>
(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper&#45;&#45;check-model-consistency</span> () t)
</code></pre>
</div>


<p>
I moved my server code to <a href="https://codeberg.org/sachac/speech-input/src/branch/main/speech-input-speaches.el">speech-input-speaches.el</a>.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper</span>
  (advice-add <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper&#45;&#45;transcribe-via-local-server</span> <span class="org-builtin">:override</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper&#45;&#45;transcribe-via-local-server</span>)
  (advice-add <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper&#45;&#45;check-model-consistency</span> <span class="org-builtin">:override</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper&#45;&#45;check-model-consistency</span>)
  (advice-add <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper&#45;&#45;ensure-server</span> <span class="org-builtin">:override</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">speech-input-speaches-server-start</span>)
  )
</code></pre>
</div>


<p>
Then I have this function for trying things out.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-test-whisper-api</span> (url <span class="org-type">&amp;optional</span> args)
  (<span class="org-keyword">with-temp-buffer</span>
    (apply <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">call-process</span> <span class="org-string">"curl"</span> nil t nil <span class="org-string">"-s"</span>
           url
         (append (mapcan
                  (<span class="org-keyword">lambda</span> (h) (list <span class="org-string">"-H"</span> h))
                  (list <span class="org-string">"Content-Type: multipart/form-data"</span>))
                 (mapcan
                  (<span class="org-keyword">lambda</span> (h) (list <span class="org-string">"-F"</span> h))
                  (list (concat <span class="org-string">"file=@"</span> whisper&#45;&#45;temp-file)
                        <span class="org-string">"temperature=0.0"</span>
                        <span class="org-string">"temperature_inc=0.2"</span>
                        <span class="org-string">"response_format=verbose_json"</span>
                        (concat <span class="org-string">"language="</span> whisper-language)))
                 args))
    (message <span class="org-string">"%s %s"</span> (buffer-string) url)))
</code></pre>
</div>


<p>
Here's the audio file. It is around 10 seconds long. I run the benchmark 3 times and report the average time.
</p>

<p>
</p><div class="audio"><audio controls="1" preload="metadata" src="https://sachachua.com/blog/2026/01/emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models/2026-01-19-14-17-53.wav?2026-04-05" type="audio/x-wav"><a href="https://sachachua.com/blog/2026/01/emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models/2026-01-19-14-17-53.wav">Download the audio</a></audio></div>
<a href="https://sachachua.com/blog/2026/01/emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models/2026-01-19-14-17-53.wav">Download</a>
<p></p>

<details><summary>Code for running the benchmarks</summary>
<div class="org-src-container">
<pre class="src src-emacs-lisp" id="orgb84bc16"><code>(<span class="org-keyword">let</span> ((times <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">3</span>))
(mapcar
 (<span class="org-keyword">lambda</span> (group)
   (<span class="org-keyword">let</span> ((whisper&#45;&#45;temp-file <span class="org-string">"/home/sacha/recordings/whisper/2026-01-19-14-17-53.wav"</span>))
     <span class="org-comment-delimiter">;; </span><span class="org-comment">warm up the model</span>
     (eval (cadr group))
     (list
      (format <span class="org-string">"%.3f"</span>
              (/ (car
                  (benchmark-call (<span class="org-keyword">lambda</span> () (eval (cadr group))) times))
                 times))
      (car group))))
 <span class="org-highlight-quoted-quote">'</span>(
   (<span class="org-string">"parakeet"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> whisper-server-host 5092)))
   (<span class="org-string">"whisper.cpp base-q4_0"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/inference"</span> whisper-server-host 8642)))
   (<span class="org-string">"speaches whisper-base"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> whisper-server-host 8001)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-base"</span>)))
   (<span class="org-string">"speaches whisper-base.en"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> whisper-server-host 8001)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-base.en"</span>)))
   (<span class="org-string">"speaches whisper-small"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> whisper-server-host 8001)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-small"</span>)))
   (<span class="org-string">"speaches whisper-small.en"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> whisper-server-host 8001)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-small.en"</span>)))
   (<span class="org-string">"speaches lorneluo/whisper-small-ct2-int8"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> whisper-server-host 8001)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=lorneluo/whisper-small-ct2-int8"</span>)))
   <span class="org-comment-delimiter">;; </span><span class="org-comment">needed export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1</span>
   (<span class="org-string">"whisperx-server Systran/faster-whisper-small"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/transcribe"</span> whisper-server-host 8002)))))
)
</code></pre>
</div>

</details>

<table>


<colgroup>
<col class="org-right">

<col class="org-left">
</colgroup>
<tbody>
<tr>
<td class="org-right">3.694</td>
<td class="org-left">parakeet</td>
</tr>

<tr>
<td class="org-right">2.484</td>
<td class="org-left">whisper.cpp base-q4_0</td>
</tr>

<tr>
<td class="org-right">1.547</td>
<td class="org-left">speaches whisper-base</td>
</tr>

<tr>
<td class="org-right">1.425</td>
<td class="org-left">speaches whisper-base.en</td>
</tr>

<tr>
<td class="org-right">4.076</td>
<td class="org-left">speaches whisper-small</td>
</tr>

<tr>
<td class="org-right">3.735</td>
<td class="org-left">speaches whisper-small.en</td>
</tr>

<tr>
<td class="org-right">2.870</td>
<td class="org-left">speaches lorneluo/whisper-small-ct2-int8</td>
</tr>

<tr>
<td class="org-right">4.537</td>
<td class="org-left">whisperx-server Systran/faster-whisper-small</td>
</tr>
</tbody>
</table>

<p>
All the results above were CPU-only, no GPU acceleration.
</p>

<p>
I tried it with:
</p>
<ul class="org-ul">
<li><a href="https://parakeettdt.com/">parakeet</a></li>
<li><a href="https://huggingface.co/ggerganov/whisper.cpp">whisper.cpp</a> (as <a href="https://github.com/natrys/whisper.el">whisper.el</a> sets it up)</li>
<li><a href="https://github.com/speaches-ai/speaches">speaches</a>, which is a front-end for <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a>, and</li>
<li><a href="https://codeberg.org/aryak/whisperx-server">whisperx-server</a>, which is a front-end for <a href="https://github.com/m-bain/whisperX">whisperx</a></li>
</ul>

<div class="update" id="org9c6e32c">
<p>
<span class="timestamp-wrapper"><time class="timestamp" datetime="2026-04-05">[2026-04-05 Sun]</time></span> Update: After a lot of rebooting, I figured out how to get the Nvidia P1000 GPU on my Lenovo P52 to work for up to the Whisper medium models. Large models don't fit in the 4GB of VRAM it has. Here are the benchmarking results:
</p>

<details><summary>Benchmarking code</summary>
<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">let</span> ((times <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">3</span>))
(mapcar
 (<span class="org-keyword">lambda</span> (group)
   (<span class="org-keyword">let</span> ((whisper&#45;&#45;temp-file <span class="org-string">"/tmp/2026-01-19-14-17-53.wav"</span>))
     <span class="org-comment-delimiter">;; </span><span class="org-comment">warm up the model</span>
     (eval (cadr group))
     (list
      (format <span class="org-string">"%.3f"</span>
              (/ (car
                  (benchmark-call (<span class="org-keyword">lambda</span> () (eval (cadr group))) times))
                 times))
      (car group))
     ))
 <span class="org-highlight-quoted-quote">'</span>(
   (<span class="org-string">"speaches whisper-tiny"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> <span class="org-string">"localhost"</span> 8000)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-tiny.en"</span>)))
   (<span class="org-string">"speaches whisper-base"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> <span class="org-string">"localhost"</span> 8000)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-base.en"</span>)))
   (<span class="org-string">"speaches whisper-small"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> <span class="org-string">"localhost"</span> 8000)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-small.en"</span>)))
   (<span class="org-string">"speaches whisper-medium"</span>
    (my-test-whisper-api
     (format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span> <span class="org-string">"localhost"</span> 8000)
     (list <span class="org-string">"-F"</span> <span class="org-string">"model=Systran/faster-whisper-medium.en"</span>))))))
</code></pre>
</div>

</details>

<table>


<colgroup>
<col class="org-right">

<col class="org-left">
</colgroup>
<tbody>
<tr>
<td class="org-right">0.596</td>
<td class="org-left">speaches whisper-tiny</td>
</tr>

<tr>
<td class="org-right">0.940</td>
<td class="org-left">speaches whisper-base</td>
</tr>

<tr>
<td class="org-right">2.909</td>
<td class="org-left">speaches whisper-small</td>
</tr>

<tr>
<td class="org-right">8.740</td>
<td class="org-left">speaches whisper-medium</td>
</tr>
</tbody>
</table>

<p>
That probably means I can upgrade to using the small model on a regular basis. For large models, I can use the command-line tool, since the overhead of loading the model is probably small compared to the processing time. I could also run two Speaches servers (one on CPU and one on GPU), but I don't think I'll need that level yet.
</p>

</div>

<p>
I like how speaches lets me switch models on the fly, so maybe I can use small.en generally and switch to a different model when I want to try dictating in French. Here's how I've set it up to use the server.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">setq</span> whisper-server-port 8000
      whisper-model <span class="org-string">"Systran/faster-whisper-small.en"</span>
      my-whisper-url-format <span class="org-string">"http://%s:%d/v1/audio/transcriptions"</span>)
</code></pre>
</div>


<p>
Benchmark notes: I have a Lenovo P52 laptop (released 2018) with an Intel Core i7-8850H (6 cores, 12 threads; 2.6 GHz base / 4.3 GHz turbo) with 64GB RAM and an SSD. It has a soldered-on Nvidia P1000 with 4GB of VRAM.
</p>

<div class="note">This is part of my <a href="https://sachachua.com/dotemacs#writing-and-editing-speech-recognition-emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models">Emacs configuration.</a></div><div><a href="https://sachachua.com/blog/2026/01/emacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models/index.org">View Org source for this post</a></div><p>You can <a href="https://social.sachachua.com/@sacha/statuses/01KFERY839GS115F3J8Z6BNT1K" target="_blank" rel="noopener noreferrer">comment on Mastodon</a> or <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2026%2F01%2Femacs-and-whisper-el-trying-out-different-speech-to-text-backends-and-models%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry><entry>
		<title type="html">Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2026-01-13T18:38:54Z</updated>
    <published>2026-01-04T01:23:57Z</published>
    <category term="emacs" />
<category term="audio" />
<category term="speech-recognition" />
<category term="speech" />
		<id>https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/</id>
		<content type="html"><![CDATA[<div class="update" id="org09f9e72">
<ul class="org-ul">
<li><span class="timestamp-wrapper"><time class="timestamp" datetime="2026-01-30">[2026-01-30 Fri]</time></span>: Major change: I switched to <a href="https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function">my fork</a> of natrys/whisper.el so that I can specify functions that change the window configuration etc.</li>
<li><span class="timestamp-wrapper"><time class="timestamp" datetime="2026-01-13">[2026-01-13 Tue]</time></span>: Change main function to <code>my-whisper-run</code>, use seq-reduce to go through the functions.</li>
<li><span class="timestamp-wrapper"><time class="timestamp" datetime="2026-01-09">[2026-01-09 Fri]</time></span>: Added code for automatically capturing screenshots, saving text, working with a list of functions.</li>
<li><span class="timestamp-wrapper"><time class="timestamp" datetime="2026-01-08">[2026-01-08 Thu]</time></span>: Added demo, fixed some bugs.</li>
<li><span class="timestamp-wrapper"><time class="timestamp" datetime="2026-01-04">[2026-01-04 Sun]</time></span>: Added note about difference from MELPA package, fixed :vc</li>
</ul>

</div>

<p>
I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. <a href="https://github.com/openai/whisper">OpenAI Whisper</a> is reasonably good at recognizing my speech now and <a href="https://github.com/natrys/whisper.el">whisper.el</a> gives me a convenient way to call <a href="https://github.com/ggml-org/whisper.cpp">whisper.cpp</a> from Emacs with a single keybinding. (Note: This is not the same <a href="https://melpa.org/#/whisper">whisper</a> package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.
</p>

<p>
I've bound <code>&lt;f9&gt;</code> to the command <code>whisper-run</code>. I press <code>&lt;f9&gt;</code> to start recording, talk, and then press <code>&lt;f9&gt;</code> to stop recording. By default, it inserts the text into the buffer at the current point. I've set <code>whisper-return-cursor-to-start</code> to <code>nil</code> so that I can keep going.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">use-package</span> whisper
  <span class="org-builtin">:vc</span> (<span class="org-builtin">:url</span> <span class="org-string">"https://github.com/natrys/whisper.el"</span>)
  <span class="org-builtin">:load-path</span> <span class="org-string">"~/vendor/whisper.el"</span>
  <span class="org-builtin">:config</span>
  (<span class="org-keyword">setq</span> whisper&#45;&#45;mode-line-recording-indicator <span class="org-string">"&#9210;"</span>)
  (<span class="org-keyword">setq</span> whisper-quantize <span class="org-string">"q4_0"</span>)
  (<span class="org-keyword">setq</span> whisper-install-directory <span class="org-string">"~/vendor"</span>)
  (<span class="org-keyword">setq</span> whisper&#45;&#45;install-path (concat
     (expand-file-name (file-name-as-directory whisper-install-directory))
     <span class="org-string">"whisper.cpp/"</span>))
  <span class="org-comment-delimiter">;; </span><span class="org-comment">Get it running with whisper-server-mode set to nil first before you switch to 'local.</span>
  <span class="org-comment-delimiter">;; </span><span class="org-comment">If you change models,</span>
  <span class="org-comment-delimiter">;; </span><span class="org-comment">(whisper-install-whispercpp (whisper&#45;&#45;check-install-and-run nil "whisper-start"))</span>
  (<span class="org-keyword">setq</span> whisper-server-mode <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">local</span>)
  (<span class="org-keyword">setq</span> whisper-model <span class="org-string">"base"</span>)
  (<span class="org-keyword">setq</span> whisper-return-cursor-to-start nil)
  <span class="org-comment-delimiter">;</span><span class="org-comment">(setq whisper&#45;&#45;ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")</span>
  (<span class="org-keyword">setq</span> whisper&#45;&#45;ffmpeg-input-device <span class="org-string">"VirtualMicSink.monitor"</span>)
  (<span class="org-keyword">setq</span> whisper-language <span class="org-string">"en"</span>)
  (<span class="org-keyword">setq</span> whisper-recording-timeout 3000)
  (<span class="org-keyword">setq</span> whisper-before-transcription-hook nil)
  (<span class="org-keyword">setq</span> whisper-use-threads (1- (num-processors)))
  (<span class="org-keyword">setq</span> whisper-transcription-buffer-name-function <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper&#45;&#45;simple-transcription-buffer-name</span>)
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-after-transcription-hook</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">my-subed-fix-common-errors-from-start</span> -100)
  <span class="org-builtin">:bind</span>
  ((<span class="org-string">"&lt;f9&gt;"</span> . whisper-run)
   (<span class="org-string">"C-&lt;f9&gt;"</span> . my-whisper-run)
   (<span class="org-string">"S-&lt;f9&gt;"</span> . my-whisper-replay)
   (<span class="org-string">"M-&lt;f9&gt;"</span> . my-whisper-toggle-language)))
</code></pre>
</div>


<p>
Let's see if we can process "Computer remind me to&hellip;":
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-org-reminder-template</span> <span class="org-string">"t"</span>)

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-org-process-reminder</span> ()
  (<span class="org-keyword">let</span> ((text (buffer-string))
        reminder)
    (<span class="org-keyword">when</span> (string-match <span class="org-string">"computer[,</span><span class="org-string"><span class="org-warning">\</span></span><span class="org-string">.]? reminds? me to </span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">(</span></span><span class="org-string">.+</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">)</span></span><span class="org-string">"</span> text)
      (<span class="org-keyword">setq</span> reminder (match-string 1 text))
      (<span class="org-keyword">save-window-excursion</span>
        (<span class="org-keyword">with-current-buffer</span> (<span class="org-keyword">if</span> (markerp whisper&#45;&#45;marker) (marker-buffer whisper&#45;&#45;marker) (current-buffer))
          (<span class="org-keyword">when</span> (markerp whisper&#45;&#45;marker) (goto-char whisper&#45;&#45;marker))
          (org-capture nil my-whisper-org-reminder-template)
          (insert reminder)
          (org-capture-finalize)))
      (erase-buffer))))

(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper</span>
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-after-transcription-hook</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">my-whisper-org-process-reminder</span> 50))
</code></pre>
</div>


<p>
Disk space is inexpensive and backups are great, so let's save each file using the timestamp.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-dir</span> <span class="org-string">"~/recordings/whisper/"</span>)
(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-set-temp-filename</span> ()
  (<span class="org-keyword">setq</span> whisper&#45;&#45;temp-file (expand-file-name
                            (format-time-string <span class="org-string">"%Y-%m-%d-%H-%M-%S.wav"</span>)
                            my-whisper-dir)))

(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper</span>
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-before-transcription-hook</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-set-temp-filename</span>))
</code></pre>
</div>


<p>
The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-replay</span> (<span class="org-type">&amp;optional</span> file)
  <span class="org-doc">"Replay the last temporary recording."</span>
  (<span class="org-keyword">interactive</span> (list
                (<span class="org-keyword">when</span> current-prefix-arg
                  (read-file-name <span class="org-string">"File: "</span> my-whisper-dir))))
  (<span class="org-keyword">setq</span> whisper&#45;&#45;temp-file (<span class="org-keyword">or</span> file whisper&#45;&#45;temp-file))
  (mpv-play whisper&#45;&#45;temp-file))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-insert-retry</span> (<span class="org-type">&amp;optional</span> file)
  (<span class="org-keyword">interactive</span> (list
                (<span class="org-keyword">when</span> current-prefix-arg
                  (read-file-name <span class="org-string">"File: "</span> my-whisper-dir))))
  (whisper&#45;&#45;cleanup-transcription)
  (<span class="org-keyword">setq</span> whisper&#45;&#45;marker (point-marker)
        whisper&#45;&#45;temp-file (<span class="org-keyword">or</span> file whisper&#45;&#45;temp-file))
  (whisper&#45;&#45;transcribe-audio))
</code></pre>
</div>


<p>
Il peut aussi comprendre le français.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-toggle-language</span> ()
  <span class="org-doc">"Set the language explicitly, since sometimes auto doesn't figure out the right one."</span>
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">setq</span> whisper-language (<span class="org-keyword">if</span> (string= whisper-language <span class="org-string">"en"</span>) <span class="org-string">"fr"</span> <span class="org-string">"en"</span>))
  <span class="org-comment-delimiter">;; </span><span class="org-comment">If using a server, we need to restart for the language</span>
  (<span class="org-keyword">when</span> (process-live-p whisper&#45;&#45;server-process) (kill-process whisper&#45;&#45;server-process))
  (message <span class="org-string">"%s"</span> whisper-language))
</code></pre>
</div>


<p>
I could use this with <code>org-capture</code>, but that's a lot of keystrokes. My shortcut for org-capture is <code>C-c r</code>. I need to press at least one key to set the template, <code>&lt;f9&gt;</code> to start recording, <code>&lt;f9&gt;</code> to stop recording, and <code>C-c C-c</code> to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.
</p>

<p>
To clock in, I can use <code>C-c C-x i</code> or my <code>!</code> <a href="https://sachachua.com/dotemacs#org-mode-keyboard-shortcuts-other-speed-commands">speed command</a>. Bonus: the modeline displays the current task to keep me on track, and I can use <code>org-clock-goto</code> (which I've bound to <code>C-c j</code>) to jump to it.
</p>

<p>
Then, when I'm looking at something else and I want to record a note, I can press <code>&lt;f9&gt;</code> to start the recording, and then <code>C-&lt;f9&gt;</code> to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-reset</span> (text)
  (<span class="org-keyword">setq</span> my-whisper-skip-annotation nil)
  (remove-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-insert-text-at-point</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-org-save-to-clocked-task</span>)
  text)
</code></pre>
</div>



<div class="org-src-container">
<pre class="src src-emacs-lisp" id="org513fed4"><code><span class="org-comment-delimiter">;; </span><span class="org-comment">Only works with my tweaks to whisper.el</span>
<span class="org-comment-delimiter">;; </span><span class="org-comment">https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function</span>
(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper</span>
  (<span class="org-keyword">setq</span> whisper-insert-text-at-point
        <span class="org-highlight-quoted-quote">'</span>(my-whisper-handle-commands
          my-whisper-save-text
          my-whisper-save-to-file
          my-whisper-maybe-expand-snippet
          my-whisper-maybe-type
          my-whisper-maybe-type-with-hints
          my-whisper-insert
          my-whisper-reset)))
</code></pre>
</div>



<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-last-annotation</span> nil <span class="org-doc">"Last annotation so we can skip duplicates."</span>)
(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-skip-annotation</span> nil)
(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-target-markers</span> nil <span class="org-doc">"List of markers to send text to."</span>)

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-insert</span> (text)
  (<span class="org-keyword">let</span> ((markers
         (<span class="org-keyword">cond</span>
          ((null my-whisper-target-markers)
           (list whisper&#45;&#45;marker)) <span class="org-comment-delimiter">; </span><span class="org-comment">current point where whisper was started</span>
          ((listp my-whisper-target-markers)
           my-whisper-target-markers)
          ((markerp my-whisper-target-markers)
           (list my-whisper-target-markers))))
        (orig-point (point))
        (orig-buffer (current-buffer)))
    (<span class="org-keyword">when</span> text
      (mapcar (<span class="org-keyword">lambda</span> (marker)
                (<span class="org-keyword">with-current-buffer</span> (marker-buffer marker)
                  (<span class="org-keyword">save-restriction</span>
                    (widen)
                    (<span class="org-keyword">when</span> (markerp marker) (goto-char marker))
                    (<span class="org-keyword">when</span> (<span class="org-keyword">and</span> (derived-mode-p <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">org-mode</span>) (org-at-drawer-p))
                      (insert <span class="org-string">"\n"</span>))
                    (whisper&#45;&#45;insert-text
                     (concat
                      (<span class="org-keyword">if</span> (looking-back <span class="org-string">"[ \t\n]</span><span class="org-string"><span class="org-regexp-grouping-backslash">\\</span></span><span class="org-string"><span class="org-regexp-grouping-construct">|</span></span><span class="org-string">^"</span>)
                          <span class="org-string">""</span>
                        <span class="org-string">" "</span>)
                      (string-trim text)))
                    <span class="org-comment-delimiter">;; </span><span class="org-comment">Move the marker forward here</span>
                    (move-marker marker (point)))))
              markers)
      (<span class="org-keyword">when</span> my-whisper-target-markers
        (goto-char orig-point))
      nil)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-maybe-type</span> (text)
  (<span class="org-keyword">when</span> text
    (<span class="org-keyword">if</span> (frame-focus-state)
        text
      (make-process <span class="org-builtin">:name</span> <span class="org-string">"xdotool"</span> <span class="org-builtin">:command</span>
                    (list <span class="org-string">"xdotool"</span> <span class="org-string">"type"</span>
                          text))
      nil)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-clear-markers</span> ()
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">setq</span> my-whisper-target-markers nil))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-use-current-point</span> (<span class="org-type">&amp;optional</span> add)
  (<span class="org-keyword">interactive</span> (list current-prefix-arg))
  (<span class="org-keyword">if</span> add
      (<span class="org-keyword">push</span> (point-marker) my-whisper-target-markers)
    (<span class="org-keyword">setq</span> my-whisper-target-markers (list (point-marker)))))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-run-at-point</span> (<span class="org-type">&amp;optional</span> add)
  (<span class="org-keyword">interactive</span> (list current-prefix-arg))
  (my-whisper-clear-markers)
  (whisper-run))

(keymap-global-set <span class="org-string">"&lt;f9&gt;"</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-run-at-point</span>)
(keymap-global-set <span class="org-string">"&lt;kp-1&gt;"</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">whisper-run</span>)

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-jump-to-marker</span> ()
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">with-current-buffer</span> (marker-buffer (car my-whisper-target-markers))
    (goto-char (car my-whisper-target-markers))))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-use-currently-clocked-task</span> (<span class="org-type">&amp;optional</span> add)
  (<span class="org-keyword">interactive</span> (list current-prefix-arg))
  (<span class="org-keyword">save-window-excursion</span>
    (<span class="org-keyword">save-restriction</span>
      (<span class="org-keyword">save-excursion</span>
        (org-clock-goto)
        (org-end-of-meta-data)
        (org-end-of-subtree)
        (<span class="org-keyword">if</span> add
            (<span class="org-keyword">push</span> (point-marker) my-whisper-target-markers)
          (<span class="org-keyword">setq</span> my-whisper-target-markers (list (point-marker))))))))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-run</span> (<span class="org-type">&amp;optional</span> skip-annotation)
  (<span class="org-keyword">interactive</span> (list current-prefix-arg))
  (<span class="org-keyword">require</span> <span class="org-highlight-quoted-quote">'</span><span class="org-constant">whisper</span>)
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">whisper-insert-text-at-point</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-org-save-to-clocked-task</span> -10)
  (whisper-run)
  (<span class="org-keyword">when</span> skip-annotation
    (<span class="org-keyword">setq</span> my-whisper-skip-annotation t)))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-save-text</span> (text)
  <span class="org-doc">"Save TEXT beside `</span><span class="org-doc"><span class="org-constant">whisper&#45;&#45;temp-file</span></span><span class="org-doc">'."</span>
  (<span class="org-keyword">when</span> text
    (<span class="org-keyword">let</span> ((link (org-store-link nil)))
      (<span class="org-keyword">with-temp-file</span> (concat (file-name-sans-extension whisper&#45;&#45;temp-file) <span class="org-string">".txt"</span>)
        (<span class="org-keyword">when</span> link
          (insert link <span class="org-string">"\n"</span>))
        (insert text)))
    text))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-org-save-to-clocked-task</span> (text)
  (<span class="org-keyword">when</span> text
    (<span class="org-keyword">save-window-excursion</span>
      (<span class="org-keyword">with-current-buffer</span> (<span class="org-keyword">if</span> (markerp whisper&#45;&#45;marker) (marker-buffer whisper&#45;&#45;marker) (current-buffer))
        (<span class="org-keyword">when</span> (markerp whisper&#45;&#45;marker) (goto-char whisper&#45;&#45;marker))
        <span class="org-comment-delimiter">;; </span><span class="org-comment">Take a screenshot maybe</span>
        (<span class="org-keyword">let*</span> ((link (<span class="org-keyword">and</span> (not my-whisper-skip-annotation)
                          (org-store-link nil)))
               (region (<span class="org-keyword">and</span> (region-active-p) (buffer-substring (region-beginning) (region-end))))
               (screenshot-filename
                (<span class="org-keyword">when</span> (<span class="org-keyword">or</span>
                       (null link)
                       (not (string= my-whisper-last-annotation link))
                       (not (frame-focus-state))) <span class="org-comment-delimiter">; </span><span class="org-comment">not in focus, take a screenshot</span>
                  (my-screenshot-current-screen (concat (file-name-sans-extension whisper&#45;&#45;temp-file) <span class="org-string">".png"</span>)))))
          (<span class="org-keyword">if</span> (org-clocking-p)
              (<span class="org-keyword">save-window-excursion</span>
                (<span class="org-keyword">save-restriction</span>
                  (<span class="org-keyword">save-excursion</span>
                    (org-clock-goto)
                    (org-end-of-subtree)
                    (<span class="org-keyword">unless</span> (bolp)
                      (insert <span class="org-string">"\n"</span>))
                    (insert <span class="org-string">"\n"</span>)
                    (<span class="org-keyword">if</span> (<span class="org-keyword">and</span> link (not (string= my-whisper-last-annotation link)))
                        (insert
                         (<span class="org-keyword">if</span> screenshot-filename
                             (concat <span class="org-string">"("</span> (org-link-make-string
                                          (concat <span class="org-string">"file:"</span> screenshot-filename)
                                          <span class="org-string">"screenshot"</span>) <span class="org-string"><span class="org-warning">") "</span></span><span class="org-warning">)</span>
                           <span class="org-string">""</span>)
                         link
                         <span class="org-string">"\n"</span>)
                      (<span class="org-keyword">when</span> screenshot-filename
                        (insert (org-link-make-string
                                 (concat <span class="org-string">"file:"</span> screenshot-filename)
                                 <span class="org-string">"screenshot"</span>)
                                <span class="org-string">"\n"</span>)))
                    (<span class="org-keyword">when</span> region
                      (insert <span class="org-string">"#+begin_example\n"</span> region <span class="org-string">"\n#+end_example\n"</span>))
                    (insert text <span class="org-string">"\n"</span>)
                    (<span class="org-keyword">setq</span> my-whisper-last-annotation link)))
                (run-at-time 0.5 nil (<span class="org-keyword">lambda</span> (text) (message <span class="org-string">"Added clock note: %s"</span> text)) text))
            <span class="org-comment-delimiter">;; </span><span class="org-comment">No clocked task, prompt for a place to capture it</span>
            (kill-new text)
            (<span class="org-keyword">setq</span> org-capture-initial text)
            (call-interactively <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">org-capture</span>)
            <span class="org-comment-delimiter">;; </span><span class="org-comment">Delay the window configuration</span>
            (<span class="org-keyword">let</span> ((config (current-window-configuration)))
              (run-at-time 0.5 nil
                           (<span class="org-keyword">lambda</span> (text config)
                             (set-window-configuration config)
                             (message <span class="org-string">"Copied: %s"</span> text))
                           text config))))))))

(<span class="org-keyword">with-eval-after-load</span> <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">org</span>
  (add-hook <span class="org-highlight-quoted-quote">'</span><span class="org-highlight-quoted-symbol">org-clock-in-hook</span> <span class="org-highlight-quoted-quote">#'</span><span class="org-highlight-quoted-symbol">my-whisper-org-clear-saved-annotation</span>))

(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-org-clear-saved-annotation</span> ()
  (<span class="org-keyword">setq</span> my-whisper-org-last-annotation nil))
</code></pre>
</div>


<p>
Here's an idea for a function that saves the recognized text with a timestamp.
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defvar</span> <span class="org-variable-name">my-whisper-notes</span> <span class="org-string">"~/sync/stream/narration.org"</span>)
(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-save-to-file</span> (text)
  (<span class="org-keyword">when</span> text
    (<span class="org-keyword">let</span> ((link (org-store-link nil)))
      (<span class="org-keyword">with-current-buffer</span> (find-file-noselect my-whisper-notes)
        (goto-char (point-max))
        (insert <span class="org-string">"\n\n"</span> (format-time-string <span class="org-string">"%H:%M "</span>) text <span class="org-string">"\n"</span> (<span class="org-keyword">if</span> link (concat link <span class="org-string">"\n"</span>) <span class="org-string">""</span>))
        (save-buffer)
        (run-at-time 0.5 nil (<span class="org-keyword">lambda</span> (text) (message <span class="org-string">"Saved to file: %s"</span> text)) text)))
    text))
</code></pre>
</div>


<p>
And now I can redo things if needed:
</p>


<div class="org-src-container">
<pre class="src src-emacs-lisp"><code>(<span class="org-keyword">defun</span> <span class="org-function-name">my-whisper-redo</span> ()
  (<span class="org-keyword">interactive</span>)
  (<span class="org-keyword">setq</span> whisper&#45;&#45;marker (point-marker))
  (whisper&#45;&#45;transcribe-audio))
</code></pre>
</div>


<p>
I think I've just figured out my Pipewire setup so
that I can record audio in OBS while also being
able to do speech to text, without the audio
stuttering. <a href="https://github.com/rncbc/qpwgraph">qpwgraph</a> was super helpful
for visualizing the Pipewire connections and fixing them.
</p>


<div class="org-src-container">
<pre class="src src-sh"><code>systemctl &#45;&#45;user restart pipewire
sleep 2
pactl load-module module-null-sink <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink_name</span>=<span class="org-string">"VirtualMicSink"</span> <span class="org-variable-name">sink_properties</span>=device.description=VirtualMicSink
pactl load-module module-null-sink <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink_name</span>=<span class="org-string">"CombinedSink"</span> <span class="org-variable-name">sink_properties</span>=device.description=CombinedSink
<span class="org-keyword">if</span> pactl list short sources | grep -i pci-0000; <span class="org-keyword">then</span>
  pactl load-module module-loopback <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">source</span>=<span class="org-string">"alsa_input.pci-0000_00_1f.3.analog-stereo"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink</span>=<span class="org-string">"VirtualMicSink"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">latency_msec</span>=100 <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">adjust_time</span>=1 <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">source_output_properties</span>=<span class="org-string">"node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink_input_properties</span>=<span class="org-string">"node.description='SysToVMic' node.name='SysToVMic' media.role='filter'"</span>
    <span class="org-variable-name">sink_input_properties</span>=media.role=filter
  pactl load-module module-loopback <span class="org-string">\ </span>   <span class="org-variable-name">source</span>=<span class="org-string">"alsa_output.pci-0000_00_1f.3.analog-stereo.monitor"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink</span>=<span class="org-string">"CombinedSink"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">node_name</span>=<span class="org-string">"SystemOutToCombined"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">source_output_properties</span>=<span class="org-string">"node.description='SysOutToCombined' node.name='SysOutToCombined'"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink_input_properties</span>=<span class="org-string">"node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">latency_msec</span>=100 <span class="org-variable-name">adjust_time</span>=1
<span class="org-keyword">fi</span>
<span class="org-keyword">if</span> pactl list short sources | grep -i yeti; <span class="org-keyword">then</span>
  pactl load-module module-loopback <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">source</span>=<span class="org-string">"alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink</span>=<span class="org-string">"VirtualMicSink"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">latency_msec</span>=100 <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">adjust_time</span>=1 <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">source_output_properties</span>=<span class="org-string">"node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink_input_properties</span>=<span class="org-string">"node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"</span>
  pactl load-module module-loopback <span class="org-string">\ </span>   <span class="org-variable-name">source</span>=<span class="org-string">"alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink</span>=<span class="org-string">"CombinedSink"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">source_output_properties</span>=<span class="org-string">"node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' "</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">sink_input_properties</span>=<span class="org-string">"node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'"</span> <span class="org-sh-escaped-newline">\</span>
    <span class="org-variable-name">latency_msec</span>=100 <span class="org-variable-name">adjust_time</span>=1
<span class="org-keyword">fi</span>
pactl load-module module-loopback <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">source</span>=<span class="org-string">"VirtualMicSink.monitor"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink</span>=<span class="org-string">"CombinedSink"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">source_output_properties</span>=<span class="org-string">"node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink_input_properties</span>=<span class="org-string">"node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">latency_msec</span>=100 <span class="org-variable-name">adjust_time</span>=1

pactl load-module module-null-sink <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink_name</span>=<span class="org-string">"ExtraSink1"</span> <span class="org-variable-name">sink_properties</span>=device.description=ExtraSink1

pactl load-module module-loopback <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">source</span>=<span class="org-string">"ExtraSink1.monitor"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink</span>=<span class="org-string">"CombinedSink"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">source_output_properties</span>=<span class="org-string">"node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">sink_input_properties</span>=<span class="org-string">"node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'"</span> <span class="org-sh-escaped-newline">\</span>
  <span class="org-variable-name">latency_msec</span>=100 <span class="org-variable-name">adjust_time</span>=1
</code></pre>
</div>


<p>
Here's a demo:
</p>

<div class="media-post" id="org4ea7b1e">
<p>
</p><figure><video controls="1" src="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/2026-01-08_11.17.22.webm" type="video/webm"><track kind="subtitles" label="Captions" src="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/2026-01-08_11.17.22.vtt" srclang="en" default=""><a href="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/2026-01-08_11.17.22.webm">Download the video</a></video><figcaption><div>Screencast of using whisper.el to do speech-to-text into the current buffer, clocked-in task, or other function</div></figcaption></figure>
<p></p>

<details class="code-details" style="padding: 1em;
                 border-radius: 15px;
                 font-size: 0.9em;
                 box-shadow: 0.05em 0.1em 5px 0.01em  #00000057;" open="">
                  <summary><strong>Transcript</strong></summary>
<p>
</p><div class="transcript-heading"><span class="audio-time" data-start="0.000000">00:00:00</span> <strong>Inserting into the current buffer</strong></div><span class="audio-time caption" data-start="0.000000" data-stop="5.239000">Here's a quick demonstration</span> <span class="audio-time caption" data-start="5.240000" data-stop="13.799000">of using whisper.el to log notes.</span> <div class="transcript-heading"><span class="audio-time" data-start="13.000000">00:00:13</span> <strong>Inserting text and moving on</strong></div><span class="audio-time caption" data-start="13.800000" data-stop="16.279000">I can insert text into the current buffer</span> <span class="audio-time caption" data-start="16.280000" data-stop="31.559000">one after the other.</span> <div class="transcript-heading"><span class="audio-time" data-start="31.000000">00:00:31</span> <strong>Clocking in</strong></div><span class="audio-time caption" data-start="31.560000" data-stop="33.159000">If I clock into a task,</span> <span class="audio-time caption" data-start="33.160000" data-stop="35.319000">I can add to the end of that clocked in task</span> <span class="audio-time caption" data-start="35.320000" data-stop="36.879000">using my custom code</span> <span class="audio-time caption" data-start="36.880000" data-stop="38.399000">by pressing C-&lt;f9&gt;</span> <span class="audio-time caption" data-start="38.400000" data-stop="52.999000">or whatever my shortcut was.</span> <span class="audio-time caption" data-start="53.000000" data-stop="65.839000">I can do that multiple times.</span> <div class="transcript-heading"><span class="audio-time" data-start="65.000000">00:01:05</span> <strong>Logging a note from a different file</strong></div><span class="audio-time caption" data-start="65.840000" data-stop="75.319000">I can do that while looking at a different file.</span> <div class="transcript-heading"><span class="audio-time" data-start="75.000000">00:01:15</span> <strong>I can look at an info page</strong></div><span class="audio-time caption" data-start="75.320000" data-stop="76.799000">I can do it looking at an info page, for example,</span> <span class="audio-time caption" data-start="76.800000" data-stop="78.599000">and annotations will include a link</span> <span class="audio-time caption" data-start="78.600000" data-stop="93.959000">back to whatever I was looking at.</span> <div class="transcript-heading"><span class="audio-time" data-start="93.000000">00:01:33</span> <strong>Adding without an annotation (C-u)</strong></div><span class="audio-time caption" data-start="93.960000" data-stop="96.479000">I just added an optional argument</span> <span class="audio-time caption" data-start="96.480000" data-stop="99.999000">so that I can also capture a note</span> <span class="audio-time caption" data-start="100.000000" data-stop="114.359000">without saving an annotation.</span> <span class="audio-time caption" data-start="114.360000" data-stop="117.319000">That way, if I'm going to say a lot of things</span> <span class="audio-time caption" data-start="117.320000" data-stop="118.839000">about the same buffer,</span> <span class="audio-time caption" data-start="118.840000" data-stop="120.519000">I don't have to have a lot of links</span> <span class="audio-time caption" data-start="120.520000" data-stop="162.719000">that I need to edit out.</span> <div class="transcript-heading"><span class="audio-time" data-start="162.000000">00:02:42</span> <strong>Saving to a different function</strong></div><span class="audio-time caption" data-start="162.720000" data-stop="177.680000">I can also have it save to a different function.</span>
<p></p>


</details>

</div>

<p>
And then I define a global shortcut in KDE that runs:
</p>


<div class="org-src-container">
<pre class="src src-sh"><code>/home/sacha/bin/xdotool-emacs key &#45;&#45;clearmodifiers F9
</code></pre>
</div>


<p>
So now I can dictate into other applications or save into Emacs.
Which suggests of course that I should get it working with C-f9 as well, if I can avoid the keyboard shortcut loop&hellip;
</p>

<div class="note">This is part of my <a href="https://sachachua.com/dotemacs#multimedia-whisper">Emacs configuration.</a></div><div><a href="https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/index.org">View Org source for this post</a></div><p>You can <a href="https://social.sachachua.com/@sacha/statuses/01KE3D9XENS7FKCEJHF7SKR3TX" target="_blank" rel="noopener noreferrer">comment on Mastodon</a> or <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2026%2F01%2Fusing-whisper-el-to-capture-text-to-speech-in-emacs%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry><entry>
		<title type="html">#YayEmacs 9: Trimming/adding silences to get to a target; subed-record-sum-time</title>
		<link rel="alternate" type="text/html" href="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/"/>
		<author><name><![CDATA[Sacha Chua]]></name></author>
		<updated>2025-01-09T15:24:29Z</updated>
    <published>2025-01-09T15:24:29Z</published>
    <category term="audio" />
<category term="subed" />
<category term="yay-emacs" />
<category term="emacs" />
<category term="video" />
		<id>https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/</id>
		<content type="html"><![CDATA[<p>
New in this video: subed-record-sum-time, <code>#+PAD_LEFT</code> and <code>#+PAD_RIGHT</code>
</p>

<div class="row"><div class="columns"><div style="width: 400px"><video controls="1" src="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/subed-record-sum-time.webm" poster="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/2025-01-09_10-23-18.png" type="video/webm"><div>Video not supported. Thumbnail:<br><img src="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/2025-01-09_10-23-18.png" alt="Thumbnail"></div><a href="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/subed-record-sum-time.webm">Download the video</a></video></div></div><div class="columns">
<p>
I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be. First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times. Then I can skip my oopses. Sometimes WhisperX doesn't catch them, so I also look at waveforms and characters per second. I already talk quickly, so I'm not going to speed that up but I can trim the pauses in between phrases which is easy to do with waveforms. Sometimes, after reviewing a draft, I realize I need a little more time. If the original audio has some silence, I can just copy and paste it. If not, I can pad left or pad right to add some silence. I can try the flow of some sections and compile the video when I'm ready. Emacs can do almost anything. Yay Emacs!
</p>

<p>You can <a href="https://youtube.com/watch?v=https://youtube.com/shorts/scyAC4yrWbE">watch this on YouTube</a>, <a href="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/subed-record-sum-time.webm">download the video</a>, or <a href="https://sachachua.com/blog/feed/atom/media/subed-record-sum-time.opus">download the audio</a>.</p></div></div>

<p>
Play by play:
</p>

<ul class="org-ul">
<li>I like the constraints of a one-minute video, so I added a <code>subed-record-sum-time</code> command. That way, when I edit the video using Emacs, I can check how long the result will be.
<ul class="org-ul">
<li><a href="https://github.com/sachac/subed-record">subed-record</a> uses subtitles and directives in
comments in a VTT subtitle file to edit audio
and video. <code>subed-record-sum-time</code> calculates
the resulting duration and displays it in the
minibuffer.</li>
</ul></li>
<li>First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times.
<ul class="org-ul">
<li>I'm experimenting with an algorithmic way to
combine the breaks from my script with the
text from the transcript. <code>subed-align</code> calls
the <a href="https://github.com/readbeyond/aeneas">aeneas forced alignment tool</a> to match up
the text with the timestamps. I use
<code>subed-waveform-show-all</code> to show all the
waveforms.</li>
</ul></li>
<li>Then I can skip my oopses.
<ul class="org-ul">
<li>Adding a <code>NOTE #+SKIP</code> comment before a
subtitle makes <code>subed-record-compile-video</code>
and <code>subed-record-compile-flow</code> skip that part
of the audio.</li>
</ul></li>
<li>Sometimes WhisperX doesn't catch them,
<ul class="org-ul">
<li><a href="https://github.com/m-bain/whisperX">WhisperX</a> sometimes doesn't transcribe my false starts if I repeat things quickly.</li>
</ul></li>
<li>so I also look at waveforms
<ul class="org-ul">
<li><code>subed-waveform-show-all</code> adds waveforms for
all the subtitles. If I notice there's a pause
or a repeated shape in the waveform, or if I
listen and notice the repetition, I can
confirm by middle-clicking on the waveform to
sample part of it.</li>
</ul></li>
<li>and characters per second.
<ul class="org-ul">
<li>Low characters per second is sometimes a sign
that the timestamps are incorrect or there's a
repetition that wasn't transcribed.</li>
</ul></li>
<li>I already talk quickly, so I'm not going to speed that up
<ul class="org-ul">
<li>Also, I already sound like a chipmunk;
mechanically speeding up my recording to fit
in a certain time will make that worse =)</li>
</ul></li>
<li>but I can trim the pauses in between phrases which is easy to do with waveforms.
<ul class="org-ul">
<li>left-click to set the start, right-click to
set the stop. If I want to adjust the
previous/next one at the same time, I would
use shift-left-click or shift-right-click, but
here I want to skip the gaps between phrases,
so I adjust the current subtitle without
making the previous/next one longer.</li>
</ul></li>
<li>Sometimes, after reviewing a draft, I realize I need a little more time.
<ul class="org-ul">
<li>I can specify visuals like a video, animated
GIF, or an image by adding a <code>[[file:...]]</code>
link in the comment for a subtitle. That
visual will be used until the next visual is
specified in a comment on a different
subtitle. <code>subed-record-compile-video</code> can
automatically speed up video clips to fit in
the time for the current audio segment, which
is the set of subtitles before the next visual
is defined. After I compile and review the
video, sometimes I notice that something goes by too quickly.</li>
</ul></li>
<li>If the original audio has some silence, I can just copy and paste it.
<ul class="org-ul">
<li>This can sometimes feel more natural than adding in complete silence.</li>
</ul></li>
<li>If not, I can pad left or pad right to add some silence.
<ul class="org-ul">
<li>I added a new feature so that I could specify
something like <code>#+PAD_RIGHT: 1.5</code> in a comment
to add 1.5 seconds of silence after the audio
specified by that subtitle.</li>
</ul></li>
<li>I can try the flow of some sections
<ul class="org-ul">
<li>I can select a region and then use <code>M-x
    subed-record-compile-try-flow</code> to play the
audio or <code>C-u M-x
    subed-record-compile-try-flow</code> to play the
audio+video for that region.</li>
</ul></li>
<li>and compile the video when I'm ready.
<ul class="org-ul">
<li><code>subed-record-compile-video</code> compiles the
video to the file specified in <code>#+OUTPUT:
    filename</code>. ffmpeg is very arcane, so I'm glad
I can simplify my use of it with Emacs Lisp.</li>
</ul></li>
<li>Emacs can do almost anything. Yay Emacs!
<ul class="org-ul">
<li>Non-linear audio and video editing is actually
pretty fun in a text editor, especially when I
can just use <code>M-x vundo</code> to navigate my undo
history.</li>
</ul></li>
</ul>

<p>
Links:
</p>

<ul class="org-ul">
<li><a href="https://github.com/sachac/subed">sachac/subed: subed is a subtitle editor for Emacs</a></li>
<li><a href="https://github.com/sachac/subed-record">sachac/subed-record: Record audio in segments and compile it into a file</a></li>
<li><a href="https://github.com/m-bain/whisperX">m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (&amp; Diarization)</a></li>
<li><a href="https://github.com/readbeyond/aeneas">readbeyond/aeneas: aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)</a></li>
</ul>

<p>
Related:
</p>

<ul class="org-ul">
<li><a href="https://sachachua.com/blog/2025/01/editing-videos-with-emacs-and-subed-record-el/">Editing videos with Emacs and subed-record.el</a></li>
<li><a href="https://sachachua.com/blog/2024/10/yay-emacs-tweaking-my-video-workflow-with-whisperx-and-subed-record/">Yay Emacs 5: Tweaking my video workflow with WhisperX and subed-record</a></li>
</ul>
<div><a href="https://sachachua.com/blog/2025/01/yayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time/index.org">View org source for this post</a></div><p>You can <a href="mailto:sacha@sachachua.com?subject=Comment%20on%20https%3A%2F%2Fsachachua.com%2Fblog%2F2025%2F01%2Fyayemacs-9-trimming-adding-silences-to-get-to-a-target-subed-record-sum-time%2F&body=Name%20you%20want%20to%20be%20credited%20by%20(if%20any)%3A%20%0AMessage%3A%20%0ACan%20I%20share%20your%20comment%20so%20other%20people%20can%20learn%20from%20it%3F%20Yes%2FNo%0A">e-mail me at sacha@sachachua.com</a>.</p>]]></content>
		</entry>
</feed>