French pronunciation assessment with Azure and my Emacs setup

Mar 11, 2026| speech

I'm working on learning French because it's fun and because I want to help my daughter with her French classes. Some sounds are particularly difficult because they don't exist in English, so they involve moving my tongue and my lips in ways that I'm not used to. I've been working on some sentences my tutor assigned me to help me with various sounds. I practice:

during the 45-minute virtual sessions with my tutor twice a week: I record this with OBS
recording on my phone when I find myself with some spare time away from my computer, like when I'm waiting for my daughter
recording on my computer with Emacs, subed-record, and ffmpeg
just out loud, like when I'm skating or doing the dishes

I use WhisperX to get the transcript and word timing data with speaker identification (diarization). I use subed-record in Emacs to correct the transcript, cut/split/trim audio segments, and compile them

I've been working on my processes for reviewing reference examples either from text-to-speech engines or my meeting audio, recording and reviewing my attempts, and comparing them.

I want to be able to quickly listen to my best attempts from a session, compare different versions, and keep track of my progress over time. I also want to figure out how to practice pronunciation in between sessions with my tutor while minimizing the risk of reinforcing mistakes. To segment the audio for review, I use WhisperX to get the transcript and word timing data with speaker identification (diarization). I use subed-record in Emacs to correct the transcript, cut/split/trim audio segments, and compile them. I want to make this process even easier.

Some of the sounds I'm working on are:

/y/ as in mule and bu
/ʁ/ as in trottoir: making it with less air, but without it feeling like an "h" instead; also, transitioning to /y/ as in brume
/œ/ as in cœur
distinguishing between roue and rue

Microsoft Azure pronunciation assessment

I've been thinking about how to practise more effectively in between my twice-weekly tutoring sessions. There's been a fair amount of research into computer-aided language training and pronunciation assessment, and I wonder how I can tweak my processes and interfaces to take advantage of what other people have learned. I think many apps use the confidence scores of Whisper-based speech recognition engines. Other services try to be more detailed. For example, Microsoft Azure offers a pronunciation assessment service that scores audio samples on:

accuracy: how closely phonemes match a native speaker's pronunciation
fluency: how closely it matches a native speaker's silent breaks between words
completeness: whether all the words were said
pronunciation: the total score
confidence

Azure's syllable and phoneme analyses only work for the en-US locale, so I can't use them for fr-FR. That's okay, LLM pronunciation evaluation might work better for sentences than for words or phonemes anyway.¹

I'll share my results first, and then I'll describe my workflow.

For example, here are some Azure scores for "Trois très grands trains traversent trois trop grandes rues", compared with a Kokoro TTS sample and its results when analyzing an audio segment of my teacher speaking (not included).

	Mar 3	Bad	Kokoro	Tutor
Overall	94	44	97	93
Accuracy	93	63	96	100
Fluency	93	62	100	89
Completeness	100	33	100	100
Confidence	90	80	92	02

I'm not entirely sure how useful these numbers will be. It does distinguish between my current attempts and intentionally bad pronunciation, but I'm not sure how much I can trust it yet, or how useful it will be for guiding my attempts in between tutoring sessions. I can imagine a workflow where I listen to a reference, record my attempt, and replay the reference and the recording while displaying the scores, a waveform, and a spectrogram.

Let's look at another example of the scores. The tongue-twister "La mule sûre court plus vite que le loup fou." might offer a more useful comparison because I have a hard time with the u sound in "mule".

	Mar 3	i with tutor	i Kokoro	Tutor
Overall	84	94	100	95
Accuracy	96	95	100	100
Fluency	75	92	100	93
Completeness	100	100	100	100
Confidence	84	85	89	87

Phonemes

What about trying to get the IPA and then doing some kind of comparison? I tried using Allosaurus and Montreal Forced Aligner for my audio samples for "La mule sûre court plus vite que le loup fou." I've also included IPA output from Kokoro TTS and espeak, which generate them from text, although I've removed the stress marks for easier comparison.

Mar 3	naɛmis̪yʁkɔpluzitɑləkœnədufu
with tutor (Allosaurus)	lɛns̪ikɑlovitkɛlədufu
with tutor (MFA)	lamylsyʁkuʁplyvitkəlølufu
Tutor	lamilks̪yuəs̪ikətəlufu
Kokoro TTS reference	la myl syʁ kuʁ ply vit kə lə lu fu.
Espeak TTS reference	la mjul suɹə kɔt plʌs vaɪt kwɛ lə lup fu.

The IPA for a sentence is hard to read and compare. Maybe I can get Allosaurus to do word breaks, or maybe I can try a different tool. The phonemes from Montreal Forced Aligner look like they might be a little more manageable; I can see that the main difference is that I said lø instead of lə. I can get timing data from MFA, so I might be able to use that to break it up into words. It would be nice to get confidence data, though, since I'm pretty sure that y isn't solid yet.

Analyzing multiple tongue-twisters from a single session

Here are some tongue-twister attempts from my March 6 session, annotated with the comments from my tutor. First, the list of tongue-twisters in this analysis.

Maman peint un grand lapin blanc.
Un enfant intelligent mange lentement.
Le roi croit voir trois noix.
Il est loin mais moins loin que ce coin.
Le témoin voit le chemin loin.
Moins de foin au loin ce matin.
La laine beige sèche près du collège.
La croquette sèche dans l’assiette.
Elle mène son frère à l’hôtel.
Le verre vert est très clair.
Elle aimait manger et rêver.
Le cœur seul pleure doucement.
Le beurre fond dans le cœur chaud.
Tu es sûr du futur ?
Un mur dur bloque la rue.

Then I can extract audio segments, transcribe the IPA, and send it to Azure for pronunciation analysis all in one go.

(my-subed-record-analyze-file-with-azure "~/proj/french/analysis/virelangues/2026-03-06-raphael-script.vtt")

File	ID	Comments	All	Acc	Flu	Comp	Conf	Phonemes
	01-01		85	92	78	100	88	lɒmɒaɒʁɔnlat̪ɒlɒ
	01-02	Mm hmm	88	86	86	100	87	mamakɔaɒkul̪ɒlatalɒ
	01-03	Ouais	68	73	69	67	86	lɒmɒaɒpzɒlɒ
	01-04	Ouais	81	78	89	83	86	ɒenɒmɒt̪ɒʁɔlɒt̪alɒŋ
	02-01	Ouais, c'est bien	92	93	90	100	90	ɒjnɒwsɒnɛnt̪eleʒɔnmomuʃlɔns̪mɒ
	03-01	Uh huh	98	100	97	100	88	jal̪aʁwaɒll̪awandwa
	04-01	Ouais, parfait	91	92	100	89	89	ijelwamenmwdoakesekwaŋ
	05-01	X: témoin	83	82	90	83	88	etemwawaəʃemɒdwaŋ
	05-02	Ouais	92	88	96	100	89	et̪timwawaeʃemɒlwa
	06-01	Mm hmm, parfait	89	93	94	86	89	wɒt̪iəwozwasənmɒtɒŋ
	07-01	X: près du collège	78	79	99	71	87	ɒmnteʁalol̪e
	07-02	X: près	85	86	85	86	88	ɒjteʁatkœl̪es
	07-03	Mm hmm	90	93	99	86	89	ɒl̪mnteadəkol̪en
	08-01	Ouais, c'est mieux	99	99	99	100	90	laokes̪estɒnlas̪iə
	08-02	Mm hmm	97	96	100	100	90	laokes̪est̪ɒvɒnləs̪iən
	09-01	Ouais, c'est bien	99	99	100	100	89	ɛnmɛnsɑnsʁɛajlot̪ɛn
	10-01	Mm hmm	87	88	99	83	87	nɛzaʁbɛʁejklə
	11-01	Mm hmm	100	100	100	100	89	elɒnmimaŋzeʁɒze
	12-01	X: doucement	81	86	81	80	87	ikaʁapjobil̪ədimɒ
	12-02	Ouais	82	83	76	100	86	ikɑsuntius̪əmɑn
	12-03	Ouais, c'est mieux	70	75	65	80	84	lɛkɑsødius̪əmən
	13-01	Ouais	85	85	85	86	84	lidəfɔdɑŋlikɑʁʃəl
	14-01	Mm hmm	97	96	98	100	87	kel̪eid̪jefit̪joəʁ
	15-01	X: rue	84	90	78	100	84	ɒwl̪midijəlɔʁlɒʁs̪u
	15-02	X: rue	84	87	88	83	83	ɒwmjd̪il̪ɔʁnɒku

The play buttons even work in Org Mode because I use a custom link type for audio links.

It would be useful to analyze one tongue-twister across multiple sessions to get a sense of my progress.

Focusing on one word

Here's a deep-dive on the word "mule" from my March 3 session. I exported the words using the WhisperX JSON. Both Azure speech recognition results and Allosaurus phonemes are all over the place when I try to run them on audio segments with individual words.

(my-subed-record-extract-words "mule"  "/home/sacha/sync/recordings/processed/2026-03-03-raphael.json" "/home/sacha/proj/french/analysis/mule/index.vtt")

I manually adjusted some timestamps, removed some segments, and added a reference sample from Wiktionary (source, public domain). Here's the WebVTT file with the directives: file:///home/sacha/proj/french/analysis/mule/index.vtt

(my-subed-record-analyze-file-with-azure "~/proj/french/analysis/mule/index.vtt")

File	ID	Comments	WhisperX	All	Acc	Flu	Comp	Conf	Phonemes
	01	Wiktionary - France		97	95	100	100	77	mel
	02		92	80	68	100	100	78	mi
	03		91	80	68	100	100	79	nol̪ɒj
	04	Bit of a y	90	95	93	100	100	88	nyjɔl̪ɒ
	05	Bit of a y	89	88	80	100	100	83	j
	06		89	88	80	100	100	82	muə
	07	Bit of a y	88	10	50	0	0	78	niə
	08		88	76	60	100	100	81	mio
	09	Bit of a y	84	91	86	100	100	84	mj
	10	Bit of a y	84	11	56	0	0	77	nj
	11	Bit of a y	83	6	31	0	0	79	jn
	12		83	80	68	100	100	80	mijə
	13		82	0	0	0	0	88	ɒ
	14	Got a small "oui"	80	1	5	0	0	84	ija
	15		78	4	21	0	0	57	mia
	16		78	10	53	0	0	80	miə
	17		75	10	52	0	0	77	we
	18	Got a "non!"	75	8	44	0	0	85	mjən
	19		72	79	66	100	100	78	mɛ
	20	Bit of a y	72	79	66	100	100	79	miə

At the word level, the Azure pronunciation scores are all over the place, and so are the Allosaurus phonemes. For now, I think it might be better to tweak my interface so that I can more easily refer to the samples (text to speech or recordings extracted from my tutoring session) and compare new recordings, maybe with waveforms, spectrograms, and other plots.

I like the progress I've made on a workflow for extracting and evaluating sentences or words. I'm looking forward to seeing what I can do once I understand a bit more of the research. Let me describe the workflow I have so far.

Analyzing one tongue-twister across multiple sessions

I can also compile different versions of one tongue-twister into one file to get a sense of my progress.

(my-subed-record-analyze-file-with-azure
 (my-subed-record-collect-matching-subtitles
 "Le roi croit voir trois noix"
 '(("~/sync/recordings/processed/2026-02-20-raphael-tongue-twisters.vtt" . "Feb 20")
   ("~/sync/recordings/processed/2026-02-22-virelangues-single.vtt" . "Feb 22")
   ("~/proj/french/recordings/2026-02-26-virelangues-script.vtt" . "Feb 26")
   ("~/proj/french/recordings/2026-02-27-virelangues-script.vtt" . "Feb 27")
   ("~/proj/french/recordings/2026-03-03-virelangues.vtt" . "Mar 3")
   ("~/proj/french/analysis/virelangues/2026-03-06-raphael-script.vtt" . "Mar 6"))
 "~/proj/french/analysis/virelangues/Le-roi-croit-voir-trois-noix.vtt"
 nil 'my-subed-simplify))

File	ID	Comments	All	Acc	Flu	Comp	Conf	Phonemes
	01	Feb 20	76	78	73	83	85	ɛmiwəkɔwɑtwɑsmɑs
	02	Feb 22	97	98	97	100	88	jusɒl̪vwatlɒnwa
	03	Feb 26	97	100	96	100	89	əkwɒwɑtwanwɑ
	04	Feb 27	96	95	97	100	87	jɛʀwɑl̪ɛvwɑtwɛnwɑ
	05	Mar 3	94	96	92	100	88	juwakwɒl̪dwɑwanwɑ
	06	Mar 6: : Uh huh	98	100	97	100	88	jal̪aʁwaɒll̪awandwa

Emacs Lisp code for collecting matching subtitles

(defun my-subed-record-collect-matching-subtitles (text files output-file &optional match-fn transform-fn)
  "Find subtitles that match TEXT in FILES and write to OUTPUT-FILE.
FILES can be a list of filenames or a list of (FILE . NOTE) pairs.
By default, TEXT is an approximate match based on
`subed-word-data-compare-normalized-string-distance'.
If TRANSFORM-FN is specified, use that on SUBTITLE-TEXT before comparing.
If MATCH-FN is specified, use that to match instead. It will be called
with the arguments input-text and subtitle-text.
Return OUTPUT-FILE."
  (subed-create-file
   output-file
   (seq-mapcat
    (lambda (o)
      (let ((media-file (subed-guess-media-file nil (expand-file-name (if (consp o) (car o) o)))))
        (seq-keep
         (lambda (sub)
           (when transform-fn
             (setf (elt sub 3) (funcall transform-fn (elt sub 3))))
           (when (funcall (or match-fn 'subed-word-data-compare-normalized-string-distance) text (elt sub 3))
             ;; Add audio note if missing
             (unless (or (subed-record-get-directive "#+AUDIO" (elt sub 4))
                         (null media-file))
               (setf (elt sub 4)
                     (subed-record-set-directive "#+AUDIO" media-file (or (elt sub 4) ""))))
             ;; Prepend note if specified
             (when (and (consp o) (cdr o))
               (let ((current-note (subed-record-get-directive "#+NOTE" (elt sub 4))))
                 (setf (elt sub 4)
                       (subed-record-set-directive
                        "#+NOTE"
                        (if current-note
                            (concat (cdr o) ": " current-note)
                          (cdr o))
                        (or (elt sub 4) "")))))
             sub))
         (subed-parse-file (car o)))))
    files)
   t)
  output-file)

Workflow

Extracting parts of the recording

I use OBS to record both my microphone and the tutor's voice. I use WhisperX to transcribe my recording with speaker diarization.

Shell script

#!/bin/zsh
WHISPER_ARGS=(${(z)WHISPER_FLAGS})
MAX_LINE_WIDTH="${MAX_LINE_WIDTH:-50}"
MODEL="${MODEL:-large-v2}"
for FILE in "$@"; do
    text="${FILE%.*}.txt"
    if [  -f "$text" ]; then
       echo "Skipping $FILE as it's already been transcribed."
    else
        ~/vendor/whisperx/.venv/bin/whisperx --model "$MODEL" --diarize --hf_token $HUGGING_FACE_API_KEY --language fr --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --compute_type int8 --print_progress True --max_line_width $MAX_LINE_WIDTH --segment_resolution chunk --max_line_count 1 --initial_prompt "Emacs et Org Mode sont d'excellents outils. J'utilise Org-roam pour prendre des notes. Today I am recording a braindump about technical setups. C'est vraiment utile pour la productivité." "$FILE" $WHISPER_FLAGS
        rm -f "${FILE%.*}.srt"
    fi
done

I can review the VTT manually, but it's also useful to be able to quickly extract different attempts at the phrases or words. I added subed-record-extract-all-approximately-matching-phrases to subed-record.el so that I can generate a starting point with something like this:

(subed-record-extract-all-approximately-matching-phrases
   phrases
   "/home/sacha/sync/recordings/processed/2026-03-06-raphael.json"
   "/home/sacha/proj/french/analysis/virelangues/2026-03-06-raphael-script.vtt")

Ideas:

Group by speaker ID to make it possible to extract phrases even with interstitial corrections.
Add an optional parameter that lets me append to an existing file.

Here's a copy of its output: file:///home/sacha/proj/french/analysis/virelangues/2026-03-06-raphael-script-original.vtt

Figure 1: Screenshot with subed-waveform showing the waveforms for each segment

Based on the waveforms, I can see that some timestamps need to be adjusted, and some phrases may need to be duplicated, trimmed, or split. This is the file that I ended up with.

file:///home/sacha/proj/french/analysis/virelangues/2026-03-06-raphael-script.vtt

Sometimes I want to do a deep dive on a specific word. Here's another function that uses the WhisperX JSON data to extract just single words, like in my analysis of "mule".

Extracting words

(defun my-subed-record-extract-words (word word-data-file output-file)
  (let ((media-file (subed-guess-media-file nil word-data-file)))
    (subed-create-file
     output-file
     (mapcar (lambda (o)
               (list nil
                     (alist-get 'start o)
                     (alist-get 'end o)
                     (alist-get 'text o)
                     (format "#+AUDIO: %s\n#+WHISPER_SCORE: %d\n#+SPEAKER: %s"
                             media-file
                             (* (alist-get 'score o) 100)
                             (or (alist-get 'speaker o) ""))))
             (sort
              (seq-filter
               (lambda (o)
                 (subed-word-data-compare-normalized-string-distance
                   word
                   (alist-get 'text o)))
               (subed-word-data-parse-file
                word-data-file))
              :key (lambda (o) (alist-get 'score o))
              :reverse t))
     t)))

Azure pronunciation assessment

I can use the following code from any subtitle of 30 seconds or less to automatically extract the audio for that subtitle and add a comment with the scores from Microsoft Azure pronunciation assessment. It needs an API key and region. I think the free tier includes 5 hours of speech each month, with additional hours priced at USD 0.66 per hour for short files less than 30 seconds (billed in 1-second increments). There's another API that can handle longer segments for USD 1.32 per hour (also included in the 5 hours free), but I'll probably need a Python or NodeJS program. Since I'm working with words and short sentences for now, I can use the REST API.

(defvar my-subed-record-azure-assessment nil)
(defvar my-subed-record-azure-assess-pronunciation-lang "fr-FR")
;; (my-subed-record-azure-assess-pronunciation "~/proj/french/analysis/mule/ref-france-sample.wav" "mule")
(defun my-azure-assess-pronunciation (audio-file reference-text)
  "Send AUDIO-FILE to Azure for pronunciation assessment against REFERENCE-TEXT.
Needs the AZURE_SPEECH_REGION and AZURE_SPEECH_KEY environment variables."
  (interactive (list (read-file-name "Audio: ")
                     (read-string "Text: ")))
  (let* ((url (format "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s&format=detailed"
                      (getenv "AZURE_SPEECH_REGION")
                      (url-hexify-string my-subed-record-azure-assess-pronunciation-lang)))
         ;; 1. Create the Configuration JSON
         (config-json (json-encode `(("ReferenceText" . ,reference-text)
                                     ("GradingSystem" . "HundredMark")
                                     ("Granularity"   . "Phoneme")
                                     ("Dimension"     . "Comprehensive"))))
         ;; 2. Base64 encode the config
         (config-base64 (base64-encode-string (encode-coding-string config-json 'utf-8) t))
         ;; 3. Prepare headers
         (headers `(("Ocp-Apim-Subscription-Key" . ,(getenv "AZURE_SPEECH_KEY"))
                    ("Pronunciation-Assessment"  . ,config-base64)
                    ("Content-Type"              . "audio/wav; codecs=audio/pcm; samplerate=16000")))
         (data (plz 'post url
                 :headers headers
                 :body `(file ,audio-file)
                 :as #'json-read)))
    (when (called-interactively-p 'any)
      (kill-new (my-subed-record-azure-format-assessment data))
      (message "%s" (my-subed-record-azure-format-assessment data)))
    data))

(defun my-subed-record-azure-assess-pronunciation (&optional beg end)
  "Assess the pronunciation of the current cue."
  (interactive (if (region-active-p)
                   (list (region-beginning)
                         (region-end))
                 (list (subed-subtitle-start-pos)
                       (save-excursion
                         (subed-jump-to-subtitle-end)
                         (point)))))
  (subed-for-each-subtitle beg end t
    (let* ((temp-file (make-temp-file "subed-record-azure" nil ".wav"))
           (text (my-subed-simplify (subed-subtitle-text)))
           data
           score)
      (subed-record-extract-audio-for-current-subtitle-to-file temp-file)
      (setq data
            (my-azure-assess-pronunciation
             temp-file
             text))
      (when data
        (subed-record-set-directive
         "#+SCORE"
         (my-subed-record-azure-format-assessment data))
        (setq my-subed-record-azure-assessment data))
      (delete-file temp-file)
      data)))

(defun my-subed-record-azure-format-assessment (&optional data)
  (setq data (or data my-subed-record-azure-assessment))
  (let-alist (car (alist-get 'NBest data))
    (format "%d; A: %d, F: %d, C: %d, Conf: %d"
            (or .PronunciationAssessment.PronScore .PronScore)
            (or .PronunciationAssessment.AccuracyScore .AccuracyScore)
            (or .PronunciationAssessment.FluencyScore .FluencyScore)
            (or .PronunciationAssessment.CompletenessScore .CompletenessScore)
            (* 100.0 (or .PronunciationAssessment.Confidence
                         .Confidence)))))

Phonemes with Allosaurus, Espeak, or Kokoro FastAPI

(defvar my-subed-record-allosaurus-command '("/home/sacha/proj/french/.venv/bin/python3" "-m" "allosaurus.run" "--lang" "fra" "-i"))
(defun my-subed-record-allosaurus-phonemes (&optional beg end)
  (interactive (if (region-active-p)
                   (list (region-beginning)
                         (region-end))
                 (list (subed-subtitle-start-pos)
                       (save-excursion
                         (subed-jump-to-subtitle-end)
                         (point)))))
  (subed-for-each-subtitle beg end t
    (let* ((temp-file (make-temp-file "subed-record-allosaurus" nil ".wav")))
      (subed-record-extract-audio-for-current-subtitle-to-file temp-file)
      (subed-record-set-directive
       "#+PHONEMES"
       (with-temp-buffer
         (apply #'call-process (car my-subed-record-allosaurus-command)
                nil t nil
                (append
                 (cdr my-subed-record-allosaurus-command)
                 (list temp-file)))
         (delete-file temp-file)
         (replace-regexp-in-string " " "" (buffer-string)))))))

(defun my-lang-espeak-ng-phonemes (text)
  (interactive "MText: ")
  (let ((data
         (with-temp-buffer
           (call-process "espeak" nil t nil "-q" "--ipa" text)
           (string-trim (buffer-string)))))
    (when (called-interactively-p 'any)
      (kill-new data)
      (message "%s" data))
    text))

(defun my-french-kokoro-fastapi-phonemes (s)
  (interactive "MText: ")
  (my-kokoro-fastapi-ensure)
  (let ((data (alist-get 'phonemes
                         (plz 'post "http://localhost:8880/dev/phonemize"
                           :headers '(("Content-Type" . "application/json"))
                           :body (json-encode `((text . ,s)
                                                (language . "fr-fr")))
                           :as #'json-read))))
    (when (called-interactively-p 'any)
      (kill-new data)
      (message "%s" data))
    data))

Splitting into segments and making a table

(defun my-subed-record-make-groups (subtitles)
  "Come up with a good ID for attempts, grouping by cue text."
  (let* ((group-num 0)
         (sub-num 0)
         (grouped ; ((text . (start start start)) ...)
          (mapcar
           (lambda (o)
             (cons (car o)
                   (mapcar
                    (lambda (sub) (elt sub 1))
                    (cdr o))))
           (seq-group-by
            (lambda (o) (elt o 3))
            (seq-remove (lambda (o)
                          (and (elt o 4)
                               (string-match "#\\+SKIP" (elt o 4)))) subtitles))))
         (one-group (= (length grouped) 1)))
    (seq-mapcat
     (lambda (o)
       (setq group-num (1+ group-num))
       (unless one-group (setq sub-num 0))
       (seq-map
        (lambda (sub-start)
          (setq sub-num (1+ sub-num))
          (cons sub-start (if one-group
                              (format "%02d" sub-num)
                            (format "%02d-%02d" group-num sub-num))))
        (cdr o)))
     grouped)))

(defun my-subed-record-analyze-file-with-azure (vtt &optional always-create filter)
  (with-current-buffer (find-file-noselect vtt)
    (let* (results
           filename
           (ids (my-subed-record-make-groups
                 (seq-filter
                  (lambda (o)
                    (if filter
                        (string-match filter (elt o 3))
                      'identity))
                  (subed-subtitle-list))))
           id)
      (subed-for-each-subtitle (point-min) (point-max) t
        (unless (and (subed-subtitle-comment)
                     (string-match "#\\+SKIP" (subed-subtitle-comment))
                     (or (null filter)
                         (string-match filter (subed-subtitle-text))))
          (setq id (alist-get (subed-subtitle-msecs-start) ids))
          (setq filename (expand-file-name
                          (format "%s-%s.opus"
                                  (file-name-base vtt)
                                  id)
                          (file-name-directory vtt)))
          (when (or always-create (not (file-exists-p filename))) (subed-record-extract-audio-for-current-subtitle-to-file filename))
          (unless (string-match (regexp-quote "#+SCORE") (subed-subtitle-comment))
            (my-subed-record-azure-assess-pronunciation))
          (unless (string-match (regexp-quote "#+PHONEMES") (subed-subtitle-comment))
            (my-subed-record-allosaurus-phonemes))
          (let* ((comment (subed-subtitle-comment))
                 (scores (mapcar (lambda (o)
                                   (if (string-match (concat (regexp-quote o) ": \\([0-9]+\\)") comment)
                                       (match-string 1 comment)
                                     ""))
                                 '("#+WHISPER_SCORE" "#+SCORE" "A" "F" "C" "Conf")))
                 (phonemes (when (string-match "#\\+PHONEMES: \\(.+\\)" comment)
                             (match-string 1 comment)))
                 (text (subed-subtitle-text))
                 (extra-comment (when (string-match "#\\+NOTE: \\(.+\\)" comment) (match-string 1 comment))))
            (push
             (append
              (list (org-link-make-string (format "audio:%s?icon=t" filename) "▶️")
                    id
                    (or extra-comment ""))
              scores
              (list
               (org-link-make-string (concat "abbr:" text) phonemes)))
             results))))
      (setq results
            (cons
             '("File" "ID" "Comments" "WhisperX" "All" "Acc" "Flu" "Comp" "Conf" "Phonemes")
             results))
      (my-org-table-remove-blank-columns results t))))

(defun my-org-table-remove-blank-columns (data &optional has-header)
  "Remove blank columns from DATA.
Skip the first line if HAS-HEADER is non-nil."
  (cl-loop
   for i from (1- (length (car results))) downto 1
   do (unless (delq nil
                    (mapcar
                     (lambda (o)
                       (and (elt o i)
                            (not (string= (elt o i) ""))))
                     (if has-header
                         (cdr data)
                       data)))
        (setq data
              (mapcar
               (lambda (o)
                 (seq-remove-at-position o i))
               data))))
  data)

Collecting segments from multiple sessions

First I collect a sample from different files

Footnotes

Wang, K., He, L., Liu, K., Deng, Y., Wei, W., & Zhao, S. (2025). Exploring the potential of large multimodal models as effective alternatives for pronunciation assessment. arXiv. https://arxiv.org/abs/2503.11229

View Org source for this post

You can e-mail me at sacha@sachachua.com.