I usually write my scripts
with phrases that could be
turned into the subtitles.
I figured I might as well
combine that information
with the WhisperX transcripts
which I use to cut out
my false starts and oopses.
To do that, I use
the string-distance function,
which calculates how similar
strings are, based on the
Levenshtein algorithm.
If I take each line of the script
and compare it with the list of words
in the transcription,
I can add one transcribed word
at a time, until I find the number
with the minimum distance
from my current script phrase.
This lets me approximately match strings
despite misrecognized words.
I use oopses to signal mistakes.
When I detect those, I look for
the previous script line that is closest
to the words I restart with.
I can then skip the previous lines
automatically.
When the script and the transcript are close,
I can automatically correct the words.
If not, I can use comments
to easily compare them at that point.
Even though I haven't optimized anything,
it runs well enough for my short videos.
With these subtitles as a base,
I can get timestamps with subed-align
and then there's just the matter
of tweaking the times
and adding the visuals.