Converting Youtube TimedText (xml/vtt) into SubRip (srt)
Hey everyone!
So I have spent some time dealing with the fact that Youtube changed their caption file format (again....) and when you use Invidious or youtube-dl to download captions what you get now is a weirdly formatted file that most players can't read.
I created a script that converts it all into standard SubRip format.
It's not very elegant, but it involved a lot of trial and error. Overall, I think it works well (I tried with about 80 different files now, and all of them are working ok, bugs were corrected). But certainly it could be improved and made more efficient. I will leave that to someone who cares about that :P
Anyway, I'm happy to share and help more people have access to captioned/subtitled content and information!
@MagicBanana feel free to show us how awk could have done it better! Hopefully I won't feel too embarrassed. Thanks! ;)
Attachment | Size |
---|---|
xml2srt.txt | 917 bytes |
@MagicBanana feel free to show us how awk could have done it better!
I promise nothing, but could you give an input?
Sure!
Here, attached are two files that I used for testing. I shortened these to be smaller and faster during testing, but the structure is the same as in the original downloaded file.
These are two slightly different files, since Youtube apparently can spit out different variations in each video. I had to modify my script to make it work with both.
Have fun! ;)
Attachment | Size |
---|---|
inputfile.txt | 3.53 KB |
inputfile2.txt | 409 bytes |
A much simpler version of your script is attached. It is also certainly much more efficient: three commands running in parallel and processing all the subtitles vs. two dozens commands (or so) called on every subtitle.
You may have to substitute far more special characters in the sed program (but there certainly exists ready-to-use programs to do so): https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references
Attachment | Size |
---|---|
xml2srt.txt | 493 bytes |
You never disappoint MagicBanana!
Ahahah, I was quite sure my script wasn't the best possible solution, but I didn't expect such a large simplification to be possible. Congrats, you did great!
I am still happy that I was able to at least work it out on my own anyway :)
As for my other problem below, I did find a solution:
cat file.txt | cut -d" " -f"$a"-"$b"
where a is the starting point to print and b is the final point to print. So the printing part is done!
However, I was wrong in the calculation method. I have not been able to fit a larger amount of text in the same number of lines. Basically if I have a 100 words in the original file and I want to replace them with 120, my code is not adding enough words per line to make it to the 120.
I am trying to hack my way through but I was thinking if there is some way to have awk count the whole words and make sure ALL of them are distributed in the lines (heuristically similarly to the original distribution) ?
Thanks for any help you might provide!
Below, I counted characters, not words. Is it OK?
I wish to better understand your code, if you don't mind me asking MagicBanana.
1. I understand function print_time is a simple method of calculating and printing the timestamps based on the milliseconds present in the original xml file. I do however ask, why is that better than dividing by 1000 and using the existing date command?
2. I had sed and tr at the start of the script, cleaning out the non-subtitle portions of the file, and then sed at the end replacing special characters. I suppose you put those all together at the start so that those commands are run only once over the entire file instead of being run at every loop, is that correct?
3. Also, is it better to do "sed -e xxx -e xxx -e xxx" than doing "sed xxx | sed xxx | sed xxx" ? As in, is one necessarily faster than the other?
4. /^t=/
I don't understand this one line... I understand it's where the whole printing comes out together (lines below), but I don't understand the line itself.
Thanks for your help once again, and for any clarification you might provide. Thanks MagicBanana!
As I wrote in the post with that script:
three commands running in parallel and processing all the subtitles vs. two dozens commands (or so) called on every subtitle.
Spawning processes takes time: the kernel must allocate memory to copy the executable read from the disk (47 kB for cut, 15 kB for rev, 101 kB for bc, 163 kB for mawk, 109 kB for date, etc.), for the call stack, the heap, etc., it must fill the file descriptor table (describing what files the process reads/writes, /dev/stdin and /dev/stdout included), manage security attributes (who owns the process, what is it allowed to do), etc. Look at the content of some /proc/[pid] to see everything Linux creates to manage a process.
Calling awk (twice), cut (twice), rev (twice), bc and date every time a timestamp (twice per subtitle) is to be converted is far more time-consuming that executing on a number a function in one single AWK program: the overhead to spawn all those processes greatly exceed the time to actually convert the timestamp! Idem for the sed call in your loop: spawning a sed process at every line to be processed certainly takes far more time that the actual substitution in that single line.
By default, all POSIX text-processing commands treat multiple lines (multiple records for AWK, by default lines, but RS can be redefined), doing every time the same thing: using them in that way is more efficient.
As for using grep/sed's -e/-f option instead of multiple calls of grep/sed, there are again fewer processes that are spawned. Nevertheless, that is a minor gain of time if many lines are processed. I believe more time is gained by avoiding I/O: reading lines from a pipe and writing lines to a pipe takes some time. Notice in particular that every line is, along its treatment, sequentially copied in the memory spaces allocated to each of the multiple processes. That is why I believe it is usually faster for a single process to sequentially apply all the selections/editions to a single string stored once in main memory. However, on the other hand, with multicore processors, the multiples processes truly work in parallel, what may save time (probably not energy, though). In the end, what is faster must depend on how long the input (and its reduction along the successive selections/editions: if the first one reduces the input a lot, it dominates the time requirements and having it in a separate process may save time) and how time-consuming the selections/editions (i.e., how complex the regular expressions). You can time to know for a specific case, if you really care.
EDIT: being less assertive in the above paragraph.
At least that is my understanding. I am not an expert in operating systems.
As for /^t=/, it is a regular expression. It means "starts with 't='". Letting aside the definition of functions, AWK programs are sequences of pairs condition-action. A condition that is only a regular expression is applied to the whole input record (here, /^t=/ equates to $0 ~ /^t=/). The subsequent action is executed if and only if the condition is satisfied (here, if the input record, $0, starts with "t="). In the present case, the AWK program is a single pair condition-action, the input records are lines (the default, RS not being redefined) that are processed if and only if they start with "t=". The same could be achieved without the condition (an action not preceded by a condition is executed on every input record) if grep ^t= would precede awk.
Thank you for the detailed explanation. I think I still don't fully understand your code, but I see now how I can improve some of my other scripts (as well as I could have made mine here better, simply by using sed in a different way for example).
I also now understand that it's better to use one single tool to do multiple things at once than calling several different ones.
Thanks!
Two more questions if you don't mind:
1. t /= 1000
Does this means "variable t becomes its own value divided by 1000"? I tried running it in a simple terminal window and it gave an error (like this: t=80000; t/=1000; printf "$t"). Does it only work inside functions? Or am I reading wrongly the meaning of that line?
2. finalfile=$(echo "$1" | rev | cut -c 5- | rev ).srt
In my code I had this line, so that the script will automatically output to a similar named file with srt extension. I am, again, using rev twice and cut. Given that different files will have different names (thus different filename lengths) how would you suggest we improve on this one?
EDIT: An additional question
3. My script I was able to run "script.sh *.vtt" and it would process every vtt inside that folder. I tried adding my "finalfile=$(echo "$1" | rev | cut -c 5- | rev ).srt" line at the start and "> "$finalfile" " at the end and it still only processes one file. What else should I look into changing? Thanks
Thanks once again!
I also now understand that it's better to use one single tool to do multiple things at once than calling several different ones.
With pipes between them, not necessarily, because the commands run in parallel. I tried to explain that in the longest paragraph of my previous post.
1. t /= 1000
Does this means "variable t becomes its own value divided by 1000"? I tried running it in a simple terminal window and it gave an error (like this: t=80000; t/=1000; printf "$t").
Yes, t /= 1000 divides t by 1000 and store the result in t. As in C. AWK is essentially a simplified C (no declaration of variables, which are automatically initiallized and transparently converted between strings and float, no pointer, arrays are associative, etc.) mixed with sed (for the structure of the program, the handling of regular expressions and the functions sub and gsub) and with many variables that are automatically defined (NR, FNR, NF, $1 to $NF, $0, etc.). To access the content of the t variable, you just write t, as in C:
$ awk '{ t = 80000; t /= 1000; printf t }'
80
(I typed [Enter] to end a record; Ctrl+D to end the input.)
If you add a dollar sign before the last t, it would print $80, the 80th fields of the records (because t is 80). If you further add double quotes around $t, then it is the string "$t" that is printed (literally "$t", but without the quotes). I have no idea how you ended up with an error.
2. finalfile=$(echo "$1" | rev | cut -c 5- | rev ).srt
(...) how would you suggest we improve on this one?
Well, I would write on the standard input, as I did. In this way you can redirect the output with > (or >> to append) or you can further process it through a pipe.
Nevertheless, to process several subtitle files given as arguments to the script (your last question), you do need to define the output file names. I would wrap the script ("(...)" below, where "$1" must become "$xml") in this way:
for xml in "$@"
do
srt=$(printf "$xml" | sed 's/\.[^.]*$//').srt
(...) > "$srt"
done
sed 's/\.[^.]*$//' deletes everything from the last dot. The shell appends ".srt". I do not substitute with ".srt" in sed to have ".srt" appended even if the file name contains no dot (otherwise the output file would be the input file). Your solution always substitutes the last four characters. It only works for file names with three-letter extensions. I am aware that \.[^.]*$ looks arcane... but to seriously use grep, sed or awk, learning POSIX regular expressions is required.
Notice that performance does not matter here: processing the file names is far less work than processing the files themselves.
"$@" are all the arguments of the shell script. My variable named "xml" successively contains them. Those arguments being file names here, they can contain spaces. That is why I write to "$xml", with the quotes, and not to $xml, without the quotes. In your script, such quotes are missing around the first occurrence of $1.
EDIT (I forgot to explain the problem with the instruction at the end of your post): For the shell script, "$1" is its first argument. If you prefer, you can actually have "$1" successively be every every argument, thanks to shift:
while [ -n "$1" ]
do
srt=$(printf "$1" | sed 's/\.[^.]*$//').srt
(...) > "$srt"
shift
done
Thank you so much!
That was extremely helpful and insightful!
And yes, there is a heck of a difference now... Running your script in 5 files at once (using the while loop above) takes less than a second. My original script would usually take between 5 and 10 seconds for each file.
I will try to use what you taught to improve my coding in the future. Thanks so much for the help! :D
Hey MagicBanana, if you feel like trying your hand in a harder issue, I have been struggling since yesterday with something different, feel totally lost, could use a help.
Essentially I have a file like this:
EMPTYLINE
TEXTLINE
EMPTYLINE
TEXTLINE
EMPTYLINE
TEXTLINE
All the text lines have been condensed in another file, in full form, with some slight variations. The main difference is some words are paraphrased and some punctuation marks were added, but that was the intention.
What I need to do now, is heuristically put the new text back into the original line by line format, with the same % of text being included. Like this:
EMPTYLINE
TEXTLINE (3 words amounting to 5% of entire original text)
EMPTYLINE
TEXTLINE (6 words amounting to 7% of entire original text)
EMPTYLINE
So the new file would be
EMPTYLINE
NEWTEXTLINE (5% of new text, regardless of number of words)
EMPTYLINE
NEWTEXTLINE (7% of new text, regardless of number of words)
So on and so on. I have attached two files as examples, it might be clearer with those.
I have been able to make the calculations work correctly, but can't get the words to print out without errors... The important part is that total number of lines in ORIGINALLINES and POSSIBLENEWLINES must be the same, and the size of each line should be approximately the same (regarding % of the entire text).
Any help if greatly appreciated! I have a semi working calculation but if you feel awk can also do that better, it's probably not a bad idea either. Thanks for any help you might provide!
Attachment | Size |
---|---|
newfulltext.txt | 466 bytes |
originallines.txt | 467 bytes |
possiblenewlines.txt | 484 bytes |
possiblenewlines.txt is not a good solution. At the end of its 13th line only ~57% of the whole text is printed, whereas, at the end of originallines.txt's 13th line, ~81% of the text is printed.
Here is a solution that works better:
#!/bin/sh
awk '
/./ {
l += length
len[++i] = ++l }
END {
for (j = 1; j <= i; ++j)
print len[j] / l }' "$1" |
awk -v RS='[ \n]' -v total=$(wc -m < "$2") '
{
if (FILENAME == "-")
proportion[NR] = $0
else {
l += length
if (++l > total * proportion[i]) {
printf "\n\n%s", $0
++i }
else
printf " %s", $0 } }' - "$2" |
tail -n +2
The first AWK program computes the proportions of the whole original text at the end of each of its lines (including the newlines). The second AWK program stores those proportions and, processing the new text word by word, it starts a new line if a space followed by the additional word would exceed the proportion at the end of the line, otherwise it prints all that on the current line. To compute the proportions of the new text, the second AWK program is given its total number of characters. wc compute it.
EDIT: clarifying my explanations.
That works better yes. I knew my calculation was not perfect for some reason but couldn't figure out exactly why. Thanks a lot! Yours works much better!
In fact I wonder if my previous step could be run together with this. What I am doing now is starting with a SRT file (1.txt), removing linenumber+timestamp (which creates 2.txt), after that creating a file that holds the entire new text with the changes (3.txt), running the script to separate that entire new text into lines based on the proportions of 2 (4.txt) and finally I have to again import the linenumer+timestamp from 1 into 4 creating a new subtitle file (5.txt). Any changes and corrections that are needed I do them manually later on.
Your script could take into account the linenumber+timestamp in 1.txt and just take the text from 3.txt to create 5.txt, correct?
I attached the relevant files as named in the comment, to better illustrate my point, in case you think it's indeed possible. Thanks for any help you might provide!
Attachment | Size |
---|---|
1.txt | 1021 bytes |
2.txt | 467 bytes |
3.txt | 466 bytes |
4.txt | 482 bytes |
5.txt | 1.01 KB |
Your script could take into account the linenumber+timestamp in 1.txt and just take the text from 3.txt to create 5.txt, correct?
Correct. One additional 4-line AWK program can combine the cues in the original SRT with what the above script output:
#!/bin/sh
awk -v RS='\n\n' -F '\n' '
{
l += length($3)
len[++i] = ++l }
END {
for (j = 1; j <= i; ++j)
print len[j] / l }' "$1" |
awk -v RS='[ \n]' -v total=$(wc -m < "$2") '
{
if (FILENAME == "-")
proportion[NR] = $0
else {
l += length
if (++l > total * proportion[i]) {
printf "\n\n%s", $0
++i }
else
printf " %s", $0 } }' - "$2" |
tail -n +3 |
awk -v RS='\n\n' -F '\n' '
{
print "\n" $1 "\n" $2
getline < "/dev/stdin"
print }' "$1"
I adapted as well the first AWK program (redefining the record and field separators) so that it processes the original subtitles and not only its text.
It works great!... when it works.
I must be doing something wrong here...
Attached are some input files. If we run
script.sh 1.txt 3.txt
It works fine.
script.sh 1.txt t3.txt
It also works fine (3.txt and t3.txt are only slightly different and they both work with 1.txt).
script.sh t1.txt 3.txt (or t3.txt)
this doesn't work. And to my eyes t1.txt and 1.txt should be the same. Subtitle counting is different yes, but the same number of lines occur. Yet, when I try to use t1.txt, what I get is all the text inside the same line.
Is there any safeguard that could be used in the code to make sure it works with both (if you see any difference) or is there something I need to correct on my end? Thanks again for the help!
Attachment | Size |
---|---|
1.txt | 1021 bytes |
3.txt | 466 bytes |
t1.txt | 1.06 KB |
t3.txt | 450 bytes |
Subtitle counting is different
If you want the last AWK program to count the subtitles from 1 instead of copying the numbers from the original subtitles, just replace in it:
print "\n" $1 "\n" $2
with
print "\n" ++nb "\n" $2
Is there any safeguard that could be used in the code to make sure it works with both (if you see any difference)
The difference is that:
- 1.txt was written from an operating system that ends lines with "\n":
$ head -1 1.txt | hexdump -c
0000000 4 \n
0000002 - t1.txt was written from an operating system (most probably Windows) that ends lines with "\r\n" (hence a waste of one character per line):
$ head -1 t1.txt | hexdump -c
0000000 1 \r \n
0000003
Below, I added an optional '\r' character before every '\n' in the inputs:
#!/bin/sh
awk -v RS='\r?\n\r?\n' -F '\r?\n' '
{
l += length($3)
len[++i] = ++l }
END {
for (j = 1; j <= i; ++j)
print len[j] / l }' "$1" |
awk -v RS='[ \r\n]+' -v total=$(awk -v RS='[ \r\n]+' '{ l += length } END { print l + NR }' "$2") '
{
if (FILENAME == "-")
proportion[NR] = $0
else {
l += length
if (++l > total * proportion[i]) {
printf "\n\n%s", $0
++i }
else
printf " %s", $0 } }' - "$2" |
tail -n +3 |
awk -v RS='\r?\n\r?\n' -F '\r?\n' '
{
print "\n" $1 "\n" $2
getline < "/dev/stdin"
print }' "$1"
No '\r' remains in the output.
EDIT: counting the total number of characters in the second file with awk (wc counts multibyte characters as several) so that it is certain the total will be reached; in that second file, sequences of ' ', '\r' and '\n' make no difference (they are treated as one single space): who modifies the text can freely use those separators.
Thanks! That worked perfectly well!
It will be a great help, thank you so much again!
I do have some questions about your first script (the simplified version of my own original one). I will post them in the comment above, if you will be so kind as to explain those to me (and others who may wish to learn).
Thank you once again!
Hey MagicBanana, I noticed a possible improvement to be made in the script srtfold which we previously worked on another thread. That thread is now locked, so I thought you wouldn't mind if we discuss it here.
After using that srtfold version I noticed that it would help with readability if a single vowel wouldn't be left alone at the end of a line (tends to happen in more cases than I expected, and it's something that actually distracts from a seamless reading).
I attached an example test.txt, which I ran against the latest version of your script ( https://trisquel.info/files/srtfold2_1.txt ) :
srtfold2_1.sh 28 test.txt
It outputs like this:
1
00:00:1,000 --> 00:00:15,250
If this is everything that I
would write down in a line,
2
00:00:15,250 --> 00:00:24,500
I would be
getting whatever it was I
3
00:00:24,500 --> 00:00:30,000
wanted for Christmas.
Notice the "I" appearing alone at the end of some lines? The same happens in other languages with other vowels which should rarely be so. The best option would be to have the single vowel be used in the next line.
Do you think this is something that could be changed in srtfold2_1.sh ?
Thanks again for your help!!
Attachment | Size |
---|---|
test.txt | 148 bytes |
The attached script is the one you linked to with, additionally, the following substitution initially applied to the input file(s):
sed 's/\b\([aeiouyAEIOUY]\) /\1 /g'
... what apparently does nothing! But it does, because the space at the end of "\b\([aeiouyAEIOUY]\) " (which matches a single vowel letter followed by a space) is a "regular space", whereas the space after "\1" (which repeats the vowel letter) is a non-breaking space: https://en.wikipedia.org/wiki/Non-breaking_space
In this way, the rest of the script sees every single vowel letter and its subsequent word as one single word. The output of ./srtfold 28 test.txt becomes:
1
00:00:1,000 --> 00:00:9,700
If this is
everything that I would
2
00:00:9,700 --> 00:00:18,158
write down in a line,
I would be
3
00:00:18,158 --> 00:00:30,000
getting whatever it was
I wanted for Christmas.
Hopefully, the application reading the subtitles properly renders the non-breaking spaces. If not, substitute them for regular space at the end of the script (same substitution but with the two kinds of spaces swapped).
Attachment | Size |
---|---|
srtfold.txt | 3.55 KB |
Hey!
Thanks for the help once again MagicBanana!
However I think this is not working... I ran the new srtfold against our earlier example as:
./srtfold.sh 28 test.txt
and it produced the result below:
1
00:00:1,000 --> 00:00:9,653
If this is
everything that IÂ would
2
00:00:9,653 --> 00:00:18,306
write down in a line,
IÂ would be
3
00:00:18,306 --> 00:00:30,000
getting whatever it was IÂ
wanted for Christmas.
I suppose the "Â" symbold must be replaced at the end of the script with another sed like you mentioned, which is ok. However that leaves still an isolated "I" in the last subtitle line.
Also, I tried another file for testing, this one is in Portuguese language which uses different isolated vowels. Running
./srtfold.sh 32 1.txt > 2.txt
And as you will see in the attached files, the result was also incorrect.
Perhaps the problem is that including the isolated vowels in the next word might appear as a "too long word" to be processed? It doesn't seem like that, not in my examples, but still... Just a thought. I would love to know your thoughts on the matter. Thanks again, I am sure we will find a way to fix this ;)
And a happy new year!
Attachment | Size |
---|---|
1.txt | 136 bytes |
2.txt | 170 bytes |
It is only an encoding problem for the script. Does the file command output the following when you give it the script to classify?
$ file srtfold.txt
srtfold.txt: POSIX shell script, UTF-8 Unicode text executable
If not, you must change the encoding. I do that in Emacs, but you can use any tool you wish.
Here, executing that file with 32 and 1.txt as argument gives:
1
00:00:1,000 --> 00:00:2,745
Publicou
numerosos livros e artigos
2
00:00:2,745 --> 00:00:4,821
sobre geografia mundial,
incluindo "O Novo
3
00:00:4,821 --> 00:00:6,000
Atlas Gigante do Mundo".
(Notice that "e", which is Portuguese for "and", being now inseparable from the word that follows it, "artigos", they go together to a same line, which cannot include "Publicou" without exceeding the 32-character maximum, hence the additional first line with only "Publicou".)
running that check gives me the same:
POSIX shell script, UTF-8 Unicode text executable
could it be sed's version? I remember we were having trouble with awk (mawk vs gawk) back when we started. Could it be the same here?
EDIT: My sed --version gives:
sed (GNU sed) 4.4
I am pretty sure it is only a problem with the encoding of the script. Abrowser suffers from that problem for instance. When you click on https://trisquel.info/files/srtfold.txt, Abrowser shows that substitution:
sed 's/\b\([aeiouyAEIOUY]\) /\1Â /g'
If you "Repair Text Encoding" via the "View" menu, as https://support.mozilla.org/en-US/kb/text-encoding-no-longer-available-firefox-menu explains, it becomes:
sed 's/\b\([aeiouyAEIOUY]\) /\1 /g'
You can also replace "Â " for the non-breaking space. With GNU Emacs, you type it with C-x 8 space. Or you can use another character that is supposed to never appear in subtitles (but I believe any character can theoretically appear!), say "~" and have sed use that character:
sed 's/\b\([aeiouyAEIOUY]\) /\1~/g'
You then need to substitute that character for a space at the end of the script:
sed 's/\b\([aeiouyAEIOUY]\)~/\1 /g'
OK, the only solution that seems to work properly is the last suggestion of using ~ as a replaceable character for space. I wonder how much trouble it may cause, since it's an unusual character to appear in regular subtitles (though a documentary about languages for example may use it as a separate character, I guess it will have to be taken some special care in such events).
I will need time to test it further, and see if there is any additional care to be taken. I noticed it differentiates "e" from "é" for example, which I think will be a good thing, but like I said, I will need to test it further, different example files and report back to you if I find any bugs.
So far it seems to produce already better results than before!
Thanks again for the help! Much appreciated!
I guess it will have to be taken some special care
sed 's/\b\([aeiouyAEIOUY]\)~/\1 /g' only substitutes "~" for a space if it is directly preceded by a single vowel: I guess there is really little to worry with real-world subtitles. Anyway, it is kind of sad to not be able to use the character that exists to specify a non-breaking space.
I noticed it differentiates "e" from "é"
[aeiouyAEIOUY] matches any single character between the brackets: 'a' or 'e' or 'i' or 'o' or 'u' or 'y' or 'A' or 'E' or 'I' or 'O' or 'U' or 'Y'. You can add "é" there if you want. You can even more deeply modify the regular expression between "\(" and "\)" to match more strings whose subsequent spaces should be non-breaking. As an exercise, you could for instance try to never break after "the" or "The".
echo "hello a the world" | sed -e 's/\b\([Tt]he\) /\1~/g' -e 's/\b\([aeiouyAEIOUY]\) /\1~/g'
This seems to do the trick. See anything I might have missed?
That is perfect.
Thanks! :D
As for the issue I mentioned below, any luck with why those words were left on their own?
Apparently either the characters counting is not done properly in those instances (causing the first word in the line to already exceed the limit), or the "cut point" is misplaced even though properly calculated. But I admit I can't find that in the code itself. Hope you can help.
Thanks again!
Following up on the exercise you proposed, I came up with the following which I think works very well for the Portuguese language:
sed -e 's/\b\([nd][eao]\) /\1~/g' -e 's/\b\([aeiouyAEIOUY]\) /\1~/g'
However I noticed a weird issue happening (I had noticed it before in other subtitle files, but never got to focus on it too much).
Running the attached srtfold against the attached test file:
./srtfold.sh 32 test1.txt
Leaves the words "apesar" and "pensei" alone in a line, even though more words could be put together and still respect the 32 limit. Do you think this is something that can be improved (without breaking everything else that is already working perfectly)?
Thanks for the help again!
Attachment | Size |
---|---|
srtfold.txt | 3.66 KB |
test1.txt | 253 bytes |
That has nothing to do with the non-breaking spaces we have just added. Let me forget that (for simplicity) and recap what the script does.
The first AWK program breaks original subtitles on some punctuation marks, trying to have lines smaller than the maximum number of characters. For the first subtitle, that gives:
1
00:01:10,000 --> 00:01:20,000
No entanto,
apesar da eficácia de seu trabalho,
ele se encontrou enfrentando o que ele descreveu como um dilema crescente.
The second or the third line exceeds the 32-character maximum, but there is no punctuation in them to avoid that.
The second AWK program keeps the same lines breaks and adds some. It breaks the too-long lines on spaces and still tries to avoid exceeding the 32-character maximum (what is always possible unless some word has more than 32 characters):
1
00:01:10,000 --> 00:01:20,000
No entanto,
apesar da
eficácia de seu trabalho,
ele se encontrou
enfrentando o que ele descreveu
como um dilema crescente.
Finally, the third AWK program splits the cues so that every output subtitle is two lines long (except for the last output subtitle that is part of an input subtitle, if there is an odd number of lines). In the example, the first output subtitle is:
1
00:01:10,000 --> 00:01:11,774
No entanto,
apesar da
And, indeed, those two lines could be merged without exceeding the 32-character maximum. The third AWK program could merge any two consecutive lines if their total number of characters does not exceed 32 and keeps outputting lines two by two.
Nevertheless, you also wanted to have the first line of every output subtitle to be usually smaller than the second line. That is the reason for the soft_min variables in the first two AWK programs. As a consequence, in the attached script, when the third AWK program now merges two lines, the resulting line is either the second line of a subtitle or its only line. There may be a better thing to do to merge more lines while having the first one usually longer than the second one, but that would require understanding why I defined soft_min the way I did...
"pensei" remains on a single line. If moved to the previous/next line, counting the additional space, the 32-character maximum is exceeded.
Attachment | Size |
---|---|
srtfold.txt | 4.01 KB |
First, thanks again for helping out!
Yes I know this is not an issue related to the "new feature", I had noticed it before already. I just wanted to let you know this was happening indeed.
As for the new proposed solution, I must say it's not an improvement. Let's see, as we had discussed before in the original thread, breaking longer sentences in the commas is the best solution for ensuring as much readability as possible (since commas are, by most languages nature, semi-breaks themselves). So, as much as possible, we want to keep those as the break between line 1 and line 2 of each subtitle, or between line 2 and line 1 of the next subtitle.
The latest srtfold you attached, results in this:
1
00:01:10,000 --> 00:01:11,532
No entanto, apesar
2
00:01:11,532 --> 00:01:15,323
da eficácia de seu trabalho,
ele se encontrou
3
00:01:15,323 --> 00:01:20,000
enfrentando o que ele descreveu
como um dilema crescente.
4
00:01:20,000 --> 00:01:22,734
Quando voltei para Chicago,
pensei
5
00:01:22,734 --> 00:01:25,000
na história dele até casa.
A much better output (which should mathematically be possible and reasonable to achieve) would be:
[No entanto,] (11 characters, less than 32, breaks on comma)
[apesar da eficácia de seu] (25 characters, can't add next word without exceeding 32)
[trabalho,] (9 characters, not perfect but breaks on comma)
[ele se encontrou enfrentando] (29 characters, can't add next word without exceeding 32, since "o que" functions as a single word now)
[o que ele descreveu como um] (28 characters, same as above)
[dilema crescente.] (18 characters, ends on the period)
What do you think, can this way of counting be applied within your script?
I suppose a secondary step would probably avoid the "trabalho," orphan word, by breaking into
[apesar da eficácia]
[de seu trabalho,]
but I am not sure if I can translate this into an algorithm that you can use to write the code. So, I would be happy if we instead got the alternative I suggested above. If you see this as a possibility of course.
Thanks again for the help!
What do you think, can this way of counting be applied within your script?
As far as I understand, you propose to keep the first and third AWK program as they were and to have the second AWK program pack as many words as possible in every output line (within the maximum number of characters). That is essentially what it does, but from the end of the processed line to its beginning, so that the first line is usually smaller: you wanted that.
In the example, when "apesar da eficácia de seu trabalho," is processed, packing as many last words as possible gives "da eficácia de seu trabalho," (28 characters; would be 33 > 32 appending "apesar ") and only "apesar" remains. When "ele se encontrou enfrentando o que ele descreveu como um dilema crescente." is processed, that gives "como um dilema crescente." (25 characters; would be 35 > 32 appending "descreveu "), "enfrentando o que ele descreveu" (31 characters; would be 41 > 32 appending "encontrou ") and "ele se encontrou" remains.
Additionally, there is the soft_min variable that tries to balance the lengths of the output lines. Nevertheless, it plays no role in this example.
Yes, I think I see the issue more clearly now.
I think what could work better would be, keep the first awk part as it is, but add an intermediary step after. After breaking on punctuation, check if each "break group" can fit into its own subtitle. I mean this:
[No entanto,] (this fits into a single 32 characters line, so it becomes its own, one single line subtitle)
[apesar da eficácia de seu trabalho,] (this group fits into two lines each below 32 characters, so instead of being grouped with the previous subtitle, it becomes its own subtitle with two lines)
[ele se encontrou enfrentando o que ele descreveu como um dilema crescente.] (this group does not fit into a single 32 characters by line subtitle, therefore it will be broken into two subtitle groups, one with two lines and another with one line, all of which are below 32 characters).
This would be the possible output:
No entanto,
apesar da eficácia
de seu trabalho,
ele se encontrou
enfrentando o que ele descreveu
como um dilema crescente.
Essentially we just want to try and keep each punctuation separated group as together as possible. If possible, I don't see any harm in joining two subtitle groups together if they are both one line only each and both end with punctuation, which would allow for having two lines on the screen as much as possible (here is a slight different text as an example):
No entanto,
apesar do seu trabalho,
ele estava desesperando,
e ficava muito triste
com o que acontecia á sua volta.
Can we give this a try?
I am not sure if my idea was very clear, feel free to question me :)
Thanks again for the help!
In the attached script, I added between what were the first and the second AWK program, a copy of what was the third AWK program that I modified in the way I understood you wanted it, so that punctuation ends the subtitle whenever possible. './srtfold.txt 32 test1.txt' now gives:
1
00:01:10,000 --> 00:01:10,968
No entanto,
2
00:01:10,968 --> 00:01:13,952
apesar
da eficácia de seu trabalho,
3
00:01:13,952 --> 00:01:17,903
ele se encontrou
enfrentando o que ele descreveu
4
00:01:17,903 --> 00:01:20,000
como um dilema crescente.
5
00:01:20,000 --> 00:01:22,188
Quando voltei para Chicago,
6
00:01:22,188 --> 00:01:25,000
pensei
na história dele até casa.
Attachment | Size |
---|---|
srtfold.txt | 4.91 KB |
After a few tests, I confirm this new version is indeed an improvement over the original one! Much better readability in most subtitles I tried it on. Thanks for the help!
One question, I remember in the original srtfold you added a rule to try and avoid orphan words in a line unless the word was already orphan in the original subtitle text. Do you think that could also be applied here to try and prevent instances like "pensei" and "apesar" in the above output?
If I understand it correctly, it happens because the line is being filled from end to beginning, which means sometimes only a single word is left out (especially now that we have non-breaking spaces happening at our will in possibly random places of the text lines). However, if such a rule could be put in place it would be nice. Is it better to have the first line actually being longer than the second one if many words are followed by what we determined as non-breaking spaces? I don't know, but I would probably think so... A list of priority could be:
1. Even distribution with second line being longer than the first one;
2. If an orphan word happens in the first line, add one more word from the second line, keeping second line longer than first one;
3. If by adding said word (which could be a series of words with non-breaking spaces) results in first line being longer than second, accept that as an exception;
4. If in the end, there is a resulting orphan word in the second line (which means that particular subtitle will ALWAYS have an orphan word, either in the first or second line) revert back to orphan word in the first line (since we have preference for the first line being shorter than the second as much as possible);
Do you concur with my reasoning here? Thinks it's possible to implement?
Thanks again for the help, and also thanks for teaching me. As a further exploration of the earlier exercise, I discovered that sed can take "optional" characters, for example an optional "s" for plural words:
sed -e 's/\b\(laptop[s]\{0,1\}\) /\1~/g'
Will work for both "laptop" and "laptops". Either 0 or 1 character from the previous bracket will work as a match. Nice... Even more evolved forms are possible (like requiring more characters, and having different combinations) and I have also been exploring those. Thanks again! :D
One question, I remember in the original srtfold you added a rule to try and avoid orphan words in a line unless the word was already orphan in the original subtitle text.
Unfortunately, if that was done, it is not really a rule, but something in the middle of the code... and I do not really (want to spend the time to) understand the whole code anymore: I did not expect it would grow that much, have left no comment, etc.
EDIT: after writing that, I actually took a look at the first AWK program, because it was initially an adaptation of the program I had just modified (it breaks on punctuation instead of spaces). I found a bug, which often produced unbalanced lines. I fixed it. I also completed the help message and briefly commented on what each AWK program does.
Do you think that could also be applied here to try and prevent instances like "pensei" and "apesar" in the above output?
The attached script tries to avoid orphan words. To do so, I have modified the third AWK program, which breaks the lines on spaces. I did not follow the algorithm you gave, so that the maximal number of characters per line remains a hard constraint (you forgot it) and to keep working backwards (I define the output lines from the end of a subtitle to its beginning). The output of ./srtfold.txt 32 test1.txt looks pretty good to me:
1
00:01:10,000 --> 00:01:10,968
No entanto,
2
00:01:10,968 --> 00:01:13,952
apesar da eficácia
de seu trabalho,
3
00:01:13,952 --> 00:01:17,903
ele se encontrou
enfrentando o que ele descreveu
4
00:01:17,903 --> 00:01:20,000
como um dilema crescente.
5
00:01:20,000 --> 00:01:22,188
Quando voltei para Chicago,
6
00:01:22,188 --> 00:01:25,000
pensei na história
dele até casa.
Notice that a sequence of words separated by non-breaking spaces is seen as a single word that the script would never put alone on a line (it would be orphan), unless the whole sequence plus a space and the previous word have more characters than allowed per line. I hope that is OK because it looks complicated to consider that words separated by non-breaking spaces are not orphans.
Attachment | Size |
---|---|
srtfold.txt | 6.37 KB |
Hey this looks really good!
I think we got a major upgrade over the quality of the original srtfold, these new outputs are actually much better in terms of readability!
Now it's mostly a matter of carefully choosing (based on the language of any given text) which non-breaking spaces should be set. A balance is needed to avoid breaking lines in places that should never be broken, without creating too many (and too long) sequences of connected words. English and Portuguese have different rules, and I expect Chinese, Danish, Russian, Bulgarian, etc etc, to each need its own set of sed rules before running the folding commands. I'll leave that up to each person at each moment to handle, since they are also pretty easy to manipulate in the script. I don't even speak some of those languages, so I can't really help there, I trust each person will adapt the rules to their own needs.
Anyway, I tried running this new version against some previous subtitles and indeed, the results are much more pleasant. Especially when punctuation is good (commas in the right places if it's a long speech) the result is a very natural flow of the words, making it easier to read and understand. The timing is also good, as it was before.
I don't see it as problematic that a sequence of words separated by non-breaking spaces are treated as a single word. That leads occasionally to some poorer breaks here and there but it's nothing too glaring to be worth of the time and hard work it would take to fix, I think. An example below:
1
00:00:01,000 --> 00:00:10,000
de facto, e eu também chamaria aos meus vizinhos o Paulo e o Mariano, de amigos.
This breaks into:
1
00:00:1,000 --> 00:00:6,444
de facto, e eu também
chamaria aos meus vizinhos
2
00:00:6,444 --> 00:00:10,000
o Paulo e o Mariano, de amigos.
Which at first glance seems to not respect the rules we established, but in fact it does! Because "De facto" is counted as a single word, and "o Paulo e o Mariano" is also treated in the same way. So, yeah, in this example the subtitle outputs a little strange break, but in the tests I run with longer subtitles, this doesn't happen very often. It's rare in fact. So, if it's not as simple as introducing a checking test to see if the "orphan word" is actually made up of non breaking spaces (in which case it would be treated accordingly), I don't think it's worth investing time into fixing this.
Thanks so much for the help MagicBanana! I hope more people use this, I have been using this for a while now and it is such an enormous help. I also learned much from these with you. Thanks a lot!! Awesome work! :D
I'll leave that up to each person at each moment to handle, since they are also pretty easy to manipulate in the script.
Well, it requires learning regular expressions...
I hope more people use this, I have been using this for a while now and it is such an enormous help.
I updated https://dcc.ufmg.br/~lcerf/en/utilities.html#srtfold (the description and the script itself).
Well, it's true, it does. But it's easier to learn regular expressions than to master a ton different languages to know which rules to apply on each one :P
If anyone wants, I can try and help with English and Portuguese.
Good thing, again I hope more people use this to their advantage. I will also try and find the time to update my translator script with this new srtfold. Once it's done, I will share it here.
Thank you so much for helping in making this possible MagicBanana! :D
EDIT: A question, instead of having a similar number of sed being executed with a lot of different rules at the end (replacing ~ with space) it should be faster and better to only have one, since that character is always being used for the non-breaking space, correct?
If "~" never occurs in the input subtitles, it is correct. Nevertheless, you wrote: "a documentary about languages for example may use it as a separate character". Reverting every specific substitution would avoid replacing most "~" occurring in such a documentary. A better single rule than 's/~/ /g' would check the absence of space right before or right after "~":
sed 's/\([^ ]\)~\([^ ]\)/\1 \2/g'
It is anyway riskier than reverting every rule. The best would be to actually use non-breaking spaces instead of "~", as commented in https://dcc.ufmg.br/~lcerf/utilities/srtfold ... a URL that exemplifies the risk: the above substitution would replace its "~" for a space. Have you tried to download the script from my website? Here, Emacs correctly shows me the non-breaking spaces in the downloaded file (but the same holds for https://trisquel.info/files/srtfold.txt which was problematic for you).
That script now starts with a substitution of any sequence of spaces for a single space. In this way:
- the counts of the number of characters are correct even if there are such supernumerary spaces;
- the substitutions turning spaces non-breaking need not worry about that.
I have also found and removed a useless test in the first AWK program.
I think I might adapt mine with using a single rule, like the one you suggested (checking for spaces before and after ~ is also a great idea!), since it's usually something that I can check before running the script.
Indeed, we could even have a test run against the subtitle, right at the start... If the ~ character was not present in the text, it would use it. Otherwise, if the character was already in there, it would revert to using another optional character. I suspect with a couple different possible characters it could be fully automated and secure enough.
This is also, in part, a reply to your question... When I open the srtfold in your website, the non breaking space appears as Â
Which is, in my opinion, a pretty regular character to appear, riskier than ~
I think that was not the correct expected appearance, but it is what appears. Same as with a downloaded preview.
As for the supernumerary spaces, it's rather interesting, after we started working on the original version of srtfold, I came across subtitles that used those to place the subtitles in opposite sides of the screen (two people were speaking each in a side of the screen, and the person making the subtitles decided that would be a good graphical representation of who was saying what... Which in my opinion only renders everything hard to read, since you have to move your eyes from one side of the screen to the other very fast, it was a very fast speaking dialogue... I didn't like it...)
Anyway, I decided at the time to have a sed that would substitute supernumerary spaces for a newline, since it made for a better reading. I have kept it since. Never thought I would find supernumerary spaces inside the same sentence... I don't think I ever found such an occurrence. Have you?
Below is the rule I used, it "accepts" two spaces but if more spaces are present, they become a newline. I am not sure if it's the best solution or not, but it works.
sed 's/ */\n/g'
I have this right at the start of my script, so srtfold never had to deal with supernumerary spaces. Still, it might be a good alternative to have it your way.
As always, thanks for the help and input :D
Indeed, we could even have a test run against the subtitle, right at the start... If the ~ character was not present in the text, it would use it. Otherwise, if the character was already in there, it would revert to using another optional character.
I still think it is sad not to use the character made for that. Anyway, if you want a shell variable (char below) to be the first character of a list ('~' and '|' below) to not occur in the subtitles (or empty if they all occur), you can add that to srtfold, after the shift:
for char in '~' '|'
do
if [ -z "$(cat "$@" | tr -dc "$char")" ]
then
break
fi
char=
done
I think that was not the correct expected appearance, but it is what appears. Same as with a downloaded preview.
Abrowser chooses the incorrect encoding for me too, as I wrote earlier in https://trisquel.info/forum/converting-youtube-timedtext-xmlvtt-subrip-srt#comment-170089
Emacs and Pluma (Trisquel's default text editor) properly display https://dcc.ufmg.br/~lcerf/utilities/srtfold though. Using one of those editors, if I uncomment one of the substitution for non-breaking spaces and execute the script, it works.
Well, I have actually discovered VLC does not properly render them. That is why srtfold now ends with the substitution of every non-breaking space for a regular space. Also, the script now removes every \r and every HTML tag. They were altering the character counts and maintaining the tags when the subtitle is split would require quite some work. I fixed the padding of the seconds in the cues too.
Never thought I would find supernumerary spaces inside the same sentence... I don't think I ever found such an occurrence. Have you?
With manual edition (as human translators do), I believe such a typo is not rare. People accustomed to LaTeX or programming may even pay less attention to letting sequences of space, because they are usually equivalent to a single space.
sed 's/ */\n/g'
Ironically, the forum replaced the sequence of three spaces for a single space!
I am not sure if it's the best solution or not, but it works.
It looks perfect to me.
I am not using Abrowser, but yes all Firefox-based do seem to use the incorrect encoding.
Anyway, either solution works.
Yes, Trisquel's forum did replace those spaces! The damn thing, I didn't even noticed :P
You still caught my idea, and I see you included it in the new srtfold in your website. Cool!
In my experience supernumerary spaces were better replaced by a \n (that's a new line symbol in case Trisquel forum changes things again). But either way is good I think.
Thanks again for all the help MagicBanana, srtfold looks now better than ever! :D
GNUser, I envy your scripting skills ;)
Usually, I use Gaupol to convert VTT into SRT. Gaupol is available through the Trisquel standard repository.
Open *.vtt file in Gaupol, then save as SubRip (*.srt).
Has worked for me so far.
Cheers
(I am not sure why, but my previous reply to you didn't appear).
Don't envy me, I am only able to get around very simple problems. Thank you anyway :)
I am not sure if Gaupol will convert these weird xml files to SRT or not. May only work with simple VTT files. Anyway, if anyone ever needs it, it's here. Happy to share!
FFmpeg supports both VTT & SRT formats and can do this conversion. Just FYI.
I did try ffmpeg first, and it didn't work. Gave a "misdetection possible". I think these are not "real" VTT files, not in their structure, since VTT and SRT are usually very similar. These files (as you can see in the attached files in my reply to MagicBanana) are weirdly formatted, nothing similar to SRT or VTT.
Anyway, ffmpeg is very good for a lot of conversions too :)