Help for text editing
- Login o registrati per inviare commenti
I am not sure I understand what you are doing. I have not written any code that would take the translated text and the file "cues" to create a .srt file. Have you?
I repeat that the idea was to put the same number of translated sentences as the number of input sentences in every cue (the reason for the number of sentences in the third column of "cues"). If the number of sentences is not preserved, that does not work. I gave you a command to count the number of input sentences:
$ awk '{ sum += $3 } END { print sum }' cues
To count how many sentences in the translated text:
$ awk -F '[.?!] ' '{ sum += NF } END { print sum }'
Are the two numbers equal?
We need to identify an invariant in the translation. Does it always preserve the blank lines, as you were suggesting? Even if the paragraphs are not whole sentences (paragraphs starting/ending in the middle of a sentence)? In that context, is the quality of the translation as good as when the paragraphs were whole sentences? If so, we will simply give the text in the .srt files with blank lines between the subtitles. If having full sentences in every paragraph is better (or necessary to preserve the blank lines), we will give paragraphs with full sentences to the translator. If the sentences are usually not aligned with the cues, we will first modify the original .srt files to have whole sentences in every cue. Now if neither the number of sentences nor the blank lines are preserved, I have no idea what to do.
I read your two replies, honestly I got a little confused, apologies. I decided to show in detail what I am doing, maybe it will be easier this way.
I have uploaded the files as follows:
I run your code in the ORIGINALSUBTITLE file, which generated CUES and TEXT;
I then translated TEXT into ENTIRETRANSLATION, into Portuguese which I believe is a language you speak;
As a way to compare quality, I also translated ORIGINALSUBTITLE file into TRANSLATEDSUBTITLE, this way you will see why I say context is so important when making software-translations since the translation quality between ENTIRETRANSLATION and TRANSLATEDSUBTITLE is quite noticeable.
Now, what do you think should be the best way to bring back the text from ENTIRETRANSLATION into subtitle format?
Thank you again for the help, sorry if I got too confused.
Allegato | Dimensione |
---|---|
ORIGINALSUBTITLE.txt | 30.49 KB |
CUES.txt | 371 byte |
TEXT.txt | 21.28 KB |
ENTIRETRANSLATION.txt | 21.49 KB |
TRANSLATEDSUBTITLE.txt | 31.44 KB |
Hey, I think this might be helpful, for the purpose of comparison. I tested another subtitle, this time resulting in a cue file consisting mostly of 1 sentence cues with some exceptions being 2 sentences. I thought this could be helpful to try out a "simpler" solution before diving into something harder like the earlier example. Again the files are:
SUBTITLE1.txt - the original subtitle
CUES1.txt - the cues file generated by your script
TEXT1.txt - the text file generated by your script
TRANSLATED1.txt - the translation of the TEXT1 file
Don't know if this helps, hope it does :)
The issue here is a simpler one, with most phrases being divided in 2 or 3 subtitle timestamps, and the script being able to mostly separate them one by one into different cues. This should be slightly easier to bring back into the subtitle format right?
(the same warning as earlier applies, some people might object to the content of the text files, discretion is advised)
Allegato | Dimensione |
---|---|
SUBTITLE1.txt | 164.87 KB |
CUES1.txt | 40.41 KB |
TEXT1.txt | 84.55 KB |
TRANSLATED1.txt | 92.39 KB |
This should be slightly easier to bring back into the subtitle format right?
That should work well indeed.
As far as I understand, the blank lines are preserved. Is it still the case and is the translation quality as good if the paragraphs can start/end in the middle of sentences? That would be going back to what I wrote ten days ago in https://trisquel.info/forum/help-text-editing#comment-166405 but with one additional "\n" after each subtitle:
$ awk -v RS='\n\n+' -F \\n '{ print $2 > "cues"; printf "%s", $3 > "text"; for (i = 4; i <= NF; ++i) printf " %s", $i > "text"; print "\n" > "text" }'
If the blank lines are preserved and the translation quality is as good, there is no reason to identify sentences and no need to ever merge cues. Otherwise, we will keep on merging cues until a subtitle ends with ".", "?" or "!". Either way, there is no need to count sentences.
(the same warning as earlier applies, some people might object to the content of the text files, discretion is advised)
I do object. I hope I am not only helping to propagate such anti-science bullshit...
I will start by replying to your last paragraph.
No, you are not helping to propagate anything in particular ;)
I do not seek to build fences but bridges.
If you remember what I wrote earlier (https://trisquel.info/en/forum/help-text-editing#comment-166383) the purpose of this work is to give people access to information that THEY choose to watch. That's why I am trying to make this process 100% automatic. The way I see it, if a person wants to watch content X (let it be a speech from a political figure, a presentation by a scientist, a sermon by a religious figure, a news broadcast from another country, etc), they should be able to run a couple scripts and get the content available in their own language.
Incidentally the two examples I got when testing were both on religious matters, but I thought were good examples of how differently the script was working on each case. Hence, my warning as not to offend anyone.
Thank you for helping me build those bridges! If you wish I will delete the files from my comments (I believe the edit option allows for that) once we are done ;)
Back on topic, I think we would do well tackling a different problem at a time. So maybe start by putting text back into subtitle format, with the proper timestamps, in the cases of 1 or 2 sentences like the last example I provided. I think it's good to keep the sentence counting because some occurrences might happen where we get 2 sentences in 1 cue. But let's consider the easier path for now, what if we treat even those as 1? Shall we try that and see what happens?
Thanks!
That code does break sentences in some cases, whereas the previous one didn't. Again, if I had to eyeball it I would say, keep the sentence counting (using punctuation as a breaker) but for now treat 2 sentences as 1 for testing purposes.
Since the translation always keeps the blank lines (correct?; you never clearly confirm that point, which is absolutely essential), there is no need to count sentences. We will either:
- take every input cue, put every associated text (usually mere *pieces* of sentences) in paragraphs, translate them, and put them back in the corresponding input cues;
- or, if that improves the quality of the translation and does not worsen too much the synchronization with the audio, merge the original cues so that the concatenated text in them are *whole* sentences, put them into paragraphs (which therefore never start/end in the middle of a sentence), translate them, and put them back in the corresponding merged cues.
Here is how the code differ:
Solution 1, with *pieces* of sentences:
$ awk -v RS='\n\n+' -F \\n '{ print $2 > "cues"; printf "%s", $3 > "text"; for (i = 4; i <= NF; ++i) printf " %s", $i > "text"; print "\n" > "text" }'
Solution 2 with *whole* sentences:
$ $ awk -v RS='\n\n+' 'out == "" { begin = $2 } { end = $4; sub(/^[^\n]*\n[^\n]*\n/, ""); gsub(/\n/, " "); sub(/ *$/, ""); out = out " " $0 } $NF ~ /[.?!]$/ { print begin " --> " end > "cues"; print substr(out, 2) "\n" > "text"; out = "" }'
In both cases:
- Give .srt subtitles to the AWK program (either Solution 1 or Solution 2);
- Translate the output file "text";
- Give the translation to that AWK program:
$ awk -v cues=cues 'NF { text = $0; getline < cues; print ++nb "\n" $0 "\n" text "\n" }'
- Execute srtfold (with the desired maximal number of characters per line) on the output: https://dcc.ufmg.br/~lcerf/en/utilities.html#srtfold
It worked wonderfully well!
I run Solution 2 in SUBTITLE1.txt (uploaded on my previous comment), and after translating the text (which yes, always keeps the blank lines, sorry I thought I had confirmed it earlier), made it back into a subtitle file using the next piece of code, run srtfold and it gave an excellent 100% automatic translated and synchronized subtitle!
As for Solution 1, it doesn't really offer any improvement above what we had before. I was already converting subtitles to single lines running this:
sed -i 's|\r||' $1
awk 'BEGIN { RS = ""; FS = "\n" }
NR > 1 { print "" }
{ print $1; print $2;
for (i = 3; i < NF; ++i) printf "%s ", $i;
print $NF;
}' "$1" > "new$1"
(Btw, how can I put the code appearing in the yellow box like you do? I am running my browser without JS).
This doesn't help to make each sentence "whole" for translation, or better for reading since synchronization is sometimes off by itself (a word or two).
What I think we can try is what we were trying to do earlier on... assume . ! ? : ; should always break a sentence, but adding exceptions. In a list form, easy to edit in a text editor, we could have
"
Dr.
Mr.
St.
etc"
This way we can break the subtitle into much more usable sentences, avoiding most issues with "periods".
A percentage of the subtitle duration would be attributed to each character, making it easy to divide the line from:
00:01 - 00:08
this is it. I will share
even more
into this:
00:01 - 00:03
this is it.
00:04 - 00:08
I will share even more
What do you think?
I run Solution 2 in SUBTITLE1.txt (...) and it gave an excellent 100% automatic translated and synchronized subtitle!
SUBTITLE1.txt's sentences are short and almost every period is at the end of a subtitle (what I would assume from human-made subtitles). With ORIGINALSUBTITLE.txt, the synchronization is probably far worse.
As for Solution 1, it doesn't really offer any improvement above what we had before.
So, the translator does not simply ignore the blank lines. They really harm the quality of the translation, right?
RS = ""
You taught me something here! Well, https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html taught it to me after I was intrigued by that definition of RS. It is indeed better than my definition as "\n\n+". As a consequence, here are improved "Solution 1" and "Solution 2":
$ awk -v RS='' -F \\n '{ print $2 > "cues"; printf "%s", $3 > "text"; for (i = 4; i <= NF; ++i) printf " %s", $i > "text"; print "\n" > "text" }'
$ awk -v RS='' 'out == "" { begin = $2 } { end = $4; sub(/^[^\n]*\n[^\n]*\n/, ""); gsub(/\n/, " "); out = out " " $0 } $NF ~ /[.?!] *$/ { print begin " --> " end > "cues"; print substr(out, 2) "\n" > "text"; out = "" }'
Btw, how can I put the code appearing in the yellow box like you do?
With the HTML tag "code". See https://trisquel.info/en/filter/tips for all the tags Trisquel's forum and wiki support.
This doesn't help to make each sentence "whole" for translation
I am confused. Does having whole sentences in the paragraphs help the translation, quality-wise? If not, why not using "Solution 1"?
What do you think?
As far as I understand you want to divide any input subtitle at any punctuation but for a list of exceptions... which will never be complete. Why not "respecting" the input divisions? If decided by a human made (or even a speech-to-text program, which probably knows more about grammar than punctuation), they are probably better than what we would achieve.
Well, I'm glad to know I helped you to also learn something new with all this, even if unintentionally :)
As you are probably aware that code wasn't written by me, it was something I found online in a forum where someone else was trying to play around with subtitles as well. The two lines were actually written by different people but I discovered I needed both to work properly. Still, I'm glad to know it was beneficial to you too!
As for the subtitles, I think it's easier to show. Yes, blank lines are PRESERVED, as in they stay in the exact same place within the text, but they are not IGNORED, since they affect the translation quality. I attached some files, with a small portion of text of the earlier subtitle that was generating those 35 sentences in a cue. Namely:
1ORIGINALSUBTITLE.txt - The subtitle as originally made, two lines in each subtitle, with improper breaks.
2ORIGINALTRANSLATED.txt - The same subtitle translated exactly as it was. Notice the way the text is incorrectly translated at some points.
3SINGLELINESUBTITLE.txt - The subtitle file after turning every two lines into one, no changes made to breaking or timing.
4TRANSLATIONSINGLELINE.txt - The translation of the previous file, you will notice an improvement on translation quality compared to ORIGINALTRANSLATED. However, notice subtitles number 20 through 26. Some of those sentences lose meaning, lose context, singular and plural are mixed, masculine and feminine terms are mixed together.
5SINGLEPARAGRAPH.txt - Running your previous Solution 1 produces this block of text, which in its entirety is translated into the next file...
6TRANSLATEDPARAGRAPH.txt - Notice that here the translation is nearly perfect, all the errors I mentioned before are gone and there is a sense of context and meaning around all the text.
So, yes, putting the text together will help improve translation quality A LOT!
Now, I fully agree with you, we won't be doing a good job trying to guess where punctuation should break the subtitle. If the original subtitle was not properly breaken we shall not attempt to fix it. You are right, think we should try a different approach, respecting the original divisions/breaks. Turn ORIGINALSUBTITLE content into ENTIRETEXT content, keeping in the CUES file not only the starting and ending point of the entire paragraph, but also all the middle subtitle times. Then, we divide heuristically the translated text over the older timestamps (word count might be different, but as long as the same percentage of words is distributed evenly it should do a better job than relying solely in srtfold for that task).
It should be doable like this right? We are not trying to improve the text breaks, we only want to translate the text as a whole and distribute the words in a close percentage relation as they were originally. Do we concur? ;)
Thanks again for all the help, I am learning a lot with you!
Allegato | Dimensione |
---|---|
1ORIGINALSUBTITLE.txt | 4.74 KB |
2ORIGINALTRANSLATED.txt | 4.91 KB |
3SINGLELINESUBTITLE.txt | 4.74 KB |
4TRANSLATIONSINGLELINE.txt | 4.94 KB |
5SINGLEPARAGRAPH.txt | 3.31 KB |
6TRANSLATEDPARAGRAPH.txt | 3.42 KB |
as long as the same percentage of words is distributed evenly it should do a better job
The result would be very bad whenever there exist long pauses (without subtitles) in the movie: the watcher would often read before the pause the beginning of what is pronounced after or, on the contrary, she would have to wait for the end of the pause to have the end of what was pronounced before it.
For instance, let us imagine two subtitles that would respectively contain the third-to-last and the second-to-last sentences of 5SINGLEPARAGRAPH.txt. In 5SINGLEPARAGRAPH.txt, those sentences are 36 words long ("I think it's significant ... history.") and 67 words long ("But in the midst ... of his ministry."). In 6TRANSLATEDPARAGRAPH.txt, they are 39 words long ("Penso que é significativo ... história") and 59 words long ("Mas no meio ... do seu ministério"). After translation, distributing the words proportionally, we would have (39 + 59) * 36 / (36 + 67) = 34 translated words in the first subtitle and (39 + 59) * 67 / (36 + 67) = 64 translated words in the second subtitle: the five words ending the first translated sentence would be in the second subtitle, which may appear much later.
We could try to map the original sentences to the translated ones so that put the latter can be put in the original (merged) cues. That was our first idea. Here, searching for ".", "?" and "!" would work. Indeed, 5SINGLEPARAGRAPH.txt and 6TRANSLATEDPARAGRAPH.txt have sixteen periods, question marks and exclamation marks:
$ tr -dc '.?!' < 5SINGLEPARAGRAPH.txt | wc -c
16
$ tr -dc '.?!' < 6TRANSLATEDPARAGRAPH.txt | wc -c
16
But those are only sixteen sentences. With longer .srt files, I am pretty sure we would rarely have the same counts, because of complications we have already mentioned (numbers, abbreviations, acronyms, "..." that may become a single character, etc.). That is why I started to turn more complex the definition of a sentence, affirming there should be a space/newline after ".", "?" or "!". Nevertheless, doing so with 5SINGLEPARAGRAPH.txt and 6TRANSLATEDPARAGRAPH.txt, they end up not having the same number of sentences, because "... they said this: "Someone has shot the President." You can imagine..." (not a space right after the period) is translated to "eles disseram isto: "Alguém alvejou o Presidente". Pode imaginar" (a space right after the period)). So, well, I do not think we should go that way.
Another idea (that we would probably name "Solution 3") may be to have long-enough pauses between the subtitles define the paragraphs. But I am not certain it would be much better than Solution 2. A too-small threshold may lead to breaking sentences with hesitations (paragraphs starting/ending in the middle of a sentence). A too-large threshold, or simply a long monologue (such as what you attached), may create huge paragraphs and the synchronization with the audio would become bad.
I have been doing some testing and essentially I think you are right, I don't see any good way to make sure a poorly structured subtitle can be improved, even less when there is translation involved. Luckily I don't think we will find so many of those (I took some time looking at several subtitles from several different sources and this was just a really poor example, though useful for considering several aspects that affect the output quality).
For now, let's put that aside and consider what we already have. I am trying to make this as automatic as possible, but there is still some things to iron out.
I am currently using DeepL for most of my translation, mostly because the quality is truly above all other systems I have tested (I would say DeepL, Reverso, Google, LibreTranslate, in this order).
Regardless, all these systems have a limit on how many words/characters you can translate at a time, as well as some limitations as for file extension. DeepL requires DOCX or PDF and has a limit of 10.000 words in a single file. Currently what I do is this:
1. Run Solution 2 in original subtitle;
2. Run the PERLCODE below to divide it into chunks of X words each (unfortunately it breaks sentences midway, something I wish we could improve);
3. Manually correct the broken sentence at the end/start of each chunk;
4. Use "soffice --convert-to docx" to convert chunk files into docx format;
5. Using the browser, go to DeepL and translate the docx file into the desired language;
6. use "soffice --convert-to txt" to convert the newly translated files into txt files;
7. Manually put the chunks together (I believe this could be done with "cat", but still haven't tried it);
8. Run the awk command to put cues and translated text back together;
9. Run srtfold and (finally) get the subtitle file as intended!
So... this certainly has a lot of room for improvement. Mainly I believe awk could probably do a better job dividing the TEXT file into smaller parts while respecting entire sentences. I guess I could make it search for a period at the end of a line, but if it already exceeded the limit of words, how to make it go back? I could really use your help with this! Btw, here is the PERLCODE I have found online to do the division:
perl -e '
undef $/;
$file=<>;
while($file=~ /\G((\S+\s+){10000})/gc)
{
$i++;
open A,">","chunk-$i.txt";
print A $1;
close A;
}
$i++;
if($file=~ /\G(.+)\Z/sg)
{
open A,">","chunk-$i.txt";
print A $1;
}
'
There is also another line of code that I use when necessary. For example, I have found subtitles online that have proper punctuation and grammar, but the words are either ALL CAPS of all lower-case. In those cases, I do this:
awk '{ print tolower($0) }' subtitle.srt > newsubtitle.srt
awk ' { $0=toupper(substr($0,1,1))substr($0,2); print } ' newsubtitle.srt > yetnewsubtitle.srt
If it's an English subtitle I also run this:
sed -i "s/ i / I /g;s/i'/I'/g"
Which capitalizes "I" where necessary. It helps to make the text more visually correct. Of course names and other stuff are still wrong, but anyway I see this as an improvement over HAVING AN ALL CAPS SUBTITLE. ;)
So, to sum it all up, I would like to have a way for your Solution 2 to provide an output of X words limit per TEXT file, or if necessary maybe we could integrate that PERLCODE into it, what do you think would be easier?
Also, a more "automatic" way to get the TEXT files back into one single file that can be merged with CUES file. I think cat would do it, but we would need to integrate that at the beginning of the awk command, would that work (maybe as pipe?)?
Thanks again for the help, and apologies if this comment got a little confusing xD
Regardless, all these systems have a limit on how many words/characters you can translate at a time, as well as some limitations as for file extension. DeepL requires DOCX or PDF and has a limit of 10.000 words in a single file.
The consequences of SaaSS: https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
You should use Apertium on your own machine. You will not suffer from any such limit.
True. I have not tried Apertium on this specific task, but I did play around with it for a while some time ago, and I have to say... I was unsatisfied with the results. Translation quality was below that of DeepL and Reverso. I would say it was maybe similar with Google's, which I dislike in general. Truth be said, I got spoiled by DeepL fluent and natural sounding translations :P
I might give it a shot again, having such complete sentences might improve the output of Apertium.
Apart from that, what do you think of those other "improvements" I am trying to make to subtitles?
I have been trying to instead of merely uppercasing the first letter of a sentence, uppercasing every first letter after a period (or an exclamation or question mark). I got myself around to this line of sed
sed -E 's/(^[a]|\. [a-z])/\U&\E/g'
But this doesn't see new lines only spaces (unlike awk). Any suggestion here?
sed -E '/./{H;$!d} ; x ; s/[.?!]\s+[a-z]/\U&\E/g'
I found the "construct" on https://www.gnu.org/software/sed/manual/html_node/Multiline-techniques.html
I tested around with some examples people were using for other purposes, and eventually came down to this:
sed -E 's/(^[a]|\. [a-z]|\! [a-z])/\U&\E/g' SUBTITLE.srt | sed 's/^\(.\)/\U\1/'
This is still not 100% complete, but I think it's already doing a good job (in a single text file I am testing it on it produced good results, I have to test in more files).
This is of course only necessary when subtitles are ALL CAPS or all lower cases. Still, I can say I have found some of those, so this might be useful for not only me but also a lot of other people.
In any event, do you think an awk script could be used to split a file into X words without breaking sentences? It could come in handy for other purposes as well. Thanks!
This is still not 100% complete, but I think it's already doing a good job (in a single text file I am testing it on it produced good results, I have to test in more files).
Have you seen my previous post?
In any event, do you think an awk script could be used to split a file into X words without breaking sentences?
With a better specification, probably. Do you want to reformat the input into paragraphs with at most a user-defined number of words? Do you want to split the input into several files? What to do if a single sentence exceeds the user-defined maximal number of words? Violate the maximum?
What do you mean?
The output would be a file with X number of words, a variable would be the best choice to make it adaptable. For the sake of example, let's say 5000 words (not characters, full words). I don't believe we will ever have a single sentence that long ;)
The original file would be broken into as many files as needed to make sure that each never hard more than 5000 words, nor break a sentence in half (sentences as defined by Solution 2 in the previous code). If a choice had to be made between having a file with 4999 words or 5001, the 4999 option would be the choice.
The PEARLCODE I posted does this, but it doesn't preserve full sentences as we need.
That should do it:
$ awk -v max=5000 '{ text = text $0 "\n" } END { n = split(text, a, /[[:space:]]+/, seps); for (i = 1; i <= n; ++i) { sentence = sentence a[i] seps[i]; if (a[i] ~ /[.?!]$/) { piece = piece sentence; sentence = ""; j = i }; if (++k == max) { k = i - j; printf "%s", piece > FILENAME "." ++nb; piece = "" } }; if (k) printf "%s", piece sentence > FILENAME "." ++nb }'
Running that command gives me an error:
awk: line 1: syntax error at or near ,
I tried changing syntax near the , but it kept giving errors. Could be due to my version of awk?
Yes, it is: having the separators in a fourth argument given to the split function is a GNU extension. Without it (split(text, a, /[[:space:]]+/)), you can substitute seps[i] for " ":
awk -v max=5000 '{ text = text $0 "\n" } END { n = split(text, a, /[[:space:]]+/); for (i = 1; i <= n; ++i) { sentence = sentence a[i] " "; if (a[i] ~ /[.?!]$/) { piece = piece sentence; sentence = ""; j = i }; if (++k == max) { k = i - j; printf "%s", piece > FILENAME "." ++nb; piece = "" } }; if (k) printf "%s", piece sentence > FILENAME "." ++nb }'
However, you will then have the text in every file on a single line. Don't you want to install gawk?
I could do that yes. Is it necessarily better, or might it break some other codes I already have in use?
mawk is faster than gawk, but gawk has useful extensions. As far as I know, gawk can interpret everything mawk can.
Thanks. Following your suggestion I installed gawk and the above command now works, producing the files up to the maximum number of words allowed. Thank you!
Btw, I am having trouble making Solution 2 work in my files. I am almost always forced to open them in Pluma text editor (gedit doesn't work for this, I don't know why) and save them as ISO 8859-15. Otherwise it doesn't create the files CUES and TEXT. iconv doesn't seem to be of help, even after running it Solution 2 still doesn't work, and I can't find another command to automate it. Do you have any suggestion? If necessary I can send some example files. Thanks.
Maybe that is because the subtitle was written on Windows that uses two characters, "\r\n", for a newline. Let us add "\r":
$ awk -v RS='' 'out == "" { begin = $2 } { end = $4; sub(/^[^\n]*\n[^\n]*\n/, ""); gsub(/\n/, " "); out = out " " $0 } $NF ~ /[.?!] *\r*$/ { print begin " --> " end > "cues"; print substr(out, 2) "\n" > "text"; out = "" }'
This improved the situation. I have run that code against some subtitles that were previously not producing TEXT and CUES files, and now they do. However, I still ran it against a file that didn't work, maybe it's because of some of the characters it has inside (those musical cues and stuff). I am attaching the file here so you can have a look, maybe you will notice something that needs changing.
Allegato | Dimensione |
---|---|
notworking.txt | 121.08 KB |
The carriage returns are the problem. Just delete them beforehand:
$ tr -d \\r < notworking.txt | awk -v RS='' 'out == "" { begin = $2 } { end = $4; sub(/^[^\n]*\n[^\n]*\n/, ""); gsub(/\n/, " "); out = out " " $0 } $NF ~ /[.?!] *$/ { print begin " --> " end > "cues"; print substr(out, 2) "\n" > "text"; out = "" }'
That seems to do the trick!
I have started to compile what we have until now in the script below. Let me know what you think.
#!/bin/bash
printf "This will start by correcting the subtitle file and output CUES and TEXT files.\nDo you also want to correct capital letters in TEXT file? If so type yes.\n"
read answer
tr -d \\r < "$1" | sed 's/ */\n/g' | awk -v RS='' 'out == "" { begin = $2 } { end = $4; sub(/^[^\n]*\n[^\n]*\n/, ""); gsub(/\n/, " "); out = out " " $0 } $NF ~ /[.?!] *$/ { print begin " --> " end > "cues"; print substr(out, 2) "\n" > "text"; out = "" }'
if [ $answer == "yes" ]; then
cp text backuptext
sed 's/[A-Z]/\L&/g' text | sed -E 's/(^[a]|\. [a-z]|\! [a-z]|\? [a-z])/\U&\E/g' | sed 's/^\(.\)/\U\1/' | sed "s/ i / I /g;s/i'/I'/g" > tmp && mv tmp text
printf "CUES and TEXT files were created. Capital letters were corrected, check backuptext for comparison.\n"
else
printf "CUES and TEXT files were created.\n"
fi;
One minor thing that would be useful to correct is that Solution 2 doesn't understand the end of a sentence is by mistake there is a space after the final period. I have gotten some test files where that was happening. As an example:
00:01 - 00:02
This is a subtitle
with a period at the end.[ ]
00:02 - 00:03
This is another line.
That [ ] represents a space after that .
In this case, Solution 2 merges this line with the next one.
00:01 - 00:03
This is a subtitle with a period at the end. This is another line.
Can this be avoided?
Another issue, like I said earlier, I still think it would offer an improvement if punctuation marks were checked at the end of both first and second lines, it will involve some calculation on dividing the subtitle duration, but that shouldn't be impossible to achieve. We could try that if you are willing to.
Thanks again for all the help!
The condition $NF ~ /[.?!] *$/ should catch the lines ending with ".", "?" or "!" and any number (including zero) of spaces... and it is in "Solution 2". It works here, apparently.
EDIT: $NF ~ /[.?!]$/ works as well, because, the field separator is not being redefined, no field ends with space.
That said, to never have to think about trailing spaces, I would remove them at the beginning, along with the return carriages. In other terms, in the code I last send you (to divide the subtitles whenever a line ends with ".", "?" or "!"), I would substitute tr -d \\r, which only deletes the return carriages, for:
sed 's/ *\r*$//'
Indeed, that seems to solve the issue!
I will incorporate that into the final version of the script.
As for translation, you are indeed right, it would be so much better if we could use Apertium. Completely offline, local solution, much more private and fully FLOSS.
However, there are only a few language pairs available, and quality-wise it's very poor. I cannot find any other local, offline translation software, fully FLOSS, to do a better job. I guess for the time being I will incorporate the preparation of files for online translation.
I have been testing some files and found an interesting example.
00:01 - 00:03
Hello there.
Hey how are you?
The above example produces a single line when running Solution 2. Shouldn't the . at the end of the top line be considered as a marker just like the ? at the end of the second line? I mean, it's not the same as having
00:01 - 00:03
Hello there Mr. Darcy,
how are you?
In this second example the . is in the middle of the line. Not at the end of it.
This example could be arranged as
Hello there Mr. Darcy, how are you?
Because since the first and second line have no .?! to indicate that those are two different sentences. However, the first example I posted indicates that those two phrases might be translated separately and they might even belong to different speakers (or not, but it's irrelevant since the meaning of each phrase is self contained).
Do you agree?
The above example produces a single line when running Solution 2.
Yes, Solution 2 only merges subtitles. It never splits a subtitle, what would require computing the division, probably proportionally to the number of characters. But, should we really divide a subtitle if the one who originally wrote it decided to put the two sentences in a single subtitle (probably because they are rapidly told)?
I think it is the best alternative, because for subtitles with dialogue (where two or more persons speak to one another) the standard is for having each person's phrase in a different line.
Of course, if we could preserve the separate lines inside the same subtitle it would be ideal, but I'm afraid that's not possible with the method we are using.
In the example I provided above, the line break is what indicates different persons speaking, without which it becomes confusing (not in every case, but still).
"Solution 2" outputs the (merged) cues and the text in paragraphs that you wanted larger for a better translation quality. Have you completely changed your mind and now want paragraphs that are smaller than a subtitle?
Not exactly.
If you remember, the original idea was to use punctuation marks as pointers of where a sentence ended and another began. We decided against that because periods might appear in the middle of a sentence in a number of different situations (Mr. Dr. etc). So we decided to use punctuation marks as pointers only if they were at the end of a subtitle line (see below):
00:01 - 00:02
This is the
end of a sentence.!?
However I think it makes sense that the same principle is applied to the line the first line of the subtitle in case it looks like this:
00:01 - 00:02
This is a sentence.!?
This is another sentence.!?
In this situations the context of one sentence is usually not necessary for proper translation of the other.
You are right that we should have small subtitles that are rapidly spoken appear together if possible. But unless the original maker of the subtitle included a - at the beginning of each spoken sentence, I don't see how we can do that. I guess we could first run a test and if the first line was a full sentence (ending with .!?) we could add the - at the beginning of the second line, which could serve as a pointer when we bring the text back together. Which could also lead to false positives, two lines are not ALWAYS spoken by different persons. So I think that will be too much work for nothing, I would rather go with the simpler solution of checking for a .!? at the end of the first line as we are already doing for the second line.
I modified the second AWK program of srtfold (everything but the loop is copy-pasted) to divide the subtitles whenever a line ends with ".", "?" or "!":
#!/bin/sh
cat "$@" | tr -d \\r | LC_ALL=C awk -F \\n -v RS='' '
function to_sec(t) {
n = split(t, hms, /:/)
sub(/,/, ".", hms[n])
return hms[n] + 60 * hms[--n] + 3600 * hms[--n] }
function print_time() {
h = int(time / 3600)
m = int((time - 3600 * h) / 60)
s = sprintf("%02.3f", time - 3600 * h - 60 * m)
sub(/\./, ",", s)
printf "%02d:%02d:%s", h, m, s }
function print_cue(duration) {
print ++nb
print_time()
printf " --> "
time += duration
print_time() }
{
for (; $NF == ""; --NF);
split($2, interval, /-->/)
time = to_sec(interval[1])
duration = (to_sec(interval[2]) - time) / (length - length($1) - length($2) - 1)
for (i = 3; i <= NF; ++i) {
sentence = sentence "\n" $i
if (i == NF || $i ~ /[.?!] *$/) {
print_cue(length(sentence) * duration)
print sentence "\n"
sentence = "" } } }'
Well, I also added the deletion of the carriage returns we have just discussed. You would run the above script before anything else.
I think we both replied at the same time.
This works... But not entirely.
The division should be done when creating the TEXT and CUES files, for the sake of usability in multiple scenarios.
A good reason for that is because some of the test files I have (which called my attention to this) have line positioning applied to help hearing-impaired people better identify who is speaking. Those are done using multiple spaces or tabs, which you can see in my code above I correct to a newline with \n
For regular people that is not necessary, since by hearing who is speaking we can identify who is saying what. Therefore, we should turn those into separate lines as is common practice when subtitling a dialogue between two or more speakers.
For that to work, we must do the division at Solution 2 and not at srtfold.
My apologies, I totally misunderstood what you meant!
I have now tested it and I understand it is not code to be run AFTER srtfold but instead at the beginning of the processing of the subtitle file! You only mentioned that you have used some of the srtfold code. Apologies, I misunderstood you!
I will test further, but I believe this should be what we needed to make it work! :D
Thanks for the help!
As I wrote:
You would run the above script before anything else.
And, as I wrote even more recently:
I would substitute tr -d \\r, which only deletes the return carriages, for:
sed 's/ *\r*$//'
Like I said, it was a misunderstanding on my part, apologies for my mistake.
And yes, that solved the issue. Thanks again for the help!
I think we can do both at once, am I wrong?
You are right.
awk -v RS='\n\n\n*' -F \\n '{ printf "%s ", $3; for (i = 4; i <= NF; ++i) printf " %s ", $i; }'
That prints supernumerary spaces. Only the first space in the second printf should be present.
Also, there is no newline. Are newlines really forbidden to run the punctuation/grammar-improving programs? If those programs never add/remove words, we can save the number of words along with the cues and then be able to get the improved text back into its time interval. If they can change the number of words (for instance transform "can not" into "cannot"), we would really like newlines (and those programs should keep them).
Hi.
Sorry I will have to be so brief.
The punctuator tool (http://bark.phon.ioc.ee/punctuator) can receive any input but it will only output a single block of text. So this:
"This is my first line
of text
and this is already a third line"
Will be processed and outputted as:
"This is my first line of text, and this is already a third line."
As far as I can tell, it does not add words but it will connect "can not" to "cannot". Words that require ' don't seem to be connected ("does not" won't be converted to "doesn't" for example).
The punctuator tool is the one I am more interested in using, whereas the Grammar corrector not so much (incidentally the Grammar corrector does change the words a bit, so let's leave that one aside for now).
I will try to run more texts on the punctuator tool to know exactly which words are changed, it seems to be very few. We can make a corrector for those if necessary.
I will provide more data as soon as I can. Thanks for the help again!
I can confirm that no other contractions happen, I have tried several all the examples in this webpage[1] and none altered the text. I think we can just have a filter for "cannot" and be done with it (which of course might be written as "cannot" in the original subtitle text, so maybe we should start by converting it into two separate words and always work with that?)
[1] - https://vocabularypoint.com/complete-list-of-contractions-in-english/
i dont know if this can help, but here you can see the format of the .srt subtitles:
1
00:05:00,400 --> 00:05:15,300
This is an example of
a subtitle.
2
00:05:16,400 --> 00:05:25,300
This is an example of
a subtitle - 2nd subtitle.
Here are a example of a file of subtitles i make for: https://video.hardlimit.com/w/9r5KbaChhdyrpumCit2vEb?subtitle=es
00:00:01.000 --> 00:00:05.000
Copiar no es Robar.
00:00:06.000 --> 00:00:08.000
Copiar no es Robar.
00:00:09.000 --> 00:00:12.000
Robar una cosa deja a otro sin ella
00:00:12.000 --> 00:00:14.000
Copiar algo es hacer uno más
00:00:14.000 --> 00:00:16.000
para eso sirve copiar.
00:00:17.000 --> 00:00:19.000
Copiar no es Robar.
00:00:19.000 --> 00:00:22.000
Si copio lo tuyo tu también lo tienes
00:00:22.000 --> 00:00:25.000
Uno para mi y otro para ti
00:00:25.000 --> 00:00:27.000
Eso es lo que pueden hacer las copias.
00:00:28.000 --> 00:00:30.000
Si robo tu bici
00:00:30.000 --> 00:00:32.000
has de coger el bus
00:00:32.000 --> 00:00:35.000
Pero si solo la copio
00:00:35.000 --> 00:00:37.000
¡Hay una para cada uno!
00:00:38.000 --> 00:00:40.000
hacer más de una cosa,
00:00:40.000 --> 00:00:43.000
eso es lo que llamamos "COPIAR".
00:00:44.000 --> 00:00:46.000
Compartir ideas con todo el mundo
00:00:46.000 --> 00:00:48.000
Por eso copiar es...
00:00:48.000 --> 00:00:50.000
¡DIVERTIDO!
i dont know if this help you, but you can download it from the peertube video, and take it as a example i have a lot of videos with subtitles, and one with english too (about DRM)
None of those lines contains more than 45 characters. Lowering the maximum number of characters per line to 29 (that number could be in argument of a shell script calling the awk program; below 29, the line with the timestamps is broken but testing $0 ~ /^[0-0:.]* --> [0-0:.]*$/ would fix that use case if you deem it useful), here is the output:
00:00:01.000 --> 00:00:05.000
Copiar no es Robar.
00:00:06.000 --> 00:00:08.000
Copiar no es Robar.
00:00:09.000 --> 00:00:12.000
Robar
una cosa deja a otro sin ella
00:00:12.000 --> 00:00:14.000
Copiar algo es hacer uno más
00:00:14.000 --> 00:00:16.000
para eso sirve copiar.
00:00:17.000 --> 00:00:19.000
Copiar no es Robar.
00:00:19.000 --> 00:00:22.000
Si copio
lo tuyo tu también lo tienes
00:00:22.000 --> 00:00:25.000
Uno para mi y otro para ti
00:00:25.000 --> 00:00:27.000
Eso es lo
que pueden hacer las copias.
00:00:28.000 --> 00:00:30.000
Si robo tu bici
00:00:30.000 --> 00:00:32.000
has de coger el bus
00:00:32.000 --> 00:00:35.000
Pero si solo la copio
00:00:35.000 --> 00:00:37.000
¡Hay una para cada uno!
00:00:38.000 --> 00:00:40.000
hacer más de una cosa,
00:00:40.000 --> 00:00:43.000
eso
es lo que llamamos "COPIAR".
00:00:44.000 --> 00:00:46.000
Compartir
ideas con todo el mundo
00:00:46.000 --> 00:00:48.000
Por eso copiar es...
00:00:48.000 --> 00:00:50.000
¡DIVERTIDO!
Tried running
./script.sh 29 subtitle.srt
and got this:
awk: line 11: runaway regular expression / soft_min ...
Something I did wrong?
You would have better luck with Perl.
What do you mean?
I don't know any Perl, so I am sure what it can and cannot do. But so far Magic Banana's awk scripts have accomplished all the tasks we set forth. Why would Perl be better than awk?
Perl can do tr, awk and sed like operations in a single place and I think on easier ways.
$perldoc perlintro
is enough to start with (install perl-doc under Trisquel), but otherwise:
https://docstore.mik.ua/orelly/perl4/index.htm.
And I like sed and awk, tho. But for "big" tools Perl is fine and the syntax it's closely related to sed/awk and
the Unix shells.
- Login o registrati per inviare commenti