Help for text editing

129 replies [Last post]
GNUser
Offline
Joined: 07/17/2013

Hey everyone,

I have been doing some subtitle editing and translating lately (discovered some neat tricks, if anyone is interested I can let you know). However I need a small help to finish what I am doing.

I have my subtitles in a single line like this:

00:01 -- 00:02
bla bla bla bla bla

00:02 -- 00:03
bla bla bla bla bla

I can use the "fold" command and give it a certain number of characters. However it will some times give me this:
00:01 -- 00:02
bla bla
bla bla
bla

00:02 -- 00:03
bla bla
bla bla
bla

What I would prefer would be to have this:

00:01 -- 00:02
bla bla
bla bla bla

00:02 -- 00:03
bla bla
bla bla bla

Basically the first line should only go up to 32 characters or so, while the second line would go up to 45 before being split. How can I achieve it?

Thanks!

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

I can use the "fold" command and give it a certain number of characters. However it will some times give me this: (...) What I would prefer would be to have this: (...)

You can use the tac command before and after fold, with pipes to connect them.

Basically the first line should only go up to 32 characters or so, while the second line would go up to 45 before being split. How can I achieve it?

fold cannot do that. AWK can:
#!/usr/bin/awk -f
function print_at_most(max) {
printf "%s", $i
l = length($i)
for (l += length($++i); $i != "" && ++l <= max; l += length($++i))
printf " %s", $i
print ""
}
{
i = 1
print_at_most(32)
while ($i != "")
print_at_most(45) }

Any sequence of space/tabulation will become one single space, but I guess that is OK. What may not be OK is your specification of the problem: if there are little more than 32 characters, do you really want most of them on the first line or would you actually prefer the second line to be the larger one?

GNUser
Offline
Joined: 07/17/2013

Hey there.

Thanks for the reply.
I must have done something wrong, because I tried copying your script into gedit and save it, after giving it permissions to run as a program, and it spits a couple errors about awk.

Regarding the question you posed, you are indeed correct. The most beneficial way would be to count words up to a certain number and after that place them in the first line, making it so that the first line is used for as little text as possible.
Do you think awk is the right tool for that?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

it spits a couple errors about awk.

What are those errors?

so that the first line is used for as little text as possible.

Have you tried my first suggestion, tac | fold -sw 45 | tac?

GNUser
Offline
Joined: 07/17/2013

Sorry, didn't post all the information.

Running that command tac,fold,tac as you suggested inverted the lines order (as in, the last subtitle of the file became the first and the first became the last).

The errors were at first that it didn't recognize /bin/awk and when I deleted that first commented line, these appeared:

syntax error near unexpected token `max'
`function print_at_most(max) {'

But I know I do have awk installed because I have been using in other commands.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

the last subtitle of the file became the first and the first became the last

The second tac should solve that... but the lines together on the screen would be in the reverse order. I will adapt the AWK program.

The errors were at first that it didn't recognize /bin/awk

It should indeed be /usr/bin/awk. I correct the problem above.

syntax error near unexpected token `max'
`function print_at_most(max) {'

Weird. Neither gawk nor mawk gives me that error. What version of AWK do you have? awk -V should tell.

GNUser
Offline
Joined: 07/17/2013

awk -V doesn't give any version information. But on apt search I found this:

mawk/etiona,now 1.3.3-17ubuntu3 amd64

GNUser
Offline
Joined: 07/17/2013

Thanks! It seems to be working now, lines are in the correct order and indeed the first line goes up to 32, and the second up to 45. I can easily play around with the exact number of characters if I need (instead of 32 it can be 34 or something else).
There is the issue that you mentioned, sometimes I get this:

00:01 - 00:02
bla bla
bla

Which would be better if it were the other way around... This solution works very well in long lines, but not as well in short lines. If I have this:

00:01 - 00:02
bla bla
bla bla bla

It is a good solution. But if the line is shorter, it should make the second line larger compared to the first one. I wonder if that's too much for awk (since it cannot differentiate between the two??) ?

Still, it's a better solution than what I had before, so thank you a lot for the help you already provided!

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

It is a good solution. But if the line is shorter, it should make the second line larger compared to the first one. I wonder if that's too much for awk (since it cannot differentiate between the two??)

The code below fills the lines in the reverse order with up to 45 characters (the first line is usually smaller):
#!/usr/bin/awk -f
{
for (out = $NF; NF; ) {
for (l = length($NF) + length($--NF); NF && ++l <= 45; l += length($--NF))
out = $NF " " out
if (NF)
out = $NF "\n" out }
print out }

GNUser
Offline
Joined: 07/17/2013

Thank you once again, that again shows an improvement. I did get some funny results like the one below

00:01 -- 00:02
I
never realised how far they were going until

00:02 -- 00:03
everything came down
crashing which was not a pleasant thing.

That "I" shouldn't be left alone, but I realise there is only one way to make it work. The like would have to be measured first.
If it's smaller than, say, 40 characters, it remains in one single line.
If it's between, say, 40 and 55 characters, it breaks into two even line (each with 26 characters for example).
If it's a line larger than 55 characters, it will break into as many lines of 40 characters as necessary.
The only way I see it working is making a series of "if" tests, which I am sure you will think is unnecessary, I guess awk will have a better way to do it, right? :P

Thanks again for you help, and thanks also to iShareFreedom for trying to help with some examples!

EDIT: Of course, there could still be lines so long that it would again produce an isolated "I" in the first line and two very long lines below, but I think at that point those should be split between two separate subtitle lines. That would a topic for another program, I think...

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Try that, calling the script with the maximal number of characters per line (40 or so):
#!/bin/sh
if [ "$1" -ge 29 ]
then
max=$1
shift
awk -v max=$max '{
for (out = $NF; NF; ) {
soft_min = (length * (max - 1) / max + 1) / max
if (soft_min != int(soft_min))
soft_min = int(soft_min) + 1
soft_min = length / soft_min
for (l = length($NF) + length($--NF); NF && ++l <= soft_min; l += length($--NF))
out = $NF " " out
if (NF) {
if (l <= max) {
out = $NF " " out
--NF }
out = $NF "\n" out } }
print out }' "$@"
exit
fi
printf "Usage: $0 max_char_per_line [file.srt]...
max_char_per_line must be at least 29.
"

It is not perfect (I suspect the exact problem of balancing the line lengths without cutting words is NP-hard) but may be good enough.

GNUser
Offline
Joined: 07/17/2013

Sorry for double post, I meant to reply to your comment...

Tried running

./script.sh 29 subtitle.srt

and got this:

awk: line 11: runaway regular expression / soft_min ...

Something I did wrong?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Compared to gawk, mawk requires additional parentheses. That should be portable:
#!/bin/sh
if [ "$1" -ge 29 ]
then
max=$1
shift
awk -v max=$max '{
for (out = $NF; NF; ) {
soft_min = (length * (max - 1) / max + 1) / max
if (soft_min != int(soft_min))
soft_min = int(soft_min) + 1
soft_min = length() / soft_min
for (l = length($NF) + length($--NF); NF && ++l <= soft_min; l += length($--NF))
out = $NF " " out
if (NF) {
if (l <= max) {
out = $NF " " out
--NF }
out = $NF "\n" out } }
print out }' "$@"
exit
fi
printf "Usage: $0 max_char_per_line [file.srt]...
max_char_per_line must be at least 29.
"

GNUser
Offline
Joined: 07/17/2013

It's working now.
And I have to say, the results are far better than what I had before! Thank you so much!
Like you said, knowing the perfect place to break a line of a subtitle is a very complicated issue, it would require analyzing punctuation, meaning of the phrase, etc etc. But... at least now the divisions are more well balanced and there are no "orphan words" left behind in a single line, while at the same time allowing for a larger line when necessary. A perfect solution for a quick automatic editing!

Thank you so much Magic Banana, don't know how to thank you enough!

GNUser
Offline
Joined: 07/17/2013

One more question do you know of any simple way to extract the information of how many characters does the longest line has in a file?

That way if the longest line is below 80 characters, I could use 40, but if it's longer than that I would use 45 or 50 (to avoid having 3 lines as much as possible).

Ideally subtitles shouldn't be longer than 32 characters, but 40 is still acceptable for most people. Between 40 and 50 should only be used when it's absolutely necessary to avoid 3 lines instead of 2 (blocking too much of the picture and making it harder to read).
I don't advise going longer than 50, most TV players won't show all the characters that way (software players like VLC won't have a problem of course).

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

In AWK, that would be:
$ awk 'length > max { max = length } END { print max }'

But here is a simpler version of the script to never have more than two lines:
#!/bin/sh
if [ "$1" -ge 29 ]
then
max=$1
shift
awk -v max=$max '{
if (length > max) {
out = $NF
min = length() / 2
for (l = length($NF) + length($--NF); ++l < min; l += length($--NF))
out = $NF " " out
out = $NF " " out
--NF
print $0 "\n" out }
else
print }' "$@"
exit
fi
printf "Usage: $0 max_char_per_line [file.srt]...
max_char_per_line must be at least 29.
"

GNUser
Offline
Joined: 07/17/2013

Thanks once again!
I think this can be useful but still gives us lines larger than 50 characters in order to not go beyond the 2 lines. Many TV players will have issues with that. This will only happen when a line is above 100 characters, which unfortunately I do have such a file. I just did some testing and the conclusion is that the best way to deal with these longer lines would be to calculate the duration of the subtitle (end time - start time = duration of line), split the subtitle line in two, and process each individually to a certain number of characters. As follows:

00:01 -- 00:06
This is such a long line of text that even with your script
each line is still longer than 50 characters in total.

Would become this:

00:01 -- 00:03
This is such a long line of
text that even with your script

00:03 -- 00:06
each line is still longer
than 50 characters in total.

This would require messing with the timestamp of the subtitles as well as the total numbering. I think I have seen people use "sed" for changing the timestamps format, do you think awk could do it as well?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

AWK is far more appropriate than sed to do arithmetic (here to compute the additional timestamps). Post-processing the script that may output more than three line:
#!/bin/sh
if [ "$1" -ge 29 ]
then
max=$1
shift
awk -v max=$max '{
for (out = $NF; NF; ) {
soft_min = (length * (1 - 1 / max) + 1) / max
if (soft_min != int(soft_min))
soft_min = int(soft_min) + 1
soft_min = length() / soft_min
for (l = length($NF) + length($--NF); NF && ++l <= soft_min; l += length($--NF))
out = $NF " " out
if (NF) {
if (l <= max) {
out = $NF " " out
--NF }
out = $NF "\n" out } }
print out }' "$@" | awk -F \\n -v RS='\n\n' '
function to_sec(t) {
s = 0
n = split(t, hms, /:/)
for (c = 1; n; c *= 60)
s += c * hms[n--]
return s }
function print_time() {
h = int(time / 3600)
m = int((time - 3600 * h) / 60)
printf "%02d:%02d:%02.3f", h, m, time - 3600 * h - 60 * m }
{
split($1, interval, / --> /)
time = to_sec(interval[1])
per_pair = 2 * (to_sec(interval[2]) - time) / (NF - 1)
for (i = 2; i < NF; ++i) {
print_time()
printf " --> "
time += per_pair
print_time()
print "\n" $i "\n" $++i "\n" } }
i == NF {
print_time()
print " --> " interval[2] "\n" $i "\n" }'
exit
fi
printf "Usage: $0 max_char_per_line [file.srt]...
max_char_per_line must be at least 29.
"

GNUser
Offline
Joined: 07/17/2013

This version of the script is not working correctly, but I believe I know why. The problem is related to the usual syntax of an SRT file, which has numbering in each line and a , instead of a . between seconds and milliseconds (which I think you didn't think was important to include, but could eventually lead to overlapping subtitles with the previous line). I have an example below:

"
1
00:06:13,950 --> 00:06:21,150
This is the first line, as you can see the lines are numbered and there are milliseconds, which means the time timestamp is in the format "HH:MM:SS,MSS".

2
00:06:21,400 --> 00:06:26,300
It's very important to notice that between SS and MSS we have a , and not a . or a :

3
00:06:26,350 --> 00:06:28,550
Now let's see what happens when we run the script.
"

After running the script becomes:

"
00:00:1.000 --> 00:00:0.667
00:06:13,950 --> 00:06:21,150
This is the first line, as you

00:00:0.667 --> 00:00:0.333
can see the lines are numbered
and there are milliseconds,

00:00:0.333 --> 00:00:0.000
which means the time timestamp
is in the format "HH:MM:SS,MSS".

00:00:2.000 --> 00:00:1.000
00:06:21,400 --> 00:06:26,300
It's very important to

00:00:1.000 --> 00:00:0.000
notice that between SS and MSS
we have a , and not a . or a :

00:00:3.000 --> 00:00:1.000
00:06:26,350 --> 00:06:28,550
Now let's see what

00:00:1.000 -->
happens when we run the script.
"

The numbering before the times is messing up with your script, and the milliseconds are gone (along with the very important , that is unique to the SRT syntax).

If we process the original file to look like below:

"
00:06:13,950 --> 00:06:21,150
This is the first line, as you can see the lines are numbered and there are miliseconds, which means the time timestamp is in the format "HH:MM:SS,MSS".

00:06:21,400 --> 00:06:26,300
It's very important to notice that between SS and MSS we have a , and not a . or a :

00:06:26,350 --> 00:06:28,550
Now let's see what happens when we run the script.

"

We get the perfectly processed output below (though the milliseconds are still missing, not very troublesome but since awk can calculate the entire timestamp I believe including the milliseconds should not be the biggest problem).

"
00:06:13.000 --> 00:06:16.200
This is the first line, as you
can see the lines are numbered

00:06:16.200 --> 00:06:19.400
and there are miliseconds,
which means the time timestamp

00:06:19.400 --> 00:06:21,150
is in the format "HH:MM:SS,MSS".

00:06:21.000 --> 00:06:24.333
It's very important to
notice that between SS and MSS

00:06:24.333 --> 00:06:26,300
we have a , and not a . or a :

00:06:26.000 --> 00:06:28.000
Now let's see what
happens when we run the script.

"

Think this is something easy to fix?
I really find the output to be a high quality solution, but it breaks the necessary syntax for many TV players (VLC again doesn't have any issue playing this, given how superior software solutions seem to be). Thanks for all your help!

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Let us try that:
#!/bin/sh
if [ "$1" -ge 29 ]
then
max=$1
shift
awk -v max=$max '{
for (out = $NF; NF; ) {
soft_min = (length * (1 - 1 / max) + 1) / max
if (soft_min != int(soft_min))
soft_min = int(soft_min) + 1
soft_min = length() / soft_min
for (l = length($NF) + length($--NF); NF && ++l <= soft_min; l += length($--NF))
out = $NF " " out
if (NF) {
if (l <= max) {
out = $NF " " out
--NF }
out = $NF "\n" out } }
print out }' "$@" | awk -F \\n -v RS='\n\n' '
function to_sec(t) {
s = 0
n = split(t, hms, /:/)
for (c = 1; n; c *= 60)
s += c * hms[n--]
return s }
function print_time() {
h = int(time / 3600)
m = int((time - 3600 * h) / 60)
printf "%02d:%02d:%02.3f", h, m, time - 3600 * h - 60 * m }
{
for (; $NF == ""; --NF);
gsub(/,/, ".", $2)
split($2, interval, / --> /)
time = to_sec(interval[1])
per_pair = 2 * (to_sec(interval[2]) - time) / (NF - 2)
for (i = 3; i < NF; ++i) {
print ++id
print_time()
printf " --> "
time += per_pair
print_time()
print "\n" $i "\n" $++i "\n" } }
i == NF {
print ++id
print_time()
print " --> " interval[2] "\n" $i "\n" }'
exit
fi
printf "Usage: $0 max_char_per_line [file.srt]...
max_char_per_line must be at least 29.
"

GNUser
Offline
Joined: 07/17/2013

You know, maybe you should try solving cancer, Covid, the Russia-Ukraine war with a awk script... and throw in the meaning of life while you at it (42) :D

That script works beautifully well, it completely rearranged all the files I threw at it with a remarkable quality result! Sure, a human could do better... but also a lot of humans do worse everyday, so that script is in my opinion a work of art!

One minor thing though, the . between seconds and milliseconds should really be a ,
Just so all players will accept the syntax.

I tried changing the line
printf "%02d:%02d:%02.3f"

to

printf "%02d:%02d:%02,3f"

But it gives an error:

awk: run time error: improper conversion(number 3) in printf("%02d:%02d:%02,3f")
FILENAME="-" FNR=1 NR=1

As you have noticed I don't understand awk, would you mind showing the correct way? Thanks!

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Besides that fix (I used sprintf and substituted /\./ for "," in the returned string):

  • the argument can now be as low as 2, because the first AWK program now detects the cues and outputs them unmodified;
  • the syntax of those cues can have additional/missing spaces, nothing instead of "00" hour, dots separating seconds from milliseconds instead of commas: the second AWK program correct all those issues;
  • possible additional blank lines around the subtitles are removed (newlines in the actual subtitles are always kept);
  • for internal coherence of the math, the "soft_min" number characters per line may now be one character smaller than what it used to be;
  • divided subtitles (because they are 3+-line long), have a duration that is proportional to the number of characters;
  • I did some cleanup in the second AWK program;
  • a 3-line documentation.

Here is the result:
#!/bin/sh
if [ "$1" -gt 1 ]
then
max=$1
shift
awk -v max=$max '{
if ($0 ~ /^ *[0-9:,.]* *-->/)
print
else {
for (out = $NF; NF; ) {
soft_min = (length * (1 - 1 / max) + 1) / max
if (soft_min != int(soft_min))
soft_min = int(soft_min) + 1
soft_min = (length + 1) / soft_min - 1
for (l = length($NF) + length($--NF); NF && ++l <= soft_min; l += length($--NF))
out = $NF " " out
if (NF) {
if (l <= max) {
out = $NF " " out
--NF }
out = $NF "\n" out } }
print out } }' "$@" | LC_ALL=C awk -F \\n -v RS='\n\n\n*' '
function to_sec(t) {
n = split(t, hms, /:/)
sub(/,/, ".", hms[n])
return hms[n] + 60 * hms[--n] + 3600 * hms[--n] }
function print_time() {
h = int(time / 3600)
m = int((time - 3600 * h) / 60)
s = sprintf("%02.3f", time - 3600 * h - 60 * m)
sub(/\./, ",", s)
printf "%02d:%02d:%s", h, m, s }
function print_cue(duration) {
print ++nb
print_time()
printf " --> "
time += duration
print_time() }
{
for (; $NF == ""; --NF);
split($2, interval, /-->/)
time = to_sec(interval[1])
duration = (to_sec(interval[2]) - time) / (length - length($1) - length($2) - 1)
for (i = 3; i < NF; ++i) {
print_cue((length($i) + length($(i + 1)) + 2) * duration)
print "\n" $i "\n" $++i "\n" } }
i == NF {
print_cue((length($i) + 1) * duration)
print "\n" $i "\n" }'
exit
fi
printf "Usage: $0 max_char_per_line [file.srt]...
Approximately-evenly break too-long lines in .srt subtitles into lines
of at most max_char_per_line characters (except for single words) and
always have at most two lines on the screen.
"

GNUser
Offline
Joined: 07/17/2013

Damn... I am amazed, this is marvelous!
The script did a wonderful job with everything I threw at it! I even did some testing seeing if I could "fool" it to make a mistake, and it passed gracefully! All the files I threw at it were perfectly converted into extremely useful, newly timed, gracefully lined, new files.
Why, you even went a step further and perfected the line splitting with time vs character count! That is so important for some people! (a question, would it be better to use word counting instead of character counting? I will further investigate this, not sure what is easier for people to read)
Thank you so much for your help, you deserve all the thumbs up in the world!

You know, I started meddling with subtitles because I am trying to help people around me to learn new things by themselves. Watching documentaries, speeches, presentations, interviews, news broadcasts from other countries, these are all great ways to learn about topics that one might be interested in, but without subtitles many people just can't. Either they don't speak the language (so many friends and family of mine don't even speak English) or they are hearing impaired (fully deaf or partially), so many people have around the world depend on captions/subtitles.

And there are even people (an old friend of mine is such an example) who have not only a hearing problem, but also a seeing problem. So being able to fiddle a bit with the output configuration of the subtitle file is very important. Some people read better with long lines. Other's prefer short sentences with just a couple words at a time.

Your script actually helped making subtitles for people a lot easier, and I can even teach some of them to do things themselves! Maybe a good way to show them how powerful this "Free Software thing" really is ;)

I still have some ideas that I am trying to implement regarding subtitle/captions editing. I have been trying to get by with tools that I could find online, but none are working fully.
Would you mind giving me a hand with some of these? I will show everything that I have gathered so far and try my best to help (even though awk scripting is really far out of my comfort zone!). In other words, I will carry my own weight but I could really use your expertise with awk for the text editing part ;)

Either way, thank you so much for all the help you already provided!!

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

I published the script on https://dcc.ufmg.br/~lcerf/en/utilities.html#srtfold

Would you mind giving me a hand with some of these?

If I can, probably.

awk scripting is really far out of my comfort zone!

AWK is really a simple language. Much easier than general-purpose programming languages (although AWK is Turing-complete, you only want to consider it to process input text and to output text). If you already know regular expressions (even better: sed substitutions) and the basics of the C syntax (tests, loops, Boolean operators, defining variables, printf, etc.) and avoid (for the moment) AWK's more complicated features (getline, system calls, etc.), you can learn most of AWK in a few hours (practicing as long as you learn). Here are my slides to do so: https://dcc.ufmg.br/~lcerf/slides/mda6.pdf

GNUser
Offline
Joined: 07/17/2013

(your website gives a warning on SSL certificate, not sure if you were aware)

Thanks!
I have to say while I have made some scripts of my own (even shared some here on the Forum) I have always looked at awk syntax and got lost. But after seeing what you could do with awk I am again tempted to learn it, when I find the time for it. Thanks for sharing the slides :)
Cool that you decided to share the script in a proper manner, that's very important and a positive thing to do!

Back to the subtitles topic, as I have stated previously I started by translating already existing subtitles into other languages. I found out this works MUCH better if you translate a single line instead of two (as is usual in subtitles) because context is applied to the entire sentence making quite an improvement. I found a way to easily create single line subtitles from subtitles that had 2 or more lines in each timestamp. After that (and especially with translation sometimes adding more words), I needed a way to break those long lines back into the more standard length, which you script does an excellent job at.
With current tools, we can now easily have an improved translation with decent line length (even adjust it to each person's specific needs, as stated previously), which is already good.

Next step is working on videos that have never been subtitled and using speech-to-text to provide subtitles. These are not highly accurate usually, but provided certain conditions are met (single, native speaker, calm pace, silent environment) the results are very usable to get something good. Right now I am using youtube auto-captions to test things, but in the future may look into other options.

What we need to make youtube auto-captions into proper subtitles?
Here is what I have done manually and figured works quite well, only thing we need now is to make it automatic using a script (awk seems to be the best way to do it!):

1. Turn the captions into a single block of text. I have found some tools that do it quite well, so that part is covered. We will still need the original timings later on.
2. Improve the single block of text with automatic punctuation and grammar correction. I found an open source solution for punctuation [http://bark.phon.ioc.ee/punctuator] and have tried an online tool for grammar [www.online-spellcheck.com]. These are manual steps but I believe I can automate them using curl and bash. Still haven't tried it though. Might need other tools for other languages, only working English texts right now.
3. Put the improved text back into the original timestamps. This is where I need some help. I think we can achieve it by counting words back into place (first line had 4 words, it will receive again 4 words) and taking punctuation along. These .,!?:; will always be counted as part of the word before them whereas - should be counted as part of the word after it
4. We now have couple of words together without proper sentence structure. The way to achieve it is start joining different lines and adding their duration until one of two things occur: either the line achieves a certain duration (5 seconds? 10 seconds? should be a variable that we can adjust to each video) or the line contains a punctuation breaker, could be .,!?:;-
If such a condition is met, a new line is created and the process starts again with the next word as the first on the new created line and duration being again calculated from scratch. If the break happens in the middle of a line, we can use the same heuristics you used before to give each word a certain amount of time and divide the sentence into words having different durations. Lines that become too long will not be a problem because we can then use your previous script to break them into smaller pieces. AFTER we translate the text into other languages, because as I said translating larger sentences works better than shorter ones.

Below I provide an example of what happens after I manually do this do an auto-caption of a video where RMS explains the origins of GNU/Linux.

Original:
1
00:00:11,920 --> 00:00:20,169
well what's GNU plus Linux in 1984 I

2
00:00:16,970 --> 00:00:24,619
began developing an operating system

3
00:00:20,169 --> 00:00:30,619
which is a free software replacement for

4
00:00:24,619 --> 00:00:36,080
Unix now UNIX in 1984 had hundreds of

5
00:00:30,619 --> 00:00:38,750
components so developing a replacement

6
00:00:36,080 --> 00:00:42,079
mean meant developing a free replacement

7
00:00:38,750 --> 00:00:48,620
for every one of those components except

8
00:00:42,079 --> 00:00:51,770
a few we could do without so in 1992 we

9
00:00:48,620 --> 00:00:54,440
had almost the entire GNU system but

10
00:00:51,770 --> 00:00:56,960
one essential important component was

11
00:00:54,440 --> 00:00:59,660
missing that component is the kernel

---

What it can become (more or less):
1
00:00:11,920 --> 00:00:20,169
Well, what's GNU plus Linux,

2
00:00:16,970 --> 00:00:24,619
in 1984, I began developing an operating system,

3
00:00:20,169 --> 00:00:30,619
which is a free software replacement for Unix.

4
00:00:24,619 --> 00:00:36,080
Now UNIX in 1984 had hundreds of components,

5
00:00:30,619 --> 00:00:38,750
so developing a replacement mean meant

6
00:00:36,080 --> 00:00:42,079
developing a free replacement for every one of those components

7
00:00:38,750 --> 00:00:48,620
except a few we could do without so.

8
00:00:42,079 --> 00:00:51,770
In 1992,

9
00:00:48,620 --> 00:00:54,440
we had almost the entire GNU system,

10
00:00:51,770 --> 00:00:56,960
but one essential important component was missing.

11
00:00:54,440 --> 00:00:59,660
That component is the kernel.

Of course I did this manually in a few minutes, a script would do a slightly different job and be able to do an entire video in seconds, but still the idea here is that we can get a grammatically correct text that anyone can understand and be easily translated to other languages allowing more people to access information freely.

So, what I need help with is taking the words back from being a single block of text into being subtitles again. After that we can work on getting sentences together that make sense. What do you think? Seems awk is the proper tool for this editing?

Thanks in advance for any help you might provide!

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

your website gives a warning on SSL certificate, not sure if you were aware

I only have control on https://dcc.ufmg.br/~lcerf/ and not on https://dcc.ufmg.br, hence on the certificate.

1. Turn the captions into a single block of text.

That is a rather good first problem to solve in AWK:
$ awk -v RS='\n\n\n*' -F \\n '{ printf "%s\n%s\n%s", $1, $2, $3; for (i = 4; i <= NF; ++i) printf " %s", $i; print "\n" }'

Let me teach you through that example:

  • AWK processes "records" that are split into "fields". By default, a record is a line and sequences of spaces/tabulations separate the fields. Those defaults are not good here. We want a record to be a subtitle and blank lines separate them. We want a field to be a line. To have that, we redefine the record separator, in the variable RS, and the field separator, in the variable FS. Those variables can contain any regular expression. One way to (re)define a variable is through option -v. That is what I used to redefine RS as "\n\n\n*": at least one blank line (two newlines). I could have written "-v FS=\\n" too. However to write less (it is common to redefine the field separator), there is option -F.
  • An AWK program is a series of pairs condition-action. For each record satisfying the condition, the action, between braces, is executed. If the condition is omitted, as in the program above, the action is applied on each record (if, instead, the action is omitted, the default action is to output the record unchanged).
  • Any action is a sequence of instructions, separated by newlines or semicolons.
  • Several variables are automatically defined right after a new record is read. In particular, NF is its number of fields, $1 is its first field, $2 its second field, ..., $NF (we can literally write that) its last field.
  • Here, the first instruction prints the first three fields (the number, the cue and the first line of subtitle) with a newline between them (and no newline after the third field). printf is used, much like C's.
  • Then, a for loop, again with a C-like syntax, enumerates the integers from 4 to NF and the related field is output with a space (instead of a newline, in the input) before. Notice that no iteration happens if NF == 3 (one single line of subtitle in the input).
  • At the end of the treatment of each record, print "\n" actually prints two newlines because print (not printf) appends the output record separator, ORS, which is by default "\n". Also, in a print statement, a comma concatenates the output field separator, OFS, which is a space by default, but that program does not need that.

So, what I need help with is taking the words back from being a single block of text into being subtitles again.

As far as I understand, you first want to separate the input into two files, one with the cues and another one with the actual subtitles. Besides what I taught above, I believe you only need to know that the output of print or printf can be redirected to a file, Shell's way:
print "blablabla" > "file"
If the file named "file" exists, the first line written to it overwrites the file (to append, >> must be used instead of >). The subsequent redirections with > append (unlike in Shell). Try to separate the cues from the actual subtitles by yourself! :-)

GNUser
Offline
Joined: 07/17/2013

First thank you for all the explanation, very clear and helpful!
Now on to the problem at hand itself:

1. Turn the captions into a single block of text.

What the awk code you wrote does is convert this:

00:01 - 00:02
This is a
multi line subtitle

00:02 - 00:03
with two lines
in each subtitle

into this:

00:01 - 00:02
This is a multi line subtitle

00:02 - 00:03
with two lines in each subtitle

Which is helpful but I already have a tool for that ;)
My problem starts when I need to put things back together!

Still it was indeed a good example to teach from, and I thank you for that.
However, the idea is to remove the timestamps and get the entire text together like this:

"
This is a multi line subtitle with two lines in each subtitle
"

I think we can do both at once, am I wrong?

Anyway, here is my attempt at removing the timestamps and compiling the entire text into a single block:

awk -v RS='\n\n\n*' -F \\n '{ printf "%s ", $3; for (i = 4; i <= NF; ++i) printf " %s ", $i; }'

It's working on my testing, but let me know if it's correct or not. Thanks!

GNUser
Offline
Joined: 07/17/2013

Wait I think I got it!
I was getting ahead of myself... we need to generate two files first, one with only the timestamps!

That would mean that after running your script, we would get:

line1: number
line2: timestamp
line3: text

So after that we need to run:

awk -v RS='\n\n\n*' -F \\n '{ printf "%s\n", $2; for (i = 4; i <= NF; ++i) printf " %s", $i; print "\n" }'

On top of the file we created before! It worked! :)

EDIT: Can't we do what we do in bash and do "script_code1 | script_code2" and the output of script_code1 serves as the input for script_code2 ?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

awk -v RS='\n\n\n*' -F \\n '{ printf "%s\n", $2; for (i = 4; i <= NF; ++i) printf " %s", $i; print "\n" }'

If there are only three fields, the action can simply be '{ print $2 "\n" }'. The space between the two strings, $2 and "\n", concatenates them.

My idea was to do everything reading the input once:

  • in one file (named "text" below), have on single lines the text in each time interval;
  • in another file (named "cues" below), with the same number of lines, have the related cues.

Something like that:
$ awk -v RS='\n\n+' -F \\n '{ print $2 > "cues"; printf "%s", $3 > "text"; for (i = 4; i <= NF; ++i) printf " %s", $i > "text"; print "" > "text" }'

Can the punctuation/grammar-improving program properly process "text" despite the newlines? Do they keep those newlines in their outputs?

Can't we do what we do in bash and do "script_code1 | script_code2" and the output of script_code1 serves as the input for script_code2 ?

I am not sure what you are suggesting but, yes, you can pipe in Bash. I did so in srtfold. You can pipe in AWK too, but that is advanced usage.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

I think we can do both at once, am I wrong?

You are right.

awk -v RS='\n\n\n*' -F \\n '{ printf "%s ", $3; for (i = 4; i <= NF; ++i) printf " %s ", $i; }'

That prints supernumerary spaces. Only the first space in the second printf should be present.

Also, there is no newline. Are newlines really forbidden to run the punctuation/grammar-improving programs? If those programs never add/remove words, we can write the number of words along with the cues and then be able to get the improved text back into its time interval. If they can change the number of words (for instance transform "can not" into "cannot"), we would really like newlines (and those programs should keep them).

GNUser
Offline
Joined: 07/17/2013

I noticed the lack of a space between lines the first time I ran the script, and added the spaces, didn't notice I was doubling them. Thanks for the correction.
As I said in another comment earlier, the punctuator will always output a single block of text, no matter what I feed into it. Kinda makes sense because what divides sentences will be punctuation and not the line break.
Ideally we should be able to put text back into the subtitles according to punctuation because that will allow us to translate the entire sentence to a different language and still put it back to the right timestamps.

"I am not your son. I wish I was your son."
"No soy tu hijo. Ojalá fuera tu hijo."

This was translated using LibreTranslate. As we can see the number of words is different but if we just use punctuation as a marker, sentences should be in the right timestamps regardless.

We are starting with simply improving english texts in the original english subtitles, but I would like to be able to use this as a work for the future to translate for other people as well. Still, the main ideas are the same.

GNUser
Offline
Joined: 07/17/2013

A better example:

"I am not your son. I wish I was.
I am not your son. I wish I was your son.
I am not at home. I wish I was."

"No soy tu hijo. Ojalá lo fuera.
No soy tu hijo. Me gustaría ser tu hijo.
No estoy en casa. Desearía estarlo."

This time I translated with DeepL
As we can see, after adding the punctuation we can always use those as a marker, instead of the number of words per se.

Of course, as a starting point we might want to start with simply counting the words, but a second step would be to use punctuation as a breaker for sentences. What do you think?

Thanks for help.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

As we can see, after adding the punctuation we can always use those as a marker, instead of the number of words per se.

By "the punctuation", you only mean "the periods", right? The use of commas, for instance, may vary a lot between languages: in English (but not in French), there is usually a comma before "and" between the last two items of an enumeration, Portuguese (but not French) has strict grammatical rules for using commas, etc. I do not speak Spanish.

How do you want to process long sentences that take several screens to display? After the translation, we would not know where to break them to reuse the existing cues. Nevertheless, we could:

  1. Before the translation, append the parts of the sentences and the cues so that a subtitle is always a whole number of sentences;
  2. After the translation, put back the same whole number of sentences (for that to work, the translation must *never* break/merge sentences!) in the merged cue;
  3. If the long text requires more than two lines, srtfold will finally redivide it into cues whose duration is proportional to the number of characters (and not necessarily synchronized with the audio).
GNUser
Offline
Joined: 07/17/2013

Hum... You are probably right, I have to admit I am hardly good enough at any language to evaluate it's proper punctuation structure. I have been paying attention to the fact that DeepL 99% of the time will respect (and keep in place) the original punctuation of the text that I provide it with. That led me to the impression that we could use commas, periods, colons, etc to guide us in the text alignment within the timestamps.
Maybe it would work, maybe it would not, truth is I have found a couple texts that some commas were indeed moved or removed. So I agree with you this is probably not a good plan moving forward.

From what you said, if we only use periods as "breakers" we could do something different:

00:01 - 00:03 (duration 2)
This is a line of
text and

00:03 - 00:04 (duration 1)
this is
another line.

00:04 - 00:07 (duration 3)
Now, I have
inserted a period before

00:07 - 00:10 (duration 3)
and made a break.

The subtitle above would be broken down into:

This is a line of text and this is another line. (duration 2 + 1 = 3)
Now, I have inserted a period before and made a break. (duration 3 + 3 = 6)

The first line will be translated and broken into 2 subtitles that will have a full duration of 3 seconds, dividing words in a proportion of 66% + 33%.
The second line will also be translated and broken into 2 subtitles with a full duration of 6 seconds and a words division of 50% + 50%
Of course srtfold will still be used to break lines that are too large into smaller ones.

It seems at first glance that this would work, what do you think? And more importantly, do you think we can make it work with awk?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

That is essentially what I was suggesting, although I do not even want to reuse the initial cues (let srtfold redefines them).

The following program exemplifies the use of some of AWK's text-processing functions (sub, gsub, length, substr, and the ~ operator) and stresses the importance of mastering regexps (normally between slashes in AWK). It takes .srt subtitles at input and outputs the text in a file named "text" and the related time intervals (possible amalgamations of the input cues) in a file named "cues", along with how many sentences every interval contains:
$ awk -v RS='\n\n+' 'out == "" { begin = $2 } { end = $4; sub(/^[^\n]*\n[^\n]*\n/, ""); gsub(/\n/, " "); sub(/ *$/, ""); out = out " " $0 } $NF ~ /[.?!]$/ { print begin, end, gsub(/[.?!] +[A-Z0-9]/, "&", out) + 1 > "cues"; print substr(out, 2) > "text"; out = "" }'

I considered that a period, a question mark or an exclamations mark ends a sentence if either nothing follows or at least one space and a capital letter or a digit ([.?!] +[A-Z0-9]). That definition will probably avoid issues with numbers (a dot separates the units from the decimals in English, a comma is used instead in Latin languages, commas group the digits by three in English, dots are used instead in Latin languages, Chinese people group them by four, etc.). Anyway, I foresee other problems, in particular with acronyms (for instance, "U.S.A." may be translated to "USA" or the opposite)...

GNUser
Offline
Joined: 07/17/2013

Hum... I see what you mean. Yes, there will be countless unexpected issues with the punctuation (Mr. and Mrs. for example in english, in spanish there is an inverted ? at the beginning of a question, etc).
If we add to that the fact that the punctuator will make some mistakes, and the fact that the speech to text is not perfect (rendering translations sometimes useless)... I am starting to doubt if we can actually make this work.

I any case the way you and I were approaching the issue was in essence the same but with one fundamental difference, I wanted to preserve at least the starting and ending point of the original timestamps, whereas I feel you wanted to just distribute words in a "heuristically balanced way", am I wrong?

I suppose I will have to think about we can actually do with these subtitles because I fear the end result will not be usable in any meaningful way. At the same time I think this could be used to further improve the translation quality on subtitles already existent (where the punctuation is 100% correct in the original language).
Thanks again for your help Magic Banana, let me know your thoughts on this please.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Mr. and Mrs. for example in english

Good example: the translations of "Mr." and "Mrs." may not have a dot and the next word (the surname) certainly starts with a capital letter.

in spanish there is an inverted ? at the beginning of a question

That should not be a big issue: the character is only to be added into the square brackets [A-Z0-9].

Anyway, like you, I believe there must be other issues with the way I defined a sentence. A solution is to translate one sentence at a time. Nevertheless, I imagine that, doing so, the quality of the translation must be significantly lower because the translator lacks context.

I wanted to preserve at least the starting and ending point of the original timestamps, whereas I feel you wanted to just distribute words in a "heuristically balanced way", am I wrong?

The program in my previous post keeps the very first timestamp as a starting point. It then searches for a subtitle ending with ".", "?" or "!". Its end point is the end point of the first merged subtitle. The beginning point of the next subtitle is the beginning point of the next merged subtitle whose ending point is the ending point of the first subsequent subtitle ending with ".", "?" or "!". And so on. After translation, srtfold would approximately balance the number of characters per line in each merged time interval.

Let me teach you the string-processing functions I used. For clarity, here is the program with one instruction per line:
awk -v RS='\n\n+' '
out == "" {
begin = $2 }
{
end = $4
sub(/^[^\n]*\n[^\n]*\n/, "")
gsub(/\n/, " ")
sub(/ *$/, "")
out = out " " $0 }
$NF ~ /[.?!]$/ {
print begin, end, gsub(/[.?!] +[A-Z0-9]/, "&", out) + 1 > "cues"
print substr(out, 2) > "text"
out = "" }'

  • RS being defined as \n\n+ (AWK uses extended regular expression: it understands "+"), that program uses "at least one blank line" to separate the records.
  • The fields are here blank-separated words, the default.
  • The program is three pairs condition-action.
  • out is a variable (automatically initialized at "", the empty string) that will contain the text in a merged subtitle.
  • When the record is the first subtitle in a merged subtitle, out == "" (C-like equality). That is the first condition of the program.
  • It conditions the definition of the variable begin as $2, the beginning timestamp ($1 is the number of the subtitle).
  • The second pair condition-action has no condition: the action is executed on every record.
  • It first (re)defines the variable end as $4, the end timestamp ($3 is "-->").
  • Then, sub(/^[^\n]*\n[^\n]*\n/, "") substitutes the first two lines (what the regular expression ^[^\n]*\n[^\n]*\n matches) with "". In other terms, it deletes the first two lines of the record, the number and the cue. sub is a sed-like substitution.
  • gsub is a sed-like substitution too, but with the "g" flag, to substitute every match, not only the first one. gsub(/\n/, " ") therefore substitutes every newline with a space.
  • sub(/ *$/, "") deletes the trailing spaces (that would be problematic when testing whether the last character is ".", "?" or "!").
  • After those substitutions, the record, $0, is therefore the text on a single line and with no trailing space.
  • out = out " " $0 appends the (edited) record to out, with a space in between. Indeed, in AWK, a space concatenates. Here, out " " $0 is the concatenation of the strings out, " ", and $0.
  • The last action of the program is only executed if the last word, $NF, ends with ".", "?" or "!". The ~ operator indeed tests whether the string at its left matches the regexp at its right.
  • If so, [begin, end] is a merged time interval that print outputs in the file named "cues", along with how many sentences it contains.
  • gsub counts that number of sentences in out, its third argument (if omitted, as in all my previous uses of sub and gsub, the record, $0, is used). Well, gsub counts how many ".", "?" or "!" followed by at least one space and a capital letter or a digit and 1 is added (for the last sentence, with nothing after the punctuation). gsub indeed returns how many substitutions it performs. Here, it substitutes the matches with themselves (&, as in sed), effectively keeping the record unaltered.
  • print then outputs out without the leading space to the file named "text". That leading space is always present given how out is built by concatenation (out " " $0). substr(out, 2) is used. It returns the string out starting from its second character.
  • Finally, out is reset to "". In this way, the next record will satisfy the first condition of the program and define the beginning of the next merged subtitle. On the contrary, if the current record does not end with ".", "?" or "!", the last action is not executed and the next record will complement the merged subtitle that has been started.
GNUser
Offline
Joined: 07/17/2013

Hum... I think I understood most of it... That is until I tried running the command "awk -v...." file.srt
And nothing happened. Am I misunderstanding what it should do? No new files were created...

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

"awk -v...." file.srt
And nothing happened.

If you did not put the quotes, that is weird. The program should create two files in the working directory: "cues" and "text".

What do you mean by "nothing": does the shell gives you the hand back (a prompt) or not? Are you executing the one-liner? The multiline version should be in an executable file and you can a space and "$@" (including the quotes) at the end of the last line if you want to be able to pass the .srt file in argument.

GNUser
Offline
Joined: 07/17/2013

I didn't put the quotes, no...

The multi-line inside a script with that "$@" runs, no error is shown, and I get the prompt line below again but no files are created.
Should I put some #/bin stuff on the first line?

As for the single-line... Pretty much the same thing. I run it, no error is show, the command line appears again below ready for a new command to be run. But no files are created.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Should I put some #/bin stuff on the first line?

You can add that first line, if you wish:
#!/bin/sh
Nevertheless, that will not change anything. Could you attach the input file so that I can test? If none of its subtitles ends with ".", "?" or "!" (and possible trailing spaces, which are removed), then, indeed, nothing is output. Is it the case?

GNUser
Offline
Joined: 07/17/2013

I think I found the reason...
Running "file -i file.srt" returned:

text/plain; charset=us-ascii

I tried another file which returns:

text/plain; charset=utf-8

This one gets the output as desired (text + cues).
The weird thing is I tried running

iconv -f US-ASCII -t ISO-8859-1//TRANSLIT file.srt > newfile.srt

and didn't get the the newly created file to actually return another format. So, in these cases I would simply create a new file and copy+paste the text into it...
I usually run iconv from UTF to ISO, because my player is picky about formats, so I expected this to work... it didn't, so... I would resort to copy+paste.

Either way, I now get the text+cues :D
So... next step would be to make the distribution of the newly translated text into the timestamps, heuristically dividing the words per time, correct?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Putting the same number of sentences in the (merged) cues... but this is when I realize there is a problem in my definition of a sentence. We defined a sentence as ending with ".", "?" or "!" (within a subtitle) then at least one space then a capital letter or a number OR ending a subtitle with ".", "?" or "!" (and no guarantee that the next subtitle starts with a capital letter or a number). Since the translated text is a single block:

  • if we consider a sentence as ending with ".", "?" or "!" then at least one space then a capital letter or a number, we will have a problem whenever we have a subtitle ending with ".", "?" or "!" and the next sentence not starting with a capital letter or a number (it was counted as a sentence in "cues" but is not a sentence anymore in the translated text);
  • if we consider a sentence as ending with ".", "?" or "!" and at least one space (whatever comes after), it is easy to adapt the script (gsub(/[.?!] +[A-Z0-9]/, "&", out) becomes gsub(/[.?!] /, "&", out) but I fear we will face even more issues with acronyms and so on.

We must choose a consistent definition for a sentence. And before putting the translated sentences in the (merged) cues, we should first check that, with our definition of "sentence" we have the same number of sentences in the translated text and according to the file named "cues". That latter number is:
$ awk '{ sum += $3 } END { print sum }' cues

GNUser
Offline
Joined: 07/17/2013

Hum... I understand what you mean, and that is indeed a problematic issue...
I will have to try and look into it, but I will only be able to do so during the weekend. Maybe I can come up with something to help solving that.
I will let you know when I get the time to pick this up again ;)

Thanks!

GNUser
Offline
Joined: 07/17/2013

Hey there!
So... I did some testing.

First, I found a video that had human-made subtitles, made only with text (meaning the words were 100% correct to what was being said in the video, but there was no punctuation, no capital letters, nothing, only the pure text of what was being said without any mistakes made by speech-to-text software). I run the punctuator on that, and the results were... less than satisfying. If it was a pre-processing that I was intending to work manually afterwards, it would be a good start. Some of the commas and periods were actually spot on. But I would say only 50% of the added punctuation was actually usable. Since I am looking for a 100% automatic process, this will be put aside for the time being. Maybe in the future we have a better model that can use speech-to-text combined with the already existing text to create appropriate punctuation. For now, I will not be using this.

Second, I found a video that has subtitles with proper punctuation and all the correct words in the original English. I run your script, and got the TEXT and CUES files. The number cues made from the original subtitle gave a total of 150 (running your command).
I then translated the both the original subtitle file, and the TEXT file produced by the script. The translation in the TEXT file was vastly superior to the one translated from the subtitle file (because the text was much more in context).
However, running the script on the translated subtitle produced a new CUES file which unfortunately has a total os 175.

I am unsure as to what step to take next. The translated TEXT file is indeed a superior translation, I would love to use it. And the sentences (or the paragraphs better saying) are the same as in the original TEXT file. Maybe we could try to bring the text back into place using paragraphs as separators?
Like...
1. Run script in ORIGINALSUBTITLE, to produce TEXT and CUES files
2. Translate the TEXT file, retaining the paragraphs as defined by the script.
3. Put the translated text back into the subtitle format using translated-TEXT file and original CUES file.

I am not sure if this will work, but we can give it a go if you could make the script use paragraphs as breakers when bringing the text back. Fingers crossed!
Thank you for the help ;)

GNUser
Offline
Joined: 07/17/2013

I noticed something that might be useful. The difference between the CUES files are only in the number of the sentences. The ending times are the same (only in some cases divided by two lines, but those could be merged, or the division could be used to help in the heuristic process of rearranging words per time. I will attach the original and translated CUES.

AttachmentSize
ORIGINALCUES.txt 371 bytes
TRANSLATEDCUES.txt 428 bytes
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Those "cues" files are unexpected. For instance the fourth merged cue in ORIGINALCUES.txt contains 34 sentences! I would actually assume most merged cues would contain a single sentence.

Could you attach the original .srt file?

GNUser
Offline
Joined: 07/17/2013

I think the reason for that is because the subtitle is of a presentation, a single person speaking in a continuous line of thought brings about the reason for the uninterrupted cues. I am attaching the files so you can examine it. Since I'm not sure about the Forum's rules on this topic, I leave here a warning that it's a text of religious nature (starts with a prayer, followed by teaching). It's in any case a great example (the best one I could find) where the subtitle is well done in terms of punctuation, but paragraphs are not properly broken down. It also has subtitles in other languages, to which I can compare the software-made translations. This is an example I think will be greatly improved by putting the text back together to better translate it. Here is also a link to the video, in case you find it helpful to determine if times are properly calculated or not: https://invidious.fdn.fr/watch?v=eIGAjoqBhhU

AttachmentSize
ORIGINALCUES.txt 371 bytes
ORIGINALSUBTITLE.txt 30.49 KB
ORIGINALTEXT.txt 21.27 KB
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

It does not look like a human divided those subtitles. For instance, consider the fourteenth subtitle:
00:01:51,550 --> 00:01:57,040
morgue. No one is ever allowed to talk or
chatter. It's a hushed silence. When

Wouldn't a human creating the subtitles put "morgue." at the end of the thirteenth subtitle, "When" at the beginning of the fifteenth and define the cues accordingly? Searching for ".", "?" or "!" at the end of a subtitle, I assumed that would be so: one merged cue would end at "morgue.", another one would relate to the two sentences "No one is ever allowed to talk or chatter. It's a hushed silence." and yet another one would start at "When".

If none of your subtitles are "aligned" with sentences, merging subtitles to always have a whole number of sentences in them is a bad idea. We end up with very long time intervals (more than seven and a half minutes for the 35 sentences my previous post referred to). srtfold will divide such an interval proportionally to the number of characters. Along seven and a half minute, that imperfect heuristics will certainly produce screens showing text that has been pronounced several seconds ago or that will be pronounced several seconds later.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Maybe we could try to bring the text back into place using paragraphs as separators?

If I properly understand, you propose to append a newline after every line there was in "text". I concatenated that character, "\n", in the script below, where a sentence is, as I suggested earlier, now simply defined as ending with ".", "?" or "!" followed by either a space or a newline (end of the subtitle):
#!/bin/sh
awk -v RS='\n\n+' '
out == "" {
begin = $2 }
{
end = $4
sub(/^[^\n]*\n[^\n]*\n/, "")
gsub(/\n/, " ")
sub(/ *$/, "")
out = out " " $0 }
$NF ~ /[.?!]$/ {
print begin, end, gsub(/[.?!] /, "&", out) + 1 > "cues"
print substr(out, 2) "\n" > "text"
out = "" }' "$@"

3. Put the translated text back into the subtitle format using translated-TEXT file and original CUES file.

The above code still merges cues so that a sentence starts at the beginning time and a sentence ends at the ending time. Without that, "text" would contain paragraphs that would start/end in the middle of a sentence and, I imagine, the translation would not be good (or maybe, the translator would merge such paragraphs).

GNUser
Offline
Joined: 07/17/2013

That code still produces a different CUES file for the original text than it does to translated texts.
But again the timings are usable in some manner.