Help for text editing

129 replies [Last post]
Magic Banana

I am a member!

Offline
Joined: 07/24/2010

I have never had any interest in learning Perl. My main reasons are that Perl is grammatically one of the most complicated language (hence hard to learn and master), one of the slowest to interpret (it is often last in benchmarks), and that enthusiasts praise the fact there are tens of way to do the same thing (what certainly means that it is hard to understand code written by somebody else).

I do not like sed. This time for being too low-level (as the link I gave in https://trisquel.info/forum/help-text-editing?page=1#comment-166618 shows: I searched the "construct" but have not tried to really understand it), cryptic (one-letter-long commands do not help) and too focused on text editing (although, like AWK and Perl, it is theoretically Turing-complete). For a few substitutions, it is fine (the same usually takes a little more typing in AWK, a series of sub/gsub and a print).

AWK appears to me as a perfect compromise, for text (especially plain-text dataset) processing. It is to my knowledge one of the easiest language to learn. I literally take ~5 hours to teach almost all of it (I let aside getline, calling system functions and things that are rarely needed such as time functions and random number generation) to students who already had a course on C). It is very rewarding to quickly implement in a few dozens lines of AWK something such as srtfold: https://dcc.ufmg.br/~lcerf/en/utilities.html#srtfold

For more complex tasks, I directly go to C++. If I would want to learn an intermediary language (and for tasks that do not require C++'s performances), I would probably choose Python.

But just to be clear: I have no animosity against Perl. Even less against its enthusiasts. We have different tastes. That is all.

andermetalsh
Offline
Joined: 01/04/2013

Perl's ugly syntax it's a bit of FUD coming mostly from
newbies who never used perl and began with Linux in late 00's where their only knowledge of Perl came from oneliners.
Properly written Perl it's like the one written from Orelly's
free CD bookshelf: easy and clean.
On slowness, lots of core utilities for Trisquel related to
system config, apt and dpkg are written in Perl: packaging building scripts, buildiing helpers, debconf...
Back in the day apt-get was written in Perl (I think now is written in C) and it performed fast enough to handle zillions of CPU intensive cases such as resolving dependencies.
If you said Ruby, yes, it's true: it's dog slow.

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

Perl's ugly syntax it's a bit of FUD coming mostly from newbies

I was not talking about Perl's syntax. I was talking about its grammar. https://everything2.com/title/Only+perl+can+parse+Perl for instance refers to "experienced, respected Perl programmers, not clueless newbies" and is plain scary to me.

As for the "tens of way to do the same thing", it is by design of its inventor ("[Perl] doesn’t try to tell the programmer how to program. It lets the programmer decide what rules today, and what sucks. It doesn’t have any theoretical axes to grind. And where it has theoretical axes, it doesn’t grind them."): https://www.perl.com/pub/1999/03/pm.html/#jump6

Back in the day apt-get was written in Perl (I think now is written in C) and it performed fast enough to handle zillions of CPU intensive cases such as resolving dependencies.

Wasn't performance a reason for the rewrite? Anyway, I was actually talking about simple tasks (I doubt anybody sensible would consider AWK to implement apt-get) on large inputs. For instance summing the values in the fifth column of a dataset: https://www.libertysys.com.au/2011/03/an-interesting-performance-difference-between-perl-and-awk/

I daily use AWK for such tasks. Since the comparison behind the link above is 11 years old, I decided to reproduce it on an even larger input (a 14+ GB dataset) compressed with zstd (it then weights 687 MB). Those are 15+ millions lines with 26 tab-separated values. I sum the integers in the second column:
$ zstdcat data.zstd | time awk '{ sum += $2 } END { print sum }'
286945428317071
6.90user 3.07system 0:11.41elapsed 87%CPU (0avgtext+0avgdata 4064maxresident)k
0inputs+0outputs (0major+212minor)pagefaults 0swaps
$ zstdcat data.zstd | time perl -we 'my $sum = 0; while (<>) { my @F = split; $sum += $F[1]; } printf "%d\n", $sum; '
286945428317071
43.19user 2.42system 0:45.66elapsed 99%CPU (0avgtext+0avgdata 5408maxresident)k
0inputs+0outputs (0major+247minor)pagefaults 0swaps

Here, Perl is four times slower than GNU AWK, which actually waits for input data (87%CPU). The AWK program looks far simpler than the Perl one, which I could adapt to use the second column instead of the fifth. That is not the case of the second Perl program on https://www.libertysys.com.au/2011/03/an-interesting-performance-difference-between-perl-and-awk/

Writing "my ($size) = /d+[^d]+(d+)/;" to define size as the value in the fifth column does not look reasonable to me. Anyway, the performance gain was not satisfactory: "nearly 3 times slower than the awk version".

andermetalsh
Offline
Joined: 01/04/2013

Perl's ugly syntax it's a bit of FUD coming mostly from
newbies who never used perl and began with Linux in late 00's where their only knowledge of Perl came from oneliners.
Properly written Perl it's like the one written from Orelly's
free CD bookshelf: easy and clean.
On slowness, lots of core utilities for Trisquel related to
system config, apt and dpkg are written in Perl: packaging building scripts, buildiing helpers, debconf...
Back in the day apt-get was written in Perl (I think now is written in C) and it performed fast enough to handle zillions of CPU intensive cases such as resolving dependencies.
If you said Ruby, yes, it's true: it's dog slow.

lanun
Offline
Joined: 04/01/2021

> I directly go to C++.

You mean Rust? ;)

C++ code looking a bit like hairy Python code, Python code looking like shaved C++ code, so hair growth lotion and a shaver should be the only requirements to switch from one to the other. Although I always have an uneasy feeling that an interpreter might run into funny things that could have been spotted at compile time.

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

I will probably never learn Rust, because it takes a long time. I spent (and keep spending) that time for learning C++, which fulfills the same needs (efficient, unlike Python, and with useful abstractions, unlike C). C++ and Rust are "hairy" indeed. But they bring useful abstractions (object orientation, generic programming, functional programming, etc.) while preserving the performances of C. On the other hand, Perl is at the same time very complex (more than C++, as far as I understand) and among the slowest languages.

If Rust existed in the mid-2000s, I may have learnt it instead of C++. C++ lacks Rust's memory/concurrency-safety guarantees, which are very useful.

lanun
Offline
Joined: 04/01/2021

On the topic of Perl, they do fancy Perl poetry, although I dare not link to the Black Perl poem in such a family friendly forum.

GNUser
Offline
Joined: 07/17/2013

Hello everyone!
After a long learning process (for me) and a lot of patience and help from Magic Banana, we have finally arrived at what I initially wanted. The most automatic possible way to create translated subtitles quickly for anyone to access any content in the world!

This could be further improved into doing everything in one single step and using Apertium for offline translation, but as I said before, Apertium doesn't meet my requirements for translations (few language pairs, poor translation quality, etc).

So, I tried to make it usable in different situations but also tailored it for my specific suggestion of a online translator right now (DeepL). I do hope to revisit this project in the future and that we have a better FLOSS solution for offline translations, which will in any case be greatly improved by all the work done here.

This would not have been possible without the immense assistance and teachings of Magic Banana. He really went beyond any help I was hoping to get when I first started this, and I am profoundly thankful for it.
All hail Magic Banana!

That being said, it is now my hope that people will seek out after knowledge more and more, in any topic they might choose, without language barriers. Like I said before, I do not wish to tell you what you should learn, or to build a fence around what you may or may not learn, but instead to build bridges that allow people to access more and more information, helping everyone to go forward.

The two scripts are attached to this post, and they are called "starter.sh" and "finalizer.sh"
I am not really good with names :P

Anyway, just put your SRT file in an empty folder with the starter script, and run:

./starter.sh subtitle.srt

After getting the files translated put them into another empty folder along with cues and finalizer.sh and run:

./finalizer.sh

You will end up with a new SRT file that shall be usable for most cases. If you so desire, you might also process it further manually, though that was not the idea behind this project.

As someone else mentioned, this could also be transformed into other programming languages, if anyone so desires :)
For now, we have a working solution, and that's what matters. Thank you Magic Banana for all your help, once again.

Like the rabbit says, that's all folks!

AttachmentSize
starter.sh 3.05 KB
finalizer.sh 2.99 KB
GNUser
Offline
Joined: 07/17/2013

As a possible alternative for translation, does anyone know if LibreTranslate instances can be accessed through terminal for file translation? Not that it's the best quality, but I undestand some people might prefer to use a FLOSS solution, I would modify the script if that was the case.
I am not looking for local installation of LibreTranslate at this time, because I want to make this usable for other people who may not be able to install it anyway.

GNUser
Offline
Joined: 07/17/2013

I found this, but I am not sure how if it's possible to upload files through it...
https://github.com/argosopentech/LibreTranslate-sh

Again, LibreTranslate would not be my first choice, but I would like to provide a 100% FLOSS solution (even though relying in an online solution, which I am sure some people will also object to).

GNUser
Offline
Joined: 07/17/2013

Another option would be to send one line at a time, or have a way to send the entire text and receive it back with the original separated lines (as we had in the original TEXT file). Anyone has any idea how we can do it? I think it would be better if we only made one request with the entire text instead of sending a line at a time but... I am not sure.

GNUser
Offline
Joined: 07/17/2013

Well, I think I found something to work with.
stranslate (https://codeberg.org/justwolf/stranslate) seems to allow for file uploading, using simplytranslate as a gateway for libretranslate. I think we can make it work!

This leaves us a different limitation than we had before, as the libretranslate won't accept files longer than 10000 characters (at least in the instances that I tested).

Magic Banana, do you think the code you wrote earlier could be modified to break TEXT into files no longer than X characters without breaking sentences?
If it will be too much trouble, I guess a calculation could be done (total number of characters in a file / total number or words, and using that average to calculate the number of words to use in the already existing code). In any case, let us know!

I have to again say that LibreTranslate is not on the same level as DeepL. Still, I think it will make for an acceptable choice for many people, so I will maybe re-work the script to use it.

GNUser
Offline
Joined: 07/17/2013

Okay, here is an alternative script that does all the work in one single step, it now uses FLOSS LibreTranslate for the translation.
There is still the need to divide the TEXT file if it goes bigger than 10000 characters (I tried this with smaller files and it worked fine). I tried changing Magic Banana's original division code to count characters instead of words, but couldn't get it to work properly. If you can give us a hand once again I would be very grateful MB! ;-)

Once that is done, I will create the necessary code to put everything together again and delete the unnecessary files (for now I am keeping those in case we need to debug).

I'm sure this could be easily improved for more options, but it is at least functional for now.

AttachmentSize
script.sh 3.84 KB
Magic Banana

I am a member!

Offline
Joined: 07/24/2010

If you can give us a hand once again I would be very grateful MB!

$ gawk -v max=10000 '{ text = text $0 "\n" } END { nb = 1; n = split(text, a, /[.?!][[:space:]]+/, seps); for (i = 1; i <= n; ++i) { l = length(a[i] seps[i]); if (l + tot > max) { tot = l; ++nb } else { tot += l }; printf "%s", a[i] seps[i] > FILENAME "." nb } }'

As for the version counting words, GNU AWK must be used, for split to accept a fourth argument (an array containing the separators).

GNUser
Offline
Joined: 07/17/2013

Thanks!
Though, I think we will not need that code for the version of the translator I am uploading today :)

I took some time to try and learn how SimplyTranslate and LibreTranslate actually work, and I was able to make our little tool interact directly with LibreTranslate, without the need for cutting the original TEXT file!

Your code will still be valuable as I will probably be faced with situations when I really want to use another translation service instead of LibreTranslate, so being able to cut a file properly is important (I actually tested your code and it worked perfectly!)

And now I upload a new script, called final.sh (like... I think it will be the final version, lol, I don't know, I just thought it was a fitting name)

Just put final.sh and your srt file in the same folder and run:

final.sh subtitle.srt

Answer the questions you are asked and you will have a file called finalsubtitle.srt when it ends, if all goes well it will be properly processed and translated :D

Thanks once again Magic Banana, I couldn't have done it without you! I really hope more people here will use to go after knowledge! I know I will :D

And I also learned a lot in the process, so it was really a win-win situation. I also remember you mentioned learning something new with our little project here, so I am really glad for that. Thanks for the help!

AttachmentSize
final.sh 5.31 KB
Magic Banana

I am a member!

Offline
Joined: 07/24/2010

The attachment cannot be read. Substituting .sh for .txt will probably solve that issue.

GNUser
Offline
Joined: 07/17/2013

That's odd, was it also happening in the previous scripts I uploaded?
Anyway, I did what you said, hope it works now!

AttachmentSize
final.txt 5.31 KB
Magic Banana

I am a member!

Offline
Joined: 07/24/2010

That's odd, was it also happening in the previous scripts I uploaded?

Yes, it was.

Anyway, I did what you said, hope it works now!

It does. Thank you.

GNUser
Offline
Joined: 07/17/2013

You could have told me sooner :P
I will have to re-upload the previous scripts in txt format, in case someone prefers to use that method.

You know, I have been thinking (and testing) the srtfold script, and I have been thinking, we could improve about 50% of the line breaking if we could detect commas and break at those, since those are usually where a sentence has a "pause". Basically, it would mean having srtfold counting characters (lets say 32 as per input of the user) and arranging the lines heuristically as it already does, but then inspecting if there is a comma in the line, and breaking there, which would mean the next line would be newly calculated. Not sure if I am making much sense lol

Here is a simple example:

00:00 - 00:05
This is a simple example
of what I mean and,

00:05 - 00:10
as you can see, there is
a weird break between lines,

00:10 - 00:15
whichis to be expected, we
made a simple calculation

00:15 - 00:20
based on characters
and not on punctuation.

Now, if there was that second step I am trying to illustrate:

00:00 - 00:05
This is a simple example
of what I mean and,

00:05 - 00:10
as you can see, there is
a weird break between lines,

00:10 - 00:15
which is to be expected, we
made a simple calculation

00:15 - 00:20
based on characters
and not on punctuation.

It's not feasible to expect perfect breaking every time (notice that "expected, we") but I think an improvement could be made (overal the example above provides a much more natural read than the previous one) by trying to look for the character , and use it as a breaking point in srtfold.

Do you think this could be achieved?
Thanks!

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

The attached alternative to srtfold adds a first step which breaks the lines using the punctuation. In fact, that first step is srtfold's first step but, instead of taking the words one by one, it takes the sequences words between punctuation: I substituted every --NF for a call to trunc_after_last_punct and every $NF for a call to after_last_punct (two functions I defined).

AttachmentSize
srtfold2.txt 3.17 KB
GNUser
Offline
Joined: 07/17/2013

Hi there!
Thanks for the updated srtfold
I tested it and it certainly is an improvement, I would say in many instances the text flows and breaks much more naturally. Awesome!
I did notice however a possible bug. I say possible because I am not sure how to interpret it in ways a machine understands it.

Running srtfold2 40 subtitle.srt breaks this line:

00:01 - 00:05
Well, the exact number has changed over the years.

like this:
00:01 - 00:03
Well,
the exact number has

00:03 - 00:05
changed over the years.

I don't understand why the script decided that it wouldn't add any more words after the first comma. I know, we are trying to break at punctuation marks, but it could have add more words afterwards seeing as it didn't interfere with any other punctuation marks (there was only the period at the end). A possibly more correct break would notice the difference between the first and second line and not break. Something like this:

Well, the exact number
has changed over the years.

OR

Well, the exact number has
changed over the years.

The first is a better example, since the first line is shorter than the second, but if the script needs to produce the second example I still think it is more reasonable than breaking into 2 separate subtitles.

What do you think? I hope i am making enough sense, lol.
Thanks for the help again MB!!

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

I wrote something simple. Given max_char_per_line (the only argument), srtfold2 proceeds in three steps:

  1. Approximately-evenly break on punctuation too-long lines in .srt subtitles into lines of at most max_char_per_line characters, except when, between punctuation, there are more than max_char_per_line characters;
  2. For the latter case, approximately-evenly break on spaces the still-too-long lines into lines of at most max_char_per_line characters, except when a single word has more than max_char_per_line characters (what never happens with a reasonably-large value for max_char_per_line);
  3. Keep at most two lines per screen, defining when the division occurs proportionally to the number of characters on each screen.

What you want apparently requires an approach somehow integrating all three steps into one. If you come up with a formalization (an algorithm) of such an approach, I may implement it. But that looks complicated...

GNUser
Offline
Joined: 07/17/2013

Hey there!
Sorry it took me so long to write back, these last couple days have been kinda rough on my end.

I understand better now. I think we could do things the other way around, it might be as simple as what you wrote actually. srtfold will run the exact way as we had it running before with only a minor difference, the check in your step1 will become a final step in each division.

1. Approximately-evenly break lines at max_char_per_line, assigning the timestamp of the subtitle;
2. Check each subtitle if it is a single line or a two line subtitle;
3. For a single line subtitle, search for a comma and if there is one, break line there, into two lines. For a two line subtitle, only perform the check on the second line leave the first line as it is;
4. If there is a comma in the checked line, break it there, having the text after comma being a part of the next line and making a new break and division. Subtitle duration is again calculated based on the characters that were taken from the previous subtitle to the new line.

I think we should start by only adding breaks on commas since periods and other marks are usually at the end of a line and not in the middle (the way we built all this process). Also, let's only do it near the end of each subtitle if it's a long subtitle there should be no need to do it in every comma.
Not sure if I was very clear in my thought process, hope you get what I mean ;)
If not let me know!
Thanks.

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

3. For a single line subtitle, search for a comma and if there is one, break line there, into two lines. For a two line subtitle, only perform the check on the second line leave the first line as it is;

I do not see how that would avoid having a single word and a punctuation (such as "Well,") on a single line. That is what you wanted, as far as I understood.

GNUser
Offline
Joined: 07/17/2013

You are right. I need to think this through again, sorry like I said these last days were kinda rough.
I will let you know when I think of a better way to handle punctuation breaks.
Thanks!

GNUser
Offline
Joined: 07/17/2013

Maybe we could try only running this extra step in two-lines subtitles?
Running srtfold with 32 max_character:

Example1:
00:01 - 00:10
This, as I see, will remain the same.

Example2:
00:01 - 00:10
This, on the other hand, will end up being, kinda like broken down.

Turns into:

Example1:
00:01 - 00:10
This, as
I think, will remain the same.

Example2:
00:01 - 00:07
This, on the other hand, will
end up being, {kinda like broken}

00:07 - 00:10
kinda like broken down.

Example 1 is broken normally as it is already happening, not perfect but very good.
Example 2 is treated differently. The text between {} is moved to the second subtitle, with a new calculation being made from scratch as it would happen normally.
An improvement would be to reconstruct the two lines subtitle to make the second line larger than the first one like this:

00:01 - 00:07
This, on the
other hand, will end up being,

While I assume it's not perfect (commas in the first line are not treated as breakers) I still think it will actually produce better results while avoiding orphan words in the beginning of a sentence.
It's important that this is only run once and not again after reconstructing the line, otherwise we might end up breaking a sentence too much. As per the example above:

00:01 - 00:07
This, on the
other hand, {will end up being,}

This time around we don't want to run a second step and remove the text between {} because we already adjusted the subtitle to end a line in a comma, which improved over the original breaking.

How does it look?

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

Well, I have not really understood what you proposed, but I understood you want to "avoid orphan words". In the attached script, I modified "my" first step, which breaks on punctuation, to never let a single word on a line, except if the word is already orphan in the input subtitles. See if that satisfies you.

AttachmentSize
srtfold2.txt 3.51 KB
GNUser
Offline
Joined: 07/17/2013

Sorry, guess I didn't explained myself clearly enough, even though I tried xD
I run your new script in a single test file, with only a couple lines, and it seems to do a better job than before. Great!
It seems to be a simpler approach than mine, but I guess it ends up doing the trick.
I will have to test it more extensively and will get back to you when I have more useful data :)

Thanks again for the help!

GNUser
Offline
Joined: 07/17/2013

Confirmed!
I tested it against a couple different subtitles, different styles (more dialogue, less dialogue, longer sentences, shorter sentences) and it looks great!
The new way to break sentences really adds to a more natural reading. Thanks, great work!
I am sure a lot of people will benefit from this, thank you!