A sed script to replace one HTML string with a different one

8 replies [Last post]
amenex
Offline
Joined: 01/03/2015

Whereas Magic Banana's admonition to write HTML with a text editor is encouraging me
actually to do so, I've reached a degree of exasperation with my lack of geek-like
resources even though I've managed to produce a mostly working HTML page containing
the essence of my results.

That said, I'm not quite at the middle of another stage of the process, one in which
I'm adding links to another series of files containing additional pertinent data. I've
managed to replace about 400 of 900+ strings that display the multi-addressed PTRs in
the database with another (much longer) set of strings that contain the links to those
data files that contain additional detail about the domains that those PTRs are visiting.

The good news is that I created a set of scripts that extract the data from the master
database (which would fill the full width of the display and be 3500 rows deep) and place
it in a subdirectory. That task is complete and took just a few seconds after a day or
so of script-writing.

The present task runs afoul of the control strings necessary for HTML.

Here's the script that creates the actual string-replacement script:
awk '{print "cat "$5" sed '\''s/"$1$2$3"/"$4"/g'\'' '\>' "$6}' 'Source-04.txt' > Outcome-07.txt

where Source-04.txt is: 'pea dhcp-163.net1.bg slashpea' 'Success' Target-0802A.txt Target-0803A.txt

and Target-0802A.txt is: peadhcp-163.net1.bgslashpea

In order to comprehend this, you'll have to imagine "less than" p "greater than" wherever
you read "pea" and "forward slash" "less than" p "greater than" wherever you read "slashpea."

The script ensuing from the awk command as Outcome-07.txt is:
cat Target-0802A.txt sed 's/'peadhcp-163.net1.bgslashpea'/'Success'/g' > Target-0803A.txt

Which elicits the following complaint and fails to produce a modified Target-0803A.txt:
bash: p: No such file or directory

The files Target-0802A.txt and Target-0803A.txt both exist; Target-0803A.txt is empty in
order to protect File-0802A.txt from obliteration by a non-functional script Outcome-07.txt.

The script Outcome-07.txt appears to be looking for an imaginary file ...

Once the script is satisfied, Target-0803A.txt should read "Success" but by then I can create
the true successor script by using awk to print all its component parts just like Source-04.txt,
but with more (and longer) components.

The last step is to replace $3 in the new Source-05.txt with the individual PTRs read from a
list of the actual remaining 500+ PTRs that are to be linked. That's a future scripting task
which can be trivial, as it has been before.

I've made a copy of a portion of the developing webpage, which is just a table at this stage,
and that's the first attachment, presented as a text file. Then there are 33 more files with
the actual data that belong in a subdirectory named "MatchFilesMay2020" These are the ones
that all should already be linked in the table. Another set of 33 files with the quantity
portion of their filenames removed are also attached; they are a different set of data. They
belong in a different subfolder, "MatchedPTRs" The forum software has renamed all the domain-like
files with underscores; those may have to be reconciled with the webpage file to complete the
links.

George Langford

AttachmentSize
Table-html-excerpt.txt17.65 KB
ip-66-70-185.eu_.3.txt89 bytes
default-rdns.vocus_.co_.nz_.18099.txt696.82 KB
125.mtsnet.ru_.12.txt345 bytes
ip-54-39-190.eu_.8.txt239 bytes
dedic-center.ru_.116.txt3.5 KB
123-51-215-0.ll_.static.sparqnet.net_.4.txt201 bytes
ip-54-39-184.eu_.9.txt267 bytes
dedicated.vsys_.host_.100.txt3.34 KB
121.dhcp_.apogeetelecom.com_.7.txt290 bytes
ip-54-39-179.eu_.3.txt87 bytes
dedicated-assignments-only.fuse_.net_.27.txt1.34 KB
120.mtsnet.ru_.6.txt173 bytes
ip-54-39-178.eu_.5.txt145 bytes
dc113.kdata_.vn_.9.txt277 bytes
113.mtsnet.ru_.3.txt87 bytes
ip-54-38-90.eu_.6.txt164 bytes
dallas-tx-datacenter.serverpoint.com_.6.txt302 bytes
111.mtsnet.ru_.6.txt174 bytes
ip-54-38-43.eu_.6.txt166 bytes
daimon.alastyr.com_.2.txt60 bytes
111.14.103.jeruk1_.ats-com.net_.4.txt179 bytes
ip-54-38-42.eu_.7.txt194 bytes
dailytopoffer.com_.4.txt136 bytes
109-198-197-x.dynamic.b-domolink.net_.16.txt837 bytes
ip-54-38-41.eu_.6.txt166 bytes
cust.uvtnet.cz_.123.txt3.36 KB
109-198-192-x.dynamic.b-domolink.net_.6.txt312 bytes
ip-54-38-40.eu_.8.txt221 bytes
customer.vivid-hosting.net_.215.txt8.92 KB
100.mtsnet.ru_.3.txt84 bytes
ip-66-70-185.eu_.txt515 bytes
default-rdns.vocus_.co_.nz_.txt1.03 KB
125.mtsnet.ru_.txt44 bytes
ip-54-39-190.eu_.txt42 bytes
dedic-center.ru_.txt84 bytes
123-51-215-0.ll_.static.sparqnet.net_.txt126 bytes
ip-54-39-184.eu_.txt41 bytes
dedicated.vsys_.host_.txt971 bytes
121.dhcp_.apogeetelecom.com_.txt57 bytes
ip-54-39-179.eu_.txt44 bytes
dedicated-assignments-only.fuse_.net_.txt61 bytes
120.mtsnet.ru_.txt40 bytes
ip-54-39-178.eu_.txt44 bytes
dc113.kdata_.vn_.txt380 bytes
113.mtsnet.ru_.txt39 bytes
ip-54-38-90.eu_.txt587 bytes
dallas-tx-datacenter.serverpoint.com_.txt63 bytes
111.mtsnet.ru_.txt45 bytes
ip-54-38-43.eu_.txt382 bytes
daimon.alastyr.com_.txt136 bytes
111.14.103.jeruk1_.ats-com.net_.txt55 bytes
ip-54-38-42.eu_.txt461 bytes
dailytopoffer.com_.txt129 bytes
109-198-197-x.dynamic.b-domolink.net_.txt68 bytes
ip-54-38-41.eu_.txt165 bytes
cust.uvtnet.cz_.txt80 bytes
109-198-192-x.dynamic.b-domolink.net_.txt62 bytes
ip-54-38-40.eu_.txt255 bytes
customer.worldstream.nl_.txt573 bytes
103.140.104-static.rdns_.serverhub.com_.txt1.15 KB
ip-54-38-38.eu_.txt245 bytes
customer.vivid-hosting.net_.txt106 bytes
100.mtsnet.ru_.txt40 bytes
Ignacio Agulló
Offline
Joined: 07/30/2019

Was it necessary to attach dozens of files, about one megabyte of size?

amenex
Offline
Joined: 01/03/2015

Ignacio Agullo inquired:
Was it necessary to attach dozens of files, about one megabyte of size?

Yes; they're all different, with differing goals, impacts, patterns, and the like.

I'd also like to encourage others to attempt similar analyses. It's taking me a
couple of months to gather the data and put it into an order which can be examined
to find out why and how so many attacks are being made by servers located at
addresses which cannot be traced. These results show that they can be examined for
country of origin, degree of obfuscation, location of additional addresses, etc.

There are other months in the year; one person cannot possibly keep up with the task;
yet there are hundreds of folks picking up the traces left behind in the headers of
malicious messages; you can find out for yourselves by putting one of the PTR records
(a.k.a. hostnames) in an Internet search engine, enclosed in quotation marks, and then
gathering the IP addresses gleaned from malicious Internet traffic by the many folks
who monitor such traffic. That's another webpage like this excerpt that can be generated.

George Langford

amenex
Offline
Joined: 01/03/2015

Regarding the Table-html-excerpt.txt file:

After converting it back to HTML and trying out the links, I discovered an easily corrected
error in about two-thirds of them. In Leafpad the correction is to search for the string,
../../ScoreCards" and replace it with "../ScoreCards"

That should fix all the broken links.

By the way: The linked files are all plain text without any scripts and contain no more
links to anywhere else. That said, you can test the names with "dig hostname" and the
IP addresses with "dig -x IPaddress" Many of the hostnames come back as on the server,
"92.242.140.21" which is a catchall address used by a fellow who maintains a site called
"barefruit error handling" or "unallocated.barefruit.co.uk" but it's not where these
oftimes malicious servers are. "whois IPaddress" will tell you where they are located
and what their autonomous server number (ASN) is.

George Langford

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

These days, I do not have time to decipher your posts. Anyway, a pipe is missing in:
cat Target-0802A.txt sed 's/'peadhcp-163.net1.bgslashpea'/'Success'/g' > Target-0803A.txt
And it is equivalent to, simply:
sed 's/peadhcp-163.net1.bgslashpea/Success/g' Target-0802A.txt > Target-0803A.txt
And, as I have told you many times, a non-escaped dot in a regular expression means "any single character". As a consequence, it should probably be:
sed 's/peadhcp-163\.net1\.bgslashpea/Success/g' Target-0802A.txt > Target-0803A.txt

Also, there is no way you got that command line returning:
bash: p: No such file or directory
Bash tried here to execute a command named "p". Such a single "p" was certainly right after the prompt.

amenex
Offline
Joined: 01/03/2015

After learning how to count characters in bash:
https://linuxhint.com/length_of_string_bash/

I found out that the offending character in the sed expression is the forward slash preceding
the second "p" in the string to be replaced. The attached text file demonstrates this result.

I have yet to see whether or not my substitution will work in the real world.

Your corrections were a key factor in this small accomplishment; thanks again !

After a lot of tries, I decided simply to bypass sed's difficulty with the pesky slashpea code,
supplied as ExemplarScripts29.txt

I constructed this with a series of awk commands followed by the paste command, plus some editing
in Leafpad to replace troublesome stuff like "'s" and "g'" wherein I filled in the leading and
trailing "'" characters, and a space between the "a and href" that I plugged with "a--href".

Most troublesome is the "IPv4" which even "3334" hasn't fixed. Everything appears to be an unknown
option to s
which is where its stands now.

George Langford

AttachmentSize
MBdemonstration.txt 151 bytes
ExemplarScripts29.txt.txt 698 bytes
amenex
Offline
Joined: 01/03/2015

Trying another tack with awk ...

see: https://stackoverflow.com/questions/50244876/how-to-use-gsub-in-awk-to-find-and-replace-and-txt-characters-within
where it's said:
echo "./file_name.txt|1230" | awk '{gsub(/\.\/|\.txt/,"")}1' file_name|1230

In the present task, taking just one exemplar PTR, see the attached file, which also shows bash's response.

The character that's flagged is the end-parenthesis, but that's part of the standard gsub syntax.

George Langford

AttachmentSize
exemplar-script-awk-gsub.txt 239 bytes
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

There are three non-escaped single quotes on this command line.

amenex
Offline
Joined: 01/03/2015

In my forays into the mysteries of Fortran fifty years ago, mistakes in the coding often generated
error messages that had no discernible relation to the mistakes, but instead pointed to their
consequences. That's what I suspect is happening as a consequence of attempting to use the command
line to edit HTML.

The attached file has five versions of the offending script and the bash responses:
(1) Is the script that was the first that I tried, with escaped dots; ")" was flagged.
(2) All the dots are escaped; also the single quotes; "(" was flagged.
(3) Target file was altered to eliminate the p's; so was the script; ")" was flagged.
(4) Target file still altered to eliminate the p's; single quotes escaped; "(" was flagged.
(5) All the HTML-sensitive characters were translated; `/\&lt' (That's "<" - GL) was flagged.

Ref:https://stackoverflow.com/questions/12873682/short-way-to-escape-html-in-bash

George Langford

AttachmentSize
TrisquelScript-08072020.txt 1.28 KB