Difficulty in printing the first line of a tab-delimited file

5 replies [Last post]
amenex
Offline
Joined: 01/03/2015

The following code demonstrates the issue at hand:
awk '(NR>1) {print $0}' Hairball-file.txt | awk 'FS="\t" {print $4}'
The first line of the output is the 4th field of the space-delimited version of what
ought to have been a tab-delimited line of seven fields.
The first line of Hairball-file.txt is a remnant of the header of the original
Hairball of ca. 14 MB which is skipped over by the script's "(NR>1)'control.

Can that first line be reclaimed other than by referral to the original file ?

AttachmentSize
Hairball-file.txt24.08 KB
Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

To output the first line of Hairball-file.txt:
$ head -1 Hairball-file.txt
0 # 1:Content analysis 0

The command you gave outputs the fourth field starting from the second line. It is perfectly equivalent to, but more complicated and slower than:
$ tail +2 Hairball-file.txt | cut -f 4

what ought to have been a tab-delimited line of seven fields.

Every line in Hairball-file.txt is "a tab-delimited line of seven fields". Indeed the following command outputs no exception:
$ awk -F \\t 'NF - 7' Hairball-file.txt
Notice that each and every tab is here considered a delimiter: fields can be empty. In the case of the first line, shown above, the first field and the last field are empty.

amenex
Offline
Joined: 01/03/2015

That second script was a puzzle until I looked online and discovered that "tail +1" means
to start at the first line. Now it's abundantly clear. Thanks !

It would seem that awk detects the tabs from 'FS="\t" {print ...}' after reading each record;
how does that feature not produce an offset between the first & second records ?

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

I do not understand your question. awk indeed reads, one by one, every record and splits it into fields ($1, $2, ..., $NF) using FS as the delimiter. FS, which can be specified after option -F, can be any regular expression. Here it is simply a single tabulation.

amenex
Offline
Joined: 01/03/2015

Let's attack the dilemma another way:
Following MB's use of tail ==>
tail +2 Hairball-file.txt | awk 'FS="\t" {print $4}'Outputs the 4th "field" of the 2nd line of Hairball-file.txt which is ==>
UID25098-1465998636 39 (3.9 points, 5.0 required) Sat, ...
where "the 4th "field" in the pertinent portion
(UID25098-1465998636 39 (3.9 points,)
is "points," where the tabs are interpreted as white space and are therefore consolidated.
As "proof" try this:
tail +2 Hairball-file.txt | awk 'FS="\t" {print $14}' | grep "\S"
which outputs http://www.askrty.casa/3ef5l239T5fM8w612Ek387obzcc0Q18sDiIrI7fDiIrI7EGsi10MQxooKQoLe7tNY1R0D8tT11Ht91/Rachel-cathode,http://www.askrty.casa/51b6W2lA395r7wXa12ts3s87cYcc0t18JDiIrI7fDiIrI7EGsi10PQxooKQoLe7sX1iIR07h,http://www.askrty.casa/5756Q2g3o95VT8J612oL387qaPcc0R18fDiIrI7fDiIrI7EGsi10hQxooKQoLe7zqB1zH07Sl3BHt9/Durkee-repealed,http://www.askrty.casa/bab6C2n39O5zDA8612oO3879Frcc0r18QDiIrI7fDiIrI7EGsi10zQxooKQoLe6zni10r8D,http://www.askrty.casa/cornered-lessens/daa4U2395yjl7a10A387dhcc0j18FDiIrI7fDiIrI7EGsi10PQxooKQoLe7MKL1N0Z8ylHyt9yp,http://www.askrty.casa/stories-conditionals/22e5h239t5c8jJ511X387UeLcc0h18DDiIrI7fDiIrI7EGsi10RQxooKQoLe7Rk1MWn08W3ABAHt9,http://www.askrty.casa/stories-conditionals/c7e6nm2B395vIk8612js38h79ncc0w18fDiIrI7fDiIrI7EGsi10wQxooKQoLe7PTjkW107SXl2Ht9 and which has seven URL's.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Outputs the 4th "field" of the 2nd line of Hairball-file.txt

No it does not. First of all tail +2 outputs every line starting from the second one. Then, I am not sure what your AWK program does. It certainly does not do what you expect: conditions normally precedes actions (between braces) and FS="\t" is not really a condition (FS == "\t" would be a condition testing whether FS is a single tabulation, what would be always false here).

What you apparently wanted to write is:
$ awk -F \\t '{ print $4 }'
Nevertheless, I repeat, that is exactly the same as cut -f 4 but more complicated and slower.

where "the 4th "field" in the pertinent portion is "points," where the tabs are interpreted as white space and are therefore consolidated.

They are not. If you define FS (for instance through option -F, as I did above) as "\t", a single tabulation separates the fields and the spaces are in the fields. If you rather want FS to be either one single space or one single tabulation, write:
$ awk -F '[ \t']' '... your program here...'
If you want FS to be any sequence of those two characters, you can write:
$ awk -F '[ \t']+' '... your program here...'
Nevertheless, in the latter case, you need not even define FS, because "[ \t']+" is its default value.

In an AWK script where the user should not have to specify the field delimiter, you may want to define FS at the beginning of the execution, writing (here to have a single tabulation separating the fields):
BEGIN { FS = "\t" }