CSV parser program

6 réponses [Dernière contribution]
boi
boi
Hors ligne
A rejoint: 04/01/2021

Hello, everyone.

I wanted to parse a CSV file with format: Item,Manufacturer,...,Price - in such way to split it into 3 different files, first - all items with price 249.99 and less, second - 250 to 549.99, third - 550 and more. The files need to be sorted by Manufacturer, then descending by Price. What program can parse CSV files that way?

J.B. Nicholson-Owens
Hors ligne
A rejoint: 06/09/2014

name at domain wrote:
> I wanted to parse a CSV file with format: Item,Manufacturer,...,Price - in such way
> to split it into 3 different files, first - all items with price 249.99 and less,
> second - 250 to 549.99, third - 550 and more. The files need to be sorted by
> Manufacturer, then descending by Price. What program can parse CSV files that way?

Spreadsheets (such as LibreOffice Calc and Gnumeric) can help. But typically there
are size limits that might get in the way with a large dataset. However this gives a
UI and graphing options that can make working with the data quite convenient and
promote experimentation.

Perhaps csvtool (https://github.com/janloeffler/csvtool) might help in combination
with other command-line programs. But this will require parsing CSV rows as text and
constantly dealing with those figures as lines of text rather than values with
meaning in your data (item names, prices, etc.). This option will scale very well as
pipelines can handle arbitrarily large input datasets.

Another approach is to load your CSV file into a database (such as SQLite, the
world's most widely-used database) and issue queries on the table. Depending on your
subsequent sort and search needs, this approach is likely to scale up well and avoid
clumsy shell-based limits with pipelines and parsing text. You could also use RDBMS
front-ends that let you manipulate the database with GUI convenience.

SQLite makes it easy to load a CSV file into a table:

[foo.csv exists on the filesystem and is a 3-column CSV file]
$ sqlite3 foo.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .import foo.csv nameAddress
sqlite> .schema nameAddress
CREATE TABLE IF NOT EXISTS "nameAddress"(
"last-name" TEXT,
"first-name" TEXT,
"address" TEXT
);
sqlite> select * from nameAddress;
smith,john,"123 main st"
jones,pat,"456 somewhere ln"
lee,phil,"789 somewhere ln"

Magic Banana

I am a member!

Hors ligne
A rejoint: 07/24/2010

AWK and sort.

First of all, count how many fields there are in the first line of your file (I assume here it is named "file.csv"):
$ head -1 file.csv | awk -F , '{ print NF }'
Let us imagine that command outputs 42 (below, substitute 42 for whatever it actually outputs). Then, before actually doing the processing you want, it is worth checking whether every line indeed has 42 fields:
$ awk -F , 'NF != 42' file.csv
If no line is printed, then go ahead:
$ awk -F , '$42 < 250' file.csv | sort -t , -k 2,2 -k 42nr
$ awk -F , '$42 >= 250 && $42 < 550' file.csv | sort -t , -k 2,2 -k 42nr
$ awk -F , '$42 >= 550' file.csv | sort -t , -k 2,2 -k 42nr

To redirect to a file (rather than to the terminal, by default) the output of any command line, append the > character followed by the output file name (between quotes if it includes a space or any other character the shell interprets; idem for the input file name by the way).

That is a solution I would come up with working interactively. I hope it will motivate you to study 'info sort' and a little of AWK's manual. Do you rather need a script that takes a file (and the thresholds?) at input and does the work whatever the number of fields?

boi
boi
Hors ligne
A rejoint: 04/01/2021

Thanks for the replies, guys.

Magic Banana your scripts are doing exactly what I want.

Magic Banana

I am a member!

Hors ligne
A rejoint: 07/24/2010

As I wrote, it is not a script but interactive usage. If you often want to apply a similar processing, it is worth writing a Shell script, what is not much harder.

boi
boi
Hors ligne
A rejoint: 04/01/2021

Thanks, what will this script look like?

Magic Banana

I am a member!

Hors ligne
A rejoint: 07/24/2010

Well, you specify it as you wish, depending on what differs between your executions. If only the input files and the thresholds (the manufacturer would always be in the second field and the price always in the last field), then it could take those arguments:
$ price-split file.csv 250 550
The output files could have the name of the input file plus a suffix specifying the price interval. Something with options is possible but starts to be more complicated to write (I never remember how to use getopt).