A while back Scott Hanselman posted on how to do some CSV file analysis using PowerShell.
Having been a long time Linux guy I didn’t like what I saw. It felt complex, ugly and strange. I felt I could do the same in less characters using good ole GNU tools like sed, awk and friends.
The first thing Scott shows is the import-csv command. This is almost exactly like awk. Now PowerShell lovers everywhere are going to scream “NO IT IS NOT!” right about now. I agree. It is not. The way PowerShell treats objects in its command pipeline is awesome and something that cannot be accomplished using a Unix shell. Please ignore that for the remainder of this demonstration.
So given a CSV format like this:
“File”,”Hits”,”Bandwidth”
“/hanselminutes_0026_lo.wma”,”78173″,”163625808″
“/hanselminutes_0076_robert_pickering.wma”,”24626″,”-1789110063″
“/hanselminutes_0077.wma”,”17204″,”1959963618″
“/hanselminutes_0076_robert_pickering.mp3″,”15796″,”-55874279″
“/hanselminutes_0078.wma”,”14832″,”-1241370004″
“/hanselminutes_0075.mp3″,”13685″,”-1840937989″
“/hanselminutes_0075.wma”,”12129″,”1276597408″
“/hanselminutes_0078.mp3″,”11058″,”-1186433073″
We can trim the first line using sed ‘1d’ and then display columns using awk $1,$2,$3. I’m using the awk command gsub to just throw away quotes. It is not ideal and not as good as import-csv, but it works
$ cat given |sed ‘1d’| awk -F, ‘{gsub(/”/,””);print $1,$2,$3}’
/hanselminutes_0026_lo.wma 78173 163625808
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063
/hanselminutes_0077.wma 17204 1959963618
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279
/hanselminutes_0078.wma 14832 -1241370004
/hanselminutes_0075.mp3 13685 -1840937989
/hanselminutes_0075.wma 12129 1276597408
/hanselminutes_0078.mp3 11058 -1186433073
Next Scott makes a new “calculated column”. Just like in PowerShell, this is trivial in awk.
$ cat given |sed ‘1d’| awk -F, ‘{gsub(/”/,””);show=$1;print $1,$2,$3,show}’
/hanselminutes_0026_lo.wma 78173 163625808 /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063 /hanselminutes_0076_robert_pickering.wma
/hanselminutes_0077.wma 17204 1959963618 /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279 /hanselminutes_0076_robert_pickering.mp3
/hanselminutes_0078.wma 14832 -1241370004 /hanselminutes_0078.wma
/hanselminutes_0075.mp3 13685 -1840937989 /hanselminutes_0075.mp3
/hanselminutes_0075.wma 12129 1276597408 /hanselminutes_0075.wma
/hanselminutes_0078.mp3 11058 -1186433073 /hanselminutes_0078.mp3
Scott makes some jokes about Regular Expressions, but as an old Perl programmer and long time regular expression master, I’m confident instead of worried about regex. (Yes, I said master and I’m not afraid. Ignite the flames!)
We can actually do much better than the regex Scott whipped up. I’m sure Scott could too, but just didn’t for his simple example.
$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print $1, $2, $3, show}’
/hanselminutes_0026_lo.wma 78173 163625808 26
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063 76
/hanselminutes_0077.wma 17204 1959963618 77
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279 76
/hanselminutes_0078.wma 14832 -1241370004 78
/hanselminutes_0075.mp3 13685 -1840937989 75
/hanselminutes_0075.wma 12129 1276597408 75
/hanselminutes_0078.mp3 11058 -1186433073 78
Cool, so far we seem to be on par with PowerShell. I’ve not had to learn anything new. In fact exercising these old skills are like riding a bike for me.
Next Scott groups to count. No problem, sounds like sort and uniq. Tell sort to do a numeric reverse sort based on the 4th column. Tell uniq to count and ignore the first 3 columns. (Don’t get me started on inconsistency within the Unix tools.)
$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print $1, $2, $3, show}’ | sort -nr -k 4| uniq -c -f 3
2 /hanselminutes_0078.wma 14832 -1241370004 78
1 /hanselminutes_0077.wma 17204 1959963618 77
2 /hanselminutes_0076_robert_pickering.wma 24626 -1789110063 76
2 /hanselminutes_0075.wma 12129 1276597408 75
1 /hanselminutes_0026_lo.wma 78173 163625808 26
Looks pretty much like what Scott has with his count and group. If you REALLY want the count and then show number you can change things around a little. Tell awk to output show first. It definitely simplifies the sort command which now we just say numeric reverse. The uniq command we now just ask it to compare only the first 3 characters. I wish their were a field option, but there is always awk instead. We are getting there.
$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print show, $1, $2, $3}’ | sort -nr | uniq -c -w 3
2 78 /hanselminutes_0078.wma 14832 -1241370004
1 77 /hanselminutes_0077.wma 17204 1959963618
2 76 /hanselminutes_0076_robert_pickering.wma 24626 -1789110063
2 75 /hanselminutes_0075.wma 12129 1276597408
1 26 /hanselminutes_0026_lo.wma 78173 163625808
Here is where Scott says the good data is trapped in the group. Well in this case the group actually ate data that is unrecoverable. We told uniq to ignore and so it did. For the next step we have to throw away sort/uniq and go back to awk. This is the point at which I think most uniq gurus would stop and break out perl, or pull things into a large awk script to do it all for us. That sounds like a good idea. I’m sticking with awk. We want to sum hits, which in our awk is $2.
$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);showSums[show]+=$2}END{for(show in showSums) {print show,showSums[show]}}’
26 78173
75 25814
76 40422
77 17204
78 25890
Scott complains about his ugly headers, and in my case I don’t even have headers, but for us it is just a print statement. Lets space things out a bit with the output field separator.
$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);showSums[show]+=$2}END{OFS=”\t”;print “Show”,”Hits”;for(show in showSums) {print show,showSums[show]}}’
Show Hits
26 78173
75 25814
76 40422
77 17204
78 25890
Scott then asks the question, “Is this the way you should do things…” and answers “No”. This too is a solution to a real world problem in shell script. Sadly, the awk mess above is actually MORE readable to me, because of my experience, than the PowerShell. But I must admit if I blur my eyes so that I don’t recognize some things, the PowerShell looks like maybe it is better. Is it time to learn PowerShell?
A PowerShell function? pfff… A Shell Script! We can loose the use of sed by putting the whole thing in an if(NR!=1) block, this becoming a shell script.
#!/usr/bin/gawk
{
if(NR!=1) {
gsub(/”/,””,$1);
gsub(/”/,””,$2);
gsub(/”/,””,$3);
show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);
showSums[show]+=$2
}
}
END {
OFS=”\t”;
print “Show”,”Hits”;
for(show in showSums) {
print show,showSums[show]
}
}
Name the above file Get-ShowHits if you want.
I don’t care what you say, awk is readable 🙂