Cygwin or PowerShell

A while back Scott Hanselman posted on how to do some CSV file analysis using PowerShell.

Having been a long time Linux guy I didn’t like what I saw. It felt complex, ugly and strange. I felt I could do the same in less characters using good ole GNU tools like sed, awk and friends.

The first thing Scott shows is the import-csv command. This is almost exactly like awk. Now PowerShell lovers everywhere are going to scream “NO IT IS NOT!” right about now. I agree. It is not. The way PowerShell treats objects in its command pipeline is awesome and something that cannot be accomplished using a Unix shell. Please ignore that for the remainder of this demonstration.

So given a CSV format like this:

“File”,”Hits”,”Bandwidth”
“/hanselminutes_0026_lo.wma”,”78173″,”163625808″
“/hanselminutes_0076_robert_pickering.wma”,”24626″,”-1789110063″
“/hanselminutes_0077.wma”,”17204″,”1959963618″
“/hanselminutes_0076_robert_pickering.mp3″,”15796″,”-55874279″
“/hanselminutes_0078.wma”,”14832″,”-1241370004″
“/hanselminutes_0075.mp3″,”13685″,”-1840937989″
“/hanselminutes_0075.wma”,”12129″,”1276597408″
“/hanselminutes_0078.mp3″,”11058″,”-1186433073″

 

We can trim the first line using sed ‘1d’ and then display columns using awk $1,$2,$3. I’m using the awk command gsub to just throw away quotes. It is not ideal and not as good as import-csv, but it works

$ cat given |sed ‘1d’| awk -F, ‘{gsub(/”/,””);print $1,$2,$3}’
/hanselminutes_0026_lo.wma 78173 163625808
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063
/hanselminutes_0077.wma 17204 1959963618
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279
/hanselminutes_0078.wma 14832 -1241370004
/hanselminutes_0075.mp3 13685 -1840937989
/hanselminutes_0075.wma 12129 1276597408
/hanselminutes_0078.mp3 11058 -1186433073

 

Next Scott makes a new “calculated column”. Just like in PowerShell, this is trivial in awk.

$ cat given |sed ‘1d’| awk -F, ‘{gsub(/”/,””);show=$1;print $1,$2,$3,show}’
/hanselminutes_0026_lo.wma 78173 163625808 /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063 /hanselminutes_0076_robert_pickering.wma
/hanselminutes_0077.wma 17204 1959963618 /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279 /hanselminutes_0076_robert_pickering.mp3
/hanselminutes_0078.wma 14832 -1241370004 /hanselminutes_0078.wma
/hanselminutes_0075.mp3 13685 -1840937989 /hanselminutes_0075.mp3
/hanselminutes_0075.wma 12129 1276597408 /hanselminutes_0075.wma
/hanselminutes_0078.mp3 11058 -1186433073 /hanselminutes_0078.mp3

 

Scott makes some jokes about Regular Expressions, but as an old Perl programmer and long time regular expression master, I’m confident instead of worried about regex. (Yes, I said master and I’m not afraid. Ignite the flames!)

We can actually do much better than the regex Scott whipped up. I’m sure Scott could too, but just didn’t for his simple example.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print $1, $2, $3, show}’
/hanselminutes_0026_lo.wma 78173 163625808 26
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063 76
/hanselminutes_0077.wma 17204 1959963618 77
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279 76
/hanselminutes_0078.wma 14832 -1241370004 78
/hanselminutes_0075.mp3 13685 -1840937989 75
/hanselminutes_0075.wma 12129 1276597408 75
/hanselminutes_0078.mp3 11058 -1186433073 78

 

Cool, so far we seem to be on par with PowerShell. I’ve not had to learn anything new. In fact exercising these old skills are like riding a bike for me.

Next Scott groups to count. No problem, sounds like sort and uniq. Tell sort to do a numeric reverse sort based on the 4th column. Tell uniq to count and ignore the first 3 columns. (Don’t get me started on inconsistency within the Unix tools.)

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print $1, $2, $3, show}’ | sort -nr -k 4| uniq -c -f 3
      2 /hanselminutes_0078.wma 14832 -1241370004 78
      1 /hanselminutes_0077.wma 17204 1959963618 77
      2 /hanselminutes_0076_robert_pickering.wma 24626 -1789110063 76
      2 /hanselminutes_0075.wma 12129 1276597408 75
      1 /hanselminutes_0026_lo.wma 78173 163625808 26

 

Looks pretty much like what Scott has with his count and group. If you REALLY want the count and then show number you can change things around a little. Tell awk to output show first. It definitely simplifies the sort command which now we just say numeric reverse. The uniq command we now just ask it to compare only the first 3 characters. I wish their were a field option, but there is always awk instead. We are getting there.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print show, $1, $2, $3}’ | sort -nr | uniq -c -w 3
      2 78 /hanselminutes_0078.wma 14832 -1241370004
      1 77 /hanselminutes_0077.wma 17204 1959963618
      2 76 /hanselminutes_0076_robert_pickering.wma 24626 -1789110063
      2 75 /hanselminutes_0075.wma 12129 1276597408
      1 26 /hanselminutes_0026_lo.wma 78173 163625808

 

Here is where Scott says the good data is trapped in the group. Well in this case the group actually ate data that is unrecoverable. We told uniq to ignore and so it did. For the next step we have to throw away sort/uniq and go back to awk. This is the point at which I think most uniq gurus would stop and break out perl, or pull things into a large awk script to do it all for us. That sounds like a good idea. I’m sticking with awk. We want to sum hits, which in our awk is $2.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);showSums[show]+=$2}END{for(show in showSums) {print show,showSums[show]}}’

26 78173
75 25814
76 40422
77 17204
78 25890

 

Scott complains about his ugly headers, and in my case I don’t even have headers, but for us it is just a print statement. Lets space things out a bit with the output field separator.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);showSums[show]+=$2}END{OFS=”\t”;print “Show”,”Hits”;for(show in showSums) {print show,showSums[show]}}’
Show    Hits
26      78173
75      25814
76      40422
77      17204
78      25890

 

Scott then asks the question, “Is this the way you should do things…” and answers “No”. This too is a solution to a real world problem in shell script. Sadly, the awk mess above is actually MORE readable to me, because of my experience, than the PowerShell. But I must admit if I blur my eyes so that I don’t recognize some things, the PowerShell looks like maybe it is better. Is it time to learn PowerShell?

A PowerShell function?  pfff… A Shell Script! We can loose the use of sed by putting the whole thing in an if(NR!=1) block, this becoming a shell script.

#!/usr/bin/gawk

{

if(NR!=1) {

  gsub(/”/,””,$1);

  gsub(/”/,””,$2);

  gsub(/”/,””,$3);

  show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);

  showSums[show]+=$2

}

}

END {

  OFS=”\t”;

  print “Show”,”Hits”;

  for(show in showSums) {

    print show,showSums[show]

  }

}

Name the above file Get-ShowHits if you want.

I don’t care what you say, awk is readable 🙂

ADO.NET Data Services

When I first saw what ADO.NET Data Services was, I wasn’t impressed. It looked like scaffolding + REST URLs and nothing all that special. A nice tool in the belt but nothing mind blowing.

After watching Mike Taulty’s screencasts I’m a little more impressed. The implementation is pretty complete. Not only is the server side complete but the ASP.NET AJAX library for calling these services from JavaScript looks complete and usable AND the webdatagen.exe tool for creating proxy classes to access the services from .NET code is complete.

If you previously dismissed it, give it another look. It is pretty cool. Just one thing… security?

Fluent Interfaces

I like to embrace Languages Inside of Languages. Jeff says “they are actually an ugly hack, and a terrible substitute for true language integration.”

Jeff totally misses the point. His first example of regular expression does indeed replace one line regular expression with ten lines of objects, methods and named enumerations. Jeff asks if this is progress.

It depends on your goal. What if you make a typo in the one line regular expression? You will never know it until you realize that things don’t work. My goal is to leverage the languages at my disposal to  a maximum benefit.

A fluent interface gets me a certain degree of type safety. This means I can

  • catch errors at compile time instead of debug time.
  • write fewer unit tests to get the same amount of coverage.
  • utilize tools such as intellisense when writing to an API instead of strings.

 

Sure, LINQ is awesome. VB.NET’s XML Literals are awesome too. Jeff is ignoring what a regular expression fluent interface brings to the table: catching errors in your code.

Next Jeff attacks Subsonic’s fluent interface. It looks a lot to me like what I have used, and loved, in API output form NHibernate Query Generator.

Jeff’s example is great. Lets look at it again

SELECT * from Customers WHERE Country = "USA"
ORDER BY CompanyName

Is approximately the SQL generated by Subsonic’s fluent interface, given this code:

CustomerCollection c = new CustomerCollection();
c.Where(Customer.Columns.Country, "USA");
c.OrderByAsc(Customer.Columns.CompanyName);
c.Load();

Jeff says this is harder to write. Jeff is wrong.

By using native C# code instead of embedding a SQL query in some string, we gain

  • type safety
  • use of code completion features like intellisense or intellassist
  • ability to pass around queries and partial queries without modifying them with StringBuilder

That last point should be obvious to anyone who has used DetachedCriteria in Hibernate/NHibernate. If you aren’t sure what I mean, go look at DetachedCriteria or better yet go look at the code that NHQG creates.

So, what do you do as a developer? I encourage you to listen to neither Jeff or me. Make your own decision given the circumstances. I definitely prefer to leverage the compiler and type safety wherever possible. However, I also write Perl when a task at hand can best be accomplished with that tool. Jeff seems to be overlooking some serious benefits to Fluent Interfaces. Please don’t overlook them yourself.

Cmd.Exe and Which

I love bash.

I hate cmd.exe. This is mostly due to unfamiliarity, but I don’t care, I hate regardless.

Enter, me configuring cc.net and msbuild complaining about a version 10, Visual Studio 2008, sln file. This is because the msbuild that cc.net knows about is the 2005 msbuild. So I fire up a 2008 Command Prompt run msbuild, and everything runs fine.

Now it seems like the “Microsoft way” would be for me to examining the path manually, go trouncing through the directories in this path, looking for msbuild. WTF??? I don’t work that way.

In Linux, there is a command called which, which will tell you which command you are executing if you type it at the prompt. All it does is search your path, in order, to tell you which command would run if you actually type a command. The exception is shell built-ins, but lets ignore that.

[jrwren@utonium:$] which ls
/bin/ls

It turns out this is TRIVIAL to write as a batch file. cmd.exe has the goods for you. I’m writing this here because MS keeps changing urls on me and my del.icio.us links don’t work anymore.

Maybe this link won’t change.

which.bat:

echo %~$PATH:1

Pretty slick eh?

IMNSHO this should be in every windows directory on every windows computer everywhere 🙂 Its 2008. I’ve been living with which in linux/unix since I started using it in 1995. Please step into 13 years ago by copying this batch file to all of your windows computers.

Learning F#: Terminology

I like knowing what to call things. I learned a couple of new definitions.

When you bind a new value to an existing value name this is called outscoping.

let funName a =

    let funName n =

        a+n

    funName 5

funName 1

funName with parameter n outscopes funName with parameter a.

image

Tuples have a special case for the 2-tuple. Dustin Campbell touched on this, but it seems to me that anytime we know a tuple is a 2-tuple it would be beneficial to refer to that tuple as a pair. A pair is a 2-tuple. Dustin showed that pairs are so special that the fst and snd functions can be used on them. These two functions work on pairs, not all tuples.

open Math

let maxProjectileDistanceVector = (Math.Sqrt(2.0)/2.0, Math.Sqrt(2.0)/2.0)

maxProjectileDistinceVector is a pair. Its a tuple too, but a pair is more specific.

Bane

 

VisualStudioHasStopped_2008-01-21_14-15-50

Definitions of bane on the Web:

Ann Arbor Computer Society Meeting On February 6th, 2008

I just got off the email with Joe O’Brien 🙂 I’m really excited for this one.

On Wednesday, February 6th 2008, Joe O’Brien of EdgeCase will be speaking at the Ann Arbor Computer Society about Domain Specific Languages: Molding Ruby.

Ever wondered what all the fuss is about when it comes to DSL’s and Ruby? It seems to be all we hear about. This talk will peel away the onion and look at what it is about Ruby that makes it the perfect candidate for creating your own languages. I will show you, through examples, how you can create your own languages without the need for compilers and parsers. We will also cover some real world examples in areas of Banking and Medicine where DSL’s have been applied.

From EdgeCase:

Joe is a father, speaker, author and developer. Before helping found EdgeCase, LLC, Joe was a developer with ThoughtWorks and spent much of his time working with large J2EE and .NET systems for Fortune 500 companies. He has spent his career as a developer, project manager, and everything in between. Joe is a passionate member of the open source community. He co-founded the Columbus Ruby Brigade and helped organize the Chicago Area Ruby Users Group. His passions are Agile Development in the Enterprise, Ruby, and demonstrating to the Fortune 500 the elegance and power of this incredible language. Joe is currently working on a book for the Pragmatic Programmers on building DSL’s with Ruby.

If you haven’t seen Joe speak, I promise you he is awesome. I’m always left scratching my head wondering why I don’t write Ruby.

The Ann Arbor Computer Society meets at SRT Solutions offices at 206 S. 5th Avenue, Ann Arbor, Michigan at 6pm the first Wednesday of every month.

Sara Ford’s Parents Were Cheap, and So Were My Sisters!

Meeting Sara Ford at CodeMash was fun. I don’t know very many Microsoft Employees and I know even fewer who actually work in Redmond.

Sara was so down to earth and fun to hang out with that I am left with an even better opinion of Microsoft’ies post-codemash.

Sara was joking that her parents were too cheap to pay for the h in her name. I love this. I wish I had thought of it so I could have made fun of my sister while we were growing up. My sister is Sara without an h too!

Sara does a great job of introducing herself in the CodeMash podcast with Chris Woodruff. I was very happy to get to talk with her, Steven Harman, Joe Brinkman, Kevin Devine and Michael Kimsal of webdevradio. Our conversation was totally unprepared. I think the idea was to have an open spaces discussion and when only a few people showed up, Michael started recording.

At times the conversation reminded me of the DotNetRocks open source panel episode, but as a 12+ year member, user and contributor to open source projects, I felt like some major points were missed on the DotNetRocks episode. Since I was part of the conversation I was able to represent these points.

Thanks for waving me over Steven. Thanks for putting the mic in my face Michael.

If this sounds interesting, give it a listen.

Order of Operation Matters Thanks to Rounding Error

My previous post was supposed to be followed up much quicker than this. Oh Well.

If that was The Good Stuff Is Hidden, then this is The Hidden Good Stuff Can Hurt You.

Joe Landman of “thanks for the awesome meeting space for MDLUG at SGI” fame had an excellent post on floating point arithmetic.

His post is what gave me the idea to do the same thing with Enumerable methods in the previous post.

I’d like to show that Joe’s assertion that programmers should worry about rounding error is true for .NET developers especially in this world of forthcoming parallel LINQ.

Lets write our own Range method which takes a start, end and increment parameters. You could use Enumerable.Range(start,count) and Enumerable.Reverse(this source). But reversing an Enumerable so big doesn’t sound like a good idea to me.

My brain things Range start & end, not start & count. I blame Python. This means the Range implementation is a little long because I try to detect infinite ranges.

public IEnumerable<int> Range(int start, int end, int increment)

{

    int i = start;

    if (start.CompareTo(end) <= 0 && increment > 0)

        while (i.CompareTo(end) <= 0)//(i <= end)

        {

            yield return i;

            i = i + increment;

        }

    else if (start.CompareTo(end) >= 0 && increment < 0)

        while (i.CompareTo(end) >= 0)//(i <= end)

        {

            yield return i;

            i = i + increment;

        }

    else

        throw new ArgumentOutOfRangeException();

}

Now lets use this method to count and sum the same exact same values twice. Once from start to end and again in a different order from end to start.

float x = 0f, y = 0f;

int N = 100000000;

 

x = Range(1, N, 1).Aggregate(x, (a, b) => a + 1 / (float)b);

y = Range(N, 1, -1).Aggregate(y, (a, b) => a + 1 / (float)b);

 

Console.WriteLine(“[increasing] sum = {0}”, x);

Console.WriteLine(“[decreasing] sum = {0}”, y);

This outputs the following:

[increasing] sum = 15.40368
[decreasing] sum = 18.80792

Not what you would expect is it? But before we go there…

I find the F# version of this to be easier and more readable.

let si = {1..100000000}

let f a b = a+1.0f/b

let i = Seq.fold f 0.0f (Seq.map (fun a-> (float32 a)) si)

let sd= StandardRanges.int 100000000 (-1) 1 //{100000000..1}

let d= Seq.fold f 0.0f (Seq.map (fun a-> (float32 a)) sd)

printf “[increasing] sum = %f\n” i

printf “[decreasing] sum = %f\n” d

This outputs the same :

[increasing] sum = 15.40368
[decreasing] sum = 18.80792

Now the question… why aren’t these numbers the same. It is the result of adding all of the same numbers.

Well if you read Joe Landman’s post that I linked above, then you already know. Order of operations matters! Floating point rounding error accumulates and it adds up big here. See Joe’s post for more.

When I read this, the thought that PLINQ might be easy quickly left my mind. Results changing just because the order is different!

Now what is worse, lets say you try to do the same thing in F#, but rather than using integers and casting like we did in C#, we just use F#’s StandardRanges library to generate floating point values for us.

let n = 100000000.0f

printf “[increasing] sum = %f\n” (Seq.fold (fun a b -> a+1.0f/b) 0.0f (StandardRanges.generate 0.0f (+) 1.0f 1.0f n))

printf “[decreasing] sum = %f\n” (Seq.fold (fun a b -> a+1.0f/b) 0.0f (StandardRanges.float32 n -1.0f 1.0f))

The problem here is the way the floating point numbers are represented is slightly different than the above integer cast version. The result is  disappointingly different:

[increasing] sum = 15.40368
[decreasing] sum = Infinity

Infinity? Maybe I’ll have to look into this some more.

Some interesting things to note:

  • StandardRanges had some issues in 1.9.2.9 Thanks to Dustin Cambell for trying the same code which ran fine on his 1.9.3.7 and making me upgrade.
  • F#’s Seq is just an alias for IEnuemerable<> but it is augmented with more things.
  • examining the source in prim-types.fs was awesome. thanks for the source Microsoft.
  • Looks like prim-types.fs and StandardRanges has changed a bit from 1.9.2.9 to 1.9.3.7 and there is some float specialty stuff in there now. I need to investigate.
  • float in F# is double in C#. float32 in F# is float in C#. or if you remember you SATs : C#float:F#float32::C#double:F#float

Wondering about using a 64bit type instead of a 32bit type? Scroll up and click my link Joe Landman’s post. It doesn’t make this issue go away. We need to actually think about what we are doing as programmers *gasp*.

BestBuy.com Bad Password Policies

I was spending a Gift Certificate from Christmas 2006 (I’m not joking) and when I registered for BestBuy.com, I got this:

bestBuyStupid_2008-01-16_15-55-45

ARE YOU SERIOUS???

The !@#$%^&*()-=`~[]{}\;:'”,<.>/? that I had in my password isn’t allowed?

I opted for a 19 character password with lower and upper case letters and numbers.

BUT STILL! This is not really acceptable IMO.