Using xmllint because xml still sucks

I already mentioned how to get human readable xml from the non-human readable.

Today I had a human readable xml file, but it had many branches, some deep, with lots of information in it. I wanted to know only the names of the nodes which are direct children of the root node.

e.g.

<someRootElement><firstChildOfRoot><someOtherBS><couldBeMoreBS>… </…> …</…></firstchildOfRoot><secondChildOfRoot>…insert tons of crap here</secondChildOfRoot><nextchild…

I could have done this by paging the file, but this would not have scaled. It was just long enough that I didn’t want to do this.

xmllint rescued me again. I started in shell mode

$ xmllint –shell mystupid/xmlfile/thatIhate.xmlhated

Then I am greeted with an xmllint prompt with which I can navigate my xml document much like I do a filesystem with command like ls and cd.

/ > ls
—     21 someRootElement
/ > cd someRootElement
someRootElement > ls
ta –   3
—     21 firstchildOfRoot
ta –   3
—     21 secondChildOfRoot

Pretty cool eh?

<3 Tools.

The Fastest Readable Xml

I hate xml.

I really hate it. No humans should have to read it.

As a developer, if I have to even think about Xml, then it is because some developer before me made the wrong choice.*

That said, sometimes a developer full of contempt toward Xml does need to read xml.

Both feedburner and WordPress output mostly well spaces xml, but what if I look at rss from blogs.msdn.com? It looks like the webserver is running a whitespace filter. This is not human readable Xml. Now, I COULD copy and paste into Visual Studio, but then I have to open a new xml document, or save what I’m viewing in the browser and open in Visual Studio. All too many steps.

Cygwin comes with an optional program called xmllint. It is part of the libxml2 package, so be sure to select libxml2 when you run cygwin setup.exe.

$ curl http://blogs.msdn.com/giorgio/rss.xml | xmllint –format – | less

This reformats the xml into a nice 2-space indented by tag display.

* This may be a bit extreme, but I will stand by it most of the time.

C# vNext feature request

At Ann Arbor Dot Net Developers meeting last night, Eric Maino was there. I don’t think he meant to open up the giant can of worms that is me when he asked “What kind of things would you like to see in the next version of C#?” Here is my list.

  • Static Imports

    yes like java!

    Sometimes extension methods aren’t as readable as just a method call. Ability to import static methods and call them without the class name would be super.

  • Drop var.

    Type inference rules. Can we drop the var? its noise.
    var x = 1;
    x = 1;
    which of these lines is more concise? At least make it optional. I don’t mind if you want to say var. Just don’t make me do it.

  • Better type inference

    Optional type parameters, since the compiler knows what I mean!
    peeps = List(new string[] {“Bob”,”Dorothy”,”Jane”});
    This is a List<T>, but I don’t need to say List<string> the type can be inferred via the List<T>(IEnumerable<T>) constructor.

  • Optional Whitespace Significance

    We all use whitespace formatted code anyway. Why not just drop the parenthesis and semicolons and make the whitespace actually mean something. Of course I say “optional” because not everyone will like it. It would be really cool to be able to toggle it throughout a file.
    #pragma whitespace significance enable
    public class Employee
      public Employee()
        name = string.Empty
        hireDate = DateTime.Now
      public string Name
        get
        set
      public DateTime HireDate
        get
        set
    #pragma whitespace significance disable
    public GiveRaise() {
    salary *=1.10;
    }
    #pragma whitespace significance enable
      public Promote()
        CLevel--

    The file would have to end in the same mode in which it were started, or the compiler can figure things out.

  • A hook into the compiler pipeline

    See boo.

  • A hybrid compiler

    At CodeMash, Joe O’Brien and Michael Letterle about how nice it would be to use the better suited language for any given task in .NET. We don’t want separate projects. We don’t want separate netmodules.

    I want it right down to partial classes. VB’s XML Literals and late binding are sweet. I want to use those if I need. C#’s everything else are what I prefer to work with.

    ReadYourFeed.cs:
    public partial class ReadYourFeed {
      Initialize(){
        rss.FirstChild.AppendChild(GetGenerator());
      }
    }

    ReedYourFeed.vb:
    Partial Public Class ReadYourFeed
      Public Function GetGenerator()
        Dim sig = <generator>SuperReadYourFeed 2000 1.2.3.4.5</generator>
        Return sig
      End Function
    End Class

    * I don’t know VB. Writing that VB took me much googling

  • A plugable compiler

    Similar to the above with boo, but almost the inverse. I want an open definition for the AST which the compiler uses to generate IL and I want a great API for building this AST myself.

    Next, I should be able to plug in my own parser/lexer which builds this AST. So the IL generation from the AST is entirely from the (now more than) C# compiler, but the parser is mine.

    This is really an abstraction and opening of the previous step. Instead of just C#/VB, I want it ALL!

  • Easier method currying.

    Dustin Campbell might like writing lots of code to do it in C#. I’m sure he does it all the time, but I’d like to be able to do it in FAR fewer lines of code.

  • F#’s pipeline operator.

    This is more readable
    {1..10} |> Seq.fold (+) 0;;
    than this
    Enumerable.Aggregate(Enumerable.Range(1, 10), (a, b) => a + b);
    or this
    (from i in Enumerable.Range(1, 10) select i).Sum();
    or even this
    Enumerable.Range(1, 10).Sum();
    Ok, well extension methods sure look a lot like the pipeline operator, so never mind on this request 🙂

It seems like I talked about other crazy stuff too, but I don’t recall what things they were.

When you don’t know the language…

internal static short uiShiftRightBy8(short x)
        {
            short iNew;
            iNew = (short)((x & 0x7FFF) / 256);
            if( (x & 0x8000) != 0)
                iNew = (short)(iNew | 0x80);

            return iNew;
        }

 

AHHHHHHH!!!!*

Nevermind that this was public even though it was used only once by the class in which it was defined. Nevermind that this was not marked static even though it uses no member functions. (Also nevermind that I made it internal instead of private so I could test it and make sure it was as stupid as I thought.)

When you learn a programming language, even when you can get things done in it, it still adds tremendous value to you as a programmer to READ A FRAKKING BOOK on the language. The above code works, absolutely. It does exactly what the writer intended. When the #1 concern when writing software is “do whatever is needed to ship it now, and you have no time to better yourself” you get this code. IMO maintainability should be #2 concern directly following “does it work?”

I’ve entirely removed the above code and replaced it with the operator and value it implements: >> 8

Yes it really is Shift Right by 8. Yes, someone didn’t know about the >> operator. Yes I wasted time reading code that uses this method. Yes I don’t want to think about the performance characteristic of this in a tight loop in critical systems.

*my coworkers laughed when I said AHHHHHHH!!! outloud. At least bad code provides us with solid entertainment.

Speculations about .NET 4.0

I found this in my Live Writer drafts. I figured I should send it, since its been a month and I haven’t thought of anything to add to it.

*insert timewarp to March 18th, 2008*

I was at the Visual Studio 2008 launch in Detroit today. This was my 3rd Microsoft Launch event. The first being Visual Studio 2005 and the second being Vista/Office2k7. As a developer the Vista launch was really the .NET 3.0 launch to me.

One thing I’ve learned that I don’t do well and I need to do better is future think. What is next?

  • ASP.NET 3.5 “extensions” will be known as ASP.NET 4.0
    • ASP.NET MVC (see MonoRail)
    • ASP.NET Dynamic Data (see Rails scaffolding or MonoRail scaffolding)
    • ASP.NET AJAX (browser history, but I expect more here)
    • ADO.NET Entity Framework (see NHibernate)
    • ADO.NET Data Services (many projects to see here, snooze is one)
    • Silverlight Controls for ASP.NET (I expect much more here)
  • Full development support for Silverlight will ship along with this release.
  • WPF finally get an editable Grid. It won’t be named DataGrid, DataGridView or GridView(exists in WPF as a ListView mode). It may be named DataViewGrid or GridDataView.
  • A WPF “Dynamic Data” library complements the ASP.NET extension.
  • The System.Core.Enumerable class gets a Each extension method. Its name endlessly confuses developers who don’t understand why it wasn’t there in 3.5 and who don’t know why it isn’t ForEach like Array and List<>. Developers start calling Microsoft’s naming difference a “catch up to Ruby”.
  • A handful of new Workflow activities are released but no one knows what they are, what they do, or how do to use them.
  • Cardspace continues and adds even easier support for OpenID but no one knows what they are, what they do, or how do to use them.
  • Like the 3.0 release the languages and compilers don’t change.
  • a DI/IoC container with a subset of the features of Windsor, Spring.net, StructureMap or even EntLib ObjectBuilder is included!
  • DLR still won’t ship. No IronPython. No IronRuby. Python and Ruby will remain the most common languages used in demo for Silverlight.
  • Visual Source Safe is still the recommended version control system if you can’t afford Team Foundation Server. WTF?
  • There will be some hidden PerfCounter, WMI, awesomeness. It won’t be documented very well.
  • Hibernate and NHibernate users everywhere are stunned and gape mouthed when they see Entity Framework. Then they fall out of their chair laughing. “We hate maintaining our single HBM files, you want to maintain 3 uglier XML files (CDSL, MSL, SSDL)?” Microsoft answers these cries with “use the designer”, but the power developers want to do things that are only possible in the XML that the designer cannot do.

I am Speaking at the Central Ohio Day Of Dot Net

I’m excited to be talking about building DSLs with Boo.

Central Ohio Day of .NET

I actually had the pleasure of giving this talk at the Ann Arbor Computer Society last Wednesday. I was very anxious because it seems like a complex topic and I wasn’t sure how well I would be able to deliver it or how well it would be received. I’m happy to report that it was both well delivered(toot-toot) and well received. I’m now much more at ease going into Central Ohio Day Of Dot Net.

I’m looking forward to attending MUG this evening and giving a 10 minute introduction to boo. Initially, my wife vetoed my attendance, but after my name was out in the email, I convinced her that I needed to go.

Google Alerts Reminds Me of Who I Was And Who I Want to Become

Like any good Leo, I have a vanity google alert setup on my name. Leos are supposedly vain and if I am to be a good Leo, then I must vanity search myself often. Today, this link came up on my google alert. It isn’t new. It is old. It is from 2003.

from http://unix.derkeiler.com/Mailing-Lists/Tru64-UNIX-Managers/2003-05/0197.html

Date: Wed, 21 May 2003 14:42:56 -0400
To: tru64-unix-managers@ornl.gov

Hello,

I’m looking for instructions for upgrading firmware on an HSZ50. HP
says it cannot be done, but I thought maybe someone knew otherwise.

I’m also looking for vendors who sell version 5.7Z memory cards of the
HSZ firmware.

Thanks

--
Jay R. Wren

In 2003 I didn’t know that I was a programmer. My title was “systems programmer” and we always joked that the title was a carry over from when any computer operator was a “programmer”. It turns out that much of my Systems Administration work in that role was not too much different from the jobs of many programmers.

I thought that real programmers followed all those software engineering things that I had heard a little bit about when I was on college. I thought that real programmers didn’t just throw code together like I was doing at the time. I turns out that I was wrong.

Of course life is a journey in which the destination is in constant change. I never did become what I wanted to be 5 years ago. I’m sure that my current destination will change over the next five years.

ALT.NET What He Said

ALT.NET What He Said or just ALT.What He Said written alt.whathesaid for the usenet types and abbreviated a.whs for the lazy types.

I’ve written about ALT.NET before. This time, I’m just going to say ME TOO!

Whether you like the “ALT.NET” name or not, I highly recommend you read Jeremy Miller‘s post about the need for supplemental .NET leadership. As usual, some else is far better at putting into words the way that I think and feel about a topic.

I urge anyone who has had the notion to speak at their local user group about a topic in which they are well versed to do so! Even if you feel like your style doesn’t mix or no one wants to hear about your cogs, we do want to hear about how you make your cogs. Why did domain driven development and test stories work so well for you? Why did you not choose a traditional n-tier design? We want to know!

I am now looking forward to hearing about your cog process or your widget method or how your development approach saved Spacely Sprockets time and money.

Cygwin or PowerShell

A while back Scott Hanselman posted on how to do some CSV file analysis using PowerShell.

Having been a long time Linux guy I didn’t like what I saw. It felt complex, ugly and strange. I felt I could do the same in less characters using good ole GNU tools like sed, awk and friends.

The first thing Scott shows is the import-csv command. This is almost exactly like awk. Now PowerShell lovers everywhere are going to scream “NO IT IS NOT!” right about now. I agree. It is not. The way PowerShell treats objects in its command pipeline is awesome and something that cannot be accomplished using a Unix shell. Please ignore that for the remainder of this demonstration.

So given a CSV format like this:

“File”,”Hits”,”Bandwidth”
“/hanselminutes_0026_lo.wma”,”78173″,”163625808″
“/hanselminutes_0076_robert_pickering.wma”,”24626″,”-1789110063″
“/hanselminutes_0077.wma”,”17204″,”1959963618″
“/hanselminutes_0076_robert_pickering.mp3″,”15796″,”-55874279″
“/hanselminutes_0078.wma”,”14832″,”-1241370004″
“/hanselminutes_0075.mp3″,”13685″,”-1840937989″
“/hanselminutes_0075.wma”,”12129″,”1276597408″
“/hanselminutes_0078.mp3″,”11058″,”-1186433073″

 

We can trim the first line using sed ‘1d’ and then display columns using awk $1,$2,$3. I’m using the awk command gsub to just throw away quotes. It is not ideal and not as good as import-csv, but it works

$ cat given |sed ‘1d’| awk -F, ‘{gsub(/”/,””);print $1,$2,$3}’
/hanselminutes_0026_lo.wma 78173 163625808
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063
/hanselminutes_0077.wma 17204 1959963618
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279
/hanselminutes_0078.wma 14832 -1241370004
/hanselminutes_0075.mp3 13685 -1840937989
/hanselminutes_0075.wma 12129 1276597408
/hanselminutes_0078.mp3 11058 -1186433073

 

Next Scott makes a new “calculated column”. Just like in PowerShell, this is trivial in awk.

$ cat given |sed ‘1d’| awk -F, ‘{gsub(/”/,””);show=$1;print $1,$2,$3,show}’
/hanselminutes_0026_lo.wma 78173 163625808 /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063 /hanselminutes_0076_robert_pickering.wma
/hanselminutes_0077.wma 17204 1959963618 /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279 /hanselminutes_0076_robert_pickering.mp3
/hanselminutes_0078.wma 14832 -1241370004 /hanselminutes_0078.wma
/hanselminutes_0075.mp3 13685 -1840937989 /hanselminutes_0075.mp3
/hanselminutes_0075.wma 12129 1276597408 /hanselminutes_0075.wma
/hanselminutes_0078.mp3 11058 -1186433073 /hanselminutes_0078.mp3

 

Scott makes some jokes about Regular Expressions, but as an old Perl programmer and long time regular expression master, I’m confident instead of worried about regex. (Yes, I said master and I’m not afraid. Ignite the flames!)

We can actually do much better than the regex Scott whipped up. I’m sure Scott could too, but just didn’t for his simple example.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print $1, $2, $3, show}’
/hanselminutes_0026_lo.wma 78173 163625808 26
/hanselminutes_0076_robert_pickering.wma 24626 -1789110063 76
/hanselminutes_0077.wma 17204 1959963618 77
/hanselminutes_0076_robert_pickering.mp3 15796 -55874279 76
/hanselminutes_0078.wma 14832 -1241370004 78
/hanselminutes_0075.mp3 13685 -1840937989 75
/hanselminutes_0075.wma 12129 1276597408 75
/hanselminutes_0078.mp3 11058 -1186433073 78

 

Cool, so far we seem to be on par with PowerShell. I’ve not had to learn anything new. In fact exercising these old skills are like riding a bike for me.

Next Scott groups to count. No problem, sounds like sort and uniq. Tell sort to do a numeric reverse sort based on the 4th column. Tell uniq to count and ignore the first 3 columns. (Don’t get me started on inconsistency within the Unix tools.)

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print $1, $2, $3, show}’ | sort -nr -k 4| uniq -c -f 3
      2 /hanselminutes_0078.wma 14832 -1241370004 78
      1 /hanselminutes_0077.wma 17204 1959963618 77
      2 /hanselminutes_0076_robert_pickering.wma 24626 -1789110063 76
      2 /hanselminutes_0075.wma 12129 1276597408 75
      1 /hanselminutes_0026_lo.wma 78173 163625808 26

 

Looks pretty much like what Scott has with his count and group. If you REALLY want the count and then show number you can change things around a little. Tell awk to output show first. It definitely simplifies the sort command which now we just say numeric reverse. The uniq command we now just ask it to compare only the first 3 characters. I wish their were a field option, but there is always awk instead. We are getting there.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);print show, $1, $2, $3}’ | sort -nr | uniq -c -w 3
      2 78 /hanselminutes_0078.wma 14832 -1241370004
      1 77 /hanselminutes_0077.wma 17204 1959963618
      2 76 /hanselminutes_0076_robert_pickering.wma 24626 -1789110063
      2 75 /hanselminutes_0075.wma 12129 1276597408
      1 26 /hanselminutes_0026_lo.wma 78173 163625808

 

Here is where Scott says the good data is trapped in the group. Well in this case the group actually ate data that is unrecoverable. We told uniq to ignore and so it did. For the next step we have to throw away sort/uniq and go back to awk. This is the point at which I think most uniq gurus would stop and break out perl, or pull things into a large awk script to do it all for us. That sounds like a good idea. I’m sticking with awk. We want to sum hits, which in our awk is $2.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);showSums[show]+=$2}END{for(show in showSums) {print show,showSums[show]}}’

26 78173
75 25814
76 40422
77 17204
78 25890

 

Scott complains about his ugly headers, and in my case I don’t even have headers, but for us it is just a print statement. Lets space things out a bit with the output field separator.

$ cat given |sed ‘1d’| awk -F, ‘ {gsub(/”/,””,$1);gsub(/”/,””,$2);gsub(/”/,””,$3);show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);showSums[show]+=$2}END{OFS=”\t”;print “Show”,”Hits”;for(show in showSums) {print show,showSums[show]}}’
Show    Hits
26      78173
75      25814
76      40422
77      17204
78      25890

 

Scott then asks the question, “Is this the way you should do things…” and answers “No”. This too is a solution to a real world problem in shell script. Sadly, the awk mess above is actually MORE readable to me, because of my experience, than the PowerShell. But I must admit if I blur my eyes so that I don’t recognize some things, the PowerShell looks like maybe it is better. Is it time to learn PowerShell?

A PowerShell function?  pfff… A Shell Script! We can loose the use of sed by putting the whole thing in an if(NR!=1) block, this becoming a shell script.

#!/usr/bin/gawk

{

if(NR!=1) {

  gsub(/”/,””,$1);

  gsub(/”/,””,$2);

  gsub(/”/,””,$3);

  show=gensub(/.*hanselminutes_0*([0-9]+).*/,”\\1″,”g”,$1);

  showSums[show]+=$2

}

}

END {

  OFS=”\t”;

  print “Show”,”Hits”;

  for(show in showSums) {

    print show,showSums[show]

  }

}

Name the above file Get-ShowHits if you want.

I don’t care what you say, awk is readable 🙂

ADO.NET Data Services

When I first saw what ADO.NET Data Services was, I wasn’t impressed. It looked like scaffolding + REST URLs and nothing all that special. A nice tool in the belt but nothing mind blowing.

After watching Mike Taulty’s screencasts I’m a little more impressed. The implementation is pretty complete. Not only is the server side complete but the ASP.NET AJAX library for calling these services from JavaScript looks complete and usable AND the webdatagen.exe tool for creating proxy classes to access the services from .NET code is complete.

If you previously dismissed it, give it another look. It is pretty cool. Just one thing… security?