Wednesday, April 06, 2011

LINQ With DataSets

Today's blog post is going to be about using LINQ with DataSets. As anyone who reads my blog regularly (or various forum posts) might already know, I've been a big fan of Typed DataSets since the very beginning of .NET (all the way back to the pretty buggy 1.0 version). However, LINQ is fairly new to me. LINQ *itself* is not new to me ... it has been around for a few years now and I've certainly been aware of it. But I haven't utilized it much because, until .NET 3.5, there was no real support for Typed DataSets and that's really all I wanted to be able to use if for. Sure, you could still use LINQ with both regular DataSets and Typed DataSets ... as long as you used .AsEnumerable() ... but you couldn't take advantage of the Typed nature of a Typed DataSet, so I didn't see any point in messing around with it.

I've recently begun using it more with Typed DataSets because, surprisingly to me, for some things it seems to be faster. Since .NET 3.5, Typed DataSets have been getting auto-generated slightly different than they have in the past (and I'm not talking about the TableAdapter generation, which I despise ... see my blog post about avoiding that fiasco: http://geek-goddess-bonnie.blogspot.com/2010/04/create-xsd.html).

Prior to 3.5, the DataTable definition in a generated Typed DataSet looked something like this:

public partial class PersonnelDataTable : global::System.Data.DataTable, global::System.Collections.IEnumerable
{
// rest of the code here
}


The minor difference is this:

public partial class PersonnelDataTable : global::System.Data.TypedTableBase<PersonnelRow>
{
// rest of the code here
}


TypedTableBase is derived from DataTable, so really nothing changes. But, it allows the DataTable to be used in LINQ queries without specifying .AsEnumerable(), allows us to use the Typed column name properties of the DataRows in our LINQ queries *and* it's also much faster.

As I already mentioned above, LINQ *can* easily be used with regular DataSets/DataTables (it's use is NOT limited to Typed DataSets), but I am not including examples of that in this blog post. As with the 2.0 Typed DataSets, all that is needed is to use .AsEnumerable() with the DataTable, so the code will be almost identical.

So, let's see a few examples.

First, let's look at a typical use for a DataSet, selecting some rows. Using LINQ with a Typed DataSet ends up being quite a bit faster than using the DataTable.Select() method. In all my tests, I used a DataTable containing 270,000 rows. The Select in this test, selected 30,000 of those rows. I also did comparisons between 2.0 DataTable and 3.5 TypedTableBase, just for the heck of it.

// ds20 is a 2.0 Typed DataSet, which uses plain a plain old DataTable

DataRow[] dsSelect20 = ds20.Personnel.Select("lastname = 'Berent'");

DataRow[] linqSelect = ds20.Personnel.AsEnumerable()
.Where(row => row.Field<string>("lastname") == "Berent")
.Select(row => row).ToArray();


The above are your only two choices when using untyped or 2.0 Typed DataSets. Benchmark timing results show that LINQ is about 6 times faster than the old DataTable.Select() method:

  • Select:  5,625,000 ticks
  • LINQ:      937,500 ticks

If you're using 3.5 Typed DataSets, you have a few more options with LINQ. Generally, your LINQ statement will look like this:

// ds35 is a 3.5 Typed DataSet, which uses the new TypedTableBase class for it's DataTable

DataRow[] linqSelectLastName = ds35.Personnel
.Where(row => row.lastname == "Berent")
.Select(row => row).ToArray();


Note the differences: you don't need Personnel.AsEnumerable() and you can use the typed row column names, row.lastname. There is one caveat though to the above syntax. If any row in the Personnel table contains DBNull.Value in the lastname column, the above syntax will throw an exception. You must also check for the null, so the actual statement will need to be this:

DataRow[] linqSelectLastName = ds35.Personnel
.Where(row => row.IslastnameNull() == false && row.lastname == "Berent")
.Select(row => row).ToArray();


You could also use the untyped syntax of the column and then you don't need to check for DBNull, but I like to take advantage of the typed nature of a Typed DataSet. That is the purpose of them, after all. =0) Also, using the untyped syntax is slightly slower, but probably not significantly so.

DataRow[] linqSelectUntypedLastName = ds35.Personnel
.Where(row => row.Field<string>("lastname") == "Berent")
.Select(row => row).ToArray();


Incidentally, my data *does* contain DBNull.Value in the lastname column, so that my benchmark timing tests are valid. The timing for the DataSet.Select() yields roughly the same results either way, but the LINQ is about 1.5 times faster with the TypedTableBase than with a regular DataTable, making LINQ about 9 times faster than a regular DataSet.Select()!!  Here are the results using a 3.5 Typed DataSet:

  • Select:                            5,937,500 ticks
  • LINQ:                                625,000 ticks
  • LINQ (untyped syntax):      781,250 ticks

So, I think that I'll wind up this post for now. There are plenty of other uses of LINQ that I should compare with old DataSet/DataTable functionality, but I think I'll save that for another post. This one is long enough!

Until next time ... happy coding!