Wednesday, April 06, 2011

LINQ With DataSets

Today's blog post is going to be about using LINQ with DataSets. As anyone who reads my blog regularly (or various forum posts) might already know, I've been a big fan of Typed DataSets since the very beginning of .NET (all the way back to the pretty buggy 1.0 version). However, LINQ is fairly new to me. LINQ *itself* is not new to me ... it has been around for a few years now and I've certainly been aware of it. But I haven't utilized it much because, until .NET 3.5, there was no real support for Typed DataSets and that's really all I wanted to be able to use if for. Sure, you could still use LINQ with both regular DataSets and Typed DataSets ... as long as you used .AsEnumerable() ... but you couldn't take advantage of the Typed nature of a Typed DataSet, so I didn't see any point in messing around with it.

I've recently begun using it more with Typed DataSets because, surprisingly to me, for some things it seems to be faster. Since .NET 3.5, Typed DataSets have been getting auto-generated slightly different than they have in the past (and I'm not talking about the TableAdapter generation, which I despise ... see my blog post about avoiding that fiasco:

Prior to 3.5, the DataTable definition in a generated Typed DataSet looked something like this:

public partial class PersonnelDataTable : global::System.Data.DataTable, global::System.Collections.IEnumerable
// rest of the code here

The minor difference is this:

public partial class PersonnelDataTable : global::System.Data.TypedTableBase<PersonnelRow>
// rest of the code here

TypedTableBase is derived from DataTable, so really nothing changes. But, it allows the DataTable to be used in LINQ queries without specifying .AsEnumerable(), allows us to use the Typed column name properties of the DataRows in our LINQ queries *and* it's also much faster.

As I already mentioned above, LINQ *can* easily be used with regular DataSets/DataTables (it's use is NOT limited to Typed DataSets), but I am not including examples of that in this blog post. As with the 2.0 Typed DataSets, all that is needed is to use .AsEnumerable() with the DataTable, so the code will be almost identical.

So, let's see a few examples.

First, let's look at a typical use for a DataSet, selecting some rows. Using LINQ with a Typed DataSet ends up being quite a bit faster than using the DataTable.Select() method. In all my tests, I used a DataTable containing 270,000 rows. The Select in this test, selected 30,000 of those rows. I also did comparisons between 2.0 DataTable and 3.5 TypedTableBase, just for the heck of it.

// ds20 is a 2.0 Typed DataSet, which uses plain a plain old DataTable

DataRow[] dsSelect20 = ds20.Personnel.Select("lastname = 'Berent'");

DataRow[] linqSelect = ds20.Personnel.AsEnumerable()
.Where(row => row.Field<string>("lastname") == "Berent")
.Select(row => row).ToArray();

The above are your only two choices when using untyped or 2.0 Typed DataSets. Benchmark timing results show that LINQ is about 6 times faster than the old DataTable.Select() method:

  • Select:  5,625,000 ticks
  • LINQ:      937,500 ticks

If you're using 3.5 Typed DataSets, you have a few more options with LINQ. Generally, your LINQ statement will look like this:

// ds35 is a 3.5 Typed DataSet, which uses the new TypedTableBase class for it's DataTable

DataRow[] linqSelectLastName = ds35.Personnel
.Where(row => row.lastname == "Berent")
.Select(row => row).ToArray();

Note the differences: you don't need Personnel.AsEnumerable() and you can use the typed row column names, row.lastname. There is one caveat though to the above syntax. If any row in the Personnel table contains DBNull.Value in the lastname column, the above syntax will throw an exception. You must also check for the null, so the actual statement will need to be this:

DataRow[] linqSelectLastName = ds35.Personnel
.Where(row => row.IslastnameNull() == false && row.lastname == "Berent")
.Select(row => row).ToArray();

You could also use the untyped syntax of the column and then you don't need to check for DBNull, but I like to take advantage of the typed nature of a Typed DataSet. That is the purpose of them, after all. =0) Also, using the untyped syntax is slightly slower, but probably not significantly so.

DataRow[] linqSelectUntypedLastName = ds35.Personnel
.Where(row => row.Field<string>("lastname") == "Berent")
.Select(row => row).ToArray();

Incidentally, my data *does* contain DBNull.Value in the lastname column, so that my benchmark timing tests are valid. The timing for the DataSet.Select() yields roughly the same results either way, but the LINQ is about 1.5 times faster with the TypedTableBase than with a regular DataTable, making LINQ about 9 times faster than a regular DataSet.Select()!!  Here are the results using a 3.5 Typed DataSet:

  • Select:                            5,937,500 ticks
  • LINQ:                                625,000 ticks
  • LINQ (untyped syntax):      781,250 ticks

So, I think that I'll wind up this post for now. There are plenty of other uses of LINQ that I should compare with old DataSet/DataTable functionality, but I think I'll save that for another post. This one is long enough!

Until next time ... happy coding!


  1. That is very interesting Bonnie. I have noticed on some queries that LINQ will throw an exception when it has nulls. That is why if I think a table may have a null value, I will use the old DataView.

    I seem to still prefer this:

    DataView dv = new DataView(tableOne, "OrderID <> 'XXXX'", "OrderTime desc", DataViewRowState.CurrentRows);

    DataTable newTable = new DataView(tableOne, "OrderID <> 'XXXX'",
    "OrderTime desc", DataViewRowState.CurrentRows).ToTable();

    To this:
    IEnumerable ordered = tableOne.AsEnumerable()
    .Where(i => i.Field("OrderID ") != 'XXXX')
    .OrderByDescending(i => i.Field("OrderTime "));

    DataTable newTable = tableOne.AsEnumerable()
    .Where(i => i.Field("OrderID ") != 'XXXX')
    .OrderByDescending(i => i.Field("OrderTime "))

  2. Hi John! How ya been?

    One *does* get used to the look of LINQ after awhile, but most of the times I still use older DataSet notation/syntax. It's definitely easier on the eyes. ;0)

  3. Hello Bonnie, could you tell me if I'm doing something wrong? I looks like LINQ ignores completely the presence of a primary key column:

    I am comparing two very simple searches over a column declared as primary key

    Dim row = (From uo In ds.UOS Where uo.UO_ID = ID).FirstOrDefault
    Dim row2 = ds.UOS.FindByUO_ID(ID)

    The table is filled with a sequential process very simple, where row1 will have key "ID01", row 2 will have "ID02" and so on.

    Comparing the two methods when the table is filled with 1000 rows:
    Searching "ID01":
    LINQ: 140 ticks
    FindByKey: 103 ticks

    Searching an inexistent element, "ID01x"
    LINQ: 2150 ticks
    FindByKey: 110 ticks

    With 100.000 rows:
    Searching "ID01:
    LINQ: 211 ticks
    FindByKey: 146 ticks

    Searching an inexistent element, "ID01x":
    LINQ: 199.252 ticks
    FindByKey: 193 ticks

    1. Hi Daniel,
      As you have seen, LINQ is not suitable for everything. I have another blog post with more DataSet vs LINQ comparisons if you're interested:

      The above post is about using DISTINCT, but I clearly show that LINQ is not always the better choice. The thing to learn from these little tests is that it sometimes depends on what your queries are and what your data is.

      As far as I know, I think you are correct that LINQ ignores the presence of a primary key column. So, no, I don't think you're doing anything wrong at all. If your DataTable makes use of a Primary Key, then you should definitely take advantage of that and use the DataTable.Find method. That is obviously faster than LINQ in that scenario!

  4. Bonnie, thank you very much for your comment. I had read your other post and then I knew that LINQ is not always the better choice, as you said. It seems clear that .FindByKEY is optimal in that scenario, but I thought that LINQ will behave better, really.

    The reason I was interested in LINQ was because it allows deferred query evaluation and permits to execute it several times with different sets. I have developing a framework and needed a way to abstract any query over a dataset in some way so that it could be executed again over another (and related) dataset, if necessary.

    Probably it still could be useful in other scenarios, or even in this one if the size of the dataset is small, if this way I get that feature (assuming the LINQ overhead in this last case). Anyway, I will continue searching a better solution to my problem.

    Thanks again

    1. If you have a DataTable with a PrimaryKey defined, you can use the old-style .Find() method rather than the Typed DataSet .FindByKEY. The syntax would look like this:

      DataRow row = ds.UOS.Rows.Find(ID)

      So, in your Framework class, perhaps you'd have a method something like this:

      public DataRow MyFind(DataTable, object key)
      return DataTable.Rows.Find(key);

      Actually, I'm not sure what using the .Find has to do with what you've written above (" abstract any query over a dataset in some way so that it could be executed again over another (and related) dataset, if necessary"), but since you initially asked about using a Find with LINQ, I thought I'd show you generic non-LINQ and non-DataSet-specific way to do this. In this way, you don't have to use the syntax that utilizes the Typed DataSet generated method (FindByUO_ID in your example), so it will work for any DataTable that has a PrimaryKey. You could even test for that in the method, just in case:

      public DataRow MyFind(DataTable, object key)
      DataRow row = null;
      if (DataTable.PrimaryKey.Length > 0)
      row = DataTable.Rows.Find(key);
      return row;

  5. Hello
    Yes, I could use .Find, I know, but I have been thinking on something like a writeonly property that could receive an IEnumerable(Of DataRow). The framework knows how to manage that query, applying it if necessary on another dataset (assuming it knows the variable/property that have been used to define the query).

    Dim Rows= From r In ds.<..any query..> Select CType(r, DataRow)

    In framework: ¿no results and it should? -> ds = AlternativeDataSet, and requery...
    The logic of the query resides outside the framework, and could return any rows of the typed dataset.

    The use of .Find can be a particular case, although most times query expression would be some kind of .Select, and there LINQ behaves ok. I tried with FindByKey because I thought it could be a reference test, simple but representative. But as I have seen it is a poor case to compare LINQ with.

    Anyway is not a madure idea, Boonie. Don't worry, your comments have been very useful to me. Thanks.

    (Sorry for my very bad english..)

    Regards, Daniel

  6. Hi Bonnie,
    sorry to bring this up after so much time.
    I am doing some tests to compare the time it takes to find data in a DataSet using a LINQ query and the traditional Select() method. And I see the opposite behavior, the LINQ query is performing way worst than the Select method.
    I tried to create the DataSet.xsd in the way you suggested ( to avoid code that is not necessary, but still got the same result.
    Perhaps is the way I am doing the searches:
    I populate the DataSet with 270000 rows, then I create a List with 100 random numbers. I will perform a search with LINQ and the Select method with each of those numbers. The numbers can be found or not in the DataSet, I did this on purpose to make sure I get the worst scenario of a full scan. I repeat this 10 times.
    In my tests, LINQ performs almost 1000 times slower than the Select(). This is weird since I saw people is saying the opposite.
    If you'd like I am happy to share the code. Any comment is welcome. Thanks

    1. Hi OsO,
      I think a lot depends on the data and your searches. LINQ is not always the perfect fit for every query, as I mentioned in a few of my comments to others. If I have some critical part of an application where performance has to be as fast as possible, I'll benchmark my queries with data that is representative of the real-world data I expect my app to consume. Then, I'll use whichever methodology is right for the query with that data. If the difference is negligible, then I'll use whichever methodology is easier on the eyes (easier to read and easier to maintain).


    1. Thanks for the link, but it doesn't say a whole helluva lot. Did you write it?