Friday, September 30, 2011

DISTINCT - LINQ versus DataSet

Wow, it’s been way too long since the last time I wrote a blog post. I’ve been busy, but that’s really no excuse. In my last post in April, I wrote about using LINQ with Typed DataSets. I did some benchmark comparisons with Selects and LINQ was considerably faster. Today I’m going to compare the DISTINCT capabilities of LINQ versus DataSets.

As before, I’m using the same set of data … a DataTable containing 270,000 rows. I’ve filled the DataTable such that there’s a random number of duplicates in the rows, but it seems that the DISTINCT values always fall between 4750 and 4800 rows.

There are some interesting results. When selecting DISTINCT values, over all columns in the DataTable, it definitely is best to use the old DataSet methods rather than LINQ. And, also surprising, is that the old-style 2.0 Typed DataSets (and likewise, plain old vanilla DataSets, not Typed) are actually faster than newer 3.5 Typed DataSets (refer to my last post, linked to above, that explains what is the difference between the two).

// ds20 is a 2.0 Typed DataSet (a regular DataTable)
// ds35 is the 3.5 Typed DataSet (a TypedTableBase) 
DataTable dt20 = ds20.Personnel.DefaultView.ToTable(true);
DataTable dt35 = ds35.Personnel.DefaultView.ToTable(true);

// Note that if you MUST use the DataRowComparer.
// If you don't you get all rows returned ... it won't find the duplicates.
DataTable dtLinq20 = ds20.Personnel.AsEnumerable()
    .Select(row => row).Distinct(DataRowComparer.Default)
    .CopyToDataTable();

// note that the only difference in syntax between
// 2.0 and 3.5 is that you don't use .AsEnumerable()
DataTable dtLinq35 = ds35.Personnel
    .Select(row => row).Distinct(DataRowComparer.Default)
    .CopyToDataTable();

Benchmarking results show that the DefaultView.ToTable() is 7 times faster than LINQ for ds20 (and 8.5 times faster for ds35). The DefaultView.ToTable() is only slightly faster for ds20 than ds35 ( not significantly though) and with LINQ, ds20 is about 1.5 times faster than ds35. Most peculiar!

  • ToTable (ds20):    77,937,500 ticks
  • ToTable (ds35):    90,718,750 ticks
  • LINQ (ds20):       554,656,250 ticks
  • LINQ (ds35):       781,218,750 ticks

But, there’s where the superiority of the DefaultView.ToTable() ends. Once you start looking for specific columns to be distinct, LINQ is faster. Let’s look at the one column scenario: First, the DataSet code (since all code for the DataSet methods are exactly the same for ds20 as for ds35, I’ll only show one):

DataTable dt20 = ds20.Personnel.DefaultView.ToTable(true, "firstname");

Pretty straightforward … nothing fancy. The benchmark times for both ds20 and ds35 are almost identical.

But now, the LINQ gets trickier. Unfortunately, you’ve got to create the resulting DataTable (including the column) before you can “fill” it from the LINQ query. If there’s a workaround for this, I sure haven’t found it. Here are the two LINQ queries:

// note that you have to create the DataTable first, 
// AND add the column!
DataTable dt = new DataTable();
dt.Columns.Add("firstname");
DataTable dtLinq20 = ds20.Personnel.AsEnumerable()
    .Select(row =>
    {
        DataRow newRow = dt.NewRow();
        newRow["firstname"] = row.Field<string>("firstname");
        return newRow;
    })
    .Distinct(DataRowComparer.Default).CopyToDataTable();

// note that there are 3 differences in syntax for ds20 & ds35:
// 1) as before, you don't use .AsEnumerable()
// 2) you must check for null
// 3) you can use the typed column syntax (row.firstname)
DataTable dt = new DataTable();
dt.Columns.Add("firstname");
DataTable dtLinq35 = ds35.Personnel
    .Select(row =>
    {
        DataRow newRow = dt.NewRow();
        newRow["firstname"] = row.IsfirstnameNull() ? "" : row.firstname;
        return newRow;
    }).Distinct(DataRowComparer.Default).CopyToDataTable();

Now for the benchmark numbers.  LINQ is more than 5 times faster than the DefaultView.ToTable() … (5.25 times faster for ds20 and 5.5 for ds35).

  • ToTable:     15,406,250 ticks
  • LINQ (ds20):    2,937,500 ticks
  • LINQ (ds35):    2,781,250 ticks

As more columns get added, the LINQ advantage becomes diminished. At some point, the DefaultView.ToTable() methodology becomes preferable. I suspect that that point depends both on the number of columns you wish to have distinct, as well as the number of rows that you are processing. The syntax for the DefaultView.ToTable() becomes slightly different when you have more than one column. Here it is:

DataTable dt20 = ds20.Personnel.DefaultView
    .ToTable(true, new string[] { "firstname", "lastname" });

The LINQ syntax doesn’t change, you just keep adding the columns:

DataTable dt = new DataTable();
dt.Columns.Add("firstname");
dt.Columns.Add("lastname");
DataTable dtLinq35 = ds35.Personnel
    .Select(row =>
    {
        DataRow newRow = dt.NewRow();
        newRow["firstname"] = row.IsfirstnameNull() ? "" : row.firstname;
        newRow["lastname"] = row.IslastnameNull() ? "" : row.lastname;
        return newRow;
    }).Distinct(DataRowComparer.Default).CopyToDataTable();

In the case above, of two columns, LINQ was just slightly under 5 times faster.

So, just for giggles, I thought I’d benchmark half the columns. My DataTable has 32 columns, so I tried a distinct on the first 16 columns in my DataTable. And now, it seemed, that we were back to having DefaultView.ToTable() being superior by about 7 times (turns out I was wrong … keep reading). Now, my curiosity got the better of me … I wasted some more time trying to find “the sweet spot” … the number of columns where the performance was about equal. After all, I was a math major in college and I love to play with numbers!

But, as I played more with this, I discovered something very interesting. The time it takes for LINQ depends on the order of the columns!!! In other words, if the first column in your list doesn’t have a lot of distinct values (mine had 7), but the next one in your list does (mine had 4700+), it will take a lot longer. If you swap the column order of those two columns, BAM! Very fast! Instead of the LINQ query being 7 times slower than the DefaultView.ToTable(), it was now 2.25 times faster!

Curiously, this doesn’t seem to matter with the DefaultView.ToTable() … it benchmarked the same amount of time with the columns in any order!

  • ToTable:                                                        42,750,000
  • LINQ (original order of 16 columns):    305,375,000
  • LINQ (swapped order of columns):         19,125,000

I suspect that my very first test, the one with all the columns,suffered from this anomaly … in other words, the LINQ query was so very much slower because that problem column (the one with only 7 distinct values) happened to be the first column in the DataTable. and the syntax of that query didn’t specify columns.

After discovering this quirk, I went back and re-ran the LINQ query that used only 2 columns, but I did it this time with these two particular columns where I noticed this problem. Sure enough, with only a two column distinct query, if the columns were in the wrong order, the query slowed to a crawl … becoming 14.5 times slower than the DefaultView.ToTable()!!!  Wow!

So, what’s the takeaway from this little experiment? Good question … I guess it’s that the data makes the difference. If you’re pretty sure about the data in the columns not being too lopsided, distinct-wise, then LINQ is clearly faster for probably all scenarios except the one including all columns (I did not test this, it’s only a guess … the most columns I tested were half). However, if you’re not certain of the distinctness of the data in your columns, perhaps it’s best to stick with the tried-and-true DefaultView.ToTable() method.