Full code for this solution available here: https://github.com/scottcha/LINQCountWords
Recently I was interviewing candidates for some positions we had open. When I look at lots of candidates back to back I like to use the same interview question so I can compare their answers against each other. Over the years there is one technical coding question I’ve been going to more often than others as its fairly easy to describe the problem and there is just enough complexity that a good candidate can code it while leaving time for probing the depth of their understanding. The question is “Return the top 10 most frequently occurring words in a string.” This year a couple candidates gave me a solution which I haven’t seen in 5+ years of asking this question—they solved it with a couple lines of LINQ. I started coding C# when it was released but have been primarily working in other languages since LINQ was added to it.* I decided to dig deeper in to these solutions post interview and I’m writing up what I learned.
It’s not really correct to say that my question was ruined … as with any interview question I want to understand the candidates thought process and how much they understand the ins and outs of their solution. But the solutions did catch me off guard. . What I wanted to do now was first learn a bit more about these solutions and compare them to solutions which didn’t use LINQ.
In the most typical solution to this problem one would first split each word (in an interview I would look for the candidate to identify the edge-cases involved when identifying a word but not necessarily ask for a method to handle all of them), hash the words to count them efficiently and then determine the 10 most frequently occurring by sorting the key/value pairs. A typical C# solution looks like this (though you can probably get the efficiency down to O(n) instead of doing a full sort I think the sort is more clean and will leave you with an O(n log(n)) solution.
A solution in LINQ, though a bit harder for the untrained eye to read, is quite a bit more succinct.
At this point I assumed that the performance of the two approaches was equivalent being bounded by the sort so I decided to test that out. I did guess LINQ had an overhead but wanted to see what it was. After building some basic tests using MSTest (you can view them in the repo) I decided an interesting approach would be to pull random Wikipedia articles and get an average time as well as a view in to the trends of the different approaches (the data contains a third solution which does the grouping using a dictionary and the sort/filter using LINQ). The results of the test ended up with the following values averaged across 1000 random datasets from Wikipedia:
LINQ: .950768 ms
Dictionary with LINQ to sort: .831811 ms
Dictionary no LINQ: .791344 ms
Turns out that there is an overhead in the LINQ implementation of approximately 20% compared to the no LINQ implementation
Here are the all the values charted:
I was curious as to where that extra 20% came from. My initial guess was that the select operation was probably the most expensive so I decided to do a quick performance analysis in VS to see. Here is what I found:
This result baffled me. I didn’t expect converting 10 items to a list to that expensive compared to the other operations. I stared at this and figured that I must not be understanding something about LINQ. Digging in to the documentation I found that most of the LINQ operations use IEnumerable as the return type. It turns out this is a lazy loaded class and in effect my entire LINQ expression wasn’t getting evaluated until I converted it using ToList which made my initial performance evaluation incorrect. I was still curious about this so I decided to look for a way to force the LINQ expression to be fully evaluated so I could see which was more expensive.
Using the built in performance tools was a too opaque to understand what was happening when looking at the LINQ breakdown once I forced the evaluation after every function call (the lambda expressions showed up with temporary identifiers which made them hard to distinguish). Instead I used the handy Stopwatch class to write out the elapsed time for the different evaluations:
GroupBy: 32.6382 ms
Select Key & value: 2.2283 ms
OrderByDescending: 4.678 ms
Select Name: 1.7164 ms
Take 10: 0.1402 ms
GroupBy is the statement which does the heavy lifting essentially sorting the words and then collating them so it seems reasonable that this has the highest cost.
At this point I was pretty happy with understanding more of the capabilities on LINQ but was curious if I could tune my implementation to be more on par with the implementation which did not employ LINQ. I started looking up some info on GroupBy on StackOverflow and found this gem by Jon Skeet who suggested that ToLookup is more efficient than GroupBy: http://stackoverflow.com/a/938104/489561
The results I got were quite a bit better with the LINQ implementation using ToLookup (LINQ with ToLookup in the table below) only ~6% slower than the non-LINQ implementation versus the 18% decrease in efficiency for the implementation using GroupBy.
Average of time to run 1000 random data sets from Wikipedia:
LINQ: 0.983825 ms
LINQ with ToLookup: 0.885152 ms
Dictionary using LINQ to sort: 0.849821 ms
Dictionary no LINQ: 0.834049
So ended my initial tour of LINQ. I’ve begun using it more in production code and am in generally not worried about the overhead incurred as the increase in code maintainability and simplicity more than outweighs the decrease in performance of the code I typically write. Knowing that there are simple tweaks like the GroupBy versus ToLookup which can have a large factor in the performance of what I write is helpful to be on the watch out for.
And yes, I decided to keep using this interview question as even with the LINQ implementation there is still enough valuable depth to explore a candidates understanding and potential.
*This used to say “Even though I use C# fairly regularly over the last decade I never updated my knowledge when LINQ was added” but I changed the wording to be more accurate.