A Beginner's Guide to LINQ, Part 1
A Beginner's Guide to LINQ, Part 1
In the tech world, acronyms are rife. There are hardware acronyms: SATA, IC, ACPI. There are software acronyms: SQL, J2EE, ASP. There are even acronyms for certifications of one's knowledge of a particular domain of acronyms: CISSP, MCPD, ISA. Any technology company who has had an impact in the field is sure to have introduced their own set of acronyms to the fray. One particularly "new kid on the block" was introduced by Microsoft circa 2007: LINQ. In this article, I intend to provide an introduction for everyone to LINQ and its uses.
While I will attempt to explain the topic in a manner suitable for even a beginner, this article is intended for an audience with some level of programming experience. New programmers may want to hold off on reading the article until they have gained a basic understanding of programming fundamentals. The content of this article will be revolve primarily around LINQ-to-Objects, though some of the concepts discussed will apply to LINQ-to-XML and LINQ-to-SQL.
LINQ (Language Integrated Query) is a technology created by Microsoft "to [bridge] the gap between the world of objects and the world of data."(1) It sounds like marketing hype to me. In some aspects, though, LINQ is just that: a bridge between your code and some data source. However, do not take the term "data source" to strictly mean "database." For my purposes, "data" means some piece(s) of information, and "source" means some point of origin of this data. In LINQ-land, data sources can be text files, XML files, objects in memory, and yes... databases.
Aside from being some magical way of joining your code to your source of data, what else should you know about LINQ before diving in? First, it is a feature of the language you develop in. You can write LINQ queries (that's the "Q" in "LINQ" after all) right inside of your regular .NET code. The designers of each .NET language (e.g. C#, VB.NET, F#, etc.) have included specific language keywords which you can use to build your queries. Next, for your introduction to LINQ think of it as a supercharged foreach (For Each - VB.NET) loop. If you have experience in .NET, then you should be familiar with "for each" loops. Key to understanding how LINQ does what it does is understanding how a "for each" loop works. To understand how a "for each" loop works you need to understand the concept of an iterator.
For a new programmer, the term "iterator" may seem daunting. It is really not. An iterator is basically a method which loops over the elements in some collection. As the iterator loops over these elements, it keeps track of its position (2). This is so it knows which elements have already been visited, and which have yet to be. Think of an iterator like counting the number of people in a line. If you were to count the number of people in a line, you might point to each person as you count. If someone interrupted you while you were counting, and nothing caused your pointing hand to move while you turned to talk to that someone, when you looked back at the line, you would still be pointing at the last person you counted. An iterator is equivalent to your pointing hand pointing at the last person you counted. (Note: I made a point not to say "next person to be counted." This is to stay in line with how an iterator works.)
So why on earth would one need to keep track of his position within an arbitrary collection of data? If he is using a "for each" loop to iterate over the whole collection, then he must want to interact with every piece of data in the collection, right? That is where the "each" part of the "for each" comes into play. "Each" in the English language corresponds to the quantification "one." When we use a "for each" loop, we are eventually going to examine every item in the collection (disregard side effects for now). We are going to do so one element at a time--even in the code that hosts the "for each." Having said that, recall that our iterator "remembers" where we are positionally within the collection. The compiler of our chosen language compiles our code in such a way that when we are in "for each" land, when our "for each" advances to the next element, we actually jump back into the code that created the iterator and we advance to the next item in the collection. Let us try another example.
Let us say that you are a factory worker. Your job is to take a line of buckets, each containing widgets, and one-by-one place the buckets on a conveyor belt to be used at various points along the assembly line. You are the iterator. The assembly line is the "for each" loop. When the conveyor belt starts, so does your work. You start with the first bucket, and you place it on the conveyor belt. The bucket proceeds through the assembly line. You have strict instructions not to proceed to the next bucket until the bucket you just sent comes back. You have no awareness as to how the bucket is being used on the assembly line; you only know that you cannot proceed to the next bucket until the bucket you just sent returns. As each bucket comes back to you, it arrives crushed, and there is nothing more you can do with a crushed bucket. You toss the unusable bucket aside and move on to the next bucket. This process continues until you exhaust the supply of buckets. This equivalent to how the iterator works under the hood and in conjunction with the "for each" loop.
Even though as the factory worker you have no idea what the processes along the conveyor belt's path do with each bucket as they arrive, the work to supply new buckets comes back to you. In this same way, the code which creates the iterator has no clue as to what the "for each" code does with the data it supplied; it only knows that once execution returns to it, it should supply the next piece of data. Furthermore, your duties do not include salvaging any unused widgets from the incoming bucket. They do not include trying to recycle any incoming buckets if they were not completely crushed. Your assignment is only to keep the conveyor belt running, and to do so one bucket at a time. So too does an iterator supply data, one element at a time. The iterator's only job is to keep supplying data to the caller as execution returns to it.
So then how does execution return to the iterator? We all know that when a function returns, that is it. There is no resuming where we left off (not without some dirty GOTO statement, but you would never do that, right?). Once a function returns, we do not jump back into it without calling it again. It is the same in mathematical functions. When we say y = x^2 (x-squared), once we get the value of y, is there any way for us to jump back into the function and change the way y is calculated? Of course not. But then how does the iterator circumvent this seemingly illogical roadblock? As previously mentioned, the compiler does a bit of magic itself.
Here is an example of what we might consider a standard function definition:
That is, take in some parameters (or maybe no parameters), do some logic, and return some result. The key to the above is the return keyword. No matter where we place return in a function, if the logic within the function causes us to hit a return, then we exit the function, possibly returning a value along the way. The compiler structures the code in such a way to ensure this happens. In a function which creates an iterator, however, this is not quite the case. Take the following:
And I am sure you are saying, "Whoa! What the heck is yield?" Well, yield is a special keyword which lets the compiler know that we intend on this function to return things in an iterative way (3). In other words, this function will return things like a normal function would, however, it will return every single item in the associated collection (_values in this case). So am I lying to you? I said earlier that functions return something and then there is no going back without calling the function again. That, my friend, is the magic of the yieldkeyword (and also the IEnumerable return type).
As I mentioned previously, the compiler will structure the compiled code in such a way that the runtime will pass whatever yield return returns back to the caller (e.g. a foreach loop), and when that caller is done with the current "iteration", execution will pick up at the next line of the code which creates the iterator (in the above, that would be the closing brace of the for loop). This is the same thing I explained in the conveyor belt example. The iteration of the bucket going off on the convey belt, and then eventually resuming with you placing the next bucket on the conveyor belt exemplifies this behavior.
You may be wondering what would happen if you didn't use the yield keyword, and you just used return by itself. Well two things would happen: 1) the code will not compile because a yield return (in this example) returns a single string, but the function's definition expects an IEnumerable of strings; 2) assuming the code did compile, you would not get the results you expect. Remember that a return forces immediate exiting of the function--no going back. In this case, the yield return and the return of IEnumerable are both required. It may seem strange that one string at a time is being "returned" by the iterator, yet we are saying that this method returns an IEnumerable , but this is a requirement of the iterator: the return type must be an IEnumerable.
Now that you hopefully have some insight into the workings of iterators, let us examine how this fits together with LINQ.
I mentioned earlier that LINQ is built into the language. There is still a bit of transformation by the compiler in order to make LINQ code actually run instructions the computer will understand. The compiler will translate your LINQ queries into a series of method calls (4). These would be the same method calls you would see if you imported the System.Linqnamespace into your project, and you brought up Intellisense for a particular collection. Some of these methods include: Where, Select, GroupBy, OrderBy, etc. Each of these methods is an extension method (5). These extension methods use iterators under the hood. Yes, if you were to decompile any of these methods you would see good ol' yield return within its code. When you chain together one or more of these methods, each item returned from theyield return actually passes from one method to the next before the next item is returned from the original collection. This is due to the behavior of yield return. This behavior is what gives LINQ so much power--like I said earlier: a supercharged foreach.
When you begin to think of your LINQ queries in this way, they become easier to understand--both in reading and writing such queries. Likewise, if you decide to use extension method syntax, you will understand why your method chains behave the way they do. Thinking of the query as an elaborate foreach loop helps you understand that something like this:
While I will attempt to explain the topic in a manner suitable for even a beginner, this article is intended for an audience with some level of programming experience. New programmers may want to hold off on reading the article until they have gained a basic understanding of programming fundamentals. The content of this article will be revolve primarily around LINQ-to-Objects, though some of the concepts discussed will apply to LINQ-to-XML and LINQ-to-SQL.
What Is LINQ?
LINQ (Language Integrated Query) is a technology created by Microsoft "to [bridge] the gap between the world of objects and the world of data."(1) It sounds like marketing hype to me. In some aspects, though, LINQ is just that: a bridge between your code and some data source. However, do not take the term "data source" to strictly mean "database." For my purposes, "data" means some piece(s) of information, and "source" means some point of origin of this data. In LINQ-land, data sources can be text files, XML files, objects in memory, and yes... databases.
Aside from being some magical way of joining your code to your source of data, what else should you know about LINQ before diving in? First, it is a feature of the language you develop in. You can write LINQ queries (that's the "Q" in "LINQ" after all) right inside of your regular .NET code. The designers of each .NET language (e.g. C#, VB.NET, F#, etc.) have included specific language keywords which you can use to build your queries. Next, for your introduction to LINQ think of it as a supercharged foreach (For Each - VB.NET) loop. If you have experience in .NET, then you should be familiar with "for each" loops. Key to understanding how LINQ does what it does is understanding how a "for each" loop works. To understand how a "for each" loop works you need to understand the concept of an iterator.
Iterators
For a new programmer, the term "iterator" may seem daunting. It is really not. An iterator is basically a method which loops over the elements in some collection. As the iterator loops over these elements, it keeps track of its position (2). This is so it knows which elements have already been visited, and which have yet to be. Think of an iterator like counting the number of people in a line. If you were to count the number of people in a line, you might point to each person as you count. If someone interrupted you while you were counting, and nothing caused your pointing hand to move while you turned to talk to that someone, when you looked back at the line, you would still be pointing at the last person you counted. An iterator is equivalent to your pointing hand pointing at the last person you counted. (Note: I made a point not to say "next person to be counted." This is to stay in line with how an iterator works.)
So why on earth would one need to keep track of his position within an arbitrary collection of data? If he is using a "for each" loop to iterate over the whole collection, then he must want to interact with every piece of data in the collection, right? That is where the "each" part of the "for each" comes into play. "Each" in the English language corresponds to the quantification "one." When we use a "for each" loop, we are eventually going to examine every item in the collection (disregard side effects for now). We are going to do so one element at a time--even in the code that hosts the "for each." Having said that, recall that our iterator "remembers" where we are positionally within the collection. The compiler of our chosen language compiles our code in such a way that when we are in "for each" land, when our "for each" advances to the next element, we actually jump back into the code that created the iterator and we advance to the next item in the collection. Let us try another example.
Let us say that you are a factory worker. Your job is to take a line of buckets, each containing widgets, and one-by-one place the buckets on a conveyor belt to be used at various points along the assembly line. You are the iterator. The assembly line is the "for each" loop. When the conveyor belt starts, so does your work. You start with the first bucket, and you place it on the conveyor belt. The bucket proceeds through the assembly line. You have strict instructions not to proceed to the next bucket until the bucket you just sent comes back. You have no awareness as to how the bucket is being used on the assembly line; you only know that you cannot proceed to the next bucket until the bucket you just sent returns. As each bucket comes back to you, it arrives crushed, and there is nothing more you can do with a crushed bucket. You toss the unusable bucket aside and move on to the next bucket. This process continues until you exhaust the supply of buckets. This equivalent to how the iterator works under the hood and in conjunction with the "for each" loop.
Even though as the factory worker you have no idea what the processes along the conveyor belt's path do with each bucket as they arrive, the work to supply new buckets comes back to you. In this same way, the code which creates the iterator has no clue as to what the "for each" code does with the data it supplied; it only knows that once execution returns to it, it should supply the next piece of data. Furthermore, your duties do not include salvaging any unused widgets from the incoming bucket. They do not include trying to recycle any incoming buckets if they were not completely crushed. Your assignment is only to keep the conveyor belt running, and to do so one bucket at a time. So too does an iterator supply data, one element at a time. The iterator's only job is to keep supplying data to the caller as execution returns to it.
So then how does execution return to the iterator? We all know that when a function returns, that is it. There is no resuming where we left off (not without some dirty GOTO statement, but you would never do that, right?). Once a function returns, we do not jump back into it without calling it again. It is the same in mathematical functions. When we say y = x^2 (x-squared), once we get the value of y, is there any way for us to jump back into the function and change the way y is calculated? Of course not. But then how does the iterator circumvent this seemingly illogical roadblock? As previously mentioned, the compiler does a bit of magic itself.
IEnumerable, Meet Yield
Here is an example of what we might consider a standard function definition:
int Add(int x, int y)
{
int z = x + y;
return z;
}
{
int z = x + y;
return z;
}
That is, take in some parameters (or maybe no parameters), do some logic, and return some result. The key to the above is the return keyword. No matter where we place return in a function, if the logic within the function causes us to hit a return, then we exit the function, possibly returning a value along the way. The compiler structures the code in such a way to ensure this happens. In a function which creates an iterator, however, this is not quite the case. Take the following:
public IEnumerator GetEnumerator()
{
for (int j = 0; j < this._values.Length; j++)
{
yield return this._values[j];
}
}
{
for (int j = 0; j < this._values.Length; j++)
{
yield return this._values[j];
}
}
And I am sure you are saying, "Whoa! What the heck is yield?" Well, yield is a special keyword which lets the compiler know that we intend on this function to return things in an iterative way (3). In other words, this function will return things like a normal function would, however, it will return every single item in the associated collection (_values in this case). So am I lying to you? I said earlier that functions return something and then there is no going back without calling the function again. That, my friend, is the magic of the yieldkeyword (and also the IEnumerable return type).
As I mentioned previously, the compiler will structure the compiled code in such a way that the runtime will pass whatever yield return returns back to the caller (e.g. a foreach loop), and when that caller is done with the current "iteration", execution will pick up at the next line of the code which creates the iterator (in the above, that would be the closing brace of the for loop). This is the same thing I explained in the conveyor belt example. The iteration of the bucket going off on the convey belt, and then eventually resuming with you placing the next bucket on the conveyor belt exemplifies this behavior.
You may be wondering what would happen if you didn't use the yield keyword, and you just used return by itself. Well two things would happen: 1) the code will not compile because a yield return (in this example) returns a single string, but the function's definition expects an IEnumerable of strings; 2) assuming the code did compile, you would not get the results you expect. Remember that a return forces immediate exiting of the function--no going back. In this case, the yield return and the return of IEnumerable
Now that you hopefully have some insight into the workings of iterators, let us examine how this fits together with LINQ.
Iterators And LINQ
I mentioned earlier that LINQ is built into the language. There is still a bit of transformation by the compiler in order to make LINQ code actually run instructions the computer will understand. The compiler will translate your LINQ queries into a series of method calls (4). These would be the same method calls you would see if you imported the System.Linqnamespace into your project, and you brought up Intellisense for a particular collection. Some of these methods include: Where, Select, GroupBy, OrderBy, etc. Each of these methods is an extension method (5). These extension methods use iterators under the hood. Yes, if you were to decompile any of these methods you would see good ol' yield return within its code. When you chain together one or more of these methods, each item returned from theyield return actually passes from one method to the next before the next item is returned from the original collection. This is due to the behavior of yield return. This behavior is what gives LINQ so much power--like I said earlier: a supercharged foreach.
When you begin to think of your LINQ queries in this way, they become easier to understand--both in reading and writing such queries. Likewise, if you decide to use extension method syntax, you will understand why your method chains behave the way they do. Thinking of the query as an elaborate foreach loop helps you understand that something like this:
...will loop through each line of the file that was read, ultimately returning only those lines which begin with the string "some text". Unfortunately, in this case the entire file is read before any single line is processed by the query (sort-of defeating the power of yield return), but that is a shortcoming of the ReadAllLines method, not the LINQ query.
There is also a good bit of power in using the extension method syntax. A good portion of those methods have an overload which takes a predicate (6), which I will cover in a separate article. In short, a predicate is just a condition. Think of it like a "where" clause, but written in a slightly different way. With predicates, you can greatly affect the execution of your queries by letting the query run behavior you dictate, not just some default behavior coded into the extension method. The predicate is a slave to the yield return of the iterator, but the relationship hinders neither the execution of the extension method nor the evaluation of the predicate.
Summary
While I always attempt to keep my explanations "short and sweet," it never seems to work out that way. Congratulations on making it this far. By now, you should have a general understanding of the underlying concepts that make LINQ so powerful and quite useful. While the above descriptions lent themselves more to LINQ-to-Objects, the concepts can apply to LINQ-to-XML and LINQ-to-SQL as well. (Granted there is a bit more going on with LINQ-to-SQL.)
If you wish to dig deeper into the underlying logic, then read up on the yield keyword and its uses. I did not cover yield break anywhere above, but if you understand what the breakkeyword does in normal loop usage, then you already have a basic understanding of what it does in an iterator (and you should quickly understand why methods like Take and Firstwork the way they do.
I did not show examples of the Yield keyword in VB. This keyword should be new in Visual Studio 11. For the VB folks, you will have to implement IEnumerator when you want to create your own iterators as best I can tell.
My articles are usually born out of some interesting or in-depth problem I have answered on the site. I will try to cover LINQ in more detail in future articles. Feel free to post a comment below to inquire about a particular LINQ topic for a future article. In the meantime, thanks for reading, and I hope you have a better understanding of LINQ and iterators and the "magic" you can achieve by using them.
Resources
dotPeek - A .NET decompiler. This can be useful to see how the existing extension methods work.
References
- 3
No comments
Post a Comment