The
ToArray and
ToList extension methods are a convenient way to quickly convert an enumerated sequence (for example, a Linq query) into an array or list. However, they have something that bothers me: both of these methods are very inefficient if they don’t know the number of elements in the sequence (which almost always happens when you use them in a Linq query). Let's first consider the
ToArray method (
ToList has several differences, but the principle is almost the same).
Basically ,
ToArray takes a sequence and returns an array that contains all the elements of the original sequence. If a sequence implements the
ICollection <T> interface,
ToArray uses the
Count property to put the correct size array in memory and copies the elements of the sequence into it. For example, convert the list of users to an array:
List<User> users = GetUsers(); User[] array = users.ToArray();
In this case,
ToArray is quite effective. Let's change the code so that only their names are retrieved from the list of users and converted into an array:
List<User> users = GetUsers(); string[] array = users.Select(u => u.Name).ToArray();
Now, the argument for
ToArray is the type
IEnumerable <User> returned by the
Select method.
IEnumerable <User> does not implement the
ICollection <User> interface, so
ToArray does not know the number of elements and cannot place an array of appropriate size in memory. He does the following:
')
- First, a small size array is placed in memory (4 elements in the current implementation )
- Elements are copied to the array until it is full.
- If there are no more elements in the source, go to step 7.
- Otherwise, a new array is placed in memory, twice the size of the previous one.
- Elements from the previous array are copied to the new one.
- The transition to item 2 is made and everything repeats.
- If the result is an array larger than the number of elements, it is truncated: a new array is placed in memory with the correct size and elements are copied into it
- The resulting array is returned by the method
If there are only a few elements in the source, then this algorithm is completely painless, but for a very large number of elements it becomes rather inefficient due to the large number of memory allocations and copying of elements.
It is annoying that in most cases we know the original number of elements! In the example above, we use
Select , which does not change the number of elements in the sequence, and we know that the same number remains. But
ToArray does not know about this, because for him this information is not available. Would we have a way to provide him with this on our own ...
Well, in fact, it's pretty easy to do: you just need to write your own extension method, which takes the number of elements as a parameter. It might look like this:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source, int count) { if (source == null) throw new ArgumentNullException("source"); if (count < 0) throw new ArgumentOutOfRangeException("count"); var array = new TSource[count]; int i = 0; foreach (var item in source) { array[i++] = item; }
Note from the translator: PsyHaSTe proposed an alternative method without explicitly specifying the count:
public static TResult[] SelectToArray<T, TResult>(this ICollection<T> source, Func<T, TResult> func) { if (source == null) throw new ArgumentNullException("source"); TResult[] result = new TResult[source.Count]; int i = 0; foreach (var value in source) { result[i++] = func(value); } return result; }
Now we can optimize our example like this:
List<User> users = GetUsers(); string[] array = users.Select(u => u.Name).ToArray(users.Count);
Note that if you pass fewer elements than you actually get, you will get an
IndexOutOfRangeException . You must provide the method with the correct number.
So, what performance will we get now? According to my tests, this version of ToArray is approximately
twice as fast as the standard version for large sequences (tests were run on 1,000,000 items). Pretty good!
Notice that we can improve the
ToList method
in a similar way, using the constructor of the
List <T> class, which allows you to set the initial capacity of the list:
public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source, int count) { if (source == null) throw new ArgumentNullException("source"); if (count < 0) throw new ArgumentOutOfRangeException("count"); var list = new List<TSource>(count); foreach (var item in source) { list.Add(item); } return list; }
In this case, there is not such a large increase in productivity, as in the case of
ToArray (approximately 25% instead of 50%), probably due to the fact that the list does not need to be truncated.
Obviously, the same optimization can be applied to the
ToDictionary method, since the
Dictionary <TKey, TValue> class is also provided by a constructor indicating the initial capacity of the dictionary.
Improved versions of the ToArray and ToList methods are available in my
Linq.Extras library, which also provides many other useful methods for working with sequences and collections.