Optimizing the ToArray and ToList methods by providing the number of elements

The ToArray and ToList extension methods are a convenient way to quickly convert an enumerated sequence (for example, a Linq query) into an array or list. However, they have something that bothers me: both of these methods are very inefficient if they don’t know the number of elements in the sequence (which almost always happens when you use them in a Linq query). Let's first consider the ToArray method ( ToList has several differences, but the principle is almost the same).

Basically , ToArray takes a sequence and returns an array that contains all the elements of the original sequence. If a sequence implements the ICollection <T> interface, ToArray uses the Count property to put the correct size array in memory and copies the elements of the sequence into it. For example, convert the list of users to an array:

List<User> users = GetUsers(); User[] array = users.ToArray();

In this case, ToArray is quite effective. Let's change the code so that only their names are retrieved from the list of users and converted into an array:

 List<User> users = GetUsers(); string[] array = users.Select(u => u.Name).ToArray();

Now, the argument for ToArray is the type IEnumerable <User> returned by the Select method. IEnumerable <User> does not implement the ICollection <User> interface, so ToArray does not know the number of elements and cannot place an array of appropriate size in memory. He does the following:
')

First, a small size array is placed in memory (4 elements in the current implementation )
Elements are copied to the array until it is full.
If there are no more elements in the source, go to step 7.
Otherwise, a new array is placed in memory, twice the size of the previous one.
Elements from the previous array are copied to the new one.
The transition to item 2 is made and everything repeats.
If the result is an array larger than the number of elements, it is truncated: a new array is placed in memory with the correct size and elements are copied into it
The resulting array is returned by the method

If there are only a few elements in the source, then this algorithm is completely painless, but for a very large number of elements it becomes rather inefficient due to the large number of memory allocations and copying of elements.
It is annoying that in most cases we know the original number of elements! In the example above, we use Select , which does not change the number of elements in the sequence, and we know that the same number remains. But ToArray does not know about this, because for him this information is not available. Would we have a way to provide him with this on our own ...
Well, in fact, it's pretty easy to do: you just need to write your own extension method, which takes the number of elements as a parameter. It might look like this:

 public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source, int count) { if (source == null) throw new ArgumentNullException("source"); if (count < 0) throw new ArgumentOutOfRangeException("count"); var array = new TSource[count]; int i = 0; foreach (var item in source) { array[i++] = item; } //   RouR     ,  source.Count < count,  //     if(i != count) throw new ArgumentOutOfRangeException(); return array; }

Note from the translator: PsyHaSTe proposed an alternative method without explicitly specifying the count:

 public static TResult[] SelectToArray<T, TResult>(this ICollection<T> source, Func<T, TResult> func) { if (source == null) throw new ArgumentNullException("source"); TResult[] result = new TResult[source.Count]; int i = 0; foreach (var value in source) { result[i++] = func(value); } return result; }

Now we can optimize our example like this:

 List<User> users = GetUsers(); string[] array = users.Select(u => u.Name).ToArray(users.Count);

Note that if you pass fewer elements than you actually get, you will get an IndexOutOfRangeException . You must provide the method with the correct number.
So, what performance will we get now? According to my tests, this version of ToArray is approximately twice as fast as the standard version for large sequences (tests were run on 1,000,000 items). Pretty good!

Notice that we can improve the ToList method in a similar way, using the constructor of the List <T> class, which allows you to set the initial capacity of the list:

 public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source, int count) { if (source == null) throw new ArgumentNullException("source"); if (count < 0) throw new ArgumentOutOfRangeException("count"); var list = new List<TSource>(count); foreach (var item in source) { list.Add(item); } return list; }

In this case, there is not such a large increase in productivity, as in the case of ToArray (approximately 25% instead of 50%), probably due to the fact that the list does not need to be truncated.
Obviously, the same optimization can be applied to the ToDictionary method, since the Dictionary <TKey, TValue> class is also provided by a constructor indicating the initial capacity of the dictionary.
Improved versions of the ToArray and ToList methods are available in my Linq.Extras library, which also provides many other useful methods for working with sequences and collections.

Source: https://habr.com/ru/post/252761/

All Articles

Optimizing the ToArray and ToList methods by providing the number of elements

More articles: