📜 ⬆️ ⬇️

Memory Optimization DataTable

I think many people are familiar with the DataTable class. Subtracting large tables from a database to a client through ADO.NET, sometimes it is necessary to extend the lifetime of instances of this class for a long time. For example, if you need processing and analysis of the data, without resorting to ORM materialization (yes, it would be better to do this in the database, but not all can sometimes be put there). When the amount of data is small, there is no particular problem with this. However, on wide tables with a large number of rows, you can get quite thick objects in memory. Comparing the amount of data coming from the database and the size of the resulting DataTable , you can come to two conclusions:



The question of replacing the DataTable with the internal structure will be discussed in the next article if it is interesting. Now consider the first item. As is known, string interning is to eliminate duplicate string in memory (you can read more here). We will not use the built-in mechanism of internment so that the lines do not hang in the process memory after they no longer need us.

The idea is to bypass all the varchar columns in the table, and in each column replace all duplicates with a link from the temporary dictionary, in which the lines will be in a single copy. The state of the heap before and after will be as follows:
')


It is worth noting that the data in the DataTable is stored in columns, not in rows, as one would expect. This is due to lower memory costs - because all the values ​​of the same type in the columns can be used for storing regular arrays with constant access time by index. For speed, we will read directly from these arrays via reflection ( FieldInfo ).

//  ,            private static readonly FieldInfo StorageField = typeof (DataColumn).GetField("_storage", BindingFlags.Instance | BindingFlags.NonPublic); private static readonly FieldInfo ValuesField = typeof (DataTable).Assembly.GetType("System.Data.Common.StringStorage") .GetField("values", BindingFlags.Instance | BindingFlags.NonPublic); 

As I mentioned above, we will use a Dictionary to store instances of strings in a single copy in memory.

  var exclusiveStrings = new Dictionary<string, string>(); 

Key will be needed only for a hash, Value - for a reference to a single instance in memory. It is worth noting that the resulting collisions Dictionary resolves using the basket method.

Full method code:

  /// <summary> ///     .        . /// </summary> /// <param name="table">.</param> /// <returns>  .</returns> public static DataTable Compact(this DataTable table) { if (table.Rows.Count == 0) return table; var exclusiveStrings = new Dictionary<string, string>(); for (int column = 0; column < table.Columns.Count; ++column) { if (table.Columns[column].DataType == typeof(string)) { //     ( ) var values = (string[])ValuesField.GetValue(StorageField.GetValue(table.Columns[column])); int rowCount = table.Rows.Count; for (int row = 0; row < rowCount; ++row) { var value = values[row]; if (value != null) { string hashed; if (!exclusiveStrings.TryGetValue(value, out hashed)) //    exclusiveStrings.Add(value, value); else //  values[row] = hashed; } } exclusiveStrings.Clear(); } } return table; } 

We estimate the complexity of the resulting algorithm for a particular column. The complexity in this case consists of two main parts: bypassing all n values ​​and searching the Dictionary by key. With the first part, everything is clear, but with the search a little more interesting. In the most fantastically bad case, when all n values ​​lead to a collision and lie in one basket, the complexity of the search will be reduced to linear. But, given the implementation of the hash function for a string in .NET, this thought cannot cause anything but a smile. The same thing was decided by the Dictionary developers, since The documentation for TryGetValue declares O (1) complexity. Thus, within a single column, the expected complexity is reduced to O (n) . For the entire table, O (mn) , where m is the number of string columns.

Consider the speed on real data:



As can be seen, the complexity is linear, as was evident from the analysis of the algorithm. Regarding the cost of memory, then for the algorithm at every step, filling the dictionary, creates a new pointer and deletes references to lines if it finds a duplicate.

It is easy to see that the efficiency of the algorithm depends on the input data: if there are a lot of duplicates, then more memory will be freed. In my tasks, memory was reduced by 30-50%. But, again, you might be different.

I hope someone will find this solution as useful as me.

Source: https://habr.com/ru/post/259069/


All Articles