I think many people are familiar with the
DataTable class. Subtracting large tables from a database to a client through ADO.NET, sometimes it is necessary to extend the lifetime of instances of this class for a long time. For example, if you need processing and analysis of the data, without resorting to ORM materialization (yes, it would be better to do this in the database, but not all can sometimes be put there). When the amount of data is small, there is no particular problem with this. However, on wide tables with a large number of rows, you can get quite thick objects in memory. Comparing the amount of data coming from the database and the size of the resulting
DataTable , you can come to two conclusions:
- With large amounts of varchar data, among which there is duplication, you can try to organize some kind of internment of string data within the DataTable ;
- DataTable contains a fairly hefty internal infrastructure. Manipulating data types and the number of rows in a table, it was possible to establish that the percentage of overhead costs is 15-20% for tables of 100 thousand records. Most of the infrastructure provides correct editing and other table functionality. In the case when you require the DataTable to be a simple container for data obtained from the database, then you can write an easy gutted analogue of this class.
The question of replacing the
DataTable with the internal structure will be discussed in the next article if it is interesting. Now consider the first item. As is known, string interning is to eliminate duplicate
string in memory (you can read more here). We will not use the built-in mechanism of internment so that the lines do not hang in the process memory after they no longer need us.
The idea is to bypass all the varchar columns in the table, and in each column replace all duplicates with a link from the temporary dictionary, in which the lines will be in a single copy. The state of the heap before and after will be as follows:
')

It is worth noting that the data in the
DataTable is stored in columns, not in rows, as one would expect. This is due to lower memory costs - because all the values of the same type in the columns can be used for storing regular arrays with constant access time by index. For speed, we will read directly from these arrays via reflection (
FieldInfo ).
As I mentioned above, we will use a
Dictionary to store instances of strings in a single copy in memory.
var exclusiveStrings = new Dictionary<string, string>();
Key will be needed only for a hash,
Value - for a reference to a single instance in memory. It is worth noting that the resulting collisions
Dictionary resolves using the basket method.
Full method code:
We estimate the complexity of the resulting algorithm for a particular column. The complexity in this case consists of two main parts: bypassing all n values and searching the
Dictionary by key. With the first part, everything is clear, but with the search a little more interesting. In the most fantastically bad case, when all n values lead to a collision and lie in one basket, the complexity of the search will be reduced to linear. But, given the implementation of the hash function for a string in .NET, this thought cannot cause anything but a smile. The same thing was decided by the
Dictionary developers, since The documentation for
TryGetValue declares O (1) complexity. Thus, within a single column, the expected complexity is reduced to
O (n) . For the entire table,
O (mn) , where
m is the number of string columns.
Consider the speed on real data:

As can be seen, the complexity is linear, as was evident from the analysis of the algorithm. Regarding the cost of memory, then for the algorithm at every step, filling the dictionary, creates a new pointer and deletes references to lines if it finds a duplicate.
It is easy to see that the efficiency of the algorithm depends on the input data: if there are a lot of duplicates, then more memory will be freed. In my tasks, memory was reduced by 30-50%. But, again, you might be different.
I hope someone will find this solution as useful as me.