String collections are read-only: we save on matches

It often happens that the program loads some data into memory and leaves it there for a long time (or even until the end of work) in an unchanged form. It uses data structures that are optimized for both reading and writing. For example, you subtract from Ensembl a list of identifiers of all human genes (including all sorts of miRNAs, etc. - just a little more than 50,000). If you read them in the standard ArrayList, then on a 32-bit HotSpot you will spend a little more than 4 megabytes. Is it possible to save memory, knowing that the collection will no longer change?

We will not deal with the possibility of reading data from the database in parts: this is a topic for another conversation. In addition, reading in parts is quite possible to combine with the approaches described here, achieving additional memory savings. Also, we will not use, for example, the -XX virtual machine option: + UseCompressedStrings: it saves memory not only in our case and can also be combined with our methods.

Before you optimize something, you need to measure it. In our case, we are talking about the amount of memory occupied by the list. And, of course, we are not interested in the shallow size (the size of the ArrayList object itself), but in the retained size (the size of the ArrayList and all the objects to which it refers directly or indirectly). Measurement will be made using Eclipse Memory Analyzer . Find the desired object in the heap (for example, by the name of the class) and use the Show retained set option. We get this picture:

')
As expected, ArrayList itself weighs quite a bit - 24 bytes. The Object [] array of string references, defined inside the ArrayList, weighs decently, but still more than 90% of the space is eaten by the string objects themselves and the associated char [] arrays.

A small improvement can be achieved if you trim the Object [] array to the actual number of elements. As you know, ArrayList allocates space with a margin so as not to re-allocate each time an element is added. Once we know that the elements will no longer be added, this supply is useless. You can do this simply and effectively:

List<String> genes = readGenes(); genes = Arrays.asList(genes.toArray(new String[genes.size()]));

Here the result is no longer an ArrayList (or rather, an ArrayList, but different). We first create a copy of the array of the required length, and then we make a wrapper over the copy. Now the retained set looks like this:

The array became a bit smaller, but in the end we saved 0.44% of the space. Well, nothing, we just warm up.

Let's look at these char [] arrays. They contain identifiers of genes like “ENSG00000121410” (15 characters - 30 bytes), another 8 bytes of system information about the object and 4 bytes of length each. And these 42 bytes are aligned by 8, that is, 6 unused bytes are added. Thus, out of 48 bytes, only 30 carry useful information (with + UseCompressedStrings they can be reduced to 15, then the share of “extra” information will increase). What can be done with this?

Recall how String.substring works. The string that is obtained as a result divides the common char [] buffer with the parent string: in the child string, the values of offset and count are set accordingly. In this case, the child string is no different from the user’s point of view, but if the parent line was deleted, the full buffer is still stored, even if you cut off one character from the megabyte line. Sometimes it is annoying, you have to look for such places in the code and sign the new String, but now we, on the contrary, will help. We put all the rows in the array on one char [] buffer. To do this, we write an auxiliary method in some utility class:

 public static List<String> compactify(Collection<String> list) { StringBuilder sb = new StringBuilder(); for( String item : list ) sb.append(item); String data = sb.toString(); List<String> result = new ArrayList<String>(list.size()); int index = 0; for( String item : list ) { result.add(data.substring(index, index + item.length())); index += item.length(); } return result; }

The method first sticks together all the strings of the original collection into one long string, and then cuts it into substrings using the substring. Run our list of genes through this seemingly useless operation and look at the retained set:

Yeah, already significant. Saved almost all of those extra bytes, as mentioned above, in the amount of winning 24% of the original size. If the strings in your array are shorter than 15 characters, then the savings will be even greater. Note, by the way, that when creating an ArrayList, we explicitly specified its length, so Object [] does not contain redundant elements, that is, we did not lose the first optimization.

The resulting array can be safely used for writing, you just have to keep in mind that any line drags all the others along with it, so if we try to delete the lines, we will not release all of the memory. But since we agreed that the array is read-only, everything is fine. Be careful if there are duplicate lines in the input array: the function does not check this, and you may lose more than get on it.

Still, 24% is small. The eye is cut by String objects, which in fact have no benefit (the data itself is in char []). Maybe you can not store them, and cut the string on demand? To do this, we have to write our own implementation of the List and store the starting line offsets:

 public class CompactStringList extends AbstractList<String> implements RandomAccess { private String data; private int[] offsets; public CompactStringList(Collection<String> list) { StringBuilder sb = new StringBuilder(); offsets = new int[list.size()]; int offset = 0, i = 0; for( String item : list ) { sb.append(item); offsets[i++] = offset; offset+=item.length(); } data = sb.toString(); } public String get(int index) { return index == offsets.length - 1 ? data.substring(offsets[index]) : data.substring(offsets[index], offsets[index + 1]); } public int size() { return offsets.length; } }

In the constructor, we again glued all the input lines into one, remembering the offset of the beginnings of the lines in the array. Upon receipt of the element (for simplicity, I do not check the correctness of the parameter), we cut the desired piece from the string. In this case, only a new String object is created, and the data itself is not copied. Take a look at the retained set of such an object:

Now we have saved 55.5% compared to the original version, here you can start dancing.

Of course, the latest optimization is the most controversial and depends on further usage scenarios. For example, if you often get the same element get and store the resulting strings in other collections, you will have String objects. Although they point to a common buffer, you will lose 40 bytes per line (in 32-bit HotSpot). Do not get carried away with optimizations for the sake of optimizations: it can only get worse. However, in many applications this method may be useful. Well, or at the very least, expand the understanding of Java features to some readers.

Source: https://habr.com/ru/post/145145/

All Articles

String collections are read-only: we save on matches

More articles: