📜 ⬆️ ⬇️

Why C is faster than Java (from the perspective of a Java developer)

On the Git mailing list , there was a discussion about how a high-level programming language reduces application performance in connection with a discussion of JGit . The discussion is especially interesting because programmers, experts of the highest level both in C and Java took part in it. One of them is Shawn O. Pearce (Shawn O. Pearce), a well-known Java programmer from Google, an active committer in Eclipse, a co-author of Git, and the author of a Git Java implementation called JGit. In his message, he described the real limitations that a highly qualified developer faces when trying to write efficient Java code that is comparable in performance to the most optimized C code. Although the letter dates from April 2009, some of Shawn's arguments have not lost their relevance.

List: git
Subject: Re: Why Git is so fast (Was: Re: Eric Sink's blog - notes on git,
From: “Shawn O. Pearce” <spearce () spearce! org>

As mentioned earlier, we made a lot of small optimizations in Git C code to achieve really high performance. 5% is here, 10% is there, and suddenly you are already 60% faster than before. Niko [Pitre], Linus [Torvalds] and Junio ​​[Hamano] - they all spent some time in the last three or four years to optimize individual Git fragments, solely to make it work as fast as possible.

High level programming languages ​​hide the machine to a certain extent, so that we cannot carry out all these optimizations.
')
For example, JGit suffers from a lack of mmap() , and when using Java NIO MappedByteBuffer, we still need to make a copy into the byte[] temporary array to get real data processing. There is no such copy in Git on C. Of course, in other high-level languages, the mmap method may be more comfortable, but they all also tend to garbage collection, and most languages ​​try to associate the mmap control with the garbage collector "for safety and simplicity."

JGit also suffers from the lack of unsigned data types in Java. There are a lot of places in JGit where we really need unsigned int32_t or unsigned long (machine word of maximum size) or unsigned char , but these data types are simply missing in Java. Converting a byte to an int, just to represent it as unsigned, requires an extra & 0xFF operation to zero the sign extension.

JGit suffers from the lack of an effective way to present SHA-1. In C code, you can simply write unsigned char[20] and immediately copy the string into memory to the container. In Java byte[20] an additional 16 bytes of memory will cost, and access to them will be longer, because these bytes themselves are in a different area of ​​memory from the container. We try to get around this by converting from byte[20] to five ints, but this is worth additional machine instructions.

Git on C takes for granted that the memcpy(a, b, 20) operation is extremely cheap when copying the contents of the memory from the tree (inflated tree) into the structure object. JGit has to pay a large fine for copying these 20 bytes into five ints, because later on these five ints are cheaper.

In other high-level programming languages, it is also not possible to mark the type as unsigned. Or they are forced to pay similar fines for storing a 20-byte binary array.

Native Java collections (collection types) have become a real trap for us in JGit. We used java.util.* Types in convenient cases, and it seemed that we almost solved the problem with the data structure, but they, as a rule, worked much worse than writing a specialized data structure.

For example, we had an ObjectIdSubclassMap for what should look like a Map<ObjectId,Object> . Only he demanded that the type Object, which you use as a “value”, originated from an ObjectId, since this representation of the object works both as a key and as a value. This causes a real nightmare when used on a HashMap<ObjectId,Object> . (If anyone does not know, ObjectId is the JGit unsigned char[20] for SHA-1).

Just a couple of days ago, I wrote LongMap , a faster version of HashMap<Long,Object> , for hashing objects by indexes in a packed file. Here, the same thing. The cost of packaging in Java for converting a long (the largest integer) into an object suitable for the standard HashMap type was rather high.

And now JGit is still slower when it comes to handling a commit or a tree object, where you need to keep track of the object links. Or when the inflate() call inflate() . We spend much more time on these procedures than git does on C, although we try to descend as low as possible, as far as byte[] allows, avoiding copying anything and avoiding memory allocation whenever possible.

Characteristically, JGit performs a rev-list --objects –all about twice as long as Git does on a project like the Linux kernel, and the index-pack for a file of about 270 MB also lasts about twice as long.

Both parts of JGit are as good as I have enough knowledge to optimize them, but we really are at the mercy of JIT, and any changes in JIT can lead to a deterioration (or improvement) in our performance. Unlike Git on C, where Linus Torvalds can change whole code snippets in assembler and try different approaches.

So yes, there is a practical sense in creating Git in a high-level language, but you simply cannot get the same performance or strict memory consumption as Git does in C. That's what you are abstracted from in a high-level language. However, JGit works quite fine; fast enough so that we use it as a git server inside Google.

PS The post of Sean Pierce was written in 2009 and the author does not take into account the changes made in Java 1.7. For example, Java is now using escape analysis to avoid heap storage when possible.

Source: https://habr.com/ru/post/136210/


All Articles