List: git
Subject: Re: Why Git is so fast (Was: Re: Eric Sink's blog - notes on git,
From: “Shawn O. Pearce” <spearce () spearce! org>
As mentioned earlier, we made a lot of small optimizations in Git C code to achieve really high performance. 5% is here, 10% is there, and suddenly you are already 60% faster than before. Niko [Pitre], Linus [Torvalds] and Junio ​​[Hamano] - they all spent some time in the last three or four years to optimize individual Git fragments, solely to make it work as fast as possible.
High level programming languages ​​hide the machine to a certain extent, so that we cannot carry out all these optimizations.
')
For example, JGit suffers from a lack ofmmap()
, and when using Java NIO MappedByteBuffer, we still need to make a copy into thebyte[]
temporary array to get real data processing. There is no such copy in Git on C. Of course, in other high-level languages, themmap
method may be more comfortable, but they all also tend to garbage collection, and most languages ​​try to associate themmap
control with the garbage collector "for safety and simplicity."
JGit also suffers from the lack of unsigned data types in Java. There are a lot of places in JGit where we really needunsigned int32_t
orunsigned long
(machine word of maximum size) orunsigned char
, but these data types are simply missing in Java. Converting a byte to an int, just to represent it as unsigned, requires an extra& 0xFF
operation to zero the sign extension.
JGit suffers from the lack of an effective way to present SHA-1. In C code, you can simply writeunsigned char[20]
and immediately copy the string into memory to the container. In Javabyte[20]
an additional 16 bytes of memory will cost, and access to them will be longer, because these bytes themselves are in a different area of ​​memory from the container. We try to get around this by converting frombyte[20]
to five ints, but this is worth additional machine instructions.
Git on C takes for granted that thememcpy(a, b, 20)
operation is extremely cheap when copying the contents of the memory from the tree (inflated tree) into the structure object. JGit has to pay a large fine for copying these 20 bytes into five ints, because later on these five ints are cheaper.
In other high-level programming languages, it is also not possible to mark the type as unsigned. Or they are forced to pay similar fines for storing a 20-byte binary array.
Native Java collections (collection types) have become a real trap for us in JGit. We usedjava.util.*
Types in convenient cases, and it seemed that we almost solved the problem with the data structure, but they, as a rule, worked much worse than writing a specialized data structure.
For example, we had anObjectIdSubclassMap
for what should look like aMap<ObjectId,Object>
. Only he demanded that the type Object, which you use as a “value”, originated from an ObjectId, since this representation of the object works both as a key and as a value. This causes a real nightmare when used on aHashMap<ObjectId,Object>
. (If anyone does not know, ObjectId is the JGitunsigned char[20]
for SHA-1).
Just a couple of days ago, I wroteLongMap
, a faster version ofHashMap<Long,Object>
, for hashing objects by indexes in a packed file. Here, the same thing. The cost of packaging in Java for converting along
(the largest integer) into an object suitable for the standard HashMap type was rather high.
And now JGit is still slower when it comes to handling a commit or a tree object, where you need to keep track of the object links. Or when theinflate()
callinflate()
. We spend much more time on these procedures than git does on C, although we try to descend as low as possible, as far asbyte[]
allows, avoiding copying anything and avoiding memory allocation whenever possible.
Characteristically, JGit performs arev-list --objects –all
about twice as long as Git does on a project like the Linux kernel, and theindex-pack
for a file of about 270 MB also lasts about twice as long.
Both parts of JGit are as good as I have enough knowledge to optimize them, but we really are at the mercy of JIT, and any changes in JIT can lead to a deterioration (or improvement) in our performance. Unlike Git on C, where Linus Torvalds can change whole code snippets in assembler and try different approaches.
So yes, there is a practical sense in creating Git in a high-level language, but you simply cannot get the same performance or strict memory consumption as Git does in C. That's what you are abstracted from in a high-level language. However, JGit works quite fine; fast enough so that we use it as a git server inside Google.
Source: https://habr.com/ru/post/136210/
All Articles