Cliff Click is a Cratus CTO (IoT Sensor for Process Improvement), founder and co-founder of several startups (including Rocket Realtime School, Neurensic and H2O.ai) with several successful exit. Cliff wrote his first compiler at 15 (Pascal for TRS Z-80)! Best known for his work on C2 in Java (the Sea of ​​Nodes IR). This compiler showed the world that JIT can produce high-quality code, which was one of the factors in the development of Java as one of the main modern software platforms. Then Cliff helped Azul Systems build an 864-core mainframe with pure Java software, which supported GC pauses on a 500-gigabyte heap within 10 milliseconds. In general, Cliff managed to work on all aspects of the JVM.
This habraposta is a great interview with Cliff. We will talk on the following topics:
Interviews are:
Andrei : You are a famous person in the world of JIT compilation, in Java and work on performance in general, right?
Cliff : That's it!
Andrei : Let's start with general questions about working on performance. What do you think about the choice between high-level and low-level optimizations like working at the CPU level?
Cliff : Yes, everything is simple. The fastest code is the one that never runs. Therefore, you should always start from a high level, work on algorithms. A better O-notation will beat a worse O-notation, unless some fairly large constants intervene. Low-level things go the latest. Usually, if you optimized the rest of the stack well enough, and there is still something interesting left - this is the low level. But how to start from a high level? How to find out that enough high level work has been done? Well ... no way. No ready-made recipes. You need to understand the problem, decide what you are going to do (in order not to take unnecessary steps in the future) and then you can already uncover a profiler who can say something useful. At some point, you yourself understand that you got rid of unnecessary things and it is time to do fine tuning of the low level. This is definitely a special art form. A bunch of people do unnecessary things, but it moves so quickly that they have no time to take care of their performance. But this is until the question arises. Usually, 99% of the time nobody is interested in what I do, right up to the moment when there is no important thing on the critical path that someone has to do with. And here everyone starts to cut you on the topic "why it worked from the very beginning was not perfect." In general, there is always something to improve in performance. But 99% of the time you have no clues! You just try to make something work and in the course of it you understand what is important. You can never know in advance that this piece should be made perfect, so, in fact, you have to be perfect in everything. And this is impossible and you do not do that. There are always a lot of things to fix - and this is completely normal.
Andrei : How do you work on performance? This is a cross-cutting problem. For example, did you have to work on problems arising from the intersection of a large number of existing functionality?
Cliff : I try to avoid it. If I know that performance will become a problem, then I wonder before I start coding, especially over data structures. But often you discover it all very much later. And then you have to go to extreme measures and do what I call “rewrite and rule”: you need to grab onto a fairly large piece. Part of the code will still have to be rewritten due to problems with the performance or for something else. Whatever the reason for rewriting the code does not take place, it is almost always better to rewrite a larger piece than a smaller piece. At this moment, everyone starts to shake from fear: "oh my God, you can not touch so much code!". But, in fact, this approach almost always works much better. It is necessary to immediately take up a big problem, outline a large circle around it and say: everything inside the circle, I will rewrite. The border is much smaller than the content inside it that needs to be replaced. And if such a delineation of boundaries will allow you to do the work inside perfectly - you have your hands free, do what you want. Once you understand the problem, the process of rewriting is much easier, so bite off a large piece!
At the same time, when you do the rewriting in a large piece and understand that performance will become a problem, you can immediately start worrying about it. Usually it turns into simple things like “don't copy the data, manage the data as simple as possible, make them smaller”. In large rewrites, there are standard ways to improve performance. And they almost always revolve around the data.
Andrew : In one of the podcasts, you talked about cost models in the context of performance. Can you explain what was meant by this?
Cliff : Of course. I was born in an era when processor performance was extremely important. And this era is returning again - destiny is not without irony. I started living in the days of eight-bit machines, my first computer worked with 256 bytes. It is bytes. Everything was very small. It was necessary to read the instructions, and as soon as we began to move up the stack of programming languages, languages ​​took on more and more. There was an Assembler, then Basic, then C, and C took over the job with a lot of details, like register allocation and instruction selection. But there everything was pretty clear and if I made a pointer to an instance of a variable, then I would get a load, and the cost of this instruction is known. Iron produces a known number of machine cycles, so that the speed of execution of various pieces can be calculated simply by adding up all the instructions that you are going to run. Each compare / test / branch / call / load / store could be folded and said: here's your lead time. Being engaged in improving performance, you just notice that the numbers correspond to small hot cycles.
But as soon as you switch to Java, Python and similar things, you very quickly move away from low-level hardware. What is the cost of calling a getter in Java? If the JIT in HotSpot all correctly zainlaynil , it will load, but if he did not - it will be a function call. Since the challenge lies in the hot loop, it will cancel all other optimizations in this loop. Therefore, the real value will be much more. And you immediately lose the ability to look at a piece of code and understand that we should execute it in terms of the processor clock frequency, used memory and cache. All this becomes interesting only if you really have drilled into a performance.
Now we are in a situation where the speed of processors for almost a decade has hardly increased. Old times are back! You can no longer count on good single-threaded performance. But if you suddenly do parallel computing - it is incredibly difficult, everyone looks at you as James Bond. Ten-time accelerations here usually occur in those places where someone has missed something. Parallelism requires a lot of work. To get that same tenfold acceleration, you need to understand the cost model. What and how much it costs. And for this you need to understand how the language falls on the underlying iron.
Martin Thompson picked a great word for his blog Mechanical Sympathy ! You need to understand what the iron is going to do, how it will do it, and why it generally does what it does. Using this, it is quite simple to start reading the instructions and find out where the execution time is running out. If you do not have the appropriate training, you are simply looking for a black cat in a dark room. I constantly see people optimizing performance, who don’t have the slightest idea what the hell they do. They are very tormented and are not moving very far. And when I take the same piece of code, I slip a couple of small hacks in there and get five or ten times acceleration, they are: well, it's not fair, we already knew that you were better. Amazing. What am I talking about ... the cost model is about what kind of code you write and how fast it works on average in the overall picture.
Andrei : And how can you keep such volume in your head? This is achieved by a lot of experience, or? Where does such experience come from?
Cliff : Well, I got my experience is not the easiest way. I programmed in Assembler back in the days when it was possible to understand every single instruction. It sounds silly, but since then in my head, in my memory, the Z80 instruction set has remained forever. I do not remember the names of people within a minute after the conversation, but I remember the code written 40 years ago. Funny, it looks like the " idiot scientist " syndrome.
Andrei : Is there any simpler way to get in?
Cliff : Yes and no. The hardware that we all use has not changed so much during this time. Everyone uses x86, with the exception of Arm. If you do not do some kind of hardcore child abuse, you have all the same. Ok, go on. Instructions also have not changed for centuries. Need to go and write something in assembly language. Not much, but enough to begin to understand. Now you are smiling, and I'm quite serious. It is necessary to understand the correspondence of language and iron. After that, you need to go, pee a little and make a small toy compiler for a small toy language. “Toy” means you need to do it in a reasonable time. It may be super simple, but must generate instructions. The act of generating instructions will allow you to understand the cost model for the bridge between the high-level code on which everyone writes and the machine code that runs on the hardware. This correspondence will burn through the brain at the time of writing the compiler. Even the simplest compiler. After that, you can start looking at Java and the fact that it has a semantic abyss much deeper, and building bridges over it is much more difficult. In Java, it is much more difficult to understand whether our bridge turned out to be good or bad, which will make it fall apart and what not. But you need some kind of starting point when you look at the code and understand: “aha, this getter should inline every time.” And then it turns out that sometimes it happens, except for the situation when the method becomes too large, and the JIT starts to inline everything. The performance of such places can be predicted instantly. Usually, getters work well, but then you look at the big hot cycles and you realize that there are some function calls floating around that do not understand what they are doing. This is the problem with the widespread use of getters, the reason why they are not inline - it is not clear if this is a getter. If you have a super-small code base, you can just remember it and then say: this is a getter, and this is a setter. In a large code base, each function lives its own history, which, in general, is not known to anyone. The profiler says that we lost 24% of the time on some cycle and in order to understand what this cycle does, you need to look at each function inside. It is impossible to understand this without studying the function, and this seriously slows down the process of understanding. Therefore, I do not use getters and setters, I have reached a new level!
Where to get the cost model? Well, you can read something, of course ... But I think the best way is to act. Make a small compiler and it will be the best way to realize the cost model and fit it in your own head. A small compiler that would fit to program a microwave is a task for a beginner. Well, I mean, if you already have programming skills, they should be enough. All these things are like parsing a string that you have some algebraic expression, pull out the instructions of mathematical operations in the correct order, take the correct values ​​from the registers - all this is done at a time. And while you do this, it will be imprinted in the brain. I think everyone knows what the compiler does. And this will give an understanding of the cost model.
Andrew : What else is worth paying attention when working on performance?
Cliff : Data Structures. By the way, yes, I did not conduct these classes for a long time ... Rocket School . It was funny, but it required to invest so much energy, but I also have life! Okay. So, in one of the big and interesting activities, “Where your performance goes,” I gave an example to students: two and a half gigabytes of Fintech data were read from a CSV file and then it was necessary to count the number of products sold. Regular tick market data. UDP packets that have been converted to text format since the 70s. Chicago Mercantile Exchange - all sorts of things like butter, corn, soybeans, and the like. It was necessary to count these products, the number of transactions, the average amount of movement of funds and goods, etc. This is a fairly simple trading math: find the product code (this is 1-2 characters in the hash table), get the amount, add it to one of the sets of deals, add volume, add value, and a couple of other things. Very simple math. The toy implementation was very straightforward: everything lies in the file, I read the file and move along it, dividing the individual entries into Java strings, looking for the necessary things in them and adding them according to the above-described mathematics. And it works at some low speed.
With this approach, everything is obvious, what is happening, and parallel computing will not help here, right? It turns out that a fivefold increase in productivity can be achieved only by choosing the right data structures. And it surprises even experienced programmers! In my particular case, the trick was not to make memory allocations in a hot loop. Well, this is not the whole truth, but as a whole - you should not allocate “once in X”, when X is large enough. When X is two and a half gigabytes, you should not allocate anything “once per letter”, or “once per line”, or “once per field”, nothing of the sort. That’s what time goes by. How does this even work? Imagine me making a call to String.split()
or BufferedReader.readLine()
. Readline
makes a line from the set of bytes that came across the network, once for each line, for each of the hundreds of millions of lines. I take this line, analyze it and throw it away. Why throw it away - well, I already processed it, that's all. So, for each byte read from these 2.7G, two characters per line will be written, that is, 5.4G already, and they are no longer needed for anything, so they are thrown away. If you look at the memory bandwidth, we will load 2.7G, which go through the memory and memory bus in the processor, and then go twice as much to the line of memory, and all this is frayed as each new line is created. But I need to read it, the iron reads it, even if later everything will be ground. And I have to write it down, because I have created a string and the caches are overflowed - the cache cannot fit 2.7G in itself. So, for each byte I read, I read two more bytes and write two additional bytes, and as a result, they have a 4: 1 ratio — in this ratio, we waste a lot of memory bandwidth. And then it turns out that if I do String.split()
, then this is not the last time, there may be 6-7 fields inside. Therefore, the classic code of reading CSV and subsequent parsing of lines leads to a loss of memory bandwidth in the 14: 1 area regarding what you really would like to have. If you throw these emissions, you can get a five-fold acceleration.
And it is not that very difficult. If you look at the code from the right angle, it all becomes quite simple, immediately, as you realize the essence of the problem. You shouldn’t stop allocating memory at all: the only problem is that you allocate something and it immediately dies, and along the way it burns an important resource, which in this case is memory bandwidth. And all this translates into a drop in performance. On x86, you usually need to actively burn CPU cycles, and here you burned all the memory much earlier. The solution is to reduce the amount of discharge.
Another part of the problem is that if you run the profiler when the memory bar runs out, right at the moment it happens, you usually wait for the cache to return, because it is full of the garbage you just spawn with all these lines. Therefore, each load or store operation becomes slow, because they lead to cache misses — the entire cache has become slow, waiting for the garbage to leave. Therefore, the profiler will only show a warm random noise, smeared along the entire cycle - there will be no separate hot instruction or space in the code. Only noise. And if you look at GC cycles, they will all be Young Generation and super fast - microseconds or milliseconds maximum. After all, all this memory dies instantly. You allocate billions of gigabytes, and he cuts them, and cuts, and cuts again. All this happens very quickly. It turns out that there are cheap GC cycles, warm noise along the entire cycle, but we want to get 5-fold acceleration. At that moment, something should close in the head and sound: “Why is it like that ?!”. The memory overflow does not appear in the classic debugger, you need to start the hardware performance counters debugger and see for yourself and directly. But not directly it can be suspected of these three symptoms. The third symptom is when you look at what you select, ask the profiler, and he answers: “You made a billion lines, but the GC worked for free.” As soon as this happened, you realize that you have spawned too many objects and burned the whole strip of memory. There is a way to understand this, but it is not obvious.
The problem is in the data structure: the bare structure behind everything that is happening, it is too big, it is 2.7G on disk, so it’s very undesirable to make a copy of this thing - I want to load it from the network byte buffer right into registers so as not to read-write to round trip five times. Unfortunately, Java by default does not give you such a library in the JDK. But this is trivial, right? In fact, these are 5-10 lines of code that will be used to implement your own buffered line loader, which repeats the behavior of the string class, being a wrapper around the underlying byte buffer. As a result, it turns out that you work almost as if with strings, but in fact there are pointers to the buffer, and the raw bytes are not copied anywhere, and thus the same buffers are reused, time after time, and the operating system is happy to take yourself things for which it is intended, like hidden double buffering of these byte buffers, and you yourself no longer grinds an endless stream of unnecessary data. By the way, you understand, when working with GC, it is guaranteed that every memory allocation will not be visible to the processor after the last GC cycle? Therefore, all this can not be in the cache, and then there is a 100% guaranteed miss. When working with a pointer, on x86 it takes 1-2 clocks to subtract a register from memory, and as soon as this happens, you pay, pay, pay, because the memory is all on NINE caches - and this is the cost of allocating memory. Real value.
In other words, data structures are the most difficult to change. And as soon as you realize that you have chosen the wrong data structure, which will further kill the performance, you usually need to do a substantial job, but if this is not done, it will be worse. First of all, you need to think about data structures, this is important. The main cost here falls on bold data structures that are beginning to be used in the style of “I copied the X data structure into the Y data structure, because Y is more like the shape of the data”. But the copy operation (which seems to be cheap) actually spends a memory band and here all the lost execution time is buried. If I have a giant string with JSON and I want to turn it into a structured DOM tree from a POJO or something, the operation of parsing this string and building a POJO, and then a new appeal to the POJO will later turn into an extra cost - not cheap. Except if you run around the POJO much more often than the line. Offhand, you can instead try to decrypt the string and pull out only the right one, without turning it into any POJOs. If all this happens on a path that requires maximum performance, no POJO for you, you need to somehow directly dig into the line.
Andrei : You said that in order to understand the value model, you need to write your own small little language ...
Cliff : Not a language, but a compiler. Language and compiler are different things. The most important difference is in my head.
Andrei : By the way, as far as I know, you are experimenting with creating your own languages. What for?
Cliff : Because I can! I’ve half retired, so this is my hobby. All my life I've been implementing someone else's languages. I also worked a lot on the coding style. And also because I see problems in other languages. I see that there are better ways to do familiar things. And I would use them. I was just tired of seeing problems in myself, in Java, in Python, in any other language. I am currently writing on React Native, JavaScript and Elm as a hobby that is not about retirement, but about active work. I also write in Python and, most likely, I will continue to work on machine learning for Java backends. There are many popular languages ​​and all of them have interesting features. Everyone is good for something and you can try to bring all these pieces together. So, I study interesting things for me, the behavior of the language, trying to come up with reasonable semantics. And so far I have been working! At the moment I'm struggling with the semantics of memory, because I want to have it in C and Java, and get a strong memory model and memory semantics for loads and stor. At the same time have automatic type inference in Haskell. Here, I am trying to mix Haskell-type type inference with memory that works in both C and Java. I have been doing this for the last 2-3 months, for example.
Andrei : If you build a language that takes better aspects from other languages, did you think that someone would do the opposite: take your ideas and use them?
Cliff : That's how new languages ​​appear! Why is Java similar to C? Because C had a good syntax, which everyone understood and Java was inspired by this syntax, adding type safety, checking the boundaries of arrays, GC, and also they improved some things from C. They added their own. But they were quite inspired, right? All stand on the shoulders of giants who were before you - that is how progress is made.
Andrew : As I understand it, your language will be safe regarding memory usage. Have you thought of realizing something like a borrow checker from Rust? Did you look at him, how did he look at you?
Cliff : Well, I have been writing in C for ages, with all these mallocs and free, and manually controlling the lifetime. You know, 90-95% of manually controlled lifetime has the same structure. And it is very, very painful to do it manually. I would like the compiler to simply say what is happening there and what you achieved by your actions. For some things, the borrow checker does it out of the box. And he must automatically display information, understand everything and not even load me in order to state this understanding. He should do at least a local escape-analysis, and only if he failed, then you need to add annotations of types that will describe lifetime — and such a scheme is much more complicated than the borrower, or any existing memory checker. The choice between “everything is in order” and “I did not understand anything” - no, there must be something better.
So, as a person who has written a lot of C code, I think that having support for automatic control of liftime is the most important thing. And I was sick of how much Java uses memory and the main complaint is in GC. When allocating memory in Java, you will not return the memory that was local on the last GC cycle. In languages ​​with more precise memory management, this is not the case. If you call malloc, then you immediately receive the memory that was usually just used. Usually you do some temporary things with your memory and immediately return it back. And she immediately returns to the malloc pool, and the next cycle of malloc pulls it out again. Therefore, the actual memory usage is reduced to a set of living objects at a specific point in time, plus leaks. And if everything does not flow at you in a completely indecent way, most of the memory is caching and the processor, and it works quickly. But it requires a lot of manual memory management with malloc and free, called in the right order, in the right place. Rust can do it right itself and, in a bunch of cases, it can give even greater performance, since memory consumption is reduced only to current computations - as opposed to waiting for the next GC cycle, which will free memory. As a result, we got a very interesting way to improve performance. And quite powerful - in the sense that I did such things in data processing for fintech, and this made it possible to get acceleration five times faster. This is quite a lot of acceleration, especially in a world where processors are not getting faster, and we still continue to wait for improvements.
Andrei : I would also like to ask about my career in general. You became famous for working at JIT at HotSpot, and then you moved to Azul - and this is also a JVM company. But already engaged in more hardware than software. And then suddenly they switched to Big Data and Machine Learning, and then to fraud detection. How did it happen? These are very different areas of development.
Cliff : I have been doing programming for quite some time and managed to get noticed in very different classes. And when people say, “Oh, you're the one who made JIT for Java!”, It's always fun. But before that, I was engaged in a clone of PostScript - the language that Apple used to use for its laser printers. And before that, he did an implementation of the Forth language. I think the general theme for me is the development of tools. All my life I do the tools with which other people write their cool programs. But I was also engaged in the development of operating systems, drivers, debuggers of the kernel level, languages ​​for developing the OS, which began trivially, but over time everything became more and more complicated. But the main theme, after all, is the development of tools. A big piece of life passed between Azul and Sun, and it was about Java. But when I started Big Data and Machine Learning, I put on my parade hat again and said, “Oh, and now we have a nontrivial problem, and then there are a lot of interesting things and people doing something.” This is a great path for development, which is worth going.
Yes, I really like distributed computing. My first job was as a student in C, on an advertising project. These were distributed computing on Zilog Z80 chips that collected data for analog optical text recognition, produced by a real analog analyzer. It was a cool and completely crazy topic. But there were problems, some part was not recognized correctly, so it was necessary to extract the picture and show it to a person who had already read with his eyes and informed what was being said, and therefore there were jobs with the data, and these jobs had their own language . There was a backend that handled all this — working in parallel Z80 with running vt100 terminals — one per person, and there was a parallel programming model on the Z80. A certain common piece of memory that all the Z80 shared inside a star configuration; the backplane was split, and half the RAM was split inside the network, and another half was private or went to something else. Intelligently complex parallel distributed system with shared ... semi-shared memory. When it was ... Already not to remember, somewhere in the mid-80s. Quite a long time ago.
Yes, we will assume that 30 years is a long time ago. Tasks related to distributed computing exist for quite a long time, people have long fought against Beowulf -clusters. Such clusters look like ... For example: there is Ethernet and your fast x86 is connected to this Ethernet, and now you want to get fake shared memory, because nobody could then do coding of distributed computing, it was too difficult and therefore was fake shared memory with protection memory pages on x86, and if you wrote to this page, then we told the rest of the processors that if they got access to the same shared memory, it would need to be downloaded from you, and thus something like the cache coherence support protocol appeared and software for this. An interesting concept. The real problem, of course, was in the other. It all worked, but you quickly got performance problems, because no one understood the performance model at a fairly good level - what kind of memory access patterns are there, how to make the nodes not endlessly ping each other, and so on.
In H2O, I came up with this: the developers themselves are responsible for determining where concurrency is hidden and where it is not. I came up with such a coding model that writing high-performance code became easy and simple. But to write slow-running code is difficult, it will look bad. You need to seriously try to write slow code, you have to use non-standard methods. Braking code can be seen at a glance. As a result, code is usually written that works quickly, but you have to figure out what to do in the case of shared memory. All this is tied up in large arrays and the behavior there is similar to non-volatile large arrays in parallel Java. In the sense, imagine that two streams are written in a parallel array, one of them wins, and the other, respectively, loses, and you do not know which of them is who. If they are not volatile, then the order can be of any kind - and it really works well. People really care about the order of operations, they correctly set volatile, and in the right places they expect performance problems with memory. Otherwise, they would simply write the code in the form of cycles from 1 to N, where N is some trillions, in the hope that all complex cases will automatically become parallel - and there it does not work. But in H2O it is not Java, and not Scala, it can be considered “Java minus minus”, if you want. This is a very clear programming style and is like writing simple C or Java code with loops and arrays. But at the same time, memory can be processed with terabytes. I still use H2O. – , . Big Data , H2O.
: ?
: ? , – .
. . , , , , . Sun, , , , . , , . , C1, , – . , . , x86- , , 5-10 , 50 .
, , , , C. , , - , C . C, C . , , C, - … , . , . , , . , , 5% . - – , « », , . : , , . . , – , . , . - – . , , ( , ), , , . , , , .
, , , , , , . , , , - . , , , . , , , , . , : , . , , - : , , - , . – , , – ! – , . Java. Java , , , , – , « ». , , . , Java C . – Java, C , , , . , – , . , . , , . : .
: - . , , - , ?
: ! – , NP- - . , ? It is simply impossible. , Ahead of Time – . - . , , – , ! – , . , , . . Why is it important? , : , , - ! - , . . , , . : - , - . , , . , , , , - . ! , , , – . . NP- .
: , – . , , , , …
: . «». . , . – , , , ( , ). , - . , , , . , , . , . , , . , , . , , - , – . – . , GC, , , , – , . , . , , . , – , ? , .
: , ? ?
: GPU , !
: . ?
: , - Azul. , . . H2O , . , GPU. ? , Azul, : – .
: ?
: , … . , . , , , , . , , . , Java C1 C2 – . , Java – . , , – . … . - , Sun, … , , . , . , . … … , . , , . . - , : . , , , , , , . , . . , . « , , ». : «!». , , , : , .
– , , , . . , , , , . , Java JIT, C2. , – . , – ! . , , , , , , . . . , . , , , , : , , . , – . , , - . : « ?». , . , , : , , – ? , . , , , , , , - .
: , -. ?
: , , . – . . , . . . : , , - – . . , , – , . , , , , - , . , . , , - . , , – , .
, . , – , , . , . , – . , . , , « », , – , , , , . , , « ».
. . - , , «»: , – . – . , , . «, -, , ». , : , . , , . . – , . Indeed, why? , ? ? , ? . , . – . . , . – – , . , « » . : «--», : «, !» . . , , , , . , . , . , – , . – , . , , , .
, – , . , , . , . , , , , . , , . , , , , . . , , , . , , , , . , , , . , – , , , . , .
: … . , . . Hydra!
Hydra 2019, 11-12 2019 -. «The Azul Hardware Transactional Memory experience» . Tickets can be purchased on the official website .
Source: https://habr.com/ru/post/458718/
All Articles