GPU, hexagonal accelerators and linear algebra

All these words are much more connected with mobile development than it seems at first glance: hexagonal accelerators are already helping to train neural networks on mobile devices; algebra and matan come in handy to get a job at Apple; And GPU programming not only allows you to speed up applications, but also teaches you to see the essence of things.

In any case, the head of mobile development Prisma Andrei Volodin says so. And also about how ideas flow into mobile development from GameDev, what distinguishes paradigms, why Android doesn’t have native blur - yes, there’s a lot more productive, AppsCast has been released. Under the cut, let's talk about Andrew's report on AppsConf without spoilers.

AppsCast is a podcast dedicated to the AppsConf mobile conference. Each issue is a new guest. Each guest is the speaker of the conference, with whom we discuss his report and speak on related topics. The podcast is conducted by members of the AppsConf program committee Alexey Kudryavtsev and Daniil Popov.
')
Alexey Kudryavtsev: Hello everyone! Andrew, please tell us about your experience.

Andrey Volodin : We at Prisma are developing products that are mainly related to photo and video processing. Our flagship app is Prisma. Now we are doing another Lensa application for Facetune-like functionality.

I lead mobile development, but I'm a gaming trainer. I have the whole core part, I write the GPU pipelines for all these applications. I develop core frameworks so that the algorithms and neurons that the R & D team has developed run on mobile devices in realtime. In short, to kill server computing and all that.

Alexey Kudryavtsev: It doesn’t sound like an ordinary iOS development.

Andrei Volodin: Yes, I have this specificity - I write every day on Swift, but at the same time it is very far from what is considered to be an iOS development.

Daniil Popov: You mentioned the GPU pipelines, is that all about?

Andrei Volodin: When you make photo editors, you also need to adjust the architecture and decompose the logic, because the application has different tools. For example, in Lensa there is a bokeh tool that blurs the background with the help of a neuron, there is a retouching tool that makes a person more beautiful. It is necessary that all this work more efficiently on the GPU. Moreover, it is advisable not to transfer data between the processor and the video card each time, but to build in advance a set of operations, perform them in one run, and show the final result to the user.

GPU pipelines are “small lumps” from which the instructions for the video card are collected. Then she does all this very quickly and efficiently, and you take the result at a time, and not after each tool. I am committed to ensuring that our GPU pipelines are as fast as possible, efficient and generally exist.

Alexey Kudryavtsev: Tell me, how did you come to this? A regular iOS developer starts with riveting and molds, then walks somewhere on the API and is happy. How did it happen that you are doing something completely different?

Andrei Volodin: For the most part, this is a coincidence. Before I got a job, I made games for iOS. It was always interesting to me, but I understood that in Russia there is no particular place to develop in this direction. It so happened that we found each other with Prisma. They needed an iOS developer who knows how to write on Swift and at the same time knows the GPU, in particular, Metal, which then only came out, and I fit that description exactly.

I responded to the vacancy, we had a synergy, and now for the third year I am going deeper and deeper into this thing. If something goes wrong now, then I already have to figure out all these Viper and MVVM - I do not even know how it stands for - from the very beginning.

What does GPU Engineer do?

Daniil Popov: GPU Engineer is written on your AppsConf profile . What does the GPU Engineer do most of the work day, except for drinking coffee?

Andrei Volodin: It should be mentioned here, what is fundamentally different from the processor GPU. The processor performs operations as it were, sequentially. Even the multithreading that we have is often fake: the processor stops and switches to make small pieces of different tasks, and performs them a little bit in slices. GPU works exactly the opposite way. There are n processors that are truly working in parallel, and there is parallelism between processes and parallelism inside the GPU.

My main job, in addition to trivial things like optimizing working with memory and organizing code reuse, is that I port algorithms that are written for the CPU on the video card so that they parallel. This is not always a trivial task, because there are very efficient algorithms that are fully tied to the sequential execution of instructions. My job is to come up with, for example, an approximation for such an algorithm, which does maybe not exactly the same, but you cannot distinguish visually the result. So we can get the acceleration 100 times, slightly sacrificing quality.

I also do porting neurons. By the way, we will soon make a major open source release. Even before Core ML appeared, we had our own analogue, and we finally matured to put it in Open Source. Its paradigm is slightly different from Core ML. I, among other things, develop its core part.

In general, I do everything around Computer Vision algorithms and calculations.

Alexey Kudryavtsev: An interesting announcement.

Andrei Volodin: It is not a secret, we will not announce it with some kind of fanfare, it will just be possible to see an example of the frameworks that are used inside Prisma.

Why optimize for GPU

Alexey Kudryavtsev: Tell me, please, why optimize algorithms for the GPU in general. It may seem that it is enough to add cores to the processor or to optimize the algorithm. Why GPU?

Andrei Volodin: Working on a GPU can accelerate algorithms tremendously. For example, we have neurons that will run for 30 s on the Samsung S10 CPU, and there will be 1 frame on the GPU, that is, 1/60 s. It incredibly changes the user experience. There is no perpetual loading screen, you can see the result of the algorithm on the video stream, or turn the slider and immediately see the effects.

The point is not at all that we are too cool to write on the CPU, so let's rewrite everything on the GPU. Using a GPU has a transparent goal - to speed up the work.

Alexey Kudryavtsev: GPU handles operations similar to each other well in parallel. Do you have such operations and why can you achieve such success?

Andrei Volodin: Yes, the main difficulty is not to code, but to create algorithms that are well-placed on the GPU. This is not always trivial. It happens that you figure out how to do everything cool, but you need too many synchronization points to do this. For example, you write everything in one property, and this is a clear sign that it will be badly paralleled. If you write a lot in one place, then all threads will need to synchronize for this. Our task is to approximate the algorithms so that they are well parallel.

Alexey Kudryavtsev: For me, as a mobile developer, it sounds like rocket science.

Andrei Volodin: In fact, it is not so difficult. For me, rocket science is VIPER.

Third chip

Daniil Popov: It seems that at the last Google I / O conference they announced a piece of hardware for TensorFlow and other things. When at last the third chip appears in the mobile phones, TPU or what will it be called, who will also do all the ML magic on the device?

Andrei Volodin: We have this thing, it is connected via USB, and you can drive neurons from Google on it. This is already the case with Huawei, we even wrote software for their hexagonal accelerators, so that the segmentation neurons quickly run on the P20.

I must say that they actually already exist on the iPhone. For example, the latest iPhone XS has a co-processor called the NPU (Neural Processing Unit), but so far only Apple has access to it. This coprocessor now cuts the GPU into the iPhone. Some Core ML models use NPU and due to this work faster than bare Metal.

This is significant, given that in addition to the very low inference of the neuron, Core ML requires many additional actions. First, you need to convert the input data into the Core ML format, it will process them, then return it in its format — you need to convert it back, and only then show it to the user. It all takes quite some time. We write overhead free pipelines that work from the beginning to the end on the GPU, while the Core ML models are faster due to this hardware process.

Most likely, a framework for working with NPU will be shown at WWDC in June.

That is, as you said, the device is already there, just the developers can not yet use them to the full. My hypothesis is that companies do not yet understand how to do this neatly in the form of a framework. Or just do not want to give up to have a market advantage.

Alexey Kudryavtsev: With the fingerprint scanner, the same thing was in the IPhone, as I recall.

Andrei Volodin: Even now it’s not that super-affordable. You can use it at the top level, but you cannot get the imprint itself. You can simply ask Apple to allow the user to use it. It's still not that full access to the scanner itself.

Hexagonal Accelerators

Daniil Popov: You mentioned the term hexagonal accelerators. I think not everyone knows what it is.

Andrei Volodin: This is just a feature of the hardware piece of hardware that Huawei uses. I must say, it is quite tricky. Few people know, but in some Huawei these processors are, but are not used, because they have a hardware bug. Huawei released them, and then they found a problem, now in some phones there are special chips that are dead weight. In the latest versions everything is already working.

In programming, there is the SIMD (Single Instruction, Multiple Data) paradigm, when the same instructions are executed in parallel on different data. The chip is designed so that it can process some operation in parallel on several data streams simultaneously. In particular, hexagonal means that 6 elements are parallel.

Alexey Kudryavtsev: I thought that the GPU works just like this: vectorizes the task and performs the same operation on different data. What's the Difference?

Andrei Volodin : GPU more general purpose. Despite the fact that programming for the GPU is rather low-level, it is rather high-level with respect to working with co-processors. For programming on a GPU, a C-like language is used. On iOS, the code is still then compiled using LLVM into machine instructions. And these things for coprocessors are often written directly hardcore - in assembly language, on machine instructions. Therefore, there the increase in productivity is much more noticeable, because they are sharpened for specific operations. They can not count anything at all, but you can count only what they were originally intended for.

Alexey Kudryavtsev: And what are they usually designed for?

Andrei Volodin: Now mainly for the most common operations in neural networks: convolution - convolution or some kind of intermediate activation. They have pre-wired functionality that works super-fast. So they are much faster on some tasks than GPUs, but in all the others they simply don’t apply.

Alexey Kudryavtsev: It looks like DSP processors, which were once used for audio, and all the plug-ins and effects worked on them very quickly. Special expensive pieces of iron were sold, but then the processors grew, and now we record and process podcasts directly on laptops.

Andrei Volodin: Yes, about the same.

GPU is not only for graphics

Daniil Popov: I understand correctly that now on the GPU you can process data that is not related to graphics directly? It turns out that the GPU loses its original purpose.

Andrei Volodin: Exactly. I talk about this quite often at conferences. The first were NVidia, which presented CUDA. This is a technology that makes GPGPU (General-purpose computing on graphics processing units) easier. You can write on it on the superset of C ++ algorithms that are parallelized on the GPU.

But people have done it before. For example, craftsmen on OpenGL or on even older DirectX simply wrote data into the texture — each pixel was interpreted as data: the first 4 bytes into the first pixel, and the second 4 bytes into the second pixel. We processed the textures, then back the data from the texture was extracted and interpreted. It was very crutch and difficult. Now video cards support general purpose logic. You can feed any buffer to the GPU, describe your structures, even the hierarchy of structures in which they will refer to each other, calculate something and return to the processor.

Daniel Popov: That is, we can say that the GPU is now Data PU.

Andrei Volodin: Yes, graphics on the GPU are sometimes processed less than general calculations.

Alexey Kudryavtsev: The architecture of the CPU and the GPU is different in essence, and you can be considered both there and there.

Andrei Volodin : Indeed, in something the CPU is faster, in something the GPU. Not to say that the GPU is always faster.

Daniil Popov: As far as I remember, if the task is to calculate something very different, then on the CPU it can be much faster.

Andrei Volodin: It also depends on the amount of data. There is always an overhead for transferring data from the CPU to the GPU and back. If you consider, for example, a million elements, then using a GPU is usually justified. But calculating a thousand items on a CPU can be faster than simply copying them onto a video card. Therefore, you should always choose a task.

By the way, Core ML does it. Core ML is able in runtime, according to Apple, to choose where to calculate faster: on the processor or on the video card. I do not know if it works in reality, but they declare that they are.

Hardcore GPU Engineer knowledge for a mobile developer

Alexey Kudryavtsev: Let's go back to mobile development. You are a GPU Engineer, you have a lot of hardcore knowledge. How can this knowledge be applied to a mobile developer? For example, what do you see in UIKit that others do not see?

Andrei Volodin: I will talk about it in detail at AppsConf. You can apply a lot where. When I see, for example, how the UIKit API works, I can immediately understand why and why it is done. Observing a drop in performance when rendering some views, I can understand the reason, because I know how the rendering is written inside. I understand that in order to display the effects that the Gaussian blur actually does over the frame buffer, you must first cache the entire texture, apply a heavy blur operation to it, return the result, finish rendering the other views, and only then show it on the screen. All this must fit in 1/60 of a second, otherwise it will slow down.

It is absolutely clear to me why this is a long time, but for my colleagues it is not clear. That is why I want to share the design techniques that we often use in GameDev, and my insights on how I look at the problems and try to solve them. It will be an experiment, but I think that should be interesting.

Why there is no native blur in Android

Daniil Popov: You mentioned the blur, and I had a question that worries, I think, all Android developers: why iOS has native blur, and Android doesn't.

Andrei Volodin: I think this is because of the architecture. Apple platforms use Tiled Shading rendering architecture. With this approach, not the whole frame is rendered, but small tiles — small squares, parts of the screen. This allows you to optimize the performance of the algorithm, because the main performance gains with the use of the GPU is the efficient use of the cache. On iOS, the frame is often rendered in such a way that it doesn't take up memory at all. For example, on iPhone 7 Plus, the resolution is 1920 * 1080, which is about 2 million pixels. Multiply by 4 bytes per channel, it turns out in the region of 20 megabytes per frame. 20 MB to simply store the system frame buffer.

The Tiled Shading approach allows you to split this buffer into small pieces and render it slightly. So the number of cache accesses is greatly increased, because in order to make a blur, you need to read the already drawn pixels and count the Gaussian distribution on them. If you read across the frame, the cash rate will be very low, because each thread will read different places. But if you read small pieces, then the cash rate will be very high, and the performance will also be high.

It seems to me that the lack of native blur in Android is connected precisely with the peculiarities of the architecture. Although, maybe this is a grocery solution.

Daniil Popov: In Android, there is a RenderScript for it, but there you need to mix, draw, lay with your hands. This is much more complicated than one checkbox in iOS.

Andrei Volodin: Most likely, the performance is also lower.

Daniil Popov: Yes, in order to satisfy the designer's wishes, we have to downscale the picture, blur it, and then upscale it back in order to save something.

Andrei Volodin: By the way, using this you can do different tricks. The Gaussian distribution is a blurred circle. Gauss sigma depends on the number of pixels you want them to collect. Often, as an optimization, you can downscale a picture and slightly narrow the sigma, and when you return the original scale, there will be no difference, because the sigma directly depends on the size of the picture. This trick we often use inside to speed up the blur.

Daniil Popov: Nevertheless, RenderScript in Android does not allow making a radius greater than 30.

Andrei Volodin: Actually, a radius of 30 is a lot. Again, I understand that it is very expensive to assemble 30 pixels using a GPU on each stream.

What is similar to mobile development and GameDev

Alexey Kudryavtsev: In the theses to your report, you say that mobile development and GameDev have a lot in common. Tell me a little, what exactly?

Andrei Volodin: UIKit architecture is very similar to game engines, and the old ones. Modern went towards Entity Component System, this will also be in the report. In UIKit it also comes, there are articles in which they write how to design views on components. But it came up with GameDev, the first time the Component System was used in the game Thief in '98.

Fundamentally, for example, Cocos2d, on which I worked for a long time, and the ideas that were used in the first implementation are very similar. Both Scene graph is used there, the scene tree, when each node has sub nodes, they are rendered using the accumulation of affine transformations, which are specifically called CGAffineTransform on iOS. These are simply 4 * 4 matrices that are multiplied together to change the coordinate system. Animation is made about the same everywhere.

Both in game engines and in UIKit everything is built on time interpolation. We just interpolate some values - be it colors or positions between frames. The optimizations are all the same: in GameDev, it’s customary not to do too much work, and UIKit uses setNeedsLayout, layoutIfNeeded.

I keep these parallels for myself constantly - between what I once did and between what I see in the Apple framework. About this and tell on AppsConf .

Daniil Popov: Indeed, Cocos2d API is similar to iOS (for UI). Do you think the developers were inspired by each other's work or did it just work out architecturally?

Andrey Volodin: I think that they were inspired by something. Cocos2d appeared in 2008-2009, then UIKit was not the UIKit that we know now. It seems to me that some tricks were repeated there in order to make it more comfortable for people to work so that they could draw parallels.

It's funny that the rocker swung: the Cocos2d core-team originally borrowed Apple's ideas a little, and then Apple completely copied Cocos2d, right down to all architectural solutions. SpriteKit is essentially a complete copy of all the ideas that appeared in Cocos2d. In this sense, Apple has taken its credit.

Alexey Kudryavtsev: It seems to me that the same tricks as in UIKit in 2009 were still on MacOS, which has existed since ancient times. There the same setNeedsLayout, layoutIfNeeded is, affine transformations.

Andrei Volodin: Of course, but GameDev exists even longer than MacOS.

Alexey Kudryavtsev: Do not argue!

Andrei Volodin: Therefore, I do not compare Cocos2d with Apple frameworks, but rather consider in principle the paradigms that originated in GameDev. It was in GameDev that people understood for the first time that inheritance is bad. When the whole world admired the PLO, GameDev already began to think that inheritance brings problems, and came up with components. Mobile development, as an industry, has come to this only now.

Alexey Kudryavtsev: It seems that Alan Kay understood a long time ago that inheritance is bad.

Andrei Volodin : Yes, but on the whole, you will agree that just a few years ago, everyone said that the PLO is cool. And now there is Protocol-Oriented Programming in Swift, a functional, and everyone is coming up with something new. In GameDev, these moods have appeared for quite some time.

Alexey Kudryavtsev: I will make a remark: Alan Kay is the same person who invented the PLO. He said that he did not invent inheritance, but only sending messages, and in general he was misunderstood.

Differences between mobile development and GameDev

Alexey Kudryavtsev: Tell me now about the differences: how are GameDev and mobile development radically different, and what can we not use from GameDev?

Andrei Volodin: It seems to me that the fundamental difference is that product development is as lazy as possible. We are trying to write code according to the principle “until they ask, I will not get up”. Until the callback works, we will not do anything. Even rendering in product development is lazy: not the whole frame is redrawn, but only those parts that have changed.

GameDev-development in this sense is merciless. Everything is done for each frame: 30 or 60 times per second the whole scene is redrawn from scratch, every frame, every object is updated, every frame is simulated by physics. A lot of things happen, and this changes the paradigm very much. You begin to live inside one frame - I have an entire part of the report devoted to this. You need all-all-all fit in 1/60 or 1/30 seconds. Therefore, you begin to be clever, to do the maximum number of preliminary calculations, parallelization, while the GPU renders a frame, to prepare the next on the CPU. That is why the battery from games is discharged much faster than from conventional applications.

Alexey Kudryavtsev: And why in games you can't do everything too lazy?

Andrei Volodin: The concept of games does not allow well. Some games could definitely benefit from this, for example, Tetris, in which there are few dynamics and only some parts change. But overall, the game is a very complex thing. Even when the character is just standing, he, for example, is swaying - some kind of animation occurs, there is some logic, physics is calculated. From the savings can get more damage, because each frame is so changed that reuse fragments becomes almost impossible.

In addition, there are hardware restrictions. For example, the GPU works better with the float type, and not with double, because of this, the accuracy is much lower. Therefore, for example, if you redraw only part of the screen, noticeable artifacts may occur. On the CPU, the accuracy is high, because there everything is rendered in double precision, you can use beautiful fonts and neat curves, but on the GPU there will still be some approximation.

The combination of these factors leads to the fact that each frame requires heavy calculations, updating all objects is actually drawing from scratch.

Classic development is much closer to GameDev than you think.

Daniil Popov: I want to discuss a provocative statement from your future report that “classical development is much closer to GameDev than you think.” I immediately remembered a series of articles about crutches in games that were intended to speed up development when deadlines were running out. For these articles, it seems that GameDev is a crutch on a crutch for the sake of optimizations. In the usual development now everyone is obsessed with architecture, beautiful code. I can not relate this to GameDev.

Andrei Volodin: Of course, enterprise companies do not do that, but in GameDev indie, this is about it. But specifically this thesis about the other. I often notice that developers use many of the concepts that are used in GameDev, but do not even understand it.

For example, affine transformations. Few can clearly say that this is just a multiplication of 4 * 4 matrices. More often, the CGAffineTransform is an opaque data structure in which something is stored and it is not clear how the view causes the view to scale.

In the report I will try to show the other side of what we use every day, but at the same time, maybe, we do not fully understand.

About the benefits of mathematics

Alexey Kudryavtsev: How can a mobile developer come to this understanding? How to figure out what is under the hood of rendering in UIKit, how affine transformations are arranged inside, and not to be scared once again? I understand that this is a matrix, but I cannot say what exactly the figure is responsible for what. Where to gather information in order not to be afraid and understand?

Andrei Volodin: The most obvious advice is to start doing a pet project.

The main thing about this is to say: all the concepts of mobile GPU development are absolutely similar to those on the desktop. iOS GPU programming is not fundamentally different from what is in the desktop environment. Therefore, if for iOS there is a lack of material on the topic, then you can always read something for NVidia or AMD solutions and be inspired by them. Ideologically, they are exactly the same. The API is a little different, but it's usually clear how to shift existing practices from desktop programming to mobile.

Alexey Kudryavtsev: When you use an API, for example, the Cocos2d or Unity game engine, you don’t understand everything early - you just pull some methods. How exactly to begin to understand, and where it is better to see what is better to read, so that it can be shifted to UIKit?

Andrei Volodin: Cocos2d - Open Source project and well written. I’m not very objective, because I’ve put a hand on it, but it seems to me that there is a pretty good code that can be read and inspired. It is written in a not very modern objective-C, but there are detailed comments on many difficult places.

But when I talk about the pet project, I’m not talking more about high-level projects like making a game, but about writing an API that does, for example, the glitch effect. You know, there are popular APIs that make a VHS effect. And not on the processor, but on the GPU. This is a relatively simple task that can be done over the weekend. But it is not so easy, if you never tried it. When I did this for the first time, I learned amazing things: “That's how contrast and saturation work on Instagram, or lightroom presets!” It turns out that these are just shaders that multiply 4 numbers or raise to a power - that's all.

Directly tears down the tower from how simple it is.

You use it every day and take it for granted - it works, but you don’t understand how. Then you start doing it yourself, and it becomes at the same time cool from the fact that you are doing something supposedly complicated, but it is also funny that in reality it is so simple that it is even funny.

Daniil Popov: Anyway, it seems to me that we need some kind of mathematical basis. For example, in Cocos2d, some shaders are literally 5 lines of code, and you sit and look at them like a ram at the gate, and you just don’t understand what is written there. Probably, it’s not easy to dive into the language of shaders without knowing mathematics, basic concepts, etc.

Andrei Volodin: I agree about mathematics. Without basic knowledge of linear algebra it will be difficult, first you have to figure it out. But at the same time, if you had a linear algebra course at the university, and you at least roughly represent at the first year level what a scalar product and its geometric meaning, what is a vector product and its geometric meaning, what is its own vector, normal matrix, as matrix multiplication works, it will be quite easy to understand.

Daniil Popov: Often computer science students whine that they don’t need physics and mathematics. Probably, many now hardly remember how matrix multiplication works.

Andrei Volodin: For me, this is a sore subject. I was the same, arrogantly yelled, why do I need a functional analysis and the like. But I have a valuable life experience when I was interviewed at Apple, on the ARKit team. At the interview, there was such a huge amount of mathematics that I later thanked myself for going to couples. If it were not for the background that I received at the university, I would never have answered these questions, and would never have understood how it works.

Now, when I myself teach at the university or come on an open day, I always say: “Friends, you will have time to sit in your IDE, please go to Linal, to Matan and in general understand what it is. In the era of machine learning, this will definitely come in handy. ”

Daniil Popov: The most important thing - was the interview?

Andrei Volodin: Yes, of course, and only due to the fact that I had a mathematical background.

Alexey Kudryavtsev: Now you know why to teach matan, and where you can get after that.

Andrei Volodin: For example, without an understanding of affine transformations and knowledge of what the normal is, you cannot go far in VR. Even when you create a Project Template in Xcode, everything is already multiplied there, there are vector artworks, something is transposed. , .

: .

: - , GameDev GPU.

: . - , , , . , , , , , UI: , , runtime Objective-C — , , . . , : , — , X Y, !

, , - , GameDev GPU- — .

, . AppsConf 22 23 .

Source: https://habr.com/ru/post/448222/

All Articles