Renderscript part two

Renderscript is a new feature introduced in Honeycomb. It is also known that earlier Renderscript was already used by Android developers (for example, the embedded live wallpaper in 2.1 (Eclair) was written on it). Anyway, full access to the API was open only in Honeycomb. In the first introductory article from the developers blog (the original | translation ) it was promised that there would be a second soon, with a more detailed description of the Renderscript architecture and an example of its use. Actually, under the cut of both.

Hereinafter on behalf of an engineer from the Android development team, R. Jason Sams.

In the article introduction to Renderscript, I gave a brief overview of this technology. In this post I will concentrate on "calculations". In terms of Renderscript, “calculation” means a kind of offloading of data processing from Dalvik code to Renderscript code, which in turn can be run on the same processor, or on others (on several).
')

Renderscript design goals

Renderscript has three main goals, consider them in order from more important to less important:

Portability: Application code must run on all devices, even if their hardware contents are radically different from each other. Now ARM comes in several versions - with and without VFP support, with and without NEON support, as well as with different register counters. In addition to ARM, there are other CPU architectures, similar to x86, also different GPUs and even more different DSP (Digital Signal Processors).

Performance: The second goal is to get as high a level of performance as possible, but with limitations on portability. For Renderscript, we tried to achieve much better performance compared to already installed solutions.

Usability: The third goal was to simplify the development as much as possible. We automated some development steps as far as possible in order to avoid the appearance of a “glue code” and other unnecessary developer downloads.

These three goals limit design and lead to some tradeoffs. These are the trade-offs that separate Renderscript from existing solutions, such as Dalvik or NDK. They should be viewed as different tools that serve different tasks.

Major decisions at the design stage

The first decision we needed to make is the development language. When the choice is what language to use, there is an almost unlimited set of options. C and C ++ were considered as programming languages for shaders. Subsequently, due to the need to manipulate data structures for graphical applications, such as scenes, shader languages had to be abandoned. Missing pointers and recursions were considered usability limitations. On the other hand, using C ++ would be desirable, but then we would have to face limitations on portability. Advanced C ++ features are difficult to run on non-processor hardware. As a result, we decided to base the Renderscript C99 standard, because it offers the same performance for the remaining solutions, is well understood by the developers and does not generate any problems to run on a large range of different hardware.

The second compromise is the Renderscript workflow itself. We especially concentrated on how to compile the source code into machine code. Having explored various options during development, we implemented two solutions. The old version (between Eclair and Gingerbread) fully compiled the C source code into machine. While this gave applications such opportunities as code generation on the fly, it turned into a usability problem. It was very inconvenient to compile the application, install it, run it, and only then find the syntax error. Also, weak CPUs were deprived of the possibility of static analysis and some optimizations that could be made.

Then we switched to LLVM (a low-level virtual machine, it compiles the script into platform-independent bytecode), moving to a model in which scripts are compiled and analyzed on the host (device running the script) using a modified version of clang (who are not knows - this is the front end for the LLVM compiler). At this stage, we produce several high-level optimizations, after which we allocate LLVM bytecode. Translation from intermediate bytecode to machine occurs on the host (with additional device-specific optimizations).

The last major compromise in the design was the launch of threads. This is a trade-off between performance and portability. With enough knowledge, existing computing solutions allow the developer to customize the application for a specific hardware platform, to the detriment of others. Given unlimited time and resources, developers can customize the application for any combination of hardware. Although testing and tuning for a certain set of devices is not bad, no amount of work will allow you to fine-tune an application for equipment that has not yet been released and is missing from the developer. A more portable solution would be to perform analysis at run time, also it provides greater average performance, due to peak performance. Given that portability was our number one goal, we chose this solution.

The second effect of the choice of controlling the launch of threads at run time was the dynamic decisions about where to run the script. For example, some computing hardware supports pointers and recursion, while others do not. We could also ban these things and give developers some sort of least common denominator API, but instead we chose run-time analysis. This allows developers to use all the hardware features that are supported, although at the same time there is a full-featured CPU that you can always rely on. In the end, developers can focus on writing good applications, while hardware manufacturers can work on a full-featured and powerful hardware. As soon as a new performance advantage appears, the application uses it without any code changes.

Usability was our main goal during design. Most existing computational and graphical solutions require “gluing” logic to associate high-performance code with application code. Such code is prone to errors and is problematic for writing. Static analysis, which is performed by Renderscript on the host, helps to solve this problem. Each user of the script creates its own “gluing” Dalvik-class. The names of the gluing class and its accessors are derived from the script. This makes it easy to use a script from Dalvik.

Example: Application Level

Given all the above-mentioned compromises, what should a simple example of a computational application look like? In this basic example, we take the usual android.graphics.Bitmap object and run the script that copies it into another Bitmap, converting it into monochrome in parallel with this. Let's take a look at the application code that calls the script before watching the script itself; HelloCompute SDK example:

private Bitmap mBitmapIn; private Bitmap mBitmapOut; private RenderScript mRS; private Allocation mInAllocation; private Allocation mOutAllocation; private ScriptC_mono mScript; private void createScript() { mRS = RenderScript.create(this); mInAllocation = Allocation.createFromBitmap(mRS, mBitmapIn, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT); mOutAllocation = Allocation.createTyped(mRS, mInAllocation.getType()); mScript = new ScriptC_mono(mRS, getResources(), R.raw.mono); mScript.set_gIn(mInAllocation); mScript.set_gOut(mOutAllocation); mScript.set_gScript(mScript); mScript.invoke_filter(); mOutAllocation.copyTo(mBitmapOut);

This method implies that two Bitmap'a already created and have the same size and format. The first thing Renderscript needs for applications is the context object. This is the central object that is used to create and manage other Renderscript objects. The first line of code creates an mRS object; this object must remain alive as long as the application intends to use it or any objects created with it.

The following two methods cause the creation of allocations (allocates) from the Bitmap for calculation. Renderscript has its own allocator (allocator) of memory, because there is a possibility that the memory will be shared among several processors and more than one memory space may exist. When a valve is created, its potential uses must be numbered, so that the system can determine the proper type of memory for the intended tasks.

The first method createFromBitmap () creates the layout and copies the contents of Bitmap into it. Placements are the basic units of memory that Renderscript uses. The second layout created with createTyped () generates a layout identical to the first. Moreover, the definition of this structure is returned by the getType () request from the first one. Renderscript types define the layout structure. In this case, the type is generated using the height, width, and format of the input bitmap.

The next line loads the script, which is named “mono.rs”, to retrieve it using R.raw.mono. The script itself is stored as a raw resource in the application package (in the APK). Notice the name of the generated “gluing” class, ScriptC_mono.

The following lines set the properties of the script using the generated methods of the “gluing” class.

Now everything is ready, although in fact the method invoke_filter () has done some work for us. The essence of the call to the filter () method of the script "mono.rs". If the method would have parameters, then they need to be passed here. Returning values is prohibited because calls are asynchronous.

The last line copies the result of the script in Bitmap. It has some synchronization code built in to ensure that the script has completed execution.

Example: script

Here is the “mono.rs” script itself, which is called by the above code:

 #pragma version(1) #pragma rs java_package_name(com.android.example.hellocompute) rs_allocation gIn; rs_allocation gOut; rs_script gScript; const static float3 gMonoMult = {0.299f, 0.587f, 0.114f}; void root(const uchar4 *v_in, uchar4 *v_out, const void *usrData, uint32_t x, uint32_t y) { float4 f4 = rsUnpackColor8888(*v_in); float3 mono = dot(f4.rgb, gMonoMult); *v_out = rsPackColorTo8888(mono); } void filter() { rsForEach(gScript, gIn, gOut, 0); }

The first line of the script simply indicates to the compiler what Renderscript revision it was written for. The second line controls the association of the generated reflexive code with the application package.

Next come three global variables that correspond to those created in the control code. The fourth global variable is not reflexive, since it is static. Constants are always better marked as static, because our script code is synchronized with the manager.

The root () method is special for Renderscript. Conceptually, it is similar to the main () method in the C language. When a script is called at runtime, then this function is the entry point. In this case, the parameters are the pixels from our locations. A general user pointer is also provided with an address within the location itself, which is processed by this call. In this example, these parameters are ignored.

Three lines of code in the root () method unpack the pixels from RGBA_8888 into the vector from float4. The second line uses the built-in mathematical dot function, which calculates the dot product of monochrome constants with incoming pixels to get a monochrome image. Notice that dot returns a regular float, which you can assign to float3, which simply copies the value to each component x, y, and z. At the end, we again use the built-in tool to pack the values into a regular 32-bit pixel. This is also an example of method overloading, since there are different versions of rsPackColorTo8888 that accept data in the form of RGB (float3) and RGBA (float4).

The filter () method is called from the control code in order to perform the conversion. It simply performs a calculation run on each placement element. The first parameter means that the script is launched, that is, the root () method is called for each location element. The second and third parameters are input and output data allocations. The last parameter is intended to transfer any user data to the root () method.

forEach will run on multiple processors if they are on the device. In the future, forEach will be able to provide transition points where control can move from one processor to another. In this example, it would be reasonable to assume that in the future filter () will be executed on the CPU, and root () on the GPU or DSP.

I hope that this gave you the opportunity to take a closer look at the design of Renderscript and to get acquainted with how it works with a simple example.

From the translator: I would be happy to hear any comments from developers who somehow began to work with this technology.

Source: https://habr.com/ru/post/115338/

All Articles

Renderscript part two

Renderscript design goals

Major decisions at the design stage

Example: Application Level

Example: script

More articles: