⬆️ ⬇️

Using Renderscript on Android devices with Intel® processors

image



In the article I would like to give a brief description of the work of the Renderscript technology inside Android, compare its performance with Dalvik on a specific example of an Android device with an Intel processor and consider a small technique of optimizing renderscript.

Renderscript is an API that includes features for 2D / 3D rendering and high performance math calculations. It allows you to describe any task with the same type of independent calculations over a large amount of data and break it into homogeneous subtasks that can be performed quickly and in parallel on multi-core Android platforms.

This technology can improve the performance of a number of dalvik applications related to image processing, pattern recognition, physical modeling, cellular automaton model, etc., which, in turn, will not lose hardware independence.



1. Renderscript technology inside Android



I will give a brief overview of the mechanism of the Renderscript technology inside Android, its advantages and disadvantages.



1.1 Renderscript offline compilation


Renderscript began to be supported in Honeycomb / Android 3.0 (API 11). Namely, in the Android SDK, in the platform-tools directory appeared llvm-rs-cc (offline compiler) for compiling renderscript (* .rs file) into bytecode (* .bc file) and generating java object classes (* .java files ) for structures, global variables inside renderscript and renderscript itself. The llvm-rs-cc is based on the Clang compiler, with minor modifications for Android, which is a front-end for the LLVM compiler .

image

')

1.2 Renderscript run-time compilation


In Android, a framework appeared based on the LLVM back-end, which is responsible for the run-time bytecode compilation, linking with the necessary libraries, launching and monitoring the execution of renderscript. This framework consists of the following parts: libbcc initializes the LLVM context in accordance with the specified pragmas and other metadata in bytecode, compiles the bytecode and dynamically links to the required libRS libraries; libRS contains the implementation of libraries (math, time, drawing, ref-counting, ...), structures and data types (Script, Type, Element, Allocation, Mesh, various matrices, ...).



image

Benefits:



Disadvantages:



2. Dalvik vs. Renderscript in monochrome image processing



Consider the Dalvik_MonoChromeFilter dalvik function (converting a color RGB image to black and white (monochrome)):



private void Dalvik_MonoChromeFilter() { float MonoMult[] = {0.299f, 0.587f, 0.114f}; int mInPixels[] = new int[mBitmapIn.getHeight() * mBitmapIn.getWidth()]; int mOutPixels[] = new int[mBitmapOut.getHeight() * mBitmapOut.getWidth()]; mBitmapIn.getPixels(mInPixels, 0, mBitmapIn.getWidth(), 0, 0, mBitmapIn.getWidth(), mBitmapIn.getHeight()); for(int i = 0;i < mInPixels.length;i++) { float r = (float)(mInPixels[i] & 0xff); float g = (float)((mInPixels[i] >> 8) & 0xff); float b = (float)((mInPixels[i] >> 16) & 0xff); int mono = (int)(r * MonoMult[0] + g * MonoMult[1] + b * MonoMult[2]); mOutPixels[i] = mono + (mono << 8) + (mono << 16) + (mInPixels[i] & 0xff000000); } mBitmapOut.setPixels(mOutPixels, 0, mBitmapOut.getWidth(), 0, 0, mBitmapOut.getWidth(), mBitmapOut.getHeight()); } 


What can I say? A simple loop with independent iterations, "grinding" a bunch of pixels. Let's see how fast it works!

For the experiment, take the MegaFon Mint on the Intel Atom Z2460 1.6GHz with Android ICS 4.0.4 and 600x1024 with a Lego robot carrying Christmas gifts.



imageimage



Measurements of the time spent on processing will be done according to the following scheme:



 private long startnow; private long endnow; startnow = android.os.SystemClock.uptimeMillis(); Dalvik_MonoChromeFilter(); endnow = android.os.SystemClock.uptimeMillis(); Log.d("Timing", "Excution time: "+(endnow-startnow)+" ms"); 


A message with the “Timing” tag can be received using ADB . We will do a dozen measurements, before each of which we will restart the device and make sure that the variation of the measurement results is small.

Image processing time dalvik-implementation amounted to 353 ms.

Note: using multithreading tools (for example, the AsyncTask class to describe tasks performed in separate threads), at best, you can squeeze the double acceleration due to the presence of two logical cores on the Intel Atom Z2460 1.6GHz.

Now consider the renderscript implementation of the RS_MonoChromeFilter of the same filter:



 //mono.rs //or our small renderscript #pragma version(1) #pragma rs java_package_name(com.example.hellocompute) //multipliers to convert a RGB colors to black and white const static float3 gMonoMult = {0.299f, 0.587f, 0.114f}; void root(const uchar4 *v_in, uchar4 *v_out) { //unpack a color to a float4 float4 f4 = rsUnpackColor8888(*v_in); //take the dot product of the color and the multiplier float3 mono = dot(f4.rgb, gMonoMult); //repack the float to a color *v_out = rsPackColorTo8888(mono); } 


  private RenderScript mRS; private Allocation mInAllocation; private Allocation mOutAllocation; private ScriptC_mono mScript; … private void RS_MonoChromeFilter() { mRS = RenderScript.create(this);/* Renderscript-*/ mInAllocation = Allocation.createFromBitmap(mRS, mBitmapIn, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);/*      dalvik  renderscript  */ mOutAllocation = Allocation.createTyped(mRS, mInAllocation.getType()); mScript = new ScriptC_mono(mRS, getResources(), R.raw.mono);/*   renderscript  renderscript- */ mScript.forEach_root(mInAllocation, mOutAllocation);/* renderscript- root c SMP   2  */ mOutAllocation.copyTo(mBitmapOut); } 


Note: the performance of the implementation will be evaluated as for dalvik.

The processing time of the same image by the renderscript implementation was 112 ms.

Gained a performance gain of 3.2x (dalvik and renderscript: 353/112 = 3.2 comparing the runtime).

Note: the renderscript implementation runtime includes creating the renderscript context, allocating and initializing the necessary memory, creating and binding the renderscript to the context, and running the root function in mono.rs.

Note: A critical place for mobile application developers is the size of the resulting apk file. In this implementation, the size of the apk file can increase only by the size of the renderscript in bytecode (* .bc file) compared with the dalvik implementation. In my case, the size of the dalvik version was 404KB, and the size of the renderscript version became 406KB, of which 2KB is the renderscript bytecode (mono.bc).



3. Optimize renderscript



The current renderscript performance can be improved by rejecting a little the accuracy of arithmetic operations with real numbers, which is unprincipled for the problem in question. To do this, add the rs_fp_imprecise pragma to the renderscript :



 //mono.rs //or our small renderscript #pragma version(1) #pragma rs java_package_name(com.example.hellocompute) #pragma rs_fp_imprecise //multipliers to convert a RGB colors to black and white const static float3 gMonoMult = {0.299f, 0.587f, 0.114f}; void root(const uchar4 *v_in, uchar4 *v_out) { //unpack a color to a float4 float4 f4 = rsUnpackColor8888(*v_in); //take the dot product of the color and the multiplier float3 mono = dot(f4.rgb, gMonoMult); //repack the float to a color *v_out = rsPackColorTo8888(mono); } 


As a consequence of this, we get an additional 10% performance gain for the renderscript implementation: 112 ms. -> 99 ms.

Note: as a result, we obtain visually the same monochrome image without any artifacts and distortions.

Note: Renderscript does not have an explicit run-time control mechanism by compiler optimization, unlike NDK, since compiler keys are pre-registered inside Android for each platform (x86, ARM, ...).



4. Dependence of the running time of dalvik and renderscript implementations on image sizes



We investigate the next question: what is the dependence of the operation time of each implementation on the size of the processed image? To do this, take 4 images with dimensions of 300x512, 600x1024 (our original image with a Lego robot), 1200x1024, 1200x2048 and make the corresponding measurements of monochrome image processing time. The results are presented below in the graph and in the table.



300x512600x10241200x10241200x2048
dalvik853537441411
renderscript7599108227
win1.133.566.86.2


Note the linear dependence of time for dalvik relative to the size of the image in contrast to renderscript. This difference can be explained by the presence of the initialization time of the renderscript context.

For images of relatively small sizes, the gain is insignificant, since The initialization time of the renderscript context is about 50-60 ms. However, on medium-sized images, which are most often used on android-devices, the gain is 4-6x.



Conclusion



The article reviewed the dalvik and renderscript implementation of monochrome image processing of different sizes. Due to the parallelization, compiler optimization and native execution of the code, renderscript significantly exceeds dalvik in performance for images of medium size. With this small example, I tried to show when renderscript can be an assistant to improving the performance of applications that remain hardware-independent.



Source: https://habr.com/ru/post/159699/



All Articles