Accelerate Android code

I will continue the work on optimization of the algorithm begun in my previous article . Briefly tell what was done. Ready-made java sources and a cascade model of one of the implementations of the Viola-Jones algorithm were taken. This algorithm is used to search for objects in a photo, in particular, to search for faces. Testing was done on my phone, according to the results, the initial code on java worked 54 seconds on a 300x400 photo. It was too slow, the C ++ code I rewrote showed a result of 14 seconds. In the comments, it was suggested to catch up to the java-implementation before C ++ as follows: profile and find bottlenecks, and replace the two-dimensional array with a one-dimensional one. I also had plans to parallelize the algorithm in C ++. Everything was done and investigated, the results below.

We profile java-release. We remove the statistics at the time of the main part of the algorithm. We see this picture (the result in the file is interesting to anyone):

List methods eat a lot of time. In this situation, all the lists can be easily replaced with arrays, which is what is used in C ++. Replace and get the following result ( in the file ):

pay attention to Math.sqrt, the percentage of its execution has risen from 6.7 to 13.5, which means that others have reduced the consumption of time, and the launch on the phone showed 38 seconds. The rest of the time is spent on primitive methods: multiplication, division and obtaining of the element of the array and Math.sqrt. Accordingly, I no longer see where and what can be changed, since C ++ repeats the code exactly.

One-dimensional arrays instead of two-dimensional . Change the following code:

int[][] grayImage=new int[width][height]; int[][] squares=new int[width][height];

on:

  int[] grayImage=new int[width * height]; int[] squares=new int[width * height];

and at the appropriate places of use:

  int total_x = grayImage[(i + w) * height + j + h] + grayImage[i * height + j] - grayImage[i * height + j + h] - grayImage[(i + w) * height + j]; int total_x2 = squares[(i + w) * height + j + h] + squares[i * height + j] - squares[i * height + j + h] - squares[(i + w) * height + j]; ... rect_sum += (int) ((grayImage[rx2 * height + ry2] - grayImage[rx1 * height + ry2] - grayImage[rx2 * height + ry1] + grayImage[rx1 * height + ry1]) * r.weight); }

the result is exactly the same. Let's try to reduce the number of multiplication operations to calculate the array element, i.e. do this:

  int iwh = (i + w) * height; int ih = i * height; int total_x = grayImage[iwh + j + h] + grayImage[ih + j] - grayImage[ih + j + h] - grayImage[iwh + j]; int total_x2 = squares[iwh + j + h] + squares[ih + j] - squares[ih + j + h] - squares[iwh + j]; ... int rx2h = rx2 * height; int rx1h = rx1 * height; rect_sum += (int) ((grayImage[rx2h + ry2] - grayImage[rx1h + ry2] - grayImage[rx2h + ry1] + grayImage[rx1h + ry1]) * r.weight);

the result has not changed. The one-dimensional array did not give pluses, only minuses in the readability of the code.
And so, with Java, we ended up at a result of 38 versus 14 seconds in C ++.

Parallelization. Let's try to parallelize the algorithm written in C ++ on the phone GT-I9300I with a quad-core processor. On Android, you can use posix threads, as it is written in the documentation, it is slightly trimmed, but we just need to create threads, pass them the parameters and wait for them to execute. All this is done simply:

 #include <pthread.h> ... extern "C" { struct ThreadArgument { void *array; int threadNum; }; void *worker_thread(void *arg) { ThreadArgument *arg1 = (ThreadArgument*) arg; ((Detector*) (arg1->array))->workerTansient(arg1->threadNum); pthread_exit(NULL); } } ... pthread_t m_pt[threadsNum - 1]; if (threadsNum > 1) { res2 = new VectorRects*[threadsNum - 1]; for (int l = 0; l < threadsNum - 1; l++) { ThreadArgument* args = new ThreadArgument; args->array = (void*)this; args->threadNum = l + 1; int success = pthread_create(&m_pt[l], NULL, worker_thread, args); if (success == 0) { __android_log_print(ANDROID_LOG_INFO, "Detector", "thread %d started", l); } } } __android_log_print(ANDROID_LOG_INFO, "Detector", "gettFaces3 baseScale=%f maxScale=%f scale_inc=%f", baseScale, maxScale, scale_inc); ret = getResult(0); if (threadsNum > 1) { for (int l = 0; l < threadsNum - 1; l++) { int success = pthread_join(m_pt[l], NULL); for (int b = 0; b < res2[l]->currIndex; b++) { ret->addRect(res2[l]->rects[b]); } } } ... VectorRects* Detector::workerTansient(int threadNum) { __android_log_print(ANDROID_LOG_INFO, "Detector", "workerTansient thread running %d", threadNum); res2[threadNum - 1] = getResult(threadNum); return NULL; } ...

We get the following result 14 seconds for one stream, 8 seconds for two streams, 5 seconds for three and 4 seconds for four or more streams. Parallelization on a Nexus S virtual device with a single-core processor did not lead to anything.
')
The source of the algorithm can be downloaded here . Use as follows:

  Detector detector = Detector.create(inputHaas); List<Rectangle> res = detector.getFaces(image, 1.2f, 1.1f, .05f, 2, true, useCpp, threadsNum);

where inputHaas is the model stream, i.e. the haarcascade_frontalface_default.xml file from the original algorithm, useCpp - use C ++ or not, threadsNum - the number of threads for C ++ code. It is assumed that the Detector is triggered once.

Total Replacing LinkedList used in iteration with arrays as a whole gave a 1.5-fold increase for the current algorithm, but still lags 2.5-fold behind the implementation in C ++. Translation of a two-dimensional array into one-dimensional showed no changes. Using parallelization in C ++, we managed to speed up the process by 3 times on a quad-core processor, which is much better, but far from ideal. You can look in the direction of RenderScript - a mechanism for complex calculations at the native level, maybe it will give some benefit.

Source: https://habr.com/ru/post/268987/

All Articles

Accelerate Android code

More articles: