In this article, we will analyze the experience of writing a tool that allows, with a minimum of effort and time, to automate a wide range of routine tasks.
Foreword
It took me a bot to perform several tasks demanding on logic and speed of reaction. I didn't want to go into the API and pick the program binaries. It was decided to go through visual automation. I found several bots, but none of them came up to my requirements, being either too slow, or the script part was severely curtailed or there was insufficient functionality to work with the visual component. Since I had a successful experience of using a visual bot in the past (albeit slow and severely curtailed in the script part), I decided to make my implementation.
Required at the beginning of the functional ')
The following features were needed:
Clicking the mouse, moving the cursor, pressing the buttons.
Simulate keyboard keystrokes.
The ability to search on the screen for a previously prepared piece of a picture, for example, an icon or a letter, and if it finds one, let it do anything with this information.
A script interpreter so that you can simply describe the algorithm of actions and do not need to be compiled time after time.
Existing analogues
There are a number of analogues, but each of them has its own advantages and disadvantages. Consider the most functional:
Sikuli - has a huge set of useful features, a convenient scripting Python language and syntax, cross-platform with some reservations, However, the initial task required exactly the reaction rate, which he could not provide for the most part due to Tesseract. The second problem is binding to Java 1.6, which will not allow translating it without dancing with a tambourine to 1.7 or 1.8, for example. The older version of Jython is also not particularly happy, although 2.5.2 is not so much outdated.
The rest of Sikuli is a great tool, I advise everyone!
AutoIt - uses Basic for scripting.
Kulibins on the forums made it possible to look for pictures on the screen, unfortunately this function did not meet my requirements due to unnecessary simplifications and a lot of restrictions. It works only under Windows, requires installation.
Requires compiling for use on other machines, there is no way to quickly correct the script, if AutoIt is not installed.
AutoHotKey - uses its syntax for writing scripts, which still need to be studied for a long time, it would be more correct to call it “horrible”. It is difficult to do anything with really extensive logic without proper habit. The search for pictures on the screen is too curtailed and did not fit my needs.
It has several ports for Linux / Unix systems, requires installation
Clickermann - uses its own language to write scripts that still need to be learned. Due to simplifications, functionality has been reduced, for example, the same http requests.
There is no search for pictures on the screen, although there is a primitive search for pixels.
UOPilot - attached to the processes that did not suit me, there are diseases like Clickermann. No cross-platform, large scripts are not convenient to write.
There were also many macros, most of which worked on the “repeat after me” principle by memorizing and repeating user actions, which, in turn, does not allow creating anything with an extensive algorithm of actions with a bunch of if and while.
And the rest were banal, there was no search function for any icons on the screen.
I read the article , the author made a bot to get a discount for laser vision correction. The article describes many of the problems encountered in the course of writing. Most of these problems arose precisely because of understandable simplifications and the crutch was solved by a new crutch, I advise an article to read.
Choosing technology for your own bike
From the outset, it was decided to use Java SE to write the kernel itself, which in turn saved time, as I use Java most often. In addition, he was familiar with the Robot class, which allows simulating mouse and keyboard controls in a convenient form.
When adding a script interpreter, Python was chosen as fairly simple and popular. For Java, there is a Jython implementation that runs on the JVM and does not require installation. In addition, they will allow working with Java classes and objects directly from a script, which significantly expands the possibilities of scripting, without limiting what is inherent in the bot's kernel.
Subsequently I added image search on the screen through GPGPU using OpenCL, for Java there was a JOCL implementation, but more on that later.
Swing graphical user interface, simple and at the same time functional component, available on any JRE right out of the box.
The first steps
In Java, there is a class Robot, which allows you to simulate keyboard presses, mouse movements, and clicks; there were no special problems with it. So I just expanded some functionality using mouseClick (x, y) using mouseMove + mousePress + mouseRelease, adding Thread.sleep (ms) between these actions, and then adding a few more methods with different arguments by overloading. The same Drag & Drop in the form of a single method.
publicvoidmouseClick(int x, int y)throws AWTException { mouseClick(x, y, InputEvent.BUTTON1_MASK, mouseDelay); } publicvoidmouseClick(int x, int y, int button_mask)throws AWTException { mouseClick(x, y, button_mask, mouseDelay); } publicvoidmouseClick(int x, int y, int button_mask, int sleepTime)throws AWTException { bot.mouseMove(x, y); bot.mousePress(button_mask); sleep(sleepTime); bot.mouseRelease(button_mask); } publicvoidmouseClick(MatrixPosition mp)throws AWTException { mouseClick(mp.x, mp.y); } publicvoidmouseClick(MatrixPosition mp, int button_mask)throws AWTException { mouseClick(mp.x, mp.y, button_mask); } publicvoidmouseClick(MatrixPosition mp, int button_mask, int sleepTime)throws AWTException { mouseClick(mp.x, mp.y, button_mask, sleepTime); } public MatrixPosition mousePos(){ returnnew MatrixPosition(MouseInfo.getPointerInfo().getLocation()); }
All this was necessary to facilitate the writing of actions in the script. Raise the level of scripting to a more abstract "You need it and you do it."
Fuh, have read ... Rested? And now let's go further!
Eyes of the nucleus
The next step is the most difficult and took the most time - adding the ability to take a screenshot of the screen and find a pre-prepared icon on it.
First, how to take a screenshot of the screen and how to store it? Secondly, how to search for a pattern (icon)? Thirdly, where to get this pattern (icon)?
If the creation of a screenshot is not so difficult
Then there were certain problems with the search, where to get the pattern for the search? Create, and how? To create the first pattern, the good old Paint was used, with the help of PrintScreen I threw a screenshot of the screen into the editor and cut a small piece out of the screenshot, saving it in a separate .bmp file of the format.
Well, the pattern itself is there, loaded it from the code into the BufferedImage. A screenshot is also created in BufferedImage, now the search algorithm has to be done. I came across an option to go brutally in the network - take the first pixel of the small picture, the first pixel of the big picture and compare them, if the pixels have the same color code, check the remaining pixels relative to that point. If all the pixels match - it means that the desired image was found. If it does not match, then we take the next pixel from the big picture and repeat the action again.
It does not sound very much, but it works.
for (int y = 0; y < screenshot.getHeight() - fragment.getHeight(); y++) { __columnscan: for (int x = 0; x < screenshot.getWidth() - fragment.getWidth(); x++) { if (screenshot.getRGB(x, y) != fragment.getRGB(0, 0)) continue; for (int yy = 0; yy < fragment.getHeight(); yy++) { for (int xx = 0; xx < fragment.getWidth(); xx++) { if (screenshot.getRGB(x + xx, y + yy) != fragment.getRGB(xx, yy)) continue __columnscan; } } System.out.println(“found!”); } }
Run, and ... It works! Found one coincidence! However, rather slowly, which is completely unacceptable for us. Time was spent on the fact that every time getRGB () methods are called, the processor's cache is used extremely inefficiently, but for us this can be said - a pure search in the matrix. Pixel matrix! Therefore, I decided to translate a BufferedImage object that stores a screenshot of the screen into an int [] [] matrix, and the search fragment was also transferred to an int [] [] matrix, correcting our cycles for working with the matrix. Run and ... Does not find.
After an active search for answers in search engines - it became clear that the reason for this is all ARGB / RGBA / RGB format in which BufferedImage data is stored. The screenshot had an ARGB file with a BGR fragment.
I had to bring everything to the same format, namely the ARGB screenshot format, since it’s faster to bring the fragments once to the format of the screenshot than every time the screenshot to the format of the fragments. We bring less to a larger format, which took quite a long time, but eventually it worked, the patterns started to be successfully in the screenshot much faster, almost twice as fast!
// USED FOR BMP/PNG BUFFERED_IMAGE private int[][] loadFromFile(BufferedImage image) { final byte[] pixels = ((DataBufferByte) image.getData().getDataBuffer()) .getData(); final int width = image.getWidth(); if (rgbData == null) rgbData = new int[image.getHeight()][width]; for (int pixel = 0, row = 0; pixel < pixels.length; row++) for (int col = 0; col < width; col++, pixel += 3) rgbData[row][col] = -16777216 + ((int) pixels[pixel] & 0xFF) + (((int) pixels[pixel + 1] & 0xFF) << 8) + (((int) pixels[pixel + 2] & 0xFF) << 16); // 255 // alpha, r // gb; return rgbData; }
Then it was left to a small optimization like caching a row of the matrix, if conditions, which also added the speed of obtaining the search result.
public MatrixPosition findIn(Frag b, int x_start, int y_start, int x_stop, int y_stop){ // precalculate all frequently used data final int[][] small = this.rgbData; final int[][] big = b.rgbData; final int small_height = small.length; final int small_width = small[0].length; final int small_height_minus_1 = small_height - 1; final int small_width_minus_1 = small_width - 1; final int first_pixel = small[0][0]; final int last_pixel = small[small_height_minus_1][small_width_minus_1]; int[] row_cache_big = null; int[] row_cache_big2 = null; int[] row_cache_small = null; for (int y = y_start; y < y_stop; y++) { row_cache_big = big[y]; __columnscan: for (int x = x_start; x < x_stop; x++) { if (row_cache_big[x] != first_pixel || big[y + small_height_minus_1][x + small_width_minus_1] != last_pixel) // if (row_cache_big[x] != first_pixel) continue __columnscan; // No first match // There is a match for the first element in small // Check if all the elements in small matches those in big for (int yy = 0; yy < small_height; yy++) { row_cache_big2 = big[y + yy]; row_cache_small = small[yy]; for (int xx = 0; xx < small_width; xx++) { // If there is at least one difference, there is no // match if (row_cache_big2[x + xx] != row_cache_small[xx]) { continue __columnscan; } } } // If arrived here, then the small matches a region of big return new MatrixPosition(x, y); } } return null; }
I tried to play with the matrix type, long vs int and the best result was still with the int [] [] matrix with both configurations of the 64 / 32bit JVM on i7 4790.
Bot brains
Namely, the script part, it should be convenient, the syntax is clear without further explanation. Ideally, any popular language that can be built into the bot's core and has rich documentation will do. Using the kernel API should be simple and easy to remember.
The choice fell on Python, is popular, easy to learn, well documented, has many ready-made libraries, and most importantly, the script is easy to edit in any text editor! In addition, I have long wanted to study it.
For Java, there is an embedded Python implementation called Jython. It runs on the JVM and does not require anything extra to get started, it allows you to use literally all classes, Java libraries, as well as .jar packs! In turn, this only strengthened confidence in the correctness of the choice.
We connect Jython to the project, create an interpreter object and run our script file.
classJythonVM{ privateboolean isJythonVMLoaded = false; private Object jythonLoad = new Object(); private PythonInterpreter pi = null; publicJythonVM(){ // TODO Auto-generated constructor stub } void load() { System.out.println("CORE: Loading JythonVM..."); pi = new PythonInterpreter(); isJythonVMLoaded = true; System.out.println("CORE: JythonVM loaded."); synchronized (jythonLoad) { jythonLoad.notify(); } } void run(String script) throws Exception { System.out.println("CODE: Waiting for JythonVM to load"); if (!isJythonVMLoaded) synchronized (jythonLoad) { jythonLoad.wait(); } System.out.println("CORE: Running " + script + "...\n\n"); pi.execfile(script); System.out.println("CORE: Script execution finished."); } }
Now look at the script.
# -*- coding: utf-8 -*- print("hello")
From the script we load the necessary classes of our kernel and simply create their objects. Thus, you can pull class methods, which means that you can use the kernel API to perform the actions we need!
These classes are Action and MatrixPosition; Exception classes of the FragmentNotLoadedException and ScreenNotGrabbedException classes were later added.
Action - used as the main class for calls to kernel functionality. It contains useful methods designed to simplify the process of writing a script, to reduce the number of extra lines required to solve any problem. The same mouseClick, keyClick, find fragments in the screenshot, grab to create the screenshots themselves, and so on.
In addition, you can create many objects of this class, and accordingly use them independently in several threads at once!
Let's add our script to a couple of lines to use the kernel API
# -*- coding: utf-8 -*- from bot.penguee import Action a = Action() # API print("hello") # , a.mouseMove(1000, 500) # x 1000, y 500
MatrixPosition - used as a wrapper for coordinates on the screen. In this format, the bot API returns the coordinates. Of course, I recall the already finished Point class, which already has the necessary functionality. However, not everything is so simple, the X and Y fields are accessible only through the pos.getX pos.getY methods, which causes a lot of inconvenience during script writing. It is much more convenient to access the fields via pos.x pos.y. In addition, practice has shown that positions should also have their own names, which turned out to be necessary for some tasks such as sorting positions among themselves (processing numbers from the screen alphabetically).
The possibilities are also expanded with the help of the methods add, sub, which allow you to create a new position relative to the coordinates of the current object.
According to the statistics of the search, it became clear that it was necessary to find mostly static images that do not change their position on the screen. For this, a cache with coordinates has been added. If the image is in the cache, it will first check for the presence of the image in the cached coordinates, if not found, then search the rest of the screen. This little detail has greatly increased the speed of execution of scripts.
GPGPU into service
I always wanted to make quick the process of finding patterns on the big screen, the optimization of the algorithm has its limitations. Prior to this, the entire search process took place on a processor, dividing it into separate threads would not give a real gain in speed, but would increase the problems with the load by an order of magnitude. Having experience in writing kernel code for GPGPU, I loaded the OpenCL library and threw the same search algorithm that was used on the processor (not the best solution for video cards), with some changes to adapt to the kernel program features.
For comparison, the intel i7 4790 with a screen resolution of 1920 * 1080 processor search spent 0-12ms in the worst case (the farthest corner of the screen), then the Intel HD 4600 0-2ms is stable. However, you have to pay more for the process of creating the screenshot itself, since it is required to load the screenshot matrix into the memory of the video card, which takes time. At the same time, this is compensated for by the fact that you can search for many different pictures on the same screenshot, which ultimately gives a performance gain over the processor search.
Thread safety
It is especially important to make it possible to use streams and look for any fragments independently of each other, so that the buffers, objects were made local, so that when writing a script, there are no bugs.
Cross platform
The scripts work the same on any platform, the only exception is text that is printed to the console, different encodings on different operating systems, so this is a separate problem. JVM allows you to run a bot on any platform without the need for installation. "Took and run."
Final result
The number of kernel methods that can be called from the script has more than 70 pieces and their number is constantly growing.
This tool is suitable for testing software and graphical interfaces. It helps in cases when it is necessary to automate something, but there is no desire or enough knowledge to pick binaries (Good for enikeev).
Real examples of use
For obvious reasons, I don’t provide titles or source code.
A trading bot for an online MMO game auction, the bot analyzes real-time figures for the cost of goods and the possible profit from resale, then with the help of overlay layers it writes the possible profit figures directly on the user's screen.
A passive macro for a single-player game, improves buildings automatically, passively watching for the presence of upgrade buttons and taking control for a moment, clicking on the necessary buttons in a very short time.
When you receive a new message in Skype - opens the program and the window of the desired dialogue.
Office calculation of the work done by the names of employees, the data are visually taken from the interface of an obsolete program that does not have an API or easy access to the database.
Office counting products from 1C, eniky can not work with the API and the database directly, used this bot.
GUI bot overview
We write a simple script for parsing
YouTube playlist
Script from video
# -*- coding: utf-8 -*- from bot.penguee import MatrixPosition, Action from java.awt.event import InputEvent, KeyEvent a = Action() p1 = MatrixPosition(630, 230) p2 = MatrixPosition(1230, 780) while True: a.grab(p1, p2) a.searchRect(630, 230, 1230, 780) # if a.find("verstak.gui"): a.searchRect(760, 320, 960, 500) # emptyCells = a.findAllPos("cell_empty") a.searchRect(700, 520, 1220, 770) # if a.findClick("coal.item"): coalRecentPos = a.recentPos() print(coalRecentPos.name) for i in range(len(emptyCells)): a.mouseClick(emptyCells[i], InputEvent.BUTTON3_MASK) a.sleep(50) a.mouseClick(coalRecentPos) a.searchRect(630, 230, 1230, 780) # result = a.findPos("verstak.arrow").relative(70, 0) a.keyPress(KeyEvent.VK_SHIFT) a.sleep(100) a.mouseClick(result) a.sleep(100) a.keyRelease(KeyEvent.VK_SHIFT) elif a.find("pech.gui"): if a.find("pech.off"): a.searchRect(700, 520, 1220, 770) # if a.findClick("coal.block"): coalBlockRecentPos = a.recentPos() a.searchRect(630, 230, 1230, 780) # a.mouseClick(a.findPos("pech.off"), InputEvent.BUTTON3_MASK) a.mouseClick(coalBlockRecentPos) result = a.findPos("verstak.arrow").relative(70, 0) a.keyPress(KeyEvent.VK_SHIFT) a.sleep(100) a.mouseClick(result) a.sleep(100) a.keyRelease(KeyEvent.VK_SHIFT) if a.find("pech.empty"): a.searchRect(700, 520, 1220, 770) # if a.findClick("gold.ore"): a.searchRect(630, 230, 1230, 780) # a.mouseClick(a.findPos("pech.empty")) a.sleep(6000)