How to recognize a text with a photo: new possibilities of the Vision framework

Now the Vision framework is able to recognize the text for real, and not as before. We look forward to when we can apply it in Dodo IS. In the meantime, translation of an article on the recognition of cards from the Magic The Gathering board game and the retrieval of textual information from them.

For the first time, the Vision framework was introduced to the general public at WWDC in 2017, along with iOS 11.
')
Vision was created to help developers classify and identify objects, horizontal planes, barcodes, facial expressions, and text.

However, there was a problem with text recognition: Vision could find a place where the text is located, but actual text recognition did not occur. Of course, it was nice to see the bounding box around the individual text fragments, but then they had to be pulled out and recognized independently.

This problem was solved in the Vision update, which was included in iOS 13. Now the Vision framework provides true text recognition.

To test this, I created a very simple application that can recognize the card from the Magic The Gathering board game and extract text information from it:

card name;
release code;
collection number (it’s an index).

Here is an example of a map and selected text that I would like to receive.

Looking at the card you might think: "This text is rather small, plus there is a lot of other text on the card that can interfere." But for Vision this is not a problem.

First we need to create a VNRecognizeTextRequest . In essence, this is a description of what we hope to recognize, plus setting up a recognition language and level of accuracy:

 let request = VNRecognizeTextRequest(completionHandler: self.handleDetectedText) request.recognitionLevel = .accurate request.recognitionLanguages = ["en_GB"]

The completion block has the form handleDetectedText(request: VNRequest?, error: Error?) . We pass it to the VNRecognizeTextRequest constructor and then set the remaining properties.

There are two levels of recognition accuracy: .fast and .accurate . Since our card has a rather small text at the bottom, I chose a higher accuracy. The fast version is more likely to be better for large amounts of text.

I have limited recognition to British English, since all my cards are on it. You can specify several languages, but you need to understand that scanning and recognition can take a little longer for each additional language.
There are two more properties worth mentioning:

customWords : you can add an array of strings to be used over the built-in lexicon. This is useful if there are any unusual words in your text. I did not apply an option for this project. But if I made a commercial recognition app for Magic The Gathering cards, I would add some of the most complex cards (for example, Fblthp, the Lost ) to avoid problems.
minimumTextHeight : this is the float value. It indicates the size relative to the height of the image at which the text should no longer be recognized. If I created this scanner to just get the name of the card, it would be useful to remove all other text that is not needed. But I need the smallest pieces of text, so for now I have ignored this property. Obviously, when ignoring small texts, the recognition rate will be higher.

Now that we have our request, we need to pass it along with the image to the request handler:

 let requests = [textDetectionRequest] let imageRequestHandler = VNImageRequestHandler(cgImage: cgImage, orientation: .right, options: [:]) DispatchQueue.global(qos: .userInitiated).async { do { try imageRequestHandler.perform(requests) } catch let error { print("Error: \(error)") } }

I use the image directly from the camera, converting it from a UIImage to a CGImage . This is used in the VNImageRequestHandler along with the orientation flag to help the processor understand which text it should recognize.

As part of this demo, I use the phone only in portrait orientation. So naturally, I add the .right orientation. So, the layouts!

It turns out that the orientation of the camera on your device is completely separated from the rotation of the device and is always considered to be on the left (as was the default in 2009, you need to keep your phone in landscape orientation to take photos). Of course, times have changed, and we mostly take photos and videos in portrait orientation format, but the camera is still aligned to the left.

As soon as our handler is configured, we move to the stream with the priority .userInitiated and try to execute our queries. You may notice that this is an array of queries. This happens because you can try to pull out several pieces of data in one pass (that is, identify faces and text from the same image). If there are no errors, the callback created using our query will be called after the text is found:

 func handleDetectedText(request: VNRequest?, error: Error?) { if let error = error { print("ERROR: \(error)") return } guard let results = request?.results, results.count > 0 else { print("No text found") return } for result in results { if let observation = result as? VNRecognizedTextObservation { for text in observation.topCandidates(1) { print(text.string) print(text.confidence) print(observation.boundingBox) print("\n") } } } }

Our handler returns our query, which now has a results property. Each result is a VNRecognizedTextObservation , which for us has several options for the result (hereinafter - candidates).

You can get up to 10 candidates for each unit of recognized text, and they are sorted in descending order of confidence. This can be useful if you have certain terminology that the parser does not correctly recognize on the first attempt. But determines correctly later, even if he is less confident in the correctness of the result.

In this example, we need only the first result, so we go through the cycle of observation.topCandidates(1) and extract both text and confidence. While the candidate himself has different text and confidence, the .boundingBox remains the same. .boundingBox uses a normalized coordinate system with the origin in the lower left corner, so if it is to be used later in UIKit, for your own convenience it should be converted.

This is almost all you need. If I run a photo of the card through this, I will get the following result in less than 0.5 seconds on the iPhone XS Max:

 Carnage Tyrant 1.0 (0.2654155572255453, 0.6955686092376709, 0.18710780143737793, 0.019915008544921786) Creature 1.0 (0.26317582130432127, 0.423814058303833, 0.09479101498921716, 0.013565015792846635) Dinosaur 1.0 (0.3883238156636556, 0.42648010253906254, 0.10021591186523438, 0.014479541778564364) Carnage Tyrant can't be countered. 1.0 (0.26538230578104655, 0.3742666244506836, 0.4300231456756592, 0.024643898010253906) Trample, hexproof 0.5 (0.2610074838002523, 0.34864263534545903, 0.23053167661031088, 0.022259855270385653) Sun Empire commanders are well versed 1.0 (0.2619712670644124, 0.31746063232421873, 0.45549616813659666, 0.022649812698364302) in advanced martial strategy. Still, the 1.0 (0.2623249689737956, 0.29798884391784664, 0.4314465204874674, 0.021180248260498136) correct maneuver is usually to deploy the 1.0 (0.2620727062225342, 0.2772137641906738, 0.4592740217844645, 0.02083740234375009) giant, implacable death lizard. 1.0 (0.2610833962758382, 0.252408218383789, 0.3502468903859457, 0.023736238479614258) 7/6 0.5 (0.6693102518717448, 0.23347826004028316, 0.04697717030843107, 0.018937730789184593) 179/279 M 1.0 (0.24829587936401368, 0.21893787384033203, 0.08339192072550453, 0.011646795272827193) XLN: EN N YEONG-HAO HAN 0.5 (0.246867307027181, 0.20903720855712893, 0.19095951716105145, 0.012227916717529319) TN & 0 2017 Wizards of the Coast 1.0 (0.5428387324015299, 0.21133480072021482, 0.19361832936604817, 0.011657810211181618)

This is incredible! Each piece of text was recognized, placed in its own bounding box and returned as a result with a trust rating of 1.0.

Even a very small copyright is basically correct. All this was done on the image of 3024x4032, weighing 3.1 MB. The process would be even faster if I first reduced the image. It is also worth noting that this process is much faster on new bionic A12 chips that have a special neural engine.

When the text is recognized, the last thing to do is to pull the information I need. I will not put all the code here, but the key logic is to go over each .boundingBox determine a location so that I can select the text in the lower left corner and in the upper left corner, ignoring anything further to the right.

The end result was an application scanning the card and returning the result to me in less than one second.

PS In fact, I need only a release code and a collection number (also known as an index). Then they can be used in the Scryfall service API to obtain all possible information about this map, including the rules of the game and the cost.

A sample application is available on GitHub .

Source: https://habr.com/ru/post/459668/

All Articles

How to recognize a text with a photo: new possibilities of the Vision framework

More articles: