At interviews we are often asked what the Department of Products for Developers does. We briefly tell about
ABBYY FineReader Engine , but many applicants are only hearsay about what the SDK is and how it can be used, and perceive our story as common words.
Today is a great example of how ABBYY FineReader Engine is used in a real product of a real company to solve real problems. Recently, the Russian company
SECURIT has integrated FineReader Engine into its data leakage detection (DLP) products, including a product called Zgate. This was a
press release , and we carefully look at the technical side.
To detect data leaks, Zgate analyzes messages created by users in the course of work — email messages (including those sent via webmail services), messages on social networks, on forums, on blogs and on online pagers. To do this, it integrates with a mail server and a proxy server and thus can monitor all traffic (usually it is enough to control only outbound).
')
As soon as a suspicious message is detected (the product has decided that the message contains confidential information), its transmission can be blocked, it can be quarantined until the end of the test by a person, or the message can be immediately sent and simultaneously deferred for subsequent inspection. If the message flow is large and the suspension is not required or can be harmful to the company, you can put Zgate on a dedicated server and configure the routing so that all traffic is duplicated to this server - Zgate will work completely independently and will not affect the transfer of messages .
When searching for suspicious messages, the rules specified by the administrator are used. The search is performed using a dictionary, including using regular expressions and morphology (“rollback size” and “rollback size” should be detected equally reliably), or by comparing the transmitted message with sample documents — in this case, the same methods of duplicate detection are used. in google search and other search engines. Text analysis methods also include pleasant trifles - for example, a product may consider letters from different alphabets with the same typefaces (“a” from Cyrillic and “a” from Latin) to be equivalent or numbers and certain combinations of letters (“w8ing” and “waiting”).
This is a very simplified description, the product administration guide alone occupies about two hundred pages, and good product implementation usually requires the supplier’s close involvement in order to choose the right hardware and set up the system according to the needs of a particular customer. For example, you need to take into account the typical amount of traffic and the requirements for its control in order to select the number and capacity of servers sufficient to process messages as they arrive.
Zgate is a sophisticated DLP product with well-understood functionality that could continue to work on its own. At the same time, before the product could only view documents digitally composed - RTF, MS Word, PDF with a text layer, etc., but not all documents in organizations exist in this form. Documents can also be in the form of graphic images (scans and photos) or PDF without a text layer, and the transfer of documents in such formats may also need to be controlled.
After embedding the FR Engine, Zgate works exactly in the same scenarios, but now it can view and analyze image files as well. Accordingly, if before such files had to either always skip (unconditional trust), or always block (unconditional distrust), now we can make an informed decision on each file.
Zgate extracts files from messages, sends them to the FR Engine for recognition, the recognized text is transmitted to the input of the same analysis methods as before. Due to the high accuracy of recognition of texts in various languages, the applicability of Zgate is expanding.
If it were not for the SDK, Zgate developers would have to do the recognition themselves, and this is not so easy (our company has been developing and improving recognition technologies for many years). Instead, they license our SDK and can, for example, simply write such C # code (based on the example supplied with the SDK):
void processOneImage( FREngine.IEngine engine, string imageFilePath, string resultPath ) { FREngine.FRDocument document = engine.CreateFRDocument(); try { document.AddImageFile( imageFilePath, null, null ); document.Process( null, null, null ); document.Export( resultPath, FREngine.FileExportFormatEnum.FEF_Text, null ); } finally { document.Close();
And that's all - the FR Engine will open the image, recognize with the default settings, export the result to a text file. If necessary, you can easily select the desired set of languages, other options. It does not matter that several million lines of code of the most diverse subsystems work inside - opening images, recognizing, exporting. The FR Engine user is provided with a well-thought-out software interface that allows you to use all the product features
The capabilities of the product are, for example, almost two hundred recognition languages, many of them with dictionary support, the discovery of a wide range of image formats, a very high recognition accuracy. All this product developers can not do themselves, and license in the form of an SDK. The text file with the recognition results, issued by the FR Engine, can be input to the text analysis methods and decide what to do with the image.
SECURIT Zgate is a great example of FR Engine integration. So part of the functionality of the product Zgate developers create themselves, and a part of our license. This gives everyone the opportunity to do what he understands better.
A few words about Linux. Such solutions as Zgate can afford to work on one system - in this case, on Windows, because the cost of implementation is usually high enough so that the cost of a Windows license doesn’t cause concern - you just buy the right hardware with the right operating system. For example, in a spherical company in a vacuum with several thousand users, the amount of outgoing traffic is usually about 20 gigabytes; a pair of HP ProLiant DL160 G6 E5620 servers with a quad core processor each is used to process it. If the Zgate developers decide to upgrade to Linux, we have the
FR Engine for Linux .
Dmitry Mescheryakov,
Product Development Department