Own Google Search - now for document scans

How to make documents on the company's servers available for full-text search and at the same time keep them confidential? How to get the functionality of Google Search, ~~without taking the dirty linen out of the hut~~ leaving the documents within the company's network? Corporate search is another fast-growing delicious pie .

~~Tiny little-known~~ company Google offers a solution in the form of a beautiful yellow box for installation in a standard 19-inch rack - Google Search Appliance.

The scheme is as follows:

enter into an agreement
put in a yellow box
assigning her an IP address (the domain name doesn't hurt either)
the box bypasses and indexes documents online
everyone who enters the browser at that IP address sees the exact same page as on www.google.com - you can give the same requests there, you can also get results
???
HAPPINESS

The same familiar search (respectively, a minimum of effort to train employees), and documents do not leave the company's network. A significant limitation is that image files in file storages (for example, document scans) are not available for search - GSA cannot extract text from them. Houston, we have a problem.
')
As often happens in this corporate blog, ~~Captain Obvious~~ comes to the aid of optical text recognition .

The Google Search Appliance is able not only to crawl websites independently, but also to receive so-called feeds (alas, an adequate Russian-language word has not yet been found).

A feed is a special XML document; You can include a pair in it (URL + text). The feed is sent to the GSA by an external program — just an HTTP POST request to the corresponding port. The GSA will accept the feed, parse it and write it into an index “by this URL is a document with this text.”

Further, when the user enters a suitable search query, a document (link plus extracted text with highlighted matches) will be displayed in the search results. The same Google Search, but the text is extracted and "forged" by an external program.

Happiness is near. For text recognition, we will normally use ABBYY Recognition Server ~~electrical tape~~ . It includes a separate service that can bypass file storage, transfer files for recognition to Recognition Server, create feeds from recognition results and send feeds to the Google Search Appliance.

Bypassing the repositories can be performed many times, the changed files are redistributed, new feeds are sent for them, special feeds are sent for deleted files, which instruct to remove the URL of the file from the index. The service runs on the same machine as Recognition Server.

The feed mechanism allows you to completely separate the recognition from the GSA itself. Due to the excellent scalability of Recognition Server, recognition can be performed quite quickly even in the case of a large number of documents. For example, if you need to quickly include a large archive in the index, you can put the recognition stations on employees' machines using an SMS installation and configure the product so that the stations are used only on weekends or only at night.

Naturally, the same installation of Recognition Server can be used for other business processes of the organization.

Here it is, another use case for Recognition Server is to help get to that fast-growing pie.

Dmitry Mescheryakov
Department of Data Entry Products

Source: https://habr.com/ru/post/107066/

All Articles

Own Google Search - now for document scans

More articles: