Google has connected the OCR-engine for indexing PDF

Google has made a significant step towards indexing the so-called Invisible Network, that is, the lion’s share of web content that is still beyond search engine robots. These are mainly password-protected sites and various databases, as well as huge arrays of scanned PDF documents.

Both Google and many other search engines index PDFs without problems if it has a text layer (it is stored in standard text format in a file container). But there are actually quite a few “correct” PDFs. Much more documents are plain scanned copies in graphical format, just saved in PDF. Therefore, for their indexing, Google now connected the OCR engine . Now millions of previously inaccessible government reports, court decisions and academic research will be included in the index. Here are some examples of the new engine.

It should be recalled that in April, Google learned to handle drop-down menus and other HTML forms in various database interfaces; this is also an important technology for indexing the Invisible Network.

Source: https://habr.com/ru/post/43864/

All Articles

Google has connected the OCR-engine for indexing PDF

More articles: