📜 ⬆️ ⬇️

From the dusty archive to the Internet: how ABBYY Recognition Server digitizes libraries

We have repeatedly written in our blog about unusual applications of our ABBYY Recognition Server . And in the comments we were regularly asked why we do not cooperate with libraries. Of course, we answered that we were cooperating, but we didn’t talk about it in detail. Today we are corrected.

image For a start, a little excursion into history: we have been working with libraries for more than 10 years. One of the first projects in this direction was the digitization of the catalog of the National Library of Lithuania. First, over the course of the year, more than three million (!) Cards were scanned, which contained information about the title of the book, its author, publisher, year of publication, and much other useful information. Recall the library card looks like this, and recognize it is not so easy.

Then they were all recognized, verified by operators - and the library had a quick and easy-to-use electronic catalog.
')
But that was a long time ago.

Now projects with libraries are more complex, and we are already talking, of course, not about electronic catalogs, but about digitizing a large amount of printed materials. Someone wants to make an online archive of historical periodicals, someone - to open access to rare books, and someone - to provide everyone with access to the entire library via the Internet.

Here, for example, another project with Lithuanians, which we did 10 years after the first. The task was set as follows: to collect a database of digital cultural heritage from materials of Lithuanian archives, museums and libraries. 50 thousand publications (newspapers and magazines), published before 1940, were to be recognized and made available for search. The situation was complicated by the fact that many documents were in a rather shabby form.

Based on Recognition Server, the following scheme was created:

image
The central distribution server (its peak load is 500 thousand pages per month) picked up TIFFs from the inbox with scanned A3 pages and distributed them to 30 recognition stations. The recognized documents were sent to 4 operators for verification, after which the central server published the read documents in searchable PDF files.

As a result, the result exceeded all expectations: it all happened in just three months, and the project budget was 4 times less than what our partners had originally planned.

So that you do not think that we work only with our closest neighbors, let us give an example of another project - with Malaysia. The Malaysian Department of Museums has set a task for us: to make all periodical materials stored in local museums (and these are 9 thousand books, newspapers, magazines and other materials) accessible via the Internet. A solution from MediaUniverse was chosen to store the results. And the whole way of a newspaper article on the screen of an Internet user looked like this:

image

And finally, a few words about the project with another one of our neighbor - Estonia. Its complexity was that some materials for recognition (namely, 600 thousand pages of newspapers, magazines and books), which were provided by the Estonian National Library, the customer of the project, were published in the XIX century and printed in Gothic fonts. The first books in Estonian were not printed in Estonia (then there were no printing houses there), but in Sweden, Finland and Germany. Then in the Estonian alphabet were letters that were not in other languages, and sometimes it turned out that the same letters in different foreign printing houses turned out differently, and different letters - on the contrary, came out similar to each other. For such difficult cases, we had to additionally “train” a special version of our product - ABBYY FineReader XIX Engine OCR (this version is able to recognize Gothic fonts). Recognition server coped with other materials in collaboration with high-performance scanners Zeutschel OK 300 Hybrid Color . By the way, if you know Estonian, you can look at what happened in the end, right here .

In addition to cooperation with museums and libraries in individual countries, we also participate in various European projects on digitizing library books. There were many articles and press releases about this, so we will not repeat, just list these projects. This is IMPACT (IMProving ACcess to Text) - a large-scale project on digitizing books printed before the 20th century, initiated by the European Commission (for more information, visit ABBYY website ). As part of the METAe project , the company developed FineReader XIX, a program designed to recognize the Fraktur gothic font, which is often found in texts from the 1800-1938s ( Itogi was written in detail about this initiative).

Source: https://habr.com/ru/post/128104/


All Articles