
It is not the first year in the UK that an ambitious program is being implemented to digitize the archives of universities and large libraries with the help of modern technologies. Its goal is to translate information into a user-friendly electronic form and make it accessible to everyone. Educational institutions are inspired (including financially) by the non-governmental public organization “Joint Information Systems Committee (JISC)” for such a feat.
And very well: last year the Hartley Library of the University of Southampton, one of the best research libraries in the country, joined this program. She set herself an ambitious task: to digitize everything that is possible, at a pace of presto-soon (half a million pages a year) - and give it to people. To make presto presto, and not some adagio, Hartley used
ABBYY Recognition Server , a solution for automatic recognition of documents in large volumes. Under the cut - technical details and some buns.
7 scanners, 2 programs and 1 open API

Large and small digitization projects in Hartley are handled by a separate unit - the LDU (Library Digitization Unit). It has 7 scanners (6 book and 1 lower case) and ABBYY Recognition Server for word processing and image processing. Manages the process of the web application Goobi Production Workflow - open-source software package, adopted by the largest European libraries to digitize the cultural heritage in the "industrial" scale (details about it
in English and
German ).
')
Recognition Server’s open programming interface provided easy integration with Goobi, and the “production algorithm” looked like this:
• LDU scanners are assigned by operator. As soon as the operator copes with his task (completely scans a book or a multi-page document), Goobi is connected to the process. The program sets Recognition Server to process the finished stack of files. Several operators, plus a lot of documents - it turns out a kind of conveyor, and Goobi watches everyone like a big brother.
• Recognition Server automatically processes specified files: recognizes, converts, indexes. Goobi checks the execution of the task and the result is sent to the Network
Thanks to the “7 + 2 + 1” combination, the Hartley library processed more than two million images, and the World Wide Web users got access to rather unusual PDF collections.
Which PDF do you want, sir?
Hartley was not afraid to put on the Internet antiquarian rarities: from pamphlets on the topic of the day and parliament bills of the 19th century to doctoral theses and antiquarian books on knitting - for dessert. Everything is available in a searchable PDF and lives on several web resources.
For example, any dissertation work (and there are 20,000 of them in the archive) can be downloaded via
ePrints Soton - the university’s electronic library. In addition to the theses, there is still a lot of interesting things; almost everyone is allowed to use - most of the works are fully accessible. From the reader (fellow researchers) expect elementary human courtesy: compliance with copyright law.
The collection of documents of the English Parliament from 1700 to 1834 lies
here . These include collections of official records of meetings of the House of Lords and the House of Commons, parliamentary registers, reports from sessions of the House of Commons and regulations.
At this
address is available a collection of pamphlets, which literally collected throughout England. The country's scientific libraries sent over 23,000 masterpieces of literary and satirical thought to digitize in Hartley, narrating the social, political, and economic climate of Britain in the 19th century. The catalog and description of the project are attached.
Touch the beautiful (and practical)
here . A collection of knitting books by Richard Ratt, a bishop and scholar, arrived in Southampton from the library of the Winchester School of Art. The oldest copy of the collection dates back to 1800, and the youngest is 1911.
But that's not all (s). Hartley Library actively supports those humanitarian and technical courses of the university, where physically it is necessary to read a lot in order to know a lot. The training material is digitized and laid out on the web as a searchable PDF, but these files are available - as you already understood - only to those who are enrolled in a course.
That's it :)