The previous article on text recognition in images in the Evernote service was devoted mainly to the issues of functionality - what it is, how it works and what it gives to the Evernote platform as a whole. This time we will talk about the technical side of the issue.
Hardware
Text recognition in images Evernote significantly loads the computing cluster, so performance and efficiency play a major role in the evaluation of equipment. After testing several different platforms, we stopped at iX1204-563UB from
iX Systems . In fact, it is Supermicro
X8DTU on the chassis
815TQ-563UB . Each of the 37 recognition systems in a cluster consists of the following hardware:
- CPU: two Intel Xeon CPU L5630 @ 2.13 MHz (estimated power dissipation - 40 watts)
- Motherboard: Supermicro X8DTU
- System unit: Supermicro 815TQ-563UB
- Power supply: 560 watts (has an efficiency rating of 80Plus Gold)
- Data storage: 5.25-inch drive with low power consumption
- RAM: 12 GB PC3-8500 (1066 MHz)
CPU, RAM and other components were selected on the basis of a compromise between bandwidth and efficiency. Previously, we evaluated some sophisticated 2U Twin² systems, but found that they were less reliable when working under constant heavy loads that they would have to deal with. Traditional blades were also considered, but in the end it turned out that they were too difficult to plug into the existing infrastructure - especially considering the typical 100% load for these servers.
')
operating system
The underlying operating system is a Debian “Squeeze” build (AMD64) from which everything unnecessary is thrown out. The choice fell on Debian because of the stability and convenience of the upgrade. The OS has remained almost pristine with the exception of a few points:
- The modified kernel 3.0.4 was configured to meet our increased bandwidth requirements, and cflags were customized for the specific processor type we used.
- We disabled the XFS file system with its relatively large buffer space and such things as 'barriers' and 'atime'.
- A set of network components has been configured for a more stable operation with many parallel file transactions.
- The kernel 'swappiness' is set to zero (instead of 60 by default).
- At the OS level, the 802.1Q Trunk network protocol is enabled.
The idea was to minimize the number of bottlenecks as much as possible and allow the image recognition toolkit to quietly go about their business. The tuning of the kernel, which led to a 7-30% increase in performance depending on various conditions, had an unexpectedly large effect. As for XFS, this gave us the opportunity to minimize I / O conflicts on a single-disk volume at the expense of a bit more RAM, as well as the ability to reassign the file system on the fly.
Software
The image recognition tool set Evernote includes internally developed software for working with queues for recognition and image processing itself, as well as a set of recognition mechanisms focused on different types of text. Among them are both our own development and the best-in-class third-party technology from
IRIS. Our own software consists of AMP (Asynchronous Media Processor, asynchronous media processor) and ENRS (Evernote Recognition Service, Evernote Recognition Service). We have already written about this software suite in detail in the
previous article , so we will limit ourselves to a brief description:
- ENRS through its child processes AIR / ANR is the mechanism responsible for the actual image recognition.
- AMP plays the role of an intermediary between the Evernote service cluster and ENRS, accepting raw images and transferring them to ENRS.
The load from the AMP server interaction is mitigated by having its own translation domain with enforced isolation through the 802.1Q tagged VLAN mentioned above. This allows recognition servers to inform each other which shard they are working with and avoid duplication, due to which the load on the main Evernote service is significantly reduced.
We hope that our story has made it clearer for interested readers one of the most unusual components of the Evernote service. The topic is such that it is quite difficult to talk about this topic in detail, but at the same time without slipping into minor details.