📜 ⬆️ ⬇️

Some details about the indexing system in Evernote

The previous article on text recognition in images in the Evernote service was devoted mainly to the issues of functionality - what it is, how it works and what it gives to the Evernote platform as a whole. This time we will talk about the technical side of the issue.

Hardware


Text recognition in images Evernote significantly loads the computing cluster, so performance and efficiency play a major role in the evaluation of equipment. After testing several different platforms, we stopped at iX1204-563UB from iX Systems . In fact, it is Supermicro X8DTU on the chassis 815TQ-563UB . Each of the 37 recognition systems in a cluster consists of the following hardware:



CPU, RAM and other components were selected on the basis of a compromise between bandwidth and efficiency. Previously, we evaluated some sophisticated 2U Twin² systems, but found that they were less reliable when working under constant heavy loads that they would have to deal with. Traditional blades were also considered, but in the end it turned out that they were too difficult to plug into the existing infrastructure - especially considering the typical 100% load for these servers.
')

operating system


The underlying operating system is a Debian “Squeeze” build (AMD64) from which everything unnecessary is thrown out. The choice fell on Debian because of the stability and convenience of the upgrade. The OS has remained almost pristine with the exception of a few points:


The idea was to minimize the number of bottlenecks as much as possible and allow the image recognition toolkit to quietly go about their business. The tuning of the kernel, which led to a 7-30% increase in performance depending on various conditions, had an unexpectedly large effect. As for XFS, this gave us the opportunity to minimize I / O conflicts on a single-disk volume at the expense of a bit more RAM, as well as the ability to reassign the file system on the fly.

Software


The image recognition tool set Evernote includes internally developed software for working with queues for recognition and image processing itself, as well as a set of recognition mechanisms focused on different types of text. Among them are both our own development and the best-in-class third-party technology from IRIS. Our own software consists of AMP (Asynchronous Media Processor, asynchronous media processor) and ENRS (Evernote Recognition Service, Evernote Recognition Service). We have already written about this software suite in detail in the previous article , so we will limit ourselves to a brief description:


The load from the AMP server interaction is mitigated by having its own translation domain with enforced isolation through the 802.1Q tagged VLAN mentioned above. This allows recognition servers to inform each other which shard they are working with and avoid duplication, due to which the load on the main Evernote service is significantly reduced.

We hope that our story has made it clearer for interested readers one of the most unusual components of the Evernote service. The topic is such that it is quite difficult to talk about this topic in detail, but at the same time without slipping into minor details.

Source: https://habr.com/ru/post/139930/


All Articles