Report from the conference Lucene Revolution

In early October, I was able to attend the Lucene Revolution conference, which was held in the hero-city of Boston. This conference was dedicated to the open search technologies Apache Lucene and Apache Solr . It seems to me that in Habré in particular and in RuNet in general, these technologies are paid unduly little attention. Let's correct this omission.

I have been engaged in these technologies for a long time, but I could not imagine that a conference on this narrow specialized area could bring together so many speakers, participants and companies - the total number of participants was over 300, more than 40 speakers. The range of implementations of these search technologies is quite impressive - social networks, microblogging services, CRM platforms, online stores, government websites, online and university libraries, dating sites, and even bioinformatics projects.
')

The largest social network of professional contacts contains more than 80 million user profiles and serves approximately 350 million search queries per week.

In my opinion, LinkedIn has the most advanced search among all social networks due to two things:

Using social graph when searching
That is, the sorting of search results depends on the distance of users in the social links graph. And after all, the truth is that the user is most likely interested in Vasya Pupkin from his circle of friends, and not the one who has a smaller or more identifier in the system’s profile. Additionally, the user has the ability to filter the search results by a given distance - the first circle / second circle.
Faceted navigation
Long gone are the days when the flat search results were considered as a search interface. Online stores, content management systems, specialized search engines have long been enriching the results with special filters, with the help of which the user can narrow down the search criteria and eventually get to the information he needs. The key feature of the user interface is that the user sees in advance how many results he will receive by applying one or another filter. LinkedIn provides a rich set of filters by employer, location, university, and more.

At the moment, LinkedIn engineers are more concerned with the scalability and performance of their solution, as well as with an immediate update of the search index, which their presentation was dedicated to.

Twitter, billion requests per day

Every day, about 80 million new tweets are added to the Twitter database, and the delay in the appearance of new data in the search index should not be more than a few seconds. The previous version of the search engine was implemented using Summize technology based on MySQL. The new version uses a slightly revised Apache Lucene library. Twitter is going to open its groundwork in the near future and integrate it into the Apache project.

Salesforce

The largest company in the field of SaaS CRM is also actively using Lucene. Currently, the Salesforce index is more than 8 terabytes (20 billion documents). About 500 thousand users use their search every day and the average load is about 4000 requests per second.

Loggly.com

Loggly is a startup that offers Cloud a solution for storing and analyzing your logs. You can configure your servers to send syslog logs to the servers of this company and subsequently search for logs and use all kinds of analytics. The core of the architecture is the Solr Cloud platform, which provides the ability to index up to 100 thousand messages per second.

Archive.org

The archive-it.org commercial solution from well-known company archive.org provides full-text search of nearly 1 billion documents for more than 120 customers migrating to a solution based on Apache Solr. They use a mix of their own Heretix and customized Apache Nutch as a search engine spider.

Search.USA.gov and WhiteHouse.gov

The US government's IT budget is about 75 billion dollars, so it’s not surprising that government orders are very valuable to technology companies. The search engine on these government sites over the past ten years has come a long way of commercial solutions - Inktomi (2000), Fast (2002), Vivisimo (2005) and eventually found stability in the Apache Lucene / Solr open source solution. Another interesting fact is that developers use Rackspace Hosting and Amazon Web Services for hosting, Pivotal Tracker as a project management tool and github for storing source code. Many commercial corporations try to keep full control over all these things within their intranets, so seeing this openness when designing a government government solution is quite surprising.

Libraries and Institutions (HathiTrust, Yale, Smithsonian)

Various university libraries use Apache Solr to search through the catalog as well as providing full-text search through scanned and recognized books. The main problems they face are the support of various languages (CJK, compound words, etc.) and huge scalability at a reasonable price. HathiTrust search indexes 6.5 million scanned books (244 terabytes of images, 6 terabytes of recognized text) using limited computational resources.

Bioinformatics (Metarep project)

This is probably the most amazing application of these technologies in the field of radically different from full-text search. I honestly did not understand much, except for the fact that in this area there are also issues of processing and analyzing large volumes of information.

New features for Apache Lucene and Solr

There have also been several presentations about the new features of the Apache Lucene and Apache Solr projects in upcoming releases, such as:

Solr cloud

The functionality seriously simplifies the creation, configuration, and support of a distributed cluster. The core of the solution uses the Apache Zookeeper project, which has confirmed its reliability in projects like HBase and many solutions from Yahoo.

Geographic Search (Solr Spatial Search)

It is no secret that all sorts of geo-services (location based services) have recently gained popularity. Often they need to solve the problem of searching for various objects not only by the factor of distance from the user, but also to apply ordinary filters — full-text, by category or tag, and so on. The Apache Solr project now provides such an opportunity out of the box (this could have been implemented before, but each one invented his own bicycle). Project Lucene / Solr is used by such major players in the industry as Yelp.com and YP.com

Instant Search (Realtime Search)

The data structure of the index and the search algorithm are such that an instant update of the index with new documents (every second or even a fraction of a second) was and remains very difficult. This is related to cache data structures in memory, combining index segments, resetting an index segment from memory to disk, etc. Some companies (Twitter, LinkedIn, and others) are working and have made good progress in keeping this delay to a minimum.

Flexible Indexing in Lucene (Flexible Indexing)

At the moment, the Lucene library is seriously rewriting in order to allow developers to control how the index is written to disk.

There were also sessions devoted to the following issues and related technologies:

using Apache Solr as a NoSQL repository
a project dedicated to practical scalable implementations of Apache Mahout machine learning algorithms (this project is used by Yahoo for spam filtering)
Apache Nutch search robot integration
migrating corporate search from the Fast platform to Apache Solr
and many others

Archive with reports of the conference is here.

Source: https://habr.com/ru/post/107638/

All Articles