At the moment, the load on Twitter servers has grown to 1000 TPS (tweets per second) and 12,000 QPS (requests per second) - more than 1 billion requests per day. The current infrastructure still stands up, but in order to create a reserve for several years to come, the company decided to update the backend for the search engine. “If we worked well, then you shouldn’t have noticed anything in recent weeks,” according
to the Twitter blog.
Until recently, the Twitter search backend was based on the old SQL system from Summize. It was bought
in July 2008 just for this purpose, and also took five out of six developers. The need to upgrade Twitter became clear immediately after the presentation of the iPhone 3G, and then began cooperation with Summize. But now it's time to be updated again.
About six months ago, it was decided to develop a new, modern search architecture based on an effective inverted index instead of a relational database. Since Twitter loves open source, we chose the Apache Lucene
search library, written in Java, as the starting point for the solution.
Requirements for the new search engine consisted of good scalability and maximum indexing speed. The task was set that from the moment of the publication of a tweet to the possibility of full-text search on it should take no more than 10 seconds. Since the indexer is only a part of the whole conveyor on this way, it had to work as fast as possible (less than 1 second).
To achieve their goals, I had to redo Lucene a bit, because it is not very suitable for a search engine in real time. The basic data structures in memory were rewritten, especially the post-lists, but at the same time the support of the standard Lucene API was preserved, so there was almost no need to redo the search part of the library. Here are the key benefits resulting from the modification:
* significantly improved garbage collection performance
* data structures and non-blocking synchronization algorithms (lock-free)
* post-lists that can be reversed
* effective termination of requests at an early stage
According to the developers themselves, some of the applied methods may be interesting and useful to other programmers (not only in the search area), so that in the future a more detailed discussion of the topic is possible.
One way or another, all the modifications made to Lucene will be sent to Apache, and some are already included in the main Lucene code and its new branch for real-time search.
As a result of the upgrade of the search infrastructure, the load on the backend was significantly reduced (now it is only 5% of the resources), so there is a good reserve for the future. A new indexer can index about 50 times more tweets per second than is published today. And the new search engine works absolutely stably, without any complaints.
One of the unpleasant moments of the Twitter search engine has always been the inability to search the archive for tweets for more than a few days. They explained
this by “lack of space.” To get around this limit, you have to use third-party search engines that independently index tweets, for example, Topsy
On January 14, 2010, Danny Sullivan checked the
search results with the word [today] and discovered the oldest tweet published 7 days ago.
A similar test in mid-September showed that the index depth was reduced to 4 days.
With the introduction of a new search, it was announced that the index would be doubled without any consequences for the speed of search queries. Apparently, this is a return to the same seven-day limit.