Although the main copies of user tweets are stored in MySQL and Cassandra databases, the company also deploys additional storage on Hadoop, which can be used for analytics and additional software applications.
Information from this system can be queried using Java MapReduce or Pig, Hadoop's own SQL-like query language. At the moment, the search system has already been transferred to this backend, and other applications will appear in the future.
Having rejected popular technologies like XML, CSV and JSON, Twitter programmers chose the relatively unknown
Protocol Buffers format developed by Google (it was already
discussed at Habré) as the format for storing backend data. Technical details of the implementation were
announced by Twitter representatives at the HadoopWorld conference on Tuesday.
Every day, 12 TB of new data is added to the Twitter database. With such volumes, choosing the right format becomes critical. The combination of Protocol Buffers, Hadoop and related technologies is designed to solve this problem.
')
Each tweet consists of 17 fields in the database, six of which have at least one subfield. In the future, Twitter plans to add other subfields. In the future, the storage system must withstand and effectively work with a
trillion tweets from a billion users .
In addition to user content, for storage, the service information from internal logs also enters the database (more than 80 types of operations that occur in the system). The based part of this data is aggregated using the free technology
Scribe (developed by Facebook).
The advantage of Protocol Buffers over XML becomes evident on large amounts of data. According to Twitter developers, a base of one trillion XML tweets can take about ten petabytes instead of one. JSON also stores a lot of unnecessary information. At the other extreme is CSV, where data is separated only by commas. There is nothing superfluous here, but it is difficult to structure the subfields.
Protocol Buffers does not have these drawbacks. In addition, it automates the process of recreating data structures. As stated in the
Protocol Buffers tutorial , it is enough to define the method of structuring data once, after which you can use specially generated code to easily write and read structured data to / from various streams and in different languages.