Once again, I want to offer my translation of the article, this time by Todd Hoff, and his article is devoted to WhatsApp architecture at the time of his purchase of Facebook.
Remark: the beginning of the article contains the reasoning of the author of the original about why Facebook bought WhatsApp for a fabulous 19 billion. If this is not interesting to you - just scroll through, the description of the architecture will be below.
Rick Reed in his upcoming March report entitled "A Billion with a Big 'M': The Next Level of Scaling in WhatsApp" reveals the stunning WhatsApp statistics:
What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM and hopes to serve billions of smartphones that will soon become a reality around the world? Erlang and FreeBSD-based WhatsApp architecture. We encountered many difficulties in meeting the ever-growing demand for our messaging service, but we continue to expand our system in terms of size (> 8000 cores) and in terms of speed (> 70M Erlang messages per second).
But since we do not yet have this report, let's look at the report that Rick Reed made two years ago: "Scaling to millions of simultaneous connections . "
With experience in developing high-performance messaging bus in C ++ at Yahoo, Rick Reed is no stranger to the world of scalable architectures. The founders of WhatsApp are also former Yahoo employees with considerable experience in scaling systems. So whatsapp works due to their scaling skills. And since their Big Bold goal is to be on every smartphone in the world, of which there will be about 5 billion in a few years, they will need all this experience.
Before we get to the facts, let's digress to this astounding riddle: how could WhatsApp ever be valued by Facebook at $ 19 billion?
If you ask me, as a programmer, if WhatsApp is worth such money, I will answer that, of course, no! It's just sending data over the network! Well really. True, I am one of those who believe that the blog platform is not needed, because there is nothing difficult in connecting remotely to the server, opening index.html using vi and writing your post in HTML. It took me a while to understand that development is not writing stupid code, it is a way to make all those users love your product, which is the most difficult. Love can not buy.
So what makes WhatsApp so valuable? Technology? Do not pay attention to those who say that in a week they will be able to write WhatsApp in PHP. This is simply not true. As we will see, this is a very cool technology. But, of course, Facebook has enough resources to develop WhatsApp, if they wanted.
Let's look at the features. We all know that WhatsApp is a product without tricks (no ads, games, tricks) with dedicated users all over the world . He offers free messages in a harsh world where SMS bills can be terrible. As a visiting American, I was very surprised at how many real people use WhatsApp to really stay in touch with their family and friends. So when you take WhatsApp, it’s likely that the people you know are already there, since everyone has a phone that eliminates the problem of an empty social network. It is aggressively cross-platform, so that everyone you know can use it and it will just work. The phrase he "just works" is often used. It has all the features (you can share location, sound, video, pictures, push-to-talk, voice messages and photos, delivery notification, group chats, sending messages via WiFi, and all this can be done regardless of addressee or not). It also supports the display of national writing systems. And the use of a mobile phone number as an identifier and contacts as a social graph is devilishly simple. No confirmation by email, username and password, no credit card number required. It just works.
It's all cool, but it's not worth $ 19 billion. Other products can compete with them in terms of features.
The possible reason is that Google wanted to buy WhatsApp , offering 99 cents per user. This is a threat to Facebook, they are just desperate . This money is offered for your phone book and for metadata (even considering that WhatsApp does not store them).
This is over 450 million active users with a million growth every day and a potential of a billion users. Facebook needs WhatsApp to get the next billion users. But if so, then this is only a part. And the price of about $ 40 per user does not look inadequate, especially when paying for shares. Facebook bought Instagram for $ 30 per user . Twitter user costs $ 110 .
Benedict Evans claims that the mobile market is over a trillion dollars, and WhatsApp is undermining the SMS industry, which generates $ 100 billion in revenue by sending 18 billion messages a day, while only 20 billion SMS are sent globally. With the fundamental transition from personal computers to almost universal smartphones, the size of opportunities is much larger than the target market than the one that is familiar to Facebook.
But Facebook promised that there would be no advertising, no association of services, so what's the benefit?
There is an interesting business development through the use of mobile technologies . WhatsApp is used to create group discussions with project teams, and investors discuss the progress of transactions through WhatsApp.
Instagram is used in Kuwait for the sheep trade .
WeChat, a competitor to WhatsApp, launched a taxi hire service in January, for the first month 21 million cars were hired.
With the future of e-commerce sent by mobile apps to send messages, is it worth playing this field?
Not only business uses WhatsApp for tasks that were once solved by desktop or web applications. Spanish police use WhatsApp to catch criminals, Italians organize basketball teams with his help.
Commercial and other applications are moving to mobile phones for obvious reasons. Everyone has a phone, and these instant messengers are full of features, free and cheap to use. You no longer need a desktop computer to do business. Many functions can be overridden by the mobile application.
Thus, instant messaging is a threat to Google and Facebook. Desktop computers are dead. The web is dying. Instant messaging + mobile technology is an ecosystem that can replace them . Instant messaging has become the focus of interaction in mobile technology, rather than search, by changing the search and nature of which applications will conquer the future. We are not just anticipating PageRank, we are anticipating the web.
Facebook should get into this market, or become useless.
With the transition to the mobile, we see the deportalization of Facebook. Its desktop interface is a portal that provides access to all features of the backend. He is big, tangled and squeaky. Who likes the Facebook interface at all?
When Facebook came to mobile devices, they tried the portal approach and it did not work. So they moved to the strategy of small, more focused applications for a single task . Mobile first! Not much can be done on the small screen. On a mobile phone, it is easier to find a separate application than a menu buried deep in the depths of a confusing portal application.
But Facebook goes one step further. Not only do they develop individual applications for specific tasks, they provide several competing applications that provide similar functionality, and these applications do not necessarily have a common backend. We see this in the example of WhatsApp and Messenger, and Instagram competes with photos on Facebook. Paper is an alternative Facebook interface that provides limited functionality, but what it does, it does well.
Conway’s law may apply here. The idea is that "organizations, design systems ... usually generate an architecture that replicates the communication structure of these organizations." With a monolithic backend infrastructure, we get a portal design similar to Borg . The transition to mobile technology frees organizations from such thinking. If applications that use only part of the Facebook infrastructure can be developed, then applications that do not use the Facebook infrastructure at all can be developed. And if they do not use the Facebook infrastructure, then they may not be developed on Facebook. What then is Facebook?
Facebook CEO Mark Zuckerberg has his own point of view, voiced at the Mobile World Congress conference, that the WhatsApp takeover is closely related to Internet.org:
The idea is to develop a set of basic free Internet services - "911 Internet". It could be a social network, like Facebook, an instant messenger, maybe a search, and other things, like the weather. The set of these free services will work as a kind of drug - users who can afford data services and phones simply do not see the point in paying for these services. This will give them some context that will show them why services are important and this will encourage them to pay for other similar services - there is such hope.
It is a long game, but it contains enough values to make sense to play it.
Have we come to an agreement? I do not think. This is an amazing amount of dollars, the short-term benefits of which are not obvious, so the explanation of how long-term play makes some sense. We are still at the dawn of mobile technology. No one knows what the future will look like, so it’s better not to try to make the future look like the past. Facebook seems to be doing just that.
But enough of that. How would you serve 450 million active users with just 32 engineers? Let's get a look...
Sources
Warning: we know not so much about whatsapp architecture as a whole. Just fragments and scraps collected from various sources. Rick Reed's talk is about optimization, which allowed to process 2 million connections on a single server using Erlang, rather than a review of the entire architecture.
Statistics
These statistics are mainly for the current system, and not for the system about which we have a report. A report on the current system will include more hack information for data storage, messaging, meta-clustering, and more patches for BEAM / OTP.
- 450 million active users have reached this figure faster than any other company in the world.
- 32 engineers, one developer accounts for 14 million active users.
- 50 billion messages every day on seven platforms (incoming and outgoing).
- More than a million new users are registered every day.
- $ 0 spent on advertising.
- $ 60 million investment from Sequoia Capital ; $ 3.4 billion Sequoia will earn.
- Facebook spent 35% of the cash on the deal.
- Hundreds of knots.
- More than 8000 processor cores.
- Hundreds of terabytes of RAM.
- Over 70 million messages of Erlang per second.
- In 2011, WhatsApp reached a million active TCP connections on a single server with free memory and CPU resources. In 2012 reached 2 million compounds . In 2013, they tweeted: "On December 31, we set a new record: 7 billion incoming messages, 11 billion outgoing messages = 18 billion messages processed in one day! With the New, 2013!"
Platform
Backend
- Erlang
- Freebsd
- Yaws, lighttpd
- Php
- Its patches for BEAM (BEAM this as JVM for Java, but for Erlang)
- Own XMPP
- Hosting, possibly from Softlayer
Frontend
- Seven client platforms: iPhone, Android, Blackberry, Nokia Symbian S60, Nokia S40, Windows Phone,?
- Sqlite
Hardware
- Standard server working with users:
- Two 6-core processors architecture Westmere (24 logical processors);
- 100GB RAM, SSD;
- Two network interfaces (public interface for communication with users, internal for backend)
Product
- Focus on messaging . Connecting people around the world, no matter where they are, without having to pay a lot of money. Founder Yang Kum remembers how difficult it was in 1992 to contact family around the world.
- Privacy Formed by the memories of Jan Kuma from growing up in Ukraine, where there was nothing private. Messages are not stored on servers; chat history is not stored; the goal is to know as little about the user as possible; Your name and gender are unknown, chat history is stored only on your phone.
General
- WhatsApp server is almost completely written in Erlang.
- Server systems that route messages are written in Erlang.
- The big achievement is that so many users are served by a small number of servers. The team agrees that, in many respects, it is thanks to Erlang.
- It is worth noting that Facebook Chat was written in 2009 on Erlang, but was later abandoned from this language, since it was difficult to find qualified programmers.
- WhatsApp server grew from ejabberd
- Ejabberd is the famous open source Jabber server written in Erlang.
- Initially, it was chosen because it is open, it has excellent developer reviews, it’s easy to get started with and there is a guarantee that Erlang is suitable for large communication systems in the long run.
- The next few years were spent rewriting and modifying a small number of ejabberd parts, including switching from XMPP to their protocol, restructuring the codebase, redesigning several key components, and making many important changes to the Erlang virtual machine to optimize performance.
- To process 50 billion messages per day, you need to focus on building a reliable system that works. You should think about monetization later, it is much ahead along the way.
- The main indicator of the health of the system is the length of the message queue. The length of the message queue of all processes on one node is constantly monitored and an alert is sent when it becomes longer than the allowed value. If one or more processes lags behind the others, and a warning has been sent for it, then this is a sign of another bottleneck.
- When sending a multimedia message, an image, audio or video is sent to the HTTPS server, and then the link to the content is sent to the addressee, along with the thumbnail encoded in Base64.
- A certain amount of code is laid out every day. Usually, this occurs several times a day, but the calculation does not occur during normal peak loads. Erlang allows you to aggressively approach the display of fixes or new opportunities for production. Hot swap code means that updates are laid out without restarting or redirecting traffic. Errors can usually be fixed very quickly, also with the help of hot swapping. Such systems tend to be loosely coupled, which facilitates the incremental calculation of changes.
- What protocol is used in the WhatsApp application? SSL socket is shared between server pool. All messages are queued on the server until the client connects to receive them. Notification of the successful receipt of the message is sent to the WhatsApp North, which forwards it to the original sender (who will see it as a checkmark next to the message). Messages are deleted from server memory as soon as the client has accepted it.
- How is the registration process implemented inside WhatsApp? WhatsApp used the phone's IMEI to create a username and password. Recently changed it. Now WhatsApp uses the usual request from the application to send a unique 5-character PIN. WhatsApp then sends an SMS with this code to the specified phone number (this means that the WhatsApp client should no longer be running on the same phone). Based on the PIN code, the application requests a unique key from WhatsApp. This key is used as a password for subsequent calls (this "permanent" key is stored on the device). It also means that registering a new device invalidates the key on the old device.
- On Android, Google's push notification service is used.
- Android users more. Working with Andoid is more fun. Developers can make a prototype and instantly send out to hundreds of millions of users, if there is any problem, it can be quickly fixed. With iOS, things are not so simple.
The challenge of 2+ million connections per server
- They are faced with a large influx of users, which is a positive problem, but also means the need to spend money on the purchase of additional hardware and the increased complexity of managing all these machines.
- You need to plan for traffic surges. Examples are football matches and earthquakes in Spain and Mexico. This happens at practically peak loads, so there should be enough free capacity to cope with peaks and spikes. A recent football game led to a 35% jump in the number of outgoing messages, exactly during the daily peak.
Tools and technologies used to increase scalability
- Developed a system state tracking tool (wsar):
- Records characteristics of the entire system, including the characteristics of the OS, hardware, BEAM. It was designed such that it is easy to connect metrics from other systems, such as virtual memory measurements. The system tracks CPU consumption, total load, user time, system time, interrupt processing time (interrupt time), context switches, exceptions, received / sent packets, total number of messages in process queues, events with busy ports, traffic, transmitted / received bytes , scheduler statistics, garbage collector statistics and so on.
- Initially, the collection of information was run once a minute. With the growth of the system load, it was necessary to shorten the polling period to one second, since the events that occurred within one minute were invisible. This is a really well-detailed statistics, allowing you to see how everything works.
- CPU hardware counters (pmcstat):
- Watch the CPU load in time. This allows you to find out how much time is spent on executing a virtual machine. In their case, this is 16%, which means that only 16% of the time is spent on executing the code for the virtual machine, so even if they could remove all the time during which the Erlang code is executed, it would save only 16% of the total system operation time. This implies that you need to focus on other areas in order to increase the efficiency of the system.
- dtrace, kernel lock-counting, fprof
- Dtrace was used primarily for debugging, not for performance monitoring.
- They patched BEAM under FreeBSD to get the CPU time stamp.
- We wrote scripts to get an aggregated overview of all processes, to understand what time is spent on.
- The greatest achievement was the compilation of the emulator with the lock count enabled.
- Some problems
- It was previously noted that a lot of time is spent on garbage collection procedures; this has been fixed.
- Noticed some problems with the network stack, but decided to configure them.
- Most of the problems were caused by the race for blocking, which is seriously reflected in the output from the lock counter.
- Measurements:
- Synthetic workloads, that is, traffic generation by your scripts, make little sense to configure huge systems working with users.
- It worked well for simple interfaces, such as a table of users, generating inserts and reads as quickly as possible.
- If the server can handle a million connections, it will take 30 hosts to open the required number of IP ports to generate enough connections for the server. For a server holding two million connections, 60 hosts are required. It is difficult to create such a scale.
- It is difficult to generate traffic, as in real use. You can assume a normal load, but in reality network events, world events will appear, and due to multi-platform, there will be differences in customer behavior and differences across countries.
- Combined load:
- Take the normal traffic from the production and direct it to a separate system.
- This is very convenient for systems where side effects can be limited. You do not want to redirect traffic and do something that may affect the constant state of users or result in multiple copies of the message sent to users.
- Erlang supports hot swapping of the code, so that you can come up with something under full load, compile, load the change while the program is running, and instantly see if it has become better or worse.
- They added switches to dynamically change the load from the production and see how this affects performance. They will read the output from sar, looking at the CPU load, memory consumption, track the queue overflows, and then toggle switches to see how the system responds.
- Real loads
- Absolute test. Perform both input and output tasks.
- Bring the server to DNS a couple of times, so that it will receive twice or three times more traffic than usual. This creates problems with TTL, as clients ignore TTL from DNS, which creates a delay, and you cannot quickly respond to getting more traffic that needs to be processed.
- IPFW. Redirect traffic from one server to another to get the required number of client connections. There is a bug that causes the kernel panic, so it does not work very well.
- Results:
- We started with 200 thousand simultaneous connections to the server.
- The first bottleneck was found at 425 thousand. The system has entered a state of multiple locks. Work has stopped. Turned to the scheduler to measure how much useful work is being performed, how many processes are waiting, how many are in locks. Under load, the pending ones increased, so that 35-45% of the CPU was used, 95% of which were consumed by the schedulers.
- The first stage of corrections allowed reaching a million connections.
- Memory consumption was 76%, the processor was loaded at 73%. BEAM consumed 45% of resources, which is close to the value of resource consumption from user space, which is good, since BEAM works on behalf of the user.
- Usually, CPU consumption is not the most successful metric, since the scheduler also consumes CPU.
- A month later, after correcting the bottlenecks, reached two million compounds.
- BEAM consumes 80% of memory, close to the rate when FreeBSD starts to unload pages from memory. The CPU consumption is about the same, but with a double increase in the number of connections. The scheduler encounters locks, but it works pretty well.
- It looked like a good moment to stop, so they started to profile the code on Erlang.
- Initially there were two processes per connection. Cut to one.
- Small changes to work with timers.
- 2.8
- 3 , .
- , . , .
- BEAM . , , .
- 10 , , 600 , 15 40 , 41 .
- Findings:
- Erlang + BEAM + — SMP . . . , 85% CPU . .
- —
- , BEAM.
- BEAM.
- , .
- , , , , .
- . (BIF, build-in-function).
- , , , -.
- mseg — . .
- . , .
- , . , .
- .
- TSE FreeBSD 9 8. . , , .
- .
- Pmcstat , c PCB . -, .
- BEAM
- . , , , "" , , , . Erlang, c procinfo, .
- , .
- : 1, 10 100 . .
- .
- .
- Customization
- , "" .
- mseg, malloc.
- FreeBSD, . FreeBSD -. TLB CPU.
- .
- BEAM real-time , , cron . .
- , .
- Mnesia
- os:timestamp, erlang:now.
- , . , .
- .
Lessons
- , . , , , . , , , , , , , , . , , .
- , . . . . , , , , . , .
- . . . . , .
- Erlang rocks! Erlang , , . , .
- . , , .
- . , . WhatsApp . WhatsApp , - , . . .
- - . WhatsApp , , , . . , , , .
- — . , , . -. .
- . , WhatsApp, Twitter Facebook 2009 - , , , .
- , . , ejabberd. , Erlang. , Erlang .
- . , , , . , .
- . , , , , .
- , . , WhatsApp , 10000 . , 1000 . , , , .
- . Skype, , .
- On Hacker News
- Keynote: Benedict Evans — InContext 2014 ,
- Whatsapp and $19bn
- WhatsApp's blog: The telling diary of a 16 billion dollar startup —
- Erlang Github
- WhatsApp :
- WhatsApp: The inside story
- The Open Source projects used at WhatsApp
- Whatsapp, Facebook, Erlang and realtime messaging: It all started with ejabberd
- Quora: How does WhatsApp Work? , How does WhatsApp work out of mobile network? , How did WhatsApp grow so big?
- WhatsApp is broken, really broken —
- WhatsApp CEO Jan Koum Hates Advertising and the Tech Rumor Mill (Full Dive Video)
- Singapore is progressively doing business over WhatsApp. Are You?
- Four Numbers That Explain Why Facebook Acquired WhatsApp
- Announcement from Mark Zuckerberg
- A Million-user Comet Application with Mochiweb, Part 3
- Inside Erlang, The Rare Programming Language Behind WhatsApp's Success
- WhatsApp Is Actually Worth More Than $19B, Says Facebook's Zuckerberg, And It Was Internet.org That Sealed The Deal
- Facebook buys Whatsapp for $19 billion: Value and Pricing Perspectives
- Facebook's $19 Billion Craving, Explained By Mark Zuckerberg
- IMHO: Lessons learned from WhatsApp
- You May Not Use WhatsApp, But the Rest of the World Sure Does
- The WhatsApp Story Challenges Some Of The Valley's Conventional Wisdom
- What WhatsApp Did Right, According to Jan Koum (Video)
- Why did Facebook buy WhatsApp?
- Can Someone Explain WhatsApp's Valuation To Me?
- Google's Unusual Offer to WhatsApp
.