📜 ⬆️ ⬇️

Cloud Bottlenecks: Pokemon Go and Trivia Crack Stories

image

The lesson: "A system that works with two million users will not necessarily be able to cope with ten million."

After the release of Pokémon Go in the USA in July 2016, it became the most popular game in augmented reality at that time. This product is a multi-year collaboration between game developer Niantic and Google (until Niantic got to its feet, it was an internal Google startup). Therefore, the Pokémon Go infrastructure depended heavily on the cloud platform and Google application services. (Nintendo and Pokémon also took part in creating an exciting gameplay for raising small monsters for mobile platforms.)
')
This was not the first Niantic game in augmented reality . Previously, the company created Ingress , a game about alien invasion, released in 2013 for Android devices. But Pokémon Go has become a completely different level game - Pokemon has long been a cultural phenomenon. The game interested the audience, for many years waiting for a mobile game. Therefore, the number of installations grew rapidly. For half a day the game took the first position in terms of income on the iPhone. To some extent, it was the world's largest release of a mobile game.

But its success increased the load on the platform: two days after the release, Niantic technical director John Hanke announced that the company was postponing the global release of Pokémon Go, which was caused by overloaded servers. At the same time, confidentiality issues have arisen due to the way Niantic works with Google's identification and location services. The company had to correct a lot of mistakes, while simultaneously solving problems with server capacities.

In theory, the cloud must cope with periods of peak loads, simplifying the management of the application, and the services provided by cloud solution providers should have simplified the development of various mobile applications (not just games). And the capabilities of the clouds really simplified the use of new features that require large computing powers (such as augmented reality).

But, as in the cases with other network platforms of the past, the developers found out that the presence of all these capacities is not important if there is no possibility to connect to them. The more interactive mobile applications, the more difficult it becomes to exchange data between mobile devices and the cloud infrastructure. Add a factor of differences in data transfer speeds from telecom operators around the world and get a system in which you need to take into account many parameters to ensure the necessary speed for users.

Few applications “fly up” in the same way as Pokemon Go that has become viral on a global scale. Those developers who want to scale games and applications will find it useful to study how Niantic and other game developers coped with unexpected success. If hit mobile games can cope with obstacles encountered when testing and debugging the performance of the interaction between devices and the cloud, then corporations are also able to solve problems with unexpected peaks of user activity.

We will process all of them


To learn the lessons learned from Pokémon Go, Arstechnica recently spoke with Niantic Technical Director Phil Keslin. We talked about the complex interactions between open-source parts of the Google cloud and internal data.

Pokémon Go uses Google Compute Engine, its cloud storage and a full stack of networking technologies, including data processing and query infrastructures. Of course, the game also uses Google Maps to determine the player’s location. According to Keslin, all changes in the gameplay AI require that the mobile client make calls to the Niantic data store. "Each time you change the state of the game - throwing a pokeball, catching a pokemon, or other action, it interacts with the data warehouse."

When the first big peak appeared, then, according to Keslin, “Google didn’t even notice, but the game at least doubled the amount of data being processed.” However, this did not lead to system overload. “The easiest way to say this: we had a forecast of the worst scenario, but the game even surpassed it.” On the day of release, there was a real explosion. “We found bottlenecks that slowed down performance. After their elimination we rested against the new bottleneck. ”

Some of the bottlenecks were in the Niantic code, “but we had problems with a couple of open source libraries, which we didn’t expect - they were the hardest to solve.” Overall, Niantic found five or six bottlenecks, and it took one or two days to eliminate each of them.

But malfunctions arose from Google. Pokémon Go has problems with cloud infrastructure; the container engine contained subsystems that were never tested under this load. There are a couple of problems with the network stack.

Removing bottlenecks required a lot of work from a team of five, consisting of Keslin, the team leader and three service engineers. “In the first two weeks, we barely slept,” says Keslin. "The guys from Google, too, were laid out in full."

Another factor affecting performance in different regions, according to Keslin, was the differences between mobile operators in different parts of the world. “We designed Pokémon Go so that the game could work in mobile devices with low bandwidth. The problems that arose were more concerned with the marketing programs of telecom operators. ” For example, a large mobile operator in the Philippines provided all its subscribers with free access to Pokémon Go, so Niantic needed to ensure that users were turned on and turned off after the promotion was completed.

Despite the initial chaos, Keslin said that Niantic did not have to change the architecture of Pokémon Go after the game was released. (The company continues to streamline the application and is preparing for the release of the second generation Pokémon Go, which, it hopes, will give the game a second wind after the Pokéman wave has subsided.) “The core of the system will remain the same, we just add a new gameplay. We were lucky that we managed to create a scalable system, we did not test it. Fortunately, the created architecture reliably scales. ”

What can Keslin advise other developers seeking to create a new phenomenon of augmented reality? “Think about scaling from the start. Our game development team focused on performance. Thanks to this, we were able to maximize performance at low cost, and were able to scale the system. ”

Addicted to Trivia Crack



Etermax Trivia Crack logo

Other companies also went through a public cloud with a gaming infrastructure, the results were inconsistent. Two years before Pokémon Go hit the jackpot, Argentinean company Etermax created its own mobile gaming hit: Trivia Crack, a game from a set of competitive games running on Amazon Web Services.

According to the technical director of Etermax, Gonzalo Garcia, the tremendous success of Trivia Crack came to her in two waves: the first, in March 2014, marked the success of the game in parts of South American countries. Traffic at the same time increased from 100 thousand to 10 million daily active users. The second wave started when the game became popular in the USA in October of the same year, increasing the number of daily active users to 25 million.

“We did not foresee this,” says Garcia. “According to our estimates and tests, we knew that we could cope with one million players, and planned two million. But we did not expect such growth - we did not even invest so much in advertising! Without a cloud infrastructure server, we would never have done it. ”

“We thought we were releasing another game of the company,” adds Etermax Director of Information Technology, Martin Dominguez. “Previously, our limit was one million users, but what suits two million is not always enough for ten million.”

Such a load has stressed the Agile company's development process. “We thought we were not tied to Scrum, and we always worked in Agile style,” says Dominguez. “The problem was that with such a jump in the number of users, sprints could not be completed in two weeks, we had to work day after day.”

To cope with popularity, Etermax first had to abandon some of the functions. She also had to change some databases and adapt processes to increase efficiency. In particular, Etermax has changed how Amazon Relational Database Service (RDS) is used. It implemented fragmentation, secondary servers, and data exchange between secondary servers. She also collaborated with AWS to solve the problem with the number of packets per second.

To support the advanced networking capabilities, Etermax had to migrate from the Amazon public network to a virtual private cloud. “It was quite difficult,” Dominguez recalls. "At the second peak of users, which increased the number to 25 million, AWS employees said they had never seen a company that used RDS at that level."

Garcia said that with the explosive growth of the popularity of Trivia Crack Etermax, one feature helped to cope. "In Trivia Crack, the interface was almost entirely in the mobile device, so only small amounts of information were transmitted." Thanks to this, synchronous connections have become less of a problem for Trivia Crack compared to other Etermax games, for example, the Bingo Crack mobile bingo game: “We constantly needed to send information about bingo balls to know who would be the first, second or third Bingo!"". What did Etermax teach about the success of Trivia Crack? According to Dominguez, when it comes to infrastructure, you need to think quickly, be proactive when making changes, and guess what problems will arise. “Pokémon Go had the same problem — it needed to survive every day.” To prepare for the tests and be ready for the new popular hit, Etermax increased the number of employees from the core development team from three to ten engineers.

Success management


Patric Palm is the technical director and founder of the Swedish company Hansoft. She created the Favro collaboration software used by many game developers, including Ubisoft and id Software. Patrick has the opportunity to observe the company's customers solving their problems in the mobile gaming industry.

Considering the problems that Pokémon Go has encountered, Palm highlights the question of differences in connection speeds in different regions of the world. However, he emphasizes that due to cloud computing, scalability is now less of a hindrance.

“A few years ago, scalability was a much more serious problem, which required many more people to solve.” Since Niantic was able to shift the resolution of some of the problems to Google Cloud, she was able to focus on implementation and registration in various countries. "Cloud systems solve one of the biggest business problems," Palm said.

Due to the fact that the solution to the problem of scalability was found, attention was paid to other, smaller problems. In the case of Pokémon Go, they became the rapid discharge of the phone's battery, forcing players to run around the city with external batteries. “Niantic has dramatically discharged batteries,” says Palm. "Cloud developers are now thinking much about power consumption."

Another related problem: data transfer restrictions in different tariffs. "Not every user has a good mobile operator rate." This is another incentive for shifting most of the burden to the server side. Games running in the background consume not only energy, but also data.


"We will need a bigger cloud."

Rules of the game


If Pokémon Go and Trivia Crack were able to cope with the problems of scaling cloud computing on mobile platforms, then you will succeed. Here are tips for evaluating your own tasks:

  1. Consider the worst-case scenario. Of course, when it comes to assessments, you need to select some indicators. But you also need to plan how to scale above the maximum limit. The main advantage of cloud computing is in their elasticity, think about what will happen if these scales will expand faster than you tested.
  2. Discuss abnormal situations with your service provider. As traffic grew, Niantic and Etermax had to work closely with Google and Amazon. When choosing a cloud service provider, specifically discuss what services they can provide if your needs grow dramatically compared to the expected ones.
  3. Explore telecom operators and mobile devices. If your cloud technology will run on users' own mobile devices (perhaps in different parts of the world), then consider how many connections will be transferred from device to cloud and rate the tariff plans with the slowest connections of the most problematic operators that your product will have to face. .

Source: https://habr.com/ru/post/321196/


All Articles