Scaling to 100 million: architecture defined by service level

This is the third part of the Wix Scaling to 100 Million Users cycle. Entry and second post .

Wix started with a single server that provided all the functionality: user registration, web site editing, maintenance of published web sites, and image upload. At one time, this single server was the right decision, because it allowed us to grow rapidly and use flexible development methods. However, by 2008, periodically recurring deployment problems began, leading to unplanned downtime both in the creation of new sites and in the maintenance of existing ones.

Deploying a new version of our system in some cases required a change in the MySQL schema. Since Hibernate does not forgive the discrepancies between the scheme it expects and the real database scheme (DB), we used general software deployment practice: a planned two-hour stop during the period of least traffic (midnight in the US at the weekend). During this scheduled shutdown, we had to stop the service, shut down the server, make changes to the MySQL schema, deploy a new version, and restart the server.
')
This planned two-hour stop often turned into something more complicated due to problems that could have occurred during deployment. In some cases, making changes to the MySQL schema took much longer than planned (changing large tables, rebuilding indexes, removing restrictions on data migration, etc.). Sometimes, after changing the schema and trying to restart the server, it did not start due to some unintended problems with deployment, configuration, or schema that prevented it from working. And in some cases, the new version of our software turned out to be unworkable, so to restore the service we had to change the MySQL scheme again (to bring it into line with the previous version) and re-deploy the previous version of the system.

But the worst happened if, a few days after the “successful” deployment, we found in the new version a critical, albeit rare, bug that caused damage to user sites. In this case, the best was to roll back to the previous version (at least until the bug was fixed), and for this it was necessary to change the scheme again, which meant an unplanned stop of the service.
It is important to note that since we used the same server application to service all Wix systems, the shutdown affected the entire service, including published sites. As our user base grew, more and more sites were affected by our planned and unplanned stops.

And then it dawned on us

Wix performs two different functions: maintenance of existing sites and the creation of new sites. Stopping site creation has a direct impact on our daily sales, but stopping existing sites has a direct impact on our existing and paying users. We needed to define and create different levels of service for each of the functions.
In addition, after additional analysis, we found that most of our innovations related to the development of new sites, and only a small number of changes related to existing sites. This meant that we made frequent releases of software that jeopardized the functioning of both created and already working sites, even if the changes concerned only created sites.

Realizing this, we divided our system into two segments: the editorial segment, which is responsible for building new sites, and the public segment, which is responsible for servicing sites. This solution allowed us to provide different levels of service corresponding to each of the business functions.

The technology stack chosen for building the public segment was intentionally simple. We no longer used Hibernate, we abandoned any form of cache, and we started using Spring MVC 3.0. An important design principle was to make the segments unrelated to each other in terms of software, release cycles and data storage, as well as to make the software stack easy to understand and optimized for serving sites.

A clear sign of such incoherence was the publication process (the descendants of which are still present in the Wix core), which copied data from the editorial database to the public segment. During this process, the data structures were transformed from an effective editing view to a view that best suits the published site.

As a result of the deployment of the public segment have become rare and low-risk. It still functions on the Wix system, six years after the first deployment (although something has changed since then).

What we have learned

We already understood that the release cycle is associated with risk, but now we realized that our two key business functions — building and maintaining websites — are at different risk. We realized that we need to provide different levels of service for these functions and that it is around them that we need to build our system. What are these different levels of service? We considered the following aspects: accessibility, performance, riskiness of changes and recovery time after a failure. The public segment, which affects all users and all Wix sites, should have the highest level of service on these aspects. But in the editorial segment, failure only affects users who are directly involved in creating the site, so the consequences for the business are less significant. Therefore, we somewhat sacrificed a high level of service for the sake of greater flexibility, which made it possible to reduce the effort required for development.

Chief Software Architect for Wix Website Designer ,
Yoav Abrahami
Original article: Wix engineers blog

Source: https://habr.com/ru/post/282045/

All Articles

Scaling to 100 million: architecture defined by service level

And then it dawned on us

What we have learned

More articles: