Scale to 100 million users. Cache or not cache?

This is the second part of the “Wix scaling to 100 million users” cycle. Entry read here .

When we started Wix, the Tomcat stack, Hibernate, and Ehcache were used with a MySQL database and a Flash frontend. Why did we choose this stack? Yes, simply because our first backend developer already had experience with it. Part of this architecture was Ehcache, an excellent cache library for Hibernate and JVM, which created an abstraction in the form of a map for the memory cache and which could also be configured as a distributed cache. Ehcache, unlike Memcached, runs as a process in the JVM and exactly replicates the cache state for all nodes in the cluster. Note that at that time (about 2006–2008) Encache was still an independent open source project and was not part of Terracotta (within the framework of Terracotta, the replication and distribution model may be different, but this article is not so important).

Cache Usage Aspects

Since we already had real clients, we installed two Tomcat servers to provide additional reliability. Following the rules of building architecture, we set up a distributed Ehcache cluster between servers. We proceeded from the fact that MySQL is slow (like any other SQL system), which means that the RAM cache will provide a much higher read speed and reduce the load on the database.
')
But what happens when we encounter a data problem, such as data corruption or littering? Let's call such a problem “incorrect state”. One such case occurred when we rolled out the version of the Wix Editor, which created the wrong site definition. The symptom of this problem was damaged user sites, i.e. users could not view and edit their sites. Fortunately, due to the fact that we had (and have) a very large user base, users immediately discovered this problem and reported it. We rolled back the problematic version and fixed the corrupted definition files in the database. Unfortunately, even after we corrected all the places where this data was stored, users continued to complain about damaged sites. The reason was that we just fixed the incorrect state stored in the database, forgetting that the cache also stores copies of our data, including corrupted documents.

Ehcache is a kind of “black box”: a Java library without an SQL query interface and without a control application for viewing the contents of the cache. Since we didn’t have an easy way to “look inside” the cache, we couldn’t diagnose and analyze it when faced with data corruption (note that some other cache solutions have control applications that turn them into a white box, but we’ve did not work with any of them). When we realized that the incorrect state, apparently, was also preserved in the cache, to solve the problem, we needed to first correct the incorrect state in the database. Both application servers stored an incorrect state in their cache, so we first stopped one of the servers to clear its memory cache and start it again. But since the cache was distributed, even after the server was restarted, the cache representation of its memory was replicated from the second application server. As a result, we again returned to the incorrect state. Restarting the second server at this stage would not have helped, the second server would receive the replication of the incorrect state from the first one. The only way to get rid of this incorrect state was to stop and restart both servers, which led to short-term downtime of our services.

What about cache invalidation?

Because we used Ehcache, which has an invalidation management API, we could write special code that tells both servers to consider the cache invalid (invalid switch). But if we had not prepared an invalid switch to work with a certain type of data, we would have to restart both servers at the same time again to get rid of the incorrect state. Of course, we could make a control application for Ehcache, adding the ability to view and invalidate data. But at that moment when it was necessary to make this decision, we thought: “Do we really need a cache?”.

First we checked the MySQL statistics. It turned out that when MySQL is used correctly, read operations take fractions of milliseconds even for large tables. Today we have tables in which over 100 million lines, and we read from them with speed in a fraction of milliseconds. We achieved this by providing the MySQL process with sufficient memory for working with disk cache and reading individual rows by the primary key or index without joining tables (JOIN). In the end, we realized that we did not need a cache. In fact, in most cases when people use a cache, there is actually no particular need for it. We believe that the cache is not part of the architecture. It is rather one of the possible solutions for a performance problem, and not the best one.

Our recommendations for using the cache are:
1. You do not need a cache.
2. No, really, not needed.
3. If you still have performance problems, try to understand their source. What is slow? Why is it running slowly? Is it possible to change the architecture to not work so slowly? Is it possible to optimize data for reading?

If you need to use a cache, consider the following:
• How will you ensure cache invalidation?
• How will you view cached data (black box or white box)?
• How will the cold start of the system occur? Can the system cope with traffic when the cache is empty?
• What performance degradation does cache use entails?

Chief Software Architect for Wix site creation ,
Yoav Abrahami
Original article: Wix engineers blog

Source: https://habr.com/ru/post/280033/

All Articles

Scale to 100 million users. Cache or not cache?

Cache Usage Aspects

What about cache invalidation?

More articles: