📜 ⬆️ ⬇️

The story of one MySQL optimization

It's about optimization in the MySQL database.

This happened when we made a system for email newsletters. Our system was supposed to send tens of millions of emails per day. Sending a letter is not an easy task, although everything looks rather primitive:

  1. Collect a letter from html creative, substitute personalized data.
  2. Add a pixel view of the letter, replace all the links in the letter with your own - to track clicks.
  3. Check before sending that the email is not in the black list.
  4. Send an email to a specific pool.

I'll tell you more about the second paragraph:
Mail-builder microservice prepares a letter for dispatch:
')

Thus, all the original links will be replaced by uuid-s, and the domain will be changed to ours. When a GET request for this link, we will proxy the original image or redirect to the original link. Saving occurs in the MySQL database, we save the generated uuid along with the original link and with some meta information (user email, email id and other data). Denormalization helps us in 1 request to get all the necessary data to save statistics, or start some kind of trigger chain.

Problem number 1


The uuid-a generation depended on the timestamp.

Since mailings usually occur in a certain period of time and many instances of microservice are running on assembling a letter, it turned out that some of the uuids were very similar. This gave a low selectivity. UPD: because the data were similar, working with a bi-tree was not very effective.

We solved this problem by using the uuid module in python, where there is no time dependence.
Such an implicit thing reduced the speed of the indexes.

How is the storage?

The table structure was as follows:

CREATE TABLE IF NOT EXISTS `Messages` ( `UUID` varchar(32) NOT NULL, `Message` json NOT NULL, `Inserted` DATE NOT NULL, PRIMARY KEY (`UUID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

At the time of creation, everything looked logical:
UUID is the primary key, and is also a clustered index. When we make a selection of this field, we simply select the record, because all values ​​are stored right there. It was a deliberate decision. Learn more about clustered index.

Everything was great until the table expanded.

Problem number 2


If you read more about the cluster index, you can find out about this nuance:
When a new row is added to the table, it is added not to the end of the file, not to the end of the flat list, but to the desired branch of the tree structure corresponding to it in sorting.
Thus, with an increase in load, the insertion time grew.

The solution was to use a different table structure.

 CREATE TABLE IF NOT EXISTS `Messages` ( `ID` int(11) NOT NULL AUTO_INCREMENT, `UUID` varchar(32) NOT NULL, `Message` json NOT NULL, `Inserted` DATE NOT NULL, PRIMARY KEY (`ID`), UNIQUE KEY `UUID` (`UUID`, `Inserted`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 

Since the primary key is now auto-increment, and the cache of the last insertion point is stored in mysql, now the insertion always occurs at the end, i.e. Innodb is optimized for writing sequentially-increasing values.

I found the details of this optimization in the source code of postgres. In mysql, a very similar optimization is implemented.
It was necessary, of course, to add a unique key so that there were no collisions, but we increased the speed of insertion.

With an even larger base, we thought about deleting old data. Using DELETE over the Inserted field is absolutely not optimal - it is very long and the place will not be freed until we execute the optimize table command. By the way, this operation completely locks the table - this did not suit us at all.

Therefore, we decided to split our table into partitions.
1 day - 1 partition, the old ones drop off automatically when the time comes.

Problem number 3


We were able to delete old data, but we did not get the opportunity to choose from the desired partition, because when select ʻe, we specify only uuid, mysql does not know which partition we look for and seeks in all.

The solution was born from Problem # 1 - add the timestamp to the generated uuid. Only this time we acted a little differently: we inserted the timestamp in a random place of the line, not at the beginning and not at the end; before it and after added a dash character so that it can be obtained with a regular expression.

With this optimization, we were able to get the date when the uuid was generated and already make a select with an indication of the specific value of the Inserted field. Now we read the data immediately from the partition we need.

Also, thanks to such things as ROW_FORMAT = COMPRESSED and changing the encoding to latin1 , we saved even more hard disk space.

Source: https://habr.com/ru/post/451020/


All Articles