An article about how to solve the problem of optimizing the process of deleting files from a shardirovannoy system. It's about the project for sharing and working with files. The system was a startup about 8 years ago, then it successfully shot and was sold several times. There are 4 developers in the project who are with the project from the very beginning, which is very valuable. Documentation, traditionally, either did not have time to write, or it is not very relevant.
Why do you need to read this and why did I write all this? I want to talk about the rake, which carefully lie inside the system and hit so that the stars spill from the eyes.
I want to say a big thank you to
Hanna_Hlushakova for working together, bringing the project to the end and helping in the preparation of the article. Basically you will find descriptions of the problem and the algorithm for its solution, which we used, there are no examples of code, data structures or other necessary things. I don’t know if my experience will help you in avoiding rakes, but I hope you’ll get something useful. Perhaps this article will be absolutely irretrievable loss of precious time.
')

about the project
The project is one of the leaders in the Gartner square, has a client-company with more than 300 thousand employees in the US and Europe, and several billion files to maintain.
The project uses the following technologies: Microsoft, C # .net servers, MS SQL database, 14 active servers + 14 in data mirroring mode.
The volume of databases is up to 4 TB, the constant load in the business hours is about 400 thousand requests per minute.
The database has a lot of business logic:
450 tables
1000 stored procedures
80,000 lines of SQL code
Documentation traditionally either did not have time to write, or it is not relevant.
About the task
The task is to redo the deletion of files from storage that have already been deleted by clients, and the period of storage of deleted files has ended, in case they want to restore them. In the current version, as calculated by the company itself, files that were deleted 1 year ago were stored on servers, although under the terms of the business they should have been stored for only 1 month. Since some of the files are stored on S3, the company paid for the extra data, and customers who used the On-Premises storage wondered why the files take up more space than they should.
Databases shardirovanny on the company of the client.
How did the deletion work earlier?

On the global server with information about all files in the system, ranges of 15 thousand file identifiers were formed.
Then, according to the schedule, servers were polled based on a range of file identifiers.
On each shard the boundaries of the range were transferred.
Shard responded by sending the found files from the range.
The main server added the missing files to the queue table for deletion in its database.
Then, from the queue table, the service of physically deleting files from the storage received a packet of identifiers for deletion, after which it sent a confirmation that it was going to delete the files and a check was started on all shards - are these files used there?
With the increase in the number of files, this approach began to work very slowly, as there were several billion files, and the number of ranges increased significantly. There were still less than 5% of deleted files compared to the total, so it’s very inefficient to go through billions of files to find several million deleted ones.
For example, usually after a user deletes a file, it should be stored for 1 month, in case it needs to be restored. After this period, the file must be deleted by the program from the repository. With the current number of files, the number of ranges and the speed of bypassing the ranges, the server would take 1 year to completely bypass all the ranges.
It is clear that the place was not cleared, and this caused discontent among users, since there were many times more files on their servers than they should have been reporting. For an additional place on S3 paid by the company that provides the service, which was for her a direct loss.
Only on S3 at the time of the start of work 2 Petabytes of deleted files were stored, and this is only on the cloud. In addition, there were clients whose files were stored on their dedicated servers, which had the same problem: the place on the server was taken by files deleted by users but not deleted from the server.
What decided to do?
We decided to track the deletion events:
- the client deleted the file and then expired.
- user deleted the file immediately without the possibility of recovery.
When deleting a file from the shard base, they decided to use an optimistic approach and remove one of the usage checks. We knew that 99% of files are used only within one shard. We decided to immediately add the file to the queue for deletion, and not to check on the rest of the shards to use this file, since the check will be done again when the removal service confirms that it was deleted from the storage.
In addition, they left the current JOB, which checked the deleted files by ranges, to add files that were deleted before the release of the new version.
Everything that has been deleted on the shard is collected into a table and then transferred to one server, with information about all the files.
On this server is sent to the table-queue for deletion.
In the table for deletion, before deletion, it is checked that the file is not used on all shards. This part of the check was here before the code was changed, and they decided not to touch it.
What should have been changed in the code?
On each of the shards added a table in which you want to record the identifier of the deleted file.
Found all the procedures for deleting files from the database, they turned out to be only 2. After the user deletes the file, the file remains in the database for some time.
In the procedure for removing a file from the database, an entry was added to this local table with the deleted files.
On the global file server, they made a JOB, which downloads a list of files from the shardy bases. Just calling the procedure from the shard base, it does a DELETE inside and lists the files in OUTPUT. In MS SQL Server pull - pulling from a remote server is done significantly faster than pasting to a remote server. All this is done in blocks.
These files are then added to the drop queue table on the global server.
In the queue table, we added a field with the shard identifier to know where the deletion event came from.
How was this all tested?
There are 3 environments:
Dev is a development environment. The code is taken from the develop branch of the gita. It is possible to cover a different version of the code on IIS and make several versions of orchestration. Connects to the dev environment from the client inside the vpn. Until recently, the inconvenience was only with the bases, since all changes to the base can break the work of other parts of the system. Then the bases were local. On a dev server with databases, you can pour the already working code, so as not to destroy all the work. On a dev environment, there are 3 shards, instead of 12, which are on sale, but this is usually enough to test the interaction.
Staging - the environment is the same as the product version (almost the same, as it is rare, but there are changes right on the product by the administrators). A copy of the code from the master branch. In the database, sometimes they caught some differences with the code on the prode, but in general they are identical. On Staging there are also 3 shards as in maiden. There is also no load on staging as well as on dev. Here you can run integration tests already clean, since the code coincides with the prod. All tests must pass, this is a prerequisite before going on the patch.
Perf lab, where tests are done under load. The load is created using jmeter, 10 times less than on the prode, and there is only one shard, which sometimes creates inconvenience. Data is taken from the sale, then anonymized and used on the perf lab. All servers are the same configuration as in the sale.
The load is 10 times smaller, because it is assumed that this is an approximate load that comes on the sale of 1 shard. The disadvantage is that the global base is very underused, in contrast to the sale. And, if the changes mainly concern the global database with files, then the test results can only be relied on approximately - this may not work so well on the sale. Although perf lab does not ideally match the load with the prod, having the opportunity to test under load already helps to catch a lot of errors before deploying to the prod.
There is also a backup server, where you can view the data from the sale, to catch some cases. In general, the company operates under a license that prohibits developers from giving access to prod data, and the administration and support team (Operations) has access to the development, so you need to ask for help from DBAs. Data from the sale is very easy to test, because some cases arise only on prod data and it is very useful to study the data in a real system to understand how the system works for the user.
During tests on perf lab, it turned out that the load on deleting files from the repository is not realized from the word at all. When implementing load testing, we chose more popular requests from client software, and part of the storage clearing was not included. Since this is a base, it turned out to conduct a simplified test for all modified objects with a call to the modified procedures on different data manually. (on those options that I knew about).
In addition, both in integration and performance tests, the main emphasis was placed on the most popular type of file storage.
An additional feature of Perf Labs, which was not immediately found out, is the discrepancy between the amount of data in some tables on the sale and perfo. In the sense that all JOBs from the sales function on the tip, which form the data, but there is not always something that processes the data generated in the table. And, for example, the aforementioned table — the queue for deletion on a feather — is much larger than on a sale - 20 million records on a feather and 200 thousand on sale.
Deploy process
Deploy process is pretty standard. There were no changes in the application code for this task, all changes are only in the database. Changes are always rolled onto the database by DBA, this process is not automated. Two versions of scripts are prepared - for applying changes and for rolling back changes, and instructions are written for DBA. 2 versions of scripts are always made, and they are surely tested for rollback and rollback of changes. And the same scripts are used to apply changes to staging and perf lab before running integration and load testing.
What happened after deploy?
During the first 5 hours after deployment, 1 million events arrived, indicating that the client software received an error when trying to download a file. Event “file corrupted”. It means that the file is trying to download the client, but the file was not found in the repository. Usually these events are either not at all, or they are measured by 1 - 2 thousand per day.
I’ll say right away that it took at least 1 week to find the cause of the file from a team of 3 and sometimes 5 people (including me).
We collected the entire list of files for which the event “File Corrupted” came.
Despite the fact that there were over 1 million events and they were all from different users, different companies, there were only 250 unique files on this list.
DBA on the backup server was raised to investigate the backup databases at the time when the event arrived. In the bases of the project there are quite a few tables with all sorts of logs that helped in the analysis. Even when deleting information from the database, a log is surely added, what was removed and what event. On the sale of such records are stored for 1 week, then merge to the archive server.
And so are the tables with logs, which are very helpful in analyzing what happened:
The full log with events at the client, is conducted on each shard
On the global server:
- Log requests for downloading all files by users
- Log download files to the system from users
- File log with event FileCorrupt
- Log files for cancel removal from storage
- Log of deleted files from the database
In addition, ELK with the application logs was available.
The error was repeated on the dev environment, which confirmed the correctness of the assumption. At first, nobody took this hypothesis seriously, since it was very difficult to believe that so many factors coincided at the same time and so many users arrived at this particular moment in time.
What went wrong?
It turned out that the system had about 250 (for comparison in the system billions of files) super duper mega popular files. 250, yes!
These files were still very old. At the time these files were uploaded to the system, another system used to generate the key of the file on the repository was used.
It turned out that for this type of key, the method of physical deletion from physical storage behaves differently from other files.
In the class with deletion there is a block of code with the condition specifically for files with the old key. The system, at the time of deletion, before it is checked that the file is absent on shards, moves this old file to another location. Well, that did not work.
And it turned out that at the moment when the file is moved (and I’ll remind you it is very popular), if some user tries to give him rights to the new user, the client software goes to the storage behind this file, but there is no file in the right place. Since it is moved so that it does not work out. And the client software sends a message that the file is broken. In the database, it is marked as broken. And all the information is removed from the database (well, almost all).
In the meantime, our shard check procedure finds out that the file is in use. And sends the answer that you need to return it. But all the info has already been removed from the database, and it is impossible to return it.
Funny, yes?
That is, when deleting a file, the user got exactly at the time when the file was moved, the shards were checked, and it was at this point in time that the user sent a download request.
Here it is - highload in action, when the most incredible coincidence you have the same.

Having recovered from the surprise and rolling it all back, we made sure that the files of the users are alive, since they were restored from the disks of other clients.
Naturally, everything was good on the tests, because during the test newer files were deleted with a new type of key that was used for the last 5 years. Such files are not transferred to another storage location during the deletion.
Our optimism diminished, and we decided not to go the most optimistic way.
Retrospective
We decided that we need to add tests to different types of repositories.
Add load to perf lab that uses calls when removed from storage
Close known race conditions
Add monitoring (although it would be in the plans, but did not fit into the initial scope)
About monitoring
They immediately decided to do the monitoring, but then he faded into the background, as it was necessary to quickly deploy.
For monitoring, Zabbix, ELK, Grafana, NewRelic, SQL Sentry and the test version of AppDynamics were used in the project.
From this, the pef lab was NewRelic and SQL Sentry.
We have already measured all system metrics, and so, we wanted to measure the business metrics. I had experience in organizing such monitoring through Zabbix - they decided to do the same.
The scheme is very simple in the database to make a table into which, by JOB, to collect the necessary metrics and a procedure that will upload the collected metrics to Zabbix.
Metrics:
- The number of files in the queue for deletion on a global basis
- Number of files in queue by server
- Number of files sent to the removal program from the repository
- Number of deleted files
- Number of events FileCorrupt
- The number of files to delete on each shard
Monitoring was implemented and secured for the sale separately, before they began to deploy a new implementation of the deletion.
New solution
In general, we decided that it was better to overeat than not sleep enough, and made a new plan.
- check on the same shard that the file is definitely not used by anyone else, and transfer only unused files to the server;
- when transferring to the server, collect all files in a table and check that files are not used on shards before being placed in a queue table for deletion;
- when using a file and searching it in the system, mark the queue for deletion in the table as a file that requires checking;
- issue for deletion only files for which there was no search;
- files that were searched, re-check for the presence on shards;
- In general, remove the check in the procedure that deletes the file, since it must work quickly - and the file being used in principle should not go to it;
- take into account in the procedure, which deletes everything by bit of a file that it is in the process of deletion, and not delete information on it.
Clause 6 with a delay included removal of the check in several stages. First they left the check, then after a week they turned off checking on the files of the company's employees, after 2 weeks they turned off the check completely.
What should be changed in the code?
Again, all changes apply only to the database.
The scale of change was the largest on the global server:
Add an intermediate table in which to put all the files downloaded from shards.
Make a JOB that checks the files in the intermediate table so that they are not used on shards.
Add the field with the date of the last access to the file to the queue of deleted files in the table and add an index.
Find all the procedures with access to the file - it turned out 5 procedures. Add a block to change the date of last use in the queue table. The date changed every time, regardless of whether it was filled or not.
In the procedure for issuing files to the uninstall program, add it so that it displays only files with an empty usage date.
Add a JOB that collects all files with the date of use and checks (with a delay of 10 minutes, which is necessary for the client software to add the file to the shard, in general it is up to 2 minutes, but decided to be safe) using the file on all shards. After the check is completed, the date of use is reset to zero if the file is not found, otherwise the file is removed from the queue. If the use date has changed during the verification process, the data does not change as it is assumed that while the test was in progress, the file could be uploaded to the shard, on which the test has already been completed and a new test cycle is required for the file.
On shards:
In the procedure that deletes files from a table with deleted files, it was necessary to add a check that the file was not used. The procedure lost its simplicity and beauty is not much - in DELETE with output just added NOT EXISTS.
Added JOB, which in the background banged from the table used on the server files.
Tests
In the integration tests added scripts to use all options for storage.
We also wrote new cases to test new file deletion functionality.
Perf Lab has added load to the global server. In addition, the load added to the removal of files from the storage.
Depla
Scripts were prepared for the application and rollback of changes for the database. DBA rolled scripts and it turned out that during load testing did not pay attention to the lock on the table-queue to delete files. As a result, we did not fix the index, which was not the most optimal.
Because of this, it was necessary to disable JOB checking by ranges and analyze and add identifiers of deleted files manually, according to files that were deleted in the system before the new code was added.
results
As a result, as a result of deployment, new deleted files are deleted from the storage within 24 hours.
Files deleted before the launch of the new system were created on the backup and added to the queue for the prod.
As a result, the extra data on S3 in the amount of 2 petabytes has been removed. The same thing happened with the files on dedicated client servers, and now they have the space occupied on the server coincides with the place displayed in their clients.
The curve index on the queue table is still living on the prod, the task of changing the index in the backlog, but slightly postponed due to higher priority tasks.