Online flower shop, or how we screwed up on Valentine's Day

Holidays all passed, the profits and losses are calculated. It's time for the story. This story is about how, due to a technical error, an online flower delivery store lost several hundred orders and revenues of 1 million rubles for St. Valentine's Day.

In life there are good moments that I want to share, tell, to praise and be glad for you. And there are situations that are of a negative connotation and from which no one is insured. They just need to learn and do not allow recurrence in the future. As promised, in this section I publish not only positive, but also instructive stories. In the end, the error was not my fault, but somehow I was part of the team that day (and still remain in it), and share responsibility with everyone. It remains only to tell what happened and what was the root cause.

You all know perfectly well that flowers are in demand at any time of the year, as they are given for holidays, birthdays, when they want someone to like or make something pleasant, to love someone, and sometimes even for no reason. But it is also known that the flower business has some seasonality.
')
If you look at the history of requests through wordstat.yandex on one of the most popular requests “flower delivery” , then for any previous year you can see characteristic rises: from the end of January to the middle of March, in August and in the end of November.

Here are the dates due to these trends:

January 25 - Tatiana's Day;
February 14 - Valentine's Day;
February 23 - Defender of the Fatherland Day;
March 8 - International Women's Day;
mid-August - preparing children for school, gifts to teachers;
November 25 - Mother's Day.

And the first three months of each new year are the most intense for the owner of this business. It is very important to approach these holidays fully armed. We tried, from year to year everything was fine, but on February 14, 2018 we were disappointed.

A little about the project: an online flower shop with delivery in Moscow and Moscow region, the main sources of promotion are contextual advertising (20%, I am responsible for this area) and SEO promotion (75%) and e-mail marketing (5%). Social networks are practically not involved. Site on 1C-Bitrix, the usual hosting timeweb. The latter is very important, then I will explain why. And the development team that oversees our project was at a distance. And this to some extent played its role.

There is a myth that “florists” for these holidays earn almost a semi-annual rate of revenue, which allows them to relax their buns in other months. This is not true. Yes, there are more orders, more revenue than times. But not always arrived more. Because the cost of buying a flower, florist services, packaging, rental of premises, the cost of couriers, advertising costs, returns - all this during a period of high demand increases greatly.

So, everything was fine until February 14, 12:30 Moscow time. By this time, we have already accepted more than 120 orders for a total amount of 500,000 rubles. And they were going to get another 200+ orders from last year’s experience. I even recorded a record of simultaneous visits to 100+ users:

At 12:43 there was a first failure. Of course, we immediately turned for help to our developers on the remote. The correspondence with the owner in WhatsApp was something like this:

The error was 502 Bad Gateway, and at some point I managed to fix even this:

And this is the most peak time for sales and orders. The developers could not figure out the problem on their own and wrote to the hosting support. After some time, we received the following answer:

Next, here is a chain of site failures:

12: 43-12: 47 = 4 minutes
13: 01-13: 18 = 17 minutes
13: 27-13: 42 = 15 minutes
13: 57-14: 17 = 20 minutes
14: 26-14: 54 = 28 minutes
14: 58-15: 15 = 17 minutes
15: 30-15: 46 = 16 minutes
16: 08-16: 27 = 19 minutes
16: 35-16: 37 = 2 minutes
16: 48-16: 51 = 3 minutes

And such a dialogue with the owner:

At that moment I was not at the point (thank God!), And therefore I could only guess about all the emotions and experiences from the messages. It was also scary that there were no copies of the site (or rather, it fell along with the main site), and florists needed to collect all the flowers (bouquets) from memory. There were orders, they fell to the post office, but now the color solutions, the quantity, the form, the pictures and some non-standard bouquets - all this now had to be asked again from the client or remembered independently.

The developers have not been able to determine the cause of the fall of the site. At some point, they even began to believe in conspiracy theories and DDoS attacks of competitors. Hosting could not resist? They also thought about it, but until February 14 there was also February 13, a day no less intense in its loads. But there everything went without problems.

Our online store remained lying until February 15 (for the remaining time we received only 30 orders), until I turned to the person who recently performed our freelancing task as an independent expert.

The first thing that the programmer immediately advised was to change hosting at least to a VPS (virtual server), and take this site to a separate VPS, since the virtual server is more powerful and more stable.

Analysis of the log files showed that there was no attack. True, once someone just turned off the site on the hosting (13: 01-13: 18), i.e. it most likely was not a DDoS attack or error on the PHP side, and it was more like someone on the hosting turned off this site for a given period of time. This was the case - the hosting staff disconnected us in manual mode due to the extremely high load on the server. And what caused such a sharp jump - had to find out further.

For each virtual account, the hoster allocates certain capacities, and with a dramatic move beyond these allocated capacities, the hoster simply temporarily sends this account to the ignore list, and then writes a message to the account owner: “You have exceeded the threshold for the allocated capacity ".

To the question:

“Why didn't they tell us (current developers) that it’s better to switch to a different, more powerful and safer server?”

I received a very specific and sensible answer:

- “Many progers will not offer to transfer there. After all, to do this, we need some knowledge in administering Linux, and this is closer to system administration. ”

All at once it became clear - incompetence. From them we received the following messages:

It looks like it is still DDoS, but some kind of cunning. Now there are almost no people on the site, and he makes 40,000 requests to the database per minute. Requests are large, and they overflow the cache, in the end, nothing is loaded. Yesterday, the load was more in the number of people, but the site coped, there were no such problems. Deployed a copy of the site, simulated entry of a large number of users. He starts to hang. Must move to VDS. Then the site will work in a planned manner to understand.

Now even I know that DDoS is not about database queries. If there are no requests to the web server, but there is a database, then this is not DDoS, but some internal problems ... In general, 0 level. The level of developers immediately became clear in a stressful and non-standard situation. For 1.5 days, a team of 3-4 people could not understand the reason for the fall of the store and restore its performance.

At some point, SEO-shniki came into the business with their own tips. They:

We reduced the load on the site from the bots of Yandex through robots. So far, we have put tough restrictions;
Reduced the load from Google bots to the site via Google Webmasters;
We reduced the load on the site from various unnecessary bots to us, closing them through the .htaccess file.

The developer, who helped us in this whole story, mocked these changes and said that there was absolutely nothing to do with it. It was necessary to do something, until someone cleared the tracks or made it worse.

We completely trusted an independent programmer and waited for news from him. And the site still hung, did not even let me enter the admin panel. As a result, the developer took a copy of the site directly from the hosting and hosted it locally.

A small digression: the online store has been operating since 2015, during which time it was serviced by several SEO promotion agencies. In these agencies, a sufficient number of developers changed and each of them had their own access. As it turned out later, almost every “dog” had administrative access to the site. And SEO-Schnick as well.

First of all, all access was closed. Yes, of course, 1C-Bitrix is so nice - you can not have access, but only a bunch of scripts that can be restored and added to the administrator at startup. And if someone from the people who had access to the site (and there were about a dozen) such a script was planted somewhere, then without any problems will be able to launch it and get access as admin and spoil ...

After some time, we got from him the first results:

The guilty was not named, but the responsible one who did it:

\Bitrix\CmskassaEkam\CheckListTable::loadUpdates()

— . ( ), . () , . , - , - .

EKAM – - . - CMS «1-» «54- - » - .

( ), , EKAM .

, «. (cmskassa.ekam) cmskassa.ru» - «\Bitrix\CmskassaEkam\CheckListTable::loadUpdates()» . () . , , - .