We constantly share our experience in optimizing the service systems of our IaaS provider:
Today, our attention was attracted by the Ticketmaster
case . Let us briefly analyze it.
/ Photo by Ginny / CC')
Oden Espinoza (Audyn Espinoza) tells us that the Ticketmaster team is constantly monitoring. The IT department of the project likes to optimize the service, but it is necessary to solve IT problems as they appear.
In this case, it all started with the fact that one of the requests received an unusually high number of timeouts. Appeal to the monitoring logs showed that the problem is typical for the entire cluster as a whole. Timeouts were observed here every minute.
Additional assessment of the situation using tcpdump showed that the problem can be localized at the stage of passing through the firewall. For a more detailed investigation, it was decided to apply OPNET and Wireshark.
These tools have shown that the SYN package passes without problems, but his friend SYN / ACK cannot do this. When the trial package was sent in the opposite direction, the result was the same.
As a result, the team returned to reviewing the work of the firewall. They found out that when a valid TCP connection downtime was reached, a retransmission of the SYN packet began, and the SYN / ACK never passed through the virtual firewall.
Final Verdict: SNMP monopolized the CPU every 60 seconds, as shown by monitoring the firewall. It was a bug at the code level. To solve the problem, the command has disabled the SNMP polling system.
PS A little about the work of our IaaS provider: