It just so happened that for the last 3 years I have exclusively solved the problems of users. Not as an enikeyschik, at a higher level, but this does not change the fact.
Problems are different. Those related to software, hardware, or brain bugs of a client have to be localized and passed on to whom. This is not so interesting.
Much more interesting zamorachistye puzzles.
')
Like that:
OSPF peering rises and falls over time.
Routes received by Route Reflector via BGP are not advertised to clients.
Inter-AS Option C does not work .
These are configuration problems, as a rule. I look at the wording of the request, I understand that I don’t understand absolutely nothing about this, my hands are falling. This is a normal situation when the task seems huge and you do not know which way to approach it.
The most difficult thing is to take the phone and call the customer. After all, he is obviously more experienced than me, since he set it up, but I have not yet.
Well, let's take as an illustration some simple ticket.
MPLS TE FastReroute (FRR) does not work. This is a technology that allows traffic to switch to a different path for a few milliseconds, until IGP sets a new route.I do not know what question to ask him? How to understand the essence of the problem, even if I do not even imagine the work of this mechanism
This is the
first stage - I take the documentation and start smoking. What is there with us? Traffic engineering ?? Ok: (3-4 hours to read and understand (well, here it should be noted that this is still not the first acquaintance with the topic - in general, I am familiar with everything).
Minor priority request, so I can afford such time waste.When having a presentation, you can already call the customer and carefully find out the details. Call - the
second stage .
At this moment, he begins to pour terms, to talk about his network, which at this stage I still have very vague ideas. If I understand this topic at least somewhat, ask questions that lead me and the engineer to the right direction.
One should not be afraid to appear stupid. I am afraid, but the question must be solved).
When the problem has cleared up, I repeat it to the engineer in the way I understood and wait for confirmation that I am right.
Here lies the intersnaya problem: not everyone is ready to listen to me or understand:
Some interrupt and again tell their vision, which generally coincides with mine, but I would like to adhere to the rule of double handshake - 1) I understood him, 2) he understood that I understood correctly.
Others say, “Yes, yes, that's right,” and after that, something obviously contradictory is explained. After asking questions, they again say “Yes” and again explain everything differently.
Here it is important not to succumb to weakness and achieve complete understanding on both sides, otherwise I will solve another problem or go to the wrong direction.
In the case of my example, there was such a scheme:

And it was important for me to understand that FRR does not work exactly between the two nodes, but it works quite well inside the ring. After my questions, the engineer confirmed this.At the end of the conversation, I ask you to send some information to me (network diagram, diagnostic information, logs, etc.).
After the call, it is necessary to write a letter where to repeat your thoughts and requests. Because a telephone conversation is a telephone conversation, and the mail is at least somehow recorded and remains on the screen of the engineer.
Otherwise, he will forget to send or not send everything.
In addition, it is very important to save the query history. I could keep an Excel file with all the information on the ticket: when I wrote, to whom they answered when I escalated or frozen. But I lack composure - he needs too much attention.
But the history of correspondence is stored for years. And if after 9 months the manager asks me why there was such a delay in the decision, I will open the archive and show that I requested the data on 01/01/2013 at 00:20, and the engineer answered me only on 10/01/2013.
But here again there is a place for the facepalm. You write in the letter point by point, for example:
1) When did the problem arise for the first time? Have any changes been made before this?
2) Provide a network map with IP addresses and interfaces
3) Send the log file from the memory card
4) Send diagnostic information.
5) Is it possible to remotely access?
In response, comes:
I described the scheme to you by phone: router-switch-router 2-switch 2- (STP) -client.
In the attachment of one file with the diagnosis and all. What did he give me useful? Only understanding that it will be difficult and you need to call him again.
Glory IP, such a little.
In the example, the engineer promptly sent both the configuration, the scheme, and the desired operation algorithm.So, now, I have a general idea of ​​the problem, the network diagram and the desires of the client.
The third stage is a more detailed reading of documentation, configuration examples. Search for information on the topic in cases that other engineers wrote to simplify our lives, and general information from Google for common development and see how it is implemented by other vendors.
Usually, new questions arise after this, we repeat the iteration with the call and letters, if there is not a remote one.
Often at this stage the problem is solved. An error was found in the configuration or (admittedly, much less frequently) a hole in the engineer’s brain.
As for the last point, a vivid example:
Me: “You have the same Router ID in OSPF”
Engineer: “Well, yes, so what? They are in different zones. ”
Me: “Router ID must be unique within the entire OSPF domain”
Engineer: “No, no. Of course, there may be the same Router ID in different Area. ”
Me: Link to documentation and Wikipedia.
Engineer: "I will check, but I used to do that before - there were no problems."
But there are also quite complex problems - everything is set up correctly, according to all data, logs and statisticians everything is in order, but there is a problem.
And then TShoot begins - the
fourth stage .
Well, if there is a remote. But sometimes customers answer: “We’ve told you not the first time - there’s no remote. We have provided you with configuration and diagnostics. ”
In addition to digging in another console, I connect a software simulator of the equipment.

I watch how the packets run, take down the dumps, check if the protocol fails at what stage.
If the problem is fixed only on a certain hardware or software version, I appeal to the real lab with real hardware.
It happens that in order to reproduce the problem, you need to use eight devices, update software on them, upload licenses.
In the example, the explicit clue is the presence of OSPF division into zones. At the same time, the FRR works for the customer normally inside the zone and does not work between Area 0 and 1
Then I turn on psychology - I open an outlook and start writing a letter to the customer with a detailed explanation of what is happening and what I discovered. I paint everything in words, with excerpts from the logs, configuration, and a windshark.
Describing, I run into some strange question that I can not explain to the customer. Then I continue the simulation further, read the manuals until I can resolve the difficulty. I describe it and the new cycle.
As a result, I often do not send a letter, but I am significantly getting closer to the root of the problem.
By logic, I understand that CSPF (Constrained Shortest Path First) relies in its work on the SPF algorithm, which, as is well known, works only within one zone. About the remaining zones, he knows only summarized routes - not a topology. Accordingly, he cannot build the shortest end-to-end path.
I collected this scheme, checked that everything is really so. Found in the documentation of the mention of this and sent to the customer.
Well then:
Me: “run the following command:
[R1] Make me feel good ”
Customer:

However, the request is not always voiced in the ticket - the essence. Answering "no, it won't work like that" you need to offer a solution. And iteration over the new one.
Solution on my request: instead of FRR, use two LSPs in the framework of one TE-tunnel - the main and the reserve one. For this purpose, an explicit-path is built in the loose mode. This is an end-to-end protection between two nodes .
Here are a couple of requests that were solved in this way:
1) New LSPs are not built.
Note: until this morning everything worked and was built. Older LSPs also function and do not flap.
2) Through the MPLS network is organized L2VPN. On the client side, packets larger than 1,546 bytes are not skipped.
3) There is a network through which the video is transmitted through a multicast. When configuring VRRP on two routers, the client for which they are gateways receives two copies of video traffic (one for each router).