
Good day.
In fact, I never thought that I would write an article about such trivial things, but for the fifth time I come across a disregard for the simplest rules for building networks. If we were talking about some small desks, but this is the case with large providers, banks and government offices, whose names I will not disclose for some reason.
At once I will make a reservation that everything written below is purely my opinion, which I do
not impose on anyone. I would also say that we are talking mainly about ip networks, where are we without ip in the modern world?
Actually, all the problems of any organization involved in communication networks can be divided into several groups:
- Physical structure of SCS
- The logical structure of the SCS
- Monitoring
- Access control, security and remote access
- Application processing system
- Backup system
It would seem that all this is so clear, known and chewed, and there is no need to say, however, the reality is more severe.
Let's start with item number one -
Physical structure of SCSIn order not to go far for an example, right now I am seeing a picture of how ruthless road workers cripple optics, which were lying on the ground and were not suspended for several months for reasons I do not understand. The most interesting thing in this situation is that the optics are working ...
')
But back to the more mundane things - data center, server, NOC, etc.
I changed the order of five jobs, from large providers and state offices to banks and everywhere, in every place it was like in the picture in the cap. The only thing in one major provider was several exemplary nodes, which drove the authorities. With the rest of the nodes it was all sad.
I think we should not talk about the need to buy organizers, carefully lay the wires, stick stickers, do not leave kilometer "snot", keep a cross magazine. These simplest manipulations will not turn into this:

In it:

I remember, with my colleagues, I somehow put things in order in such "snot", but since we had access not only to the node, a month later the cobweb grew again. The main problem of such "snot" is that all tasks are set with the term "yesterday", installers do not really care about "beauty" and throw wires at random, and because of the lack of control who and what threw, it is impossible to apply punitive measures and the hellish web grows. If everything is not so bad with the schemes of laying and unwinding of optical cables, then everything is sad with the circuits inside the rooms.
Also, it is worth periodically blowing dust on the communication centers, this is better, about half of my employers have been doing this. But, he was a real witness of how dust stalactites hung from routers.
My most "sore" point is the
logical structure of the SCS .
I will speak with you not as an abstract narrator, but as a person who has served a rather big network with static routing, lack of an address plan and complete anarchy for more than a year. All arguments on the need to introduce dynamic routing were broken against the phrase - “the most reliable statics”.
Of course, in some small local area network static is quite appropriate, but when the geography of your network grows beyond the borders of your cozy office, it's time to give up on static.
Although there is another sad example when there was a network with IGP - OSPF, with several hundred routers in one AREA. The person simply did not suspect that it is necessary to break the network into zones, not to mention the fact that there are zones of different types. Any questions had an answer - “it still works.”
Descript was invented for a reason, sign ports, vlan, subinterfaces, vlan interface - write everything. Since the person who comes after you or to help you, you don’t want to study the kilometers of arp and mac tables, walk through a dozen switches and routers to understand how everything works. Also, specify the bandwidth, even if it is not taken into account in calculating the cost of the route, you can always find out the bandwidth, besides, it will be useful for monitoring (see below).
Draw schemes! I did not have normal circuits at any place of work, let alone separate L2 / L3 circuits. But then, in every enterprise there is the most “valuable” worker, who is valuable in that he remembers how we connected this damned channel. When asked to provide a scheme, at best, an ancient volume is extracted from the safe, which shows a small cloud with two lines.
There is another example, when the operator sent me the circuit on the channel, where all the devices were drawn in the form of squares, the circuit was completely unreadable. Icons for router, switch, satellite, etc. invented for a reason, draw readable diagrams, gentlemen!
Targeted plan - this is what should be a priori at the stage of the origin of the network. But in reality it happens that he either simply does not exist, or he is “bitten” by pieces from different ranges, sometimes even with intersections. In combination with L3 statics, a loop is guaranteed to you and not one.
In addition to the address plan, it would be nice to keep a sign with vlan'mi, and even better to raise QinQ, especially if you are a provider, and provide communication services, since vlan tend to run out and overlap.
Also, there is such a thing as web design. Saving on it leads to sad consequences. There is one well-known large operator, which provides IP-TV service, due to the fact that the multicast runs in one vlan, and ADSL is used as the last mile, incorrect settings of vpi, vci on the subscriber's modem lead to the emergence of L2 loops, television is strewing, Internet It works badly, all subscribers suffer.
Another very sensitive topic is vlan1. Why, why do they continue to use it for control, sometimes even for data transmission and to be surprised with L2 loops? Why can not choose another vlan and make it native? It is especially “pleasant” to look for a loop when there is an unmanaged switch on access.
Next on the agenda -
MonitoringFor some reason, many small providers prefer to respond to problems solely by calling from subscribers. For larger ones, at best, all monitoring is based only on
icmp .
There are a lot of excellent open monitoring systems: Zabbix, Cacti, Dude, etc. Also, being a bunch of paid. I think it is not necessary to say how important monitoring is, that besides
icmp , there is also
snmp .
However, I still came across an employer who claimed that he had excellent monitoring of Zabbix, which monitors absolutely everything. In fact, the person who set it up did not bother to read the documentation, the pullers were overloaded, respectively, the data was lost, little data was collected, all the data on the nodes were entered manually. MySQL configuration was standard.
Now this monitoring is brought into divine form,
LLD (low-level detection) is configured, which automatically adds and removes interfaces, tunnels, modules, fans, power supplies, etc. The information in the signature of the graphs is taken from
descript , the speed of the interfaces is taken from the bandwitch, that's why you need to update this information. The only thing that, as for me, is done in Zabbix is ​​bad - housekeeping - for large installations, partitioning is necessary.
Collect syslog from hardware, not from everything, but at least from critical nodes. It is possible to do this through Zabbix, but it seems to me that the decision will be too “heavy” (there is a corresponding article on the Zabbix blog).
Deploy several NetFlow collectors, install some analyzer and you will always see who and what is overloading your channels. If there is any billing, you can use it.
Not the least important point -
Access control, security and remote accessAs a rule, at the word AAA, I heard only - "what is it"? - or - "but I know - this is a battery"! Therefore, either one local account is used for all, or each has its own. In any case, when an employee leaves, especially when he doesn’t gorge himself, everyone starts running around with bulging eyes, looking for where else he is registered. I have an article about how to deploy Tacacs +, believe that it will make your life much easier. In addition, in the logs you will see who did what with the equipment.
Some use RADIUS, invents something else. I personally like Tacaca +, almost all modern equipment supports this protocol.
Also set the ACL for access, at least on the equipment that has "white" addresses, as our narrow-eyed friends are not asleep.
Do not release control out, that is - ssh, telnet, rdp and so on. If you are already releasing, set at least the firewall on certain ip. Personally, I always opened openvpn for access, generating an individual key for each employee.
In order not to go far for examples, at the current place of work I have a network on Cisco equipment, with a bunch of local accounts and different passwords for the privileged mode, which of course no one remembers. By superhuman will power, Tacacs + was deployed on this network - two servers with redundancy, geographically separated, synchronized. But despite this, with the setup of new equipment continue to set up a million local accounts, instead of one. To the assertion that local records do not work when Tacacs + is running, the answer is “local accounts are reliable, but what happens?”
Also, when I launched Tacacs +, I saw a bunch of entries in the authorization logs - attempts to log in as “root”, “guest”, etc. from our Chinese friends. On the question of the ACL access got round eyes. By the way, now it was about a large bank ...
Actually, I did not think to include the
application processing system in the list, but there was a case that in the state office dealing with telecommunications services, which has several thousand subscribers, applications were received by phone, printed and delivered to the performer, drum roll, fax! But in general, in this regard, everything is fine, except that the “usability” of all the systems with which I had to work at zero.
We go further -
System backupsI think it is not necessary to explain that the equipment can fail from a voltage surge, fire, flood, cleaning woman and other natural disasters. To quickly replace it, we need its configuration. I witnessed how a person “gave up” from memory the configuration of a burned-out router, simultaneously smoking smoke from the ears and other parts of the body. I do not urge to raise something global, for example Rancid + SVN (as suggested by Mr.
EvilMause , Rancid can be forced to make backups from the equipment of any vendors). But to write the simplest script in bash or in another language that will run through routers, you can give everyone the command to copy the configuration file to ftp / tftp and sort into folders, and Tacacs + will allow you to create an account with limited rights.
By the way, Cisco equipment supports changing and copying configuration over snmp in the case of the RW community, but the community must be tied to an ACL, otherwise ... Especially if it is snmp 1 or v2c. In the same bank, a lot of where the RW snmp community was configured without being tied to an ACL, and the community was no more difficult than “public”. But one day, as legends say, a certain “hacker” penetrated the bank’s network and turned off ip routing. How did he do it ?!
For servers, you can use something like Bacula, if you’re really lazy and a server without a RAID controller, use at least SoftRAID 1, for example Linux mdad, but not the built-in FakeRAID. There will be at least some data backup. And best of all, since we live in an age of modern technology, use virtualization. However, just with the server part, most operators are not so bad. But, with backups trouble. In banks, this is much better, since all the work is tied to the AU.
SummarizeThe conversation was only about the ip network, if you look in the direction of satellite networks SCPS - suffer from the "interference" of an unidentified source. In VSAT, time slots are miraculously ended. I personally witnessed how well a person set up HUB iDirect. In this model, there are two Protocol Processors that can balance the load among themselves, respectively, we need dynamics, the only protocol that this system supports is RIPv2. But a bunch of router - Protocol Processor, this person made a static, wrapping everything on one Protocol Processor, without even bothering to make a route to the second, with a larger metric. Accordingly, half of the modems did not work for an "inexplicable" reason, and in the case of a load transfer to the second Protocol Processor, everything did not work. Or a designer who tried to assure everyone that the calculation was correct and he does not understand why the antenna is precisely aligned to the pole alone standing in the field (a real case, by the way). Fortunately, I have never touched PDH and SDH, although apparently, things are a little more fun there. I can’t talk about telephony, I have never worked with it.
It would seem that these are all such obvious things that we should not even talk about, but ... But we have what we have. Perhaps in other regions the situation is radically different, but hard to believe.
It seems to me that all this happens for the following reasons:
- Commercial companies are focused primarily on the needs of the business and rightly so, but the “high” executives are focused not on the good of the company, but on the personal good. In consequence of that - unrealistic deadlines for tasks, lack of payment for processing, and then search for the guilty. In the state to delete the words "business" and "commerce".
- In consequence of the first, a certain attitude of the workers to their work is formed.
- Desire will save on personnel, instead of a staff of competent specialists to hire one and ten yesterday students or humanities scholars who have not taken anywhere else.
- Lack of competent local managers who would act as a stratum between technical specialists and directors. As a rule, managers are either not competent and everything falls on the shoulders of subordinates, or they have long been beaten for everything.
- The desire to save or "drank" money on purchases. This leads to the fact that equipment is purchased that is completely unable to cope with the duties assigned to it, incompatible, or simply not necessary.
To those who have had the patience to finish reading this stream of consciousness to the end, it may seem like I just whine, and I would agree with you, but after the fifth time the same problems, it's hard to hold back.
Let me remind you once again that all that is stated in the article is purely my opinion and is based on personal experience. This is written with the hope that someone will not make such mistakes.
Thank.