Difficulty administering proxies in large companies (Part 2)

In the previous article I described the main problems that lie ahead of administrators in large companies.

Today I will continue this topic and describe the main problems of configurations in large networks and their possible solutions.

I recall the original data:
- SQUID 3.0 with pf transparent mode
- External categorization service Orange Filter Database
- Number of customers 300-350
- peak number of requests up to 300 per second
- Internal categorizer with data from bigblacklist, rejik and other sites.
')
Problem one: DHCP
The use of statically assigned addresses of computers in large networks is a moveable one from the point of view of network administration, and has many other problems. However, DHCP is not a panacea for solving all problems.
Consider the case of using AD and ISC-DHCPd. The main problem is the correct synchronization of the DHCPd address base in AD. There are not many options here. To open the root domain zone on the “Unsecure” modification is fraught with possible spoofing (substitution) of addresses on the network. Domain computers when receiving an address via DHCP and when initializing the network interface, when the machine is booted, are automatically registered in DNS on domain controllers (unless another setting is specified). Problems begin when connecting to a network of computers not related to the domain. Immediately note questions about MS DHCP Server. Microsoft's DHCP server is of course all good, except for the convenience of managing it, as well as working with MAC address masks for dynamic device configuration. But it's not about him. In order to properly register a reverse zone without using MS DHCP Server, it is best to use the ISC-DHCPd and ISC-Bind combination. On Bind, the reverse zones for your subnets are raised and the update of zones by key is configured by writing the key to the dhcpd and bind config. On domain controllers, reverse zones are set as secondary or forward for the correct reverse resolving. When issuing an address to the host, dhcpd will automatically update the reverse zone on bind, and bind will send a zone update signal to the controllers.

Problem two: Disabled hosts
Using the FQDN for the hosts allowed to access in squid configs and others causes one day, if the host has not updated its A and PTR records for a long time, its name will not be found in DNS. That in turn will lead to a crash of the process that tried to cut the name off (if no workaround is used for this problem). Entering all hostnames in / etc / hosts when using dhcpd results in untimely updating of information. One solution is to authorize in the redirector by host name with a preliminary inverse resolution of the IP address.
What does this give? This scheme allows you to not bind to the IP address of the host, increasing the flexibility of configuration The scheme is simple:
- data is sent to external acl as "<IP Address> <URI>"
- gethostbyaddr is done for the address and the result is compared with the external acl settings
- to reduce the load on the DNS, the result of gethostbyaddr is cached for a certain amount of time.
- if the received name does not suit us - the rule does not apply, if it does, it applies.

Problem three: “You can’t do what others can. Can that which others can not "
The problem is not so much configuration, as logical and common to all access control systems. Most access control systems imply a unique definition of rights. Either "allowed" or "prohibited." Without taking into account the possible occurrence of the resource in several categorization groups. The best option for access control is to use the “allow, deny” and “deny, allow” check methods.
We accept the default “allow = 0, deny = 1”.
For the method “allow, deny”, the rule of logical OR will work, first we allow access to everything and check if the domain is included in the prohibiting category. (0 or 1 = 1)
For the “deny, allow” method, the logical AND rule will work. First, we deny access to everything and check if the domain is included in the resolving category. (1 and 0 = 0)

Problem four: "External categorization of domains"
Most of the existing categorization solutions for squid imply the use of offline databases translated from text files downloaded from the Internet into any database format understandable to the redirector. This is usually BerkeleyDB with all the attendant comforts and inconveniences. BerkeleyDB is very convenient for processing large amounts of data, but it is absolutely inconvenient when monitoring the contents of these databases, as it lacks an element to control the lifetime of the value. As a result, to update the database, it is necessary to re-report it with a complete reset of previously entered data. The best option to work with the database is a differential or incremental update of data within the database.
Why did I start talking about the external categorization of domains?
When a server is heavily loaded, the categorization of a domain based on external data sources can lead to an overload. For balancing and load balancing, intermediate caching of the received data from the categorization server is necessary.
Consider the option to block TOR connections using dnsbl. For this, there are two domain zones tor.dan.me.uk and torexit.dan.me.uk, the first contains a list of tor clients, the second contains a list of tor nodes that can route traffic through themselves. With a load of 100 to 200 requests per second - per minute we will receive 6000 or 12000 requests per dns server. Not every local DNS server will like this load. You can reduce the number of requests to the dnsbl server by excluding the scan of all hosts that are accessed, as well as by limiting the checks for the hosts that initiate connections. Anyway, the more external categorization services we use to determine the category of a host or domain, the more system resources we have to spend on it.
What is the way out of this situation?
There is only one way out in my opinion. This is the use of query caching mechanisms with the ability to control the lifetime of the cache. Updating the cache can be done as a full reloading of data, and by differential and incremental updates. The system should work automatically without the participation of the administrator and not “fall” when any used resource is unavailable.
The same domain categorization method can be used to identify open proxy servers, malware and other malicious servers. A list of dnsbl domains can be found at www.robtex.com/ip/127.0.0.1.html#blacklists . Many of you will now want to say: “so are the dnsbl lists used to check mail for spam!”, But the answer will be simple: “what is the difference between checking a host for belonging to spam and checking for belonging to malware? nothing! ”
Once again I want to warn that this scheme can cause a very strong decrease in the speed of access to the network, depending on the availability of DNS servers and the number of requests sent.

Problem Five: "One second-level domain, many third-level domains and above"
How to resolve third-level domains by banning a second-level domain? Some blocking services imply an unambiguous blocking of a domain by a second-level domain name. White lists of domains, of course, exist, but they often act on everyone. Consider the option of restricting access to the "yandex.ru" domain (as an example of a multi-level domain tree). We need to give one user access only to mail and calendar, closing everything else in this domain. And to another to allow everything in this domain except mail. The easiest method of solving this is to do it in the form: "user1: + mail.yandex.ru + calendar.yandex.ru -yandex.ru", "user2: -mail.yandex.ru". But not all redirectors allow such a scheme. In essence, the method for calculating the resolution for the second-level domain is the same as for the third and higher level domains, except that the second-level root domain may not have a category. The root domain of the second level to speed up the search should have information about top-level domains to reduce the number of checks.
Let's say there is a domain “vasya.pupkin.home.drive.narod.ru”, we have a certain category for the third level domain - drive.narod.ru - “Users Drives”, while the second level domain “narod.ru” can be categorized as "users homepages". It is quite clear that having 2 categories for the second and third level domains, it is senseless to check the domains of the levels above the third. For example, we requested the url "vasya.super.puper.good.narod.ru". The first check should be the domain “good.narod.ru”, since we have information that we have a categorized domain in third-level domains. If a category for the requested domain is not found at this level, the next level will be checked (that is, the second) and as a result, the requested url will receive the “Users Homepages” category in just two passes of checks instead of 5.

Problem six: Correct user identification on internal information resources
When a user is redirected to a page that explains why the connection is blocked, a situation arises in which the information page sees, instead of the user's address, the address of the proxy server. You can use internal proxy headers such as “Forwarded for” and “via proxyaddress”, but this is not safe enough for traveling outside the company. Therefore, the best solution would be to use the client computer automatic configuration script (WPAD). Google search will give you comprehensive information on how to configure this file. The point is that when you automatically configure the user's browser, you can ask which resources you need to go through a proxy server, and which resources you need to go directly. For example, on internal intranet sites there is no point in going through a proxy.

Problem Seven: Control after hours what is available during business hours
When the system administrator is at work, most problems are solved fairly quickly, ranging from virus activity to monitoring proxy traffic. When the administrator's working day ends, in his absence a lot of all kinds of nasty things can happen. This of course applies more to paranoia, but a rational grain is still present. Squid has the ability to severely limit working hours for users. The main problem with this scheme is that you cannot independently extend the possibility of staying on the network if we say you need to work after work. Usually, without the intervention of administrators, such things do not do. Separate temporary lists, user groups and other bicycles are invented. In turn, the easiest way to control time is to assign this task to the checking script. If necessary, the user can independently extend their stay in the network after forwarding a page warning that the working time is over. This scheme is extremely convenient in terms of protection against viruses, Trojans and other nastiness. If the network configuration is correctly configured, no request will pass by the controlling script.

Problem eight: proxy failover and configuration integrity
If you have a large network, then sooner or later the question of reserving the server arises. In this case, the stumbling block is the integrity and identity of the configuration on all your proxy servers. I want to have a config somewhere centralized and manage too from anywhere without being tied to any server. Bringing the config to memcached or mysql would be ideal.

Here it is. Why did I actually write all this? :)
I want to share my development with everyone.
The project is called PRADM (Proxy Administration Kit)
Currently, the project includes two components. This is a redirector providing access control and a categorization server that allows you to work with lists of resources, ip addresses and an infinite number of categories. Both components are designed for continuous operation and are controlled in real time without the need to pull squid to update the configuration.

All components are written in perl using a number of external modules:
memcached 1.4.4
perl 5.8.9 (with defined-or)
Cache :: Memcached :: Fast
Digest :: MD5
URI :: URL
MIME :: Base64

main locations of some files (for those who want to try)
pradm - / usr / local / squid / scripts
category server - / usr / local / squid / catdbserver /
pradm logs - /usr/local/squid/logs/redir.log
pradm.conf config - /usr/local/etc/squid/pradm.conf

Ingredients pradm components:
pradm.conf.example - config example
pradm.pl - redirector script
md5.pm - auxiliary module with service functions
squid.conf - an example of the squid.conf file to call the redirector as external acl
reloadconfig.pl - script loading configuration in memcached
showperm.pl - script showing data for a given host
Web directory:
error.cgi - a script that provides output of information about a blocked resource, as well as managing the extension of network access (should be in / cgi-bin / of your web server link to which is indicated in squid.conf
stop.gif and wpad.dat - should be in the DocumentRoot of your web server link to which is specified in squid.conf

Composition components catdbserver
README - brief information on how to use category loader.
categoryloader.pl - script forming database file for a category server.
catserver.pl - the categorical server itself
testfilter.pl - script for testing your categorical server.
updatelists.sh is a script that unloads a list of malware domains from external resources and uploads them to a category server.

For most of the scripts there is a small internal help on the keys with the exception of the redirector and the informing script.

Project sources:
aborche.com/pradm/source/pradm.tar.gz
aborche.com/pradm/source/catdbserver.tar.gz

So far, the development has a categorical server for SQL (the server is ready, the speed of adding domains to the database is optimized). Also in the development of memcached is a categorical server (there are problems with indexing data in the database of the memcached server)
Brief summary of aborche.com/pradm project features
If you have questions - then ask, I will answer them with pleasure.

Aborche 2010

Source: https://habr.com/ru/post/88119/

All Articles

Difficulty administering proxies in large companies (Part 2)

More articles: