To-do: Filter everything and everything

This article is more of a FAQ rather than a full manual. However, much has already been written on Habré and for that there is a search by tags. There is no sense to rewrite everything anew.

Recently, our state, fortunately or not fortunately, took up the Internet and its contents.
Many will undoubtedly say that rights, freedoms, etc. are violated. Of course, I think few people have doubts about the fact that the laws that were invented were made by few people who understood the Internet, and their main goal is not to protect us from what is there. Being a responsible person and driven by prosecutors in some institutions, the question arises of limiting incoming information. Such institutions, for example, include schools, kindergartens, universities, etc. them institutions. And the business also needs to take care of information security.
And our first point on the way to local content filter is

Analysis of what the Internet is and how it works.

It's no secret that 99 percent of the Internet is http. Further, it is known that each site has a name, page content, url, ip address. It is also known that several sites can sit on one ip, and vice versa. Also, url addresses can be both dynamic and constant.
And what is written on the page is written. From here we draw conclusions that sites can be monitored for:

Site name
page url
According to the content written on the page of the site
By ip address.

Further, all content on the Internet can be divided into three groups:
')

This is bad
This is an unknown
This is good.

And from this follows two ideological paths:

We only allow what is good and prohibit the bad and the unknown. This path is called - BOOLSH (and sometimes small). White list
We only allow what is good and unknown. We prohibit only the bad. This path bears the proud name Blacklist.

And of course between these two paths there is a middle ground — we prohibit bad, we resolve the good, and we analyze the unknown and decide online whether it is bad or good.

Means of their implementation.

Here again, two ways:

We take a turnkey solution.

Such solutions also come in 3 types — paid, free, limited (until you give the money).

Paid solutions are hardware (tobish a box with what is not known, but doing its job), hardware-software (this is the same box, but already with full-fledged OS and corresponding applications) and software.
Free solutions are software only. But there are exceptions, but this is exactly the case that confirms the rule.
These include, for example, Kaspersky antivirus related functionality, ideco.ru, netpolice, kerio, etc. It is easy to find them, because they are well advertised and it is enough to enter something like this in the search bar - to buy a content filter.
Free solutions have one drawback — they can't do everything at once. to find them more difficult. But here is a list of them: PfSense, SmoothWall (there are two types - paid and free. Free bit is not functional), UntangleGateway, Endian Firewall (also paid and free), IPCOP, Vyatta, ebox platform, Comixwall (A wonderful solution. You can download from my site is 93.190.205.100/main/moya-biblioteka/comixwall ). All these solutions have one drawback - limitations.

We do everything by hand.

This path is the most difficult, but the most flexible. Allows you to create everything that your heart desires (including a loophole).
There is a great many components. But the most powerful and necessary is

Squid. Without a proxy or where.
Dansguardian. This is the heart of the entire content filter. His only free rival (not counting his forks) is the POESIA filter (but he is very dense).
DNS server bind.
Clamav. Antivirus.
Squidguard, rezhik and to them similar redirectors for a proxy.
Squidclamav.
Sslstrip. This utility makes encrypted https traffic, decrypted http traffic.
www.thoughtcrime.org/software/sslstrip . Analogs to her proxy server flipper and charly proxy. But analogs work on Windows. And the second is paid. But who needs it, then you can expand the wine.
Black lists. These lists can be obtained from www.shallalist.de (1.7 million sites), www.urlblacklist.com (namely the big version with more than 10 million sites), www.digincore.com (about 4 million), lists of directories.
Whitelists. It's all very tight. The only normal (meaning-big) Russian-language list can be obtained from the Safe Internet League , and then only as a proxy of the Safe Internet League or the program www.ligainternet.ru/encyclopedia-of-security/parents-and-teachers/parents-and-teachers -detail.php? ID = 532 . By the way, in connection with the digest authorization on the proxy league, this proxy can not be picked up by squid. If anyone knows how to pick up as a parent proxy, proxy server with digest authentication, please inform.
DNS lists. There are two well-known options. The first is skydns filter www.skydns.ru .
The second is yandex dns dns.yandex.ru .
Skydns is more functional, in contrast to Yandex.

Where filtering occurs.

The following options are possible:

On the user's computers without centralized management, as a system component or application.
The same as the first, but with centralized management (as an example of KASPERSKY ADMINISTRATION KIT).
Component to the browser. There are appropriate plugins for chrome and fox
On a single computer or cluster of computers (including option-on the gateway).
Distributed.

1 and 2, 3 options in terms of filtering speed are the fastest with mass network use.
In terms of effort, 1 and 3 are the most labor-intensive.
From the point of view of reliability not bypassing the filtering by the user, then 4-first place.
5 option is a dream. But it is not anywhere.

Now the next question:

Reliability filtering.

I think it is clear. Protection needs to be made multi-level, because what leaks at one level of protection will be blocked by another level.

let `s talk about

On the shortcomings of the levels of protection.

Lists

The Internet is constantly and most importantly, a very rapidly changing environment. It is clear that our lists will not keep up with the Internet, and even more so if we keep their hands. Therefore, participate in the list-making communities and use not only files with lists, but also list services, where everything will be done for us (for example, skydns and yandex).
And the list does not guarantee that something is written on some page, but the site itself is completely white and fluffy.
Use multiple lists. What did not fall into one may fall into another. !!!
The list programs include Netpolice (http://netpolice.ru), censor (http://icensor.ru/), Traffic Inspector for schools (http://www.smart-soft.ru/ru) and etc. Usually, programs that can do lexical parsing can work according to lists.
Censor has an old base from 2008. But free in everything. Netpolice has many versions and is free but truncated.
And do not forget, neither black nor whitelists can protect you by 100%. Only lexical analysis is capable of that.

Analysis for viruses.

Here the main problem is the anti-virus database. Again, one antivirus on the gateway, the other in the workplace.

Analysis of the content written on the page.

Here the main problem is the lexical analysis of the text. Nobody has money for artificial intelligence, therefore they use a base of words and expressions with a weighting factor. The smaller the base, the less effective the filtering, but the larger the base, the more effective it is, but also labor-intensive. For example, parsing the work of Jules Verne. The mysterious island with lib.ru takes 8 seconds with my base and dansguardian (core2duo 2.66). And the base must be taken somewhere. I had to do a normal base myself than with you and share 93.190.205.100/main/dlya-dansguardian/spiski/view .

The next question is

Ability to bypass the user content filtering.

This issue can be solved in two radical ways.

To prohibit direct access to the network, with the exception of passing through a proxy server (the proxy must also be limited to the CONNECT method to the list of domains and / or ip or mac addresses.) We do this either using iptables, or simply write net.ipv4.ip_forward in sysctl.conf = 0 Well iptables is a question of a separate article.
To prohibit users in the workplace to put something. The thing is clear: no program, no crawl.

The issue is performance.

There is more or less everything is clear, more memory, more hertz, more cache. And it is very useful for those who have small powers to use CFLAGS optimization. This allows you to do all the Linux and fryahi, but especially convenient gentoo, calculate linux, slackware, freebsd.
Who has multi-core processors, then use OPEMNP (dansguardian suitable for it can take 93.190.205.100/main/dlya-dansguardian from me. By the way, it also fixed the error with the impossibility of downloading data to the Internet.) CFLAGS = "- fopenmp". LDFLAGS = "- lgomp". Remember to include -O3 -mfpmath = sse + 387. About autopatching here.

Question hierarchy of caches and proxies.

If you have a lot of computers and you have the opportunity to use several as a filter, then do so. On one, put the squid proxy server and point it to the parameters of the parent caches with the round-robin parameter (http://habrahabr.ru/post/28063/). The dansguardian acts as a parent on each specific computer with squid in conjunction (for without a superior dansguardian does not know how). Higher rankings are located on the same computers on which dansguardians are located. For the upstream big cache does not make sense, and for the first, necessarily the largest cache. Even if you have one machine, then on it, anyway, do a bunch of squid1-> dansguardian-> squid2-> provider with the same distribution of caching. On dansguardian, do not impose anything, except for the analysis written on the pages, redrawing the content, headers and some url blocking mime types. In no case do not hang on it antivirus and black sheets, otherwise there will be brakes.

List analysis let them do squid1 and squid2.
Let squidclamav do a virus scan via c-icap on squid2. We hang white lists on squid1.
Everything in the white list should go directly to the Internet, bypassing the parent proxy. !!!

The DNS server is sure to use ours, in which we use redirection to skydns or dns from yandex. If there are local resources of the provider, then add the forward zone to the provider dns. Also, in the dns server, we register the local zone for the necessary intranet resources (and that would be beautiful, they are needed). Specify nosslsearch google search. In squid configs, we will definitely use our dns.
For all we use Webmin webcam and command line. On windows servers we do everything with the mouse.

LAN setting

Use authentication by ip address. If you are not a “serious” organization, access with obligatory logging does not require anything.
Use logically separated networks in one solid physical network. Give IP addresses to MAC addresses. Forbid the connection to the proxy port if the MAC address of the machine does not match the IP address assigned to this MAC address.
Configure iptables so that calls to any ports (3128, 80, 80, 3130, 443) go through the port of the proxy server.
Configure the automatic configuration of the proxy server on the network via dns and dhcp. www.lissyara.su/articles/freebsd/trivia/proxy_auto_configuration
Group and filtering level do by ip address.
You can configure the proxy in the browser settings.

We are going to check.

In this case, all the sliders to the maximum.
Additionally, prohibit all video sites, contact, social networks, music portals, file sharing and file sharing networks.
Forbid mp3.
Put a tick in front of a safe search in your SKYDNS account.
Be sure to tidy up the documentation !!!

Https filtering

To do this, between squid2 and the provider, we insert sslstrip. This utility makes encrypted https traffic, decrypted http traffic. www.thoughtcrime.org/software/sslstrip . It is also possible in the squid1 rules to set matches for port 443 and domains for ban / permission.

Another couple of tips.

Not all sites are correctly filtered. Therefore, use the bypass feature blocked by dansguardian. A ready page can be taken from me 93.190.205.100/main/dlya-dansguardian/stranichka-blokirovka-i-razblokirovka-dlya-dansguardian/view .
Always keep logs of site visits and keep statistics for the year. There will always be smart people who want to do something illegal on the Internet. It is enough ip identifier, by virtue of sufficiency and in the name of enforcing the law of personal data protection. Statistics do open.
There are sites that are not subject to dansguardian. These are the ones who use json. This, for example, yandex.ru, video.yandex.ru. Do for them authorization by password, through squid1.
Not all providers comply with the law and from what is written in the federal list of extremist materials and on zapret-info.gov.ru is not blocked. Therefore, for the first, read and fill out the database of words and expressions, and for the second, use the unloading antizapret.info.
Know, most prosecutors do not care who is to blame. Seen means visible. And at least do what.
Do not forget and put snort with snortsam. Security is above all, even more so if you have a white ip address on the gateway.
Many search engines have the ability to filter the results. This is done by adding a special parameter to the request, or via cookies. Recently, they have increasingly begun to switch to cookies, so a corresponding dansguardian setting is needed. Yes, and configs thereof, you can take from me . There they are registered. In addition, lists must be made in 4 encodings (1251, utf8, koi8r, utf16) and select the correct filtering method (more details in configs). For youtube use edufilter .
A good squid configuration manual can be found here .

Source: https://habr.com/ru/post/188444/

All Articles