Rspamd Spam Filtering System

The Rspamd system is being developed as the main spam filtering system in Rambler-Mail. However, initially I planned to make a system that would not be inferior in terms of Spamassassin’s capabilities, flexibility and quality of work, but I would be deprived of its main disadvantages: excessive use of regular expressions, poor optimization and general thoughtfulness of work, and relatively inaccurate statistics. So the idea of rspamd ripened - a system whose core would be optimized for filtering a large stream of letters, which would be easily expandable and use more advanced statistics algorithms. The rspamd kernel is written in C and uses an event-based message handling model (based on libevent). Rspamd is expanded by writing plugins and rules in lua. The project itself from the very beginning was Open Source (under the BSD license) and is now located on bitbucket .

Preamble

Projects such as nginx, crm114 and, of course, spamassassin greatly influenced rspamd. From nginx rspamd took the data processing model, as well as the principles of processing various information - the most optimal algorithms, such as finite automata, suffix trees, and so on, are used to the maximum. In my opinion, crm114 implements the largest number of various statistical algorithms and approaches, which are then slowly introduced into other systems. For example, rspamd for statistical analysis of messages uses an algorithm for analyzing bigrams from words rather than unigrams, like SA and many other spam filters. This allows us to estimate the probability (or frequency) of not just words, but a spectrum from combinations of words. On the one hand, this increases the size of the statistics, and on the other, increases its accuracy. But, of course, most of all rspamd took from SpamAssassin'a, which served as the prototype and the reference point for the creation of the project. The ideology of evaluating a message based on many factors — regular expressions, DNS blocklists, various lists, statistics, signatures, phishing, and others — was taken from SpamAssassin. In addition, the assessment was expanded to the concept of "metric", which may allow to evaluate the message according to different sets of rules.

Who is Rspamd for?

Rspamd can be suitable for systems of various scales, both large mail systems, and small, processing several letters per hour. The first ones will find in rspamd such features as easy horizontal scaling, master-slave synchronization of statistics, built-in commands for monitoring work, high speed of work and the ability to withstand sudden bursts of loading (by disabling complex checks), as well as a flexible extensible architecture. For small systems, rspamd is quite suitable with out-of-box settings. I also plan to organize the distribution of statistics, so that users can put rspamd and have a ready system for spam filtering. Rspamd can integrate into various MTAs, as well as work in the SMTP proxy mode (for more information, see this documentation in this article ).

Receiving, installing and configuring the system

Rspamd now works only on Unix-like systems (working on various types of Linux, FreeBSD and OpenSolaris has been tested). In FreeBSD, rspamd is in the form of a port (mail / rspamd), the owners of the rest of the system, unfortunately, will have to collect rspamd from source. This process, as well as the initial configuration, is described in the “ quick start ”. The rspamd configuration can be done in two ways: through the xml file and through lua. The first method is designed to set the basic parameters, for example, this is how weights and thresholds for various actions for a message are set:

<!-- Metrics section --> <metric> <name>default</name> <required_score>14.0</required_score> <!-- Sample actions --> <action>reject</action> <action>greylist:4</action> <action>add_header:8</action> <!-- Weights for symbols --> <!-- Subject is missing inside message --> <symbol weight="2.00" description="Subject is missing inside message">MISSING_SUBJECT</symbol> ...

The second method is used to write rules and a more “fine" setting rspamd. For example, you can make a function that selects the "appropriate" statistics files for the message (in the base package this function is supplied for the selection of statistics according to the language of the message). Here is an example of determining the presence of an empty letter with a picture in the letter:

 reconf['R_EMPTY_IMAGE'] = function (task) parts = task:get_text_parts() if parts then for _,part in ipairs(parts) do if part:is_empty() then images = task:get_images() if images then return true end end end end return false end

You can use the rspamc console client for training, checking letters and monitoring rspamd.
')

Current project status

Not so long ago, I did a little performance testing of rspamd and SA (I checked it with a stream of 30 simultaneous connections using approximately the same rules). Approximate results were as follows:

Rspamd:
rspamd stat

SpamAssassin:
spamassassin stat

Now rspamd users are, in addition to Rambler-Mail, several large providers. The project is the only developer, so I am interested in new users to improve the quality of the project and implement new ideas. In addition, in the latest version (0.4.0) I put a lot of effort into stabilizing, simplifying and improving the quality of my project. Therefore, if someone gets less spam thanks to rspamd, then I wrote the code and this article for good reason.

Rspamd Spam Filtering System

Preamble

Who is Rspamd for?

Receiving, installing and configuring the system

Current project status

Links

More articles: