📜 ⬆️ ⬇️

Making a spam filter for mail

image

Thanks to gmail.com, I could put all my mailboxes together. But I ran into a problem when all messages come to one mail, then you already start to notice spam. And too lazy to climb and remove spam with your hands, and the filter that is already built into the mail service is not always happy.

Why not make a bot that cleans the mail, especially spam is clearly visible on some grounds?
Here is what I highlighted that in my eyes is spam:
- everything is written in upper case
- messages where the main idea: porn, dating, casino, money, etc.
- if someone regularly sends mail and I do not read it

At the very beginning, you need to configure imap php for close work with mail. Then write some algorithms that are not optimal in this article, because everyone needs their own filter (for example, some are waiting for spam from pornographic sites).
')
There will only be ideas and information for the mind. And for those who want to put their filter, there will already be a foundation.

Getting started...

About how to set up imap php there are a bunch of articles, you can search for them. I have Ubuntu, I decided this question in a couple of minutes and a little change in the settings.

When you have already configured imap you can connect it.
<?php
//
$imapaddress = "{imap.gmail.com:993/imap/ssl}";
$imapmainbox = "INBOX";
$maxmessagecount = 10;
$user=" gmail @gmail.com";
$password=" ";

// ,
spam_delete($imapaddress, $imapmainbox, $user, $password, $maxmessagecount);


Now we go to the mail pick up the letter. When they took the letter, divide the entire text into words and count the number. Then in the cycle we take by words and check whether this word can confirm that this letter is spam. Some items that in my opinion is spam described above. Then we find the probability that this letter is spam by the following formula:

probability = total words in the letter / words that did not pass the filter

This is how it is all in the code:
function spam_delete($imapaddress, $imapmainbox, $imapuser, $imappassword, $maxmessagecount)
{
$imapaddressandbox = $imapaddress . $imapmainbox;

//
$connection = imap_open ($imapaddressandbox, $imapuser, $imappassword)
or die("Can't connect to '" . $imapaddress .
"' as user '" . $imapuser .
"' with password '" . $imappassword .
"': " . imap_last_error());

echo "Gmail information for " . $imapuser ."";

echo "Inbox headers\n";
$headers = imap_headers($connection)
or die("can't get headers: " . imap_last_error());

// - , 10
$totalmessagecount = sizeof($headers);

echo $totalmessagecount . " messages";

if ($totalmessagecount<$maxmessagecount)
$displaycount = $totalmessagecount;
else
$displaycount = $maxmessagecount;

echo "Message bodies\n";
//
for ($count=1; $count<=$displaycount; $count+=1)
{
$body=imap_fetchbody($connection,$count,"2");
//
$text=explode(" ",$body);
$spam=0;
// -
$n=count($text);
for ($i=0;$i<$n;$i++) {
$spam+=test_spam($text[$i])==1:1?0;
}
// ,
// - , ,
// ,
$result=$n/$spam;
// 50% ,
if ($result>0.5) {
imap_delete($connection,$count);
imap_expunge($connection);
}
}
// imap
imap_close($connection);
}


The spam checking algorithm is very simple, it is written as an example. If you want to write a stronger and smarter algorithm, I advise you to read some chapters about spam in the book “Programming Collective Intelligence”, they also wrote about it in Habré .

The algorithm performs two actions:
1. Identifies the words that are most often found in spam.
2. Checks for the register, if everything is in the top, then it is most likely spam.

The code itself:
//
function test_spam ($string) {
//
//
$array=array('' => 1, '' => 1, '' => 1, '' => 1);
if ($array[$string]==1) {return 1;}
//
if (strtolower($string)!==$string) {
return 1;
}
return 0;
}
?>

Tested on two examples, it seems to work ...

PS It will be very happy to hear how you fight the garbage. If you find an error in the code, do not swear strongly - this is just an example and a foundation for developing something bigger.

Source: https://habr.com/ru/post/81514/


All Articles