The diary and photos of Chris Beach. I'm into windsurfing, coding, badminton, drawing and composing music using computers and synths.

Let's start with a quote:
"I contend that we are both atheists. I just believe in one fewer god than you do. When you understand why you dismiss all the other possible gods, you will understand why I dismiss yours" Stephen Roberts


email: password:

automating spam assassin's learning process on plesk

Update [22/11/07]: I have updated the script to make it more generic

Amongst other techniques, Spam Assassin uses a bayesian filter to judge the probability that a mail is spam. The bayesian filter works on the probability that certain words in the mail identify it as spam or non spam. In order for this to be effective, the filter needs to be taught - and this can be automated to a degree.

It took a fair amount of Googling to work out this solution, so hopefully it will save you some time if you have a similar setup to mine (Linux, Plesk, Qmail, Spam Assassin). You may need to substitute your Spam, Learn and Trash folder names if they differ from mine:

1. Inside your Spam mail folder, create a folder named Learn.
2. If a spam mail is not caught by Spam Assassin, get in the habit of moving it manually to your Spam/Learn folder (do this in your mail client).
3. Create a script /var/scripts/dailyMailJobs as follows:


#!/bin/bash

MAILNAMES_PATH="/var/qmail/mailnames"
SPAM_LIFETIME_DAYS=2
TRASH_LIFETIME_DAYS=4

# learnAndFlush args are the following directories: MAIL SPAM SPAM.LEARN TRASH
function learnAndFlush {
echo -e "\n\nLearning new Bayesian data from spam for $1 on" `date`
sa-learn --dbpath ${MAILNAMES_PATH}/$1/.spamassassin --spam ${MAILNAMES_PATH}/$1/Maildir/$3/cur/
# Flush Spam.Learn
flush $1 $3
# Flush Spam
flush $1 $2 ${SPAM_LIFETIME_DAYS}
# Flush Trash
flush $1 $4 ${TRASH_LIFETIME_DAYS}
echo -e "\nLearning new Bayesian data from the last 24hrs of non-spam for $1"
find ${MAILNAMES_PATH}/$1/Maildir -mtime -1 -type d -name cur -not -path "*$2*" -not -path "*$4*" -not -path "*/Maildir/cur" -print -exec sa-learn --dbpath ${MAILNAMES_PATH}/$1/.spamassassin --ham {} \;
}

function flush {
echo -e "\nCleaning $2 from ${MAILNAMES_PATH}/$1"
mtimeArg=""
if [ "$3" ]
then
echo "Only deleting mail older than $3 days"
mtimeArg="-mtime +$3"
fi
find ${MAILNAMES_PATH}/$1/Maildir/$2/cur $mtimeArg -type f -exec rm {} \;
}

su popuser
# Substitute your mail directory here:
learnAndFlush chrisbeach.co.uk/chris .Spam .Spam.Learn .Trash

4. Place this in your crontab so it runs every day (e.g. at 00:15):


15 0 * * * /var/scripts/dailyMailJobs >> /var/cronjobs/logs/dailyMailJobs.log

This script will teach Spam Assassin that the mails in the Spam/Learn folder are spam, and the mails elsewhere are non-spam. It will also perform some housekeeping, deleting the Spam/Learn mails that have been learnt, and deleting old mails from Trash (5 days or older) and Spam (3 days or older).

written by Chris Beach
01/02/07 8:12pm
(10 years, 3 months ago)
comment one comment

photoadd photo

 26 links more journal entries from tech journal