[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

DSpam



Sorry, this is a tad long.  I've only posted a couple of times in the last
year, so I'll make up for it here. :)

So, I set up dspam as a filter on my postfix gateway today.  As background,
I have an AMD 5x86-133 (basically about a pentium 133 in a 486 socket) with
64MB RAM and a few 50-100Mb IDE drives running postfix (as well as named
and xinet acting as a port forwarder).  It's the baddest 486 in town, but
not much of a system by any modern measure.  Anyway, it takes care of
blacklists, header and body checkes, virtual address redirection, etc, and
forwards mail either externally or to the internal server, depending on how
the virtual domain's set up.  The load average sits on 0.00 almost all the
time, despite handling a few messages/minute pretty constantly.

I've been running SpamAssassin on the internal machine, with spamc launched
from procmail and another machine running MySQL and spamd.  Any of you
who've run SpamAssassin know that it's a pain to keep trained, and that it
slowly falls behind spammers until the next version's tested and released a
couple times/year.

A few days ago, I built dspam on the Slackware 486 with the intention of
running the client on the 486 and the server on the machine which now runs
spamd.  However, I had problems getting ./configure to find the pthreads
implementation on this machine, which is required for the client/server
support, so I just built it without.  I mean, a super 486 like this oughtta
be able to handle heavy mail filtering, right? :)

Apparently, it does just fine.  I assembled a corpus of about 20K spams
(about what I got in the last 6 months) and about 25K non-spams (a few
months of the rsync and reiserfs mailing lists, the last 6 months of my
inbox, and about half of the messages I've sent to and received from my
wife over the last couple of years).  I put the messages into an mbox, used
formail and a shell script to split the messages up into
consecutively-numbered 500-message mboxes for ease of processing, and
started training.  It took about two days to train - if you exclude the
short period where I ran the database server out of disk space twice and
had to repair the 100MB+ nearly 2 million record token table. (whoops).
I'm sure it would've gone faster on another machine.  BTW, I configured
dspam to use MySQL to store prefs and spam stuff.  Anyway, after training,
I set it up as a filter with one forced user so everyone uses the same
data, and set up spam@/ham@ aliases that override the filter for
self-training.  It adds an ID to the message for training purposes, so if
it's wrong it can easily retrain itself on teh original data without lots
of extra re-parsing and worrying about whether or not the original message
was modified before it was sent back for retraining.

So far, the new filter has been 100% accurate - probably due to the huge
set of training mail.  I left SA active for a while, and it marked as ham
several messages that DSpam caught.  I'm not using the web interface for
anything but stats, but with thttpd set up to serve the CGIs, the 486 is
really pretty responsive.  Due to the way I set things up, using the
quarantine wouldn't work well anyway. Load is still near zero, though
message delay is increased by a second or two.  Overall, I'm pretty darned
impressed.  Once I get the database server running on a better machine
(it's on my awesome SMP Celeron 333 machine with a broken RedHat - ugh), I
expect performance to be a bit better.  I've disabled SA, though, because
even if this is only "as" good as SA in the long term, it's *way* easier to
keep up to date.

Maybe one day I'll report back with documentation or longer term testing.
But I'm just so darned excited now that I had to get it out. :)

--Danny, hopefully on his way to being spam-free again (for a while)

-
To unsubscribe, send email to majordomo@luci.org with
"unsubscribe luci-discuss" in the body.