Eventually I will write a blog post

Wherein I will describe how to set up a cyrus-imapd cluster in fairly straightforward, and mostly reliable terms. It will probably include lots of invective concerning cyrus-sasl (the scourge of my existence) and openldap (which is like a little kid that tends to get the pointy scissors and stab you over and over with a gleeful smile on its face).

For the moment, I will just note that I have moved email for a couple of my secondary addresses to our little cluster. That represents a certain optimism on my part.

NPR to the rescue

So, my dad and I have had a long-running debate over whether or not we should be including spam with so-called “Poison Paragraphs” in the corpus we hand-manage for “AnteSpam’s”:http://antespam.com/ Bayesian database.

I’ve long maintained that the right solution is to just bung it in there–the text that is generally being inserted is generally far too atypical of real emails to make a difference. Dad was more hesitant.

With this in mind, I tried to be gracious when he called to mention that NPR had “a story”:http://www.npr.org/templates/story/story.php?storyId=5624749, including an interview with “Paul Graham”:http://paulgraham.com/, the guy who “first proposed using Bayesian analysis”:http://paulgraham.com/spam.html, who confirmed that it really wasn’t a problem.

Implementing VERP for AnteSpam v2

My big accomplishment today–it was an otherwise fairly busy day, still catching up from the last couple of weekends–was adding VERP handling to the AnteSpam daemon process.

Those of you who don’t hang out in email handling circles probably don’t recognize the acronym“1”:#fn1, but if you’re subscribed to a mailing list these days, you’ve probably seen it in action.

What happens is that during the SMTP delivery process, when the mailing list server hands the message to whatever server hosts your mail, it is given a special address as the originator of the mail. This is often, but not always, of the form @bounce-mdorman=tendentious.org@bounce.antespam.com@–the important bit is that the address to which the mail is being delivered is included (albeit mangled) in the address from which the mail seems to be coming.

This might seem weird, but when you send to a non-existent address, any bounce message is almost certainly going to be delivered to that specially encoded address, and modern mail transfer agents make it easy to route all mail for @bounce-*@bounce.antespam.com@ to a program which can then extract the address whose delivery failed and behave appropriately–in the case of mailing lists, by removing the user from the list.

Now you might ask why we would want this–it’s not like we’re running mailing lists, we’re checking for spam.

There’s two reasons.

First and foremost, this will allow us to recognize, in an automated way, that an address doesn’t exist on the destination server, and we can mark that address as non-existent in our database, and refuse to even accept mail for it in the future. This cuts down on the load on our servers and our customers servers.

Second, this means our system will not run afoul of senders who have implemented SPF and customers who pay attention to it. Right now, if a sender has SPF records, and our customer honors them, we will probably not be able to deliver the mail from that sender because when we try and do the delivery we use the original sender address during the SMTP transaction with the customer’s mail server, and we aren’t cleared to send mail for that sender. If we’re using an address that’s in our domain, we are certainly allowed to send it.

Both of these are important quality of implementation issues.

The cool part is that, after an hour of investigation and testing, the actual diff turned out to be a one-line change–we were already using the QMQP protocol to hand clean messages to the postfix system for final delivery (because it operates well over unix sockets, and I was sick of having postfix listening on non-standard TCP/IP sockets for what was ultimately an entirely internal transaction), and it turns out that, because the postfix QMQP service strives to be compatible with qmail’s QMQP service (it was written, I understand, because securityfocus wanted to keep using ezmlm, which depends on QMQP, but wanted to move away from qmail), you just have to use a specially constructed sender address, and postfix will do the hard work for you.

1 “VERP”:http://cr.yp.to/proto/verp.txt stands for ??Variable envelope return paths??, and was pioneered by the qmail MTA, largely for automating bounce handling in its companion Mailing List Manager, ezmlm. Yes, I just wanted to try out the footnoting.

I have little sympathy for SpamCop

I’ve worked on a service that sends out lots of email. We were very careful to 1) only add people to our system who have requested it (which involves sending out confirmation emails), and 2) not send mail to someone who never wants to hear from us again.

Now anyone who does this sort of work will have immediately spotted that if we send an email in step 1 in order to verify the address, it is possible for someone to have us send emails to arbitrary addresses.

This is where #2 comes it; every email we send out includes a link that will put you on our “Do Not Call” list–get on that list and you’ll never be able to sign up for the service, because we won’t ever send you an email again even if you (or someone else trying to annoy you) asks.

Nonetheless, I’ve had to deal with several SpamCop complaints. Each time it’s the same thing–a forwarded message with wild invective, etc., and, inevitably, SpamCop *never tells us the address the message was sent to!*

Yep, that’s right, it’s a great game of Hide The Ball–“If you’re not a spammer, remove this person’s address. No, we’re not going to give it to you, just do it.”

Usually there’s something in the headers that gives it away–which arguably just proves how stupid a game it is for SpamCop to play–but it’s a waste of time.

That said, it’s unfortunate that it’s Scott Richter “who is suing them”:http://www.clickz.com/news/article.php/3348241, since he’s a fucking spamboy wanker.

In real news…

I found that the refactored code was very amenable to modifying the block/pass list processing to do two consecutive passes, first with any domain settings, then with any user settings.

In what will no doubt be the first of many messages with this name…

So I look at the stats this morning, like I do most every morning, and I see that hiwaayoffice.net has been seeing incredibly high volume–more than two messages per minute, which is enormously more than they normally do. _Enormously_. And the message size was pretty frigging huge, too.

As usualy when things involve HiWAAY, I called dad. I asked to know if he knew of someone trying to beat up on our machines, etc. He asked who it was, so I did a couple of quick queries on the log file, and found out it was his address that was getting all the mail.

After some investigation, we discovered it was problem with his email being over quota, and that address being on the list of people to be notified if someone goes over quota. Oops.

I laughed long and hard, though.

Well, that was a productive day

Generally, Saturday is my day Away From The Machine. Some Saturdays I don’t even log in–no email, no web surfing, nothing.

Unusually, though, I did some work today, and mighty productive it was, too.

One of the things we need to get a handle on for “AnteSpam”:http://antespam.com/ is building (and maintaining) a corpus of messages. Having a good corpus gives us what we need to build a good Bayes database, which will hopefully keep us nice and accurate, and it will also allow us to contribute some to the SpamAssassin development by running mass-checks and generally giving input on how well things are working.

At the moment we just grab random messages that come through the system and someone has to go in and classify those–which is tough, because what might be spam to me is someone else’s precious newsletter. I end up deleting a lot of messages that I think are probably spam, but might not be–and it’s better to be conservative.

Better, though, is the new capability I implemented. In addition to the random messages, it’s now possible to send messages to some special addresses, and those will be picked up and put in the corpus appropriately marked (but not as verified–we don’t want anyone to be able to screw us up by hitting us with a bunch of spam marked as ham or anything).

Combined with some options to let people submit mail to the corpus when they see good or bad messages in their sideline folder, this could work out really well. I, personally, get thousands of messages each day, good and bad, that I can bounce into these addresses–instant training ground.

I’m pretty stoked.

OK, I was wrong.

So I ended up getting back into the swing of things, and ended up making some fairly significant revisions to the code for the main daemon at the heart of “AnteSpam”:http://antespam.com/. No really groundbreaking changes to the core functionality, but some optimizations, and some cleanup of the code. It may be a little more accessible now.

We’re growing

When you talk actual numbers it sounds pitiful in a way, but “AnteSpam”:http://antespam.com/ is growing consistently, if not super-fast. We’re up to 18 paying domains, and there are reportedly several “ready to land” any moment now.

If I ever felt for a moment that this wouldn’t sell because it didn’t provide value to the customers, I need only look at the stats to see that we’ve got domains that get three spam mails for every good mail. Amazing.