If you’re using SpamAssassin and PostgreSQL for Bayes data

You should know that you can see increased performance by making some small changes to your database schema.

Specifically, you should run the following set of SQL commands on your bayes database:

bc. alter table bayes_token drop constraint bayes_token_pkey;
alter table bayes_token add constraint bayes_token_pkey primary key (token, id);
drop index bayes_token_idx1;
analyze table bayes_token;

If you are running SpamAssassin <= 3.0.4, that last statement will probably complain that bayes_token_idx1 doesn’t exist–in which case you should fasten your seat-belt because your bayes database is probably going to get a *lot* faster very suddenly.

What this change does is modify the primary key to make the token column the first part of the key, meaning the index that is implied by the primary key (to guarantee uniqueness) can now also be for most queries, and the now-redundant additional index (which is on the token column alone) can be dropped.

As a result, with each insert or update on the database, you will do one fewer index modifications, so you should see some performance increase (albeit perhaps a modest one, if you are running a low-traffic installation) and some disk space savings.

The reason this is all important is simple.

Most queries made by SpamAssassin have a where clause of the form “where id = 1 and token in (…)”–looking for rows with a particular token, associated with a particular user (represented here by the id).

With the old primary key definition (and without the additional index on token), PostgreSQL would look at its available indices, and see that it could match the id and token (in that order) against the primary key index…but, in most cases, the id is likely to have very low selectivity–that is, there are many rows with the same value–so it doesn’t help cut down the number of rows that must be examined very much, and PostgreSQL opts to do a sequential scan of the table.

On the other hand, the token column will probably have relatively high selectivity–meaning querying for a given token against an index is likely to produce relatively few rows–so, realizing the problem, someone did the obvious and added an index just on the token column.

However, the same result could be achieved by switching the order of the columns in the primary key, and without introducing the overhead of maintaining a second index. I have, in fact, “opened a ticket in SpamAssassin’s Bugzilla”:http://bugzilla.spamassassin.org/show_bug.cgi?id=4400, but I don’t have time to jump through their benchmarking hoops right now, so I’m making this entry here in the hopes that people who need the benefit can find it until such time as I can setup the infrastructure to run their benchmarks.

Published by

Michael Alan Dorman

Yogi, brigand, programmer, thief, musician, Republican, cook. I leave it to you figure out which ones are accurate.