If you’re using SpamAssassin and PostgreSQL for Bayes data

You should know that you can see increased performance by making some small changes to your database schema.

Specifically, you should run the following set of SQL commands on your bayes database:

bc. alter table bayes_token drop constraint bayes_token_pkey;
alter table bayes_token add constraint bayes_token_pkey primary key (token, id);
drop index bayes_token_idx1;
analyze table bayes_token;

If you are running SpamAssassin <= 3.0.4, that last statement will probably complain that bayes_token_idx1 doesn’t exist–in which case you should fasten your seat-belt because your bayes database is probably going to get a *lot* faster very suddenly.

What this change does is modify the primary key to make the token column the first part of the key, meaning the index that is implied by the primary key (to guarantee uniqueness) can now also be for most queries, and the now-redundant additional index (which is on the token column alone) can be dropped.

As a result, with each insert or update on the database, you will do one fewer index modifications, so you should see some performance increase (albeit perhaps a modest one, if you are running a low-traffic installation) and some disk space savings.

The reason this is all important is simple.

Most queries made by SpamAssassin have a where clause of the form “where id = 1 and token in (…)”–looking for rows with a particular token, associated with a particular user (represented here by the id).

With the old primary key definition (and without the additional index on token), PostgreSQL would look at its available indices, and see that it could match the id and token (in that order) against the primary key index…but, in most cases, the id is likely to have very low selectivity–that is, there are many rows with the same value–so it doesn’t help cut down the number of rows that must be examined very much, and PostgreSQL opts to do a sequential scan of the table.

On the other hand, the token column will probably have relatively high selectivity–meaning querying for a given token against an index is likely to produce relatively few rows–so, realizing the problem, someone did the obvious and added an index just on the token column.

However, the same result could be achieved by switching the order of the columns in the primary key, and without introducing the overhead of maintaining a second index. I have, in fact, “opened a ticket in SpamAssassin’s Bugzilla”:http://bugzilla.spamassassin.org/show_bug.cgi?id=4400, but I don’t have time to jump through their benchmarking hoops right now, so I’m making this entry here in the hopes that people who need the benefit can find it until such time as I can setup the infrastructure to run their benchmarks.

DBIx::POS 0.03 released

I just put out the first official release of my DBIx::POS module (which, incidentally, is my first ever CPAN entry in roughly ten years of Perl development).

This is the end-product of more than a years worth of effort in finding a good lightweight way of organizing and documenting the SQL code associated with a large application.

This is not any sort of object mapping layer or automatic SQL generator; I keep looking at those sorts of things, and experimenting with them, and I keep coming back to plain old SQL. This will not save you any keystrokes–in fact, it’s going to take more, because this is as much about documentation as anything else.

All this is is a way to try and organize and document your SQL as a component along-side (rather than an integral part) of your perl code.

There’s still lots to do–at the very least, I want to put together tools for turning your POS (that’s Plain Old SQL, and a reference to POD, the format from which it is derived) into DocBook or POD–but for the moment, this works, and, I believe, works well.

You can find it “here”:http://search.cpan.org/~mdorman/DBIx-POS-0.03/

Feedback welcome.