Bayesian training tips

Praetor provides the administrator with the facility to train its Bayesian filter to be more accurate by reducing false positives (non-spam messages that are quarantined) and false negatives (spam messages that are unfiltered because the Bayesian filter is unsure).  Your goal is to attain an acceptable level of these classification errors.  

A reasonable goal might be a false positive rate of one in 2,000 (0.05%) or lower.  This number can be found in the special summary traffic report as the percentage of messages totaled for the approved event group.  The sample snippet from this report shows one approved message which original was quarantined because of the rule called "NOT Praetor ZIP files".

Keep in mind that as the user population becomes larger, it will be statistically more difficult to get lower false positive numbers simply because of the wide-ranging message content for the entire population that causes many more conflicting tokens.  For example, some people may want jokes-of-the-day messages, and others not, so training on these as spam and non-spam samples will dilute the statistics for the tokens in those messages.

The false negative rate is not as easy to see because it is among the messages whose spamicity are rated unsure, which the Bayesian filter permits to be delivered (default).  Most of these messages will be spam and some will be non-spam.  To a lesser extent, there will also be spam with low spamicity values.  A reasonable goal for the false negative rate is to have the percentage of spamicity unsure to be less than 5% of all inbound messages received.

In discussing false positive/negative rates, realize that these represent counter-opposing forces.  If you try to get extremely low false positives, this typically results in more false negatives, and vice versa.  Be prepared to see this reflected in your own Praetor Bayesian filtering.

Here are some tips to train the Praetor Bayesian filter in an optimal manner.

Dos:

1. Be sensible in training

You do not need to train excessively with samples numbering in the tens or hundreds of thousands to get good performance with Praetor's Bayesian filter.  CMS has found that starting from the default token database, and training selectively (as described below) over short periods of up to one week will get better results.

Going to the trouble to accumulate tens of thousands or more samples may even work against you in the following ways:

2. Train only on errors

The Bayesian filter is best trained with message samples on which it has made classification errors.

For good messages that are quarantined, the administrator simply uses the Praetor administration program to immediately train on the message before releasing it for delivery.

For messages deemed to be unsure which have been accepted by Praetor and delivered to the intended recipients, the administrator must set the rule to archive a copy of such messages into the TrainUNSURE Waiting Review folder.  Follow these steps to archive such accepted messages.

3. Train periodically only when needed

The administrator can use the Praetor Log Analyzer to determine if more training is needed.  Just view the special summary traffic report to see if the number of false positives that have been approved is too high.  

To a certain degree, the false negatives are also reflected by the number of accepted messages found to have unsure spamicity, but realize that the number shown also includes good messages as well as spam.  Of course, it is more likely that your user population will complain if they are still receiving too much spam.

4. Add listserver addresses

Listservers sending newsletters, discount notices, etc tend to have spam-like characteristics so it is a good idea to simply avoid the Bayesian filter by adding their addresses into the Approved Listserver Address list.  This can easily be done when reviewing the details of the message and selecting the Address Options tab and then checking the box for the desired list.

Keep in mind, however, that some listservers use a sending address where the portion to the left of the @-sign is auto-generated and containing a string of digits that may identify the newsletter issue number.  In such cases, it will not be sufficient to add the entire address unmodified to the Approved Listserver Address list.   Instead you may want to simply add an entry to the Approved Domains list.

Note:

CMS has found that people tend to use Praetor to filter out messages from listservers they don't want to continue receiving.  Perhaps they are heeding to the warning not to unsubscribe from mailing lists.  This warning should not apply to listservers that you voluntarily subscribed to in the past.

As the administrator, you should not comply with user requests to train on such messages as spam.  Doing this may cause problems for other users who want to continue their subscription.  Simply ask those who don't want to receive such messages from a particular listserver to unsubscribe.

 

5. Add entries to the whitelist for addresses and domains

If some users request to always accept messages originating from a particular address or domain, then add these to the Approved Senders or Approved Domains list.  

Today's email-borne infections usually forge the sender address with something it finds in the local address book on the infected machine so there is a small risk that such an address may be on the whitelist.  This risk can be eliminated by Praetor as long as you retain the rule order (default) that puts the rules for attachment tests before the rules for the whitelists.

Don'ts:

1. Don't train spam that is already classified with high spamicity

If your spam threshold is at 0.60 (default) and you see spam messages rated by the Bayesian filter with spamicity of 0.61 or higher, it is pointless to train on such messages just to get that spamicity to be computed even higher.  Whether the spamicity will be computed at 0.61 or 1.00 in the future, it will still be caught simply because it exceeds your spam threshold value.

2. Don't train non-spam that is already classified with low spamicity

If your low non-spam threshold is at 0.30 (default) and you find good messages are being rated with a spamicity value less than this, it is pointless to train on such messages just to get the spamicity to be computed even lower.  In the future similar non-spam messages, resulting in a spamicity of 0.00 or 0.29, will still be treated as good messages and accepted because their spamicity is below the low threshold value.

 

Saving unsure messages for training

In ADVANCED filtering mode you will need to configure the additional rule action to save the accepted message into the TrainUNSURE folder by performing these steps in the Praetor administration program.

1.

In the Inbound rules list, highlight the "Spamicity is unsure" rule and double-click to open its properties.

2.

View the tab for the Actions and select to "save message as training type in training folder" as shown below.

3.

In the rule description window, click on the training type and select to "Save message as Train UNSURE".

4.

In the rule description window, click on the training folder and leave the selection to the "Waiting for Review" folder.  This default selection is the only valid one for the UNSURE messages.

Press .

5.

Your rule description should have this optional action to save the message as shown below.

Click to complete the rule modification.  Don't forget to save your rule set by pressing on on the toolbar and put it into effect.

 

In the BASIC filtering mode, saving the UNSURE messages is much simpler since this mode uses the pre-configured rules.  Just check the box on the Classify tab of the Bayesian configuration page.

Once captured in the node for you need to review these and queue as good or bad samples and then perform the training.  Click here to read more about this process.

 

Warning:

If you enable this option to save the UNSURE message, your free disk space will be consumed at a faster rate than normal.  CMS recommends that you enable it for only a short period when you want to capture samples for training the Bayesian filter.