Bayesian Training Reset

While CMS does not suggest it, if you wanted to use only those tokens that are trained by your samples, it is possible to do so.  This does mean that you will clear the default token database provided by CMS, and that you will have to capture your own samples.

To reset the token database and only populate it with your message samples, you will first need to completely clear the PraetorWordList table in the PraetorClassify SQL database.  Call CMS Support for instructions on how to empty this table of the default entries.

In the meantime, before the default token database is cleared, you will need to capture message samples for use in training.  In this situation preliminary classification relies entirely on the default rules to identify spam from good messages.  

The process is:

  1. Capturing message samples after switching to BASIC filtering mode.

  2. Verify each message has been classified correctly as spam or non-spam.

  3. Shutdown Praetor by stopping the IIS.

  4. Clear the default token database entries per CMS instructions.

  5. Train the Bayesian filter using these captured samples.

  6. Re-start IIS.

 

Capturing message samples

To facilitate the accumulation of spam and ham messages for initial bulk training, Praetor provides several novel training options for capturing samples as seen below on the Train tab page of the configuration window.  

The only administrator control on this training page is within the Options group box.  The other remaining group boxes show information about how many messages are in the various states.  Here are the options that you may set:

Using the default rule-based filters to make the initial classification, Praetor will save a copy of those messages that were quarantined as spam or accepted as ham.  This will allow you to review each type for any incorrect classification, removing them as needed.  Such messages will be placed on the list of the and nodes.

This selection automatically queues for training as BAD (spam) any message that originates from a known spam mail server whose IP address is on the DNS blacklist.  

While it is possible that GOOD email might be incorrectly classified if your trading partner has a blacklisted mailserver, this should happen infrequently if at all.  If you detect this in the auto-queued BAD list, you can manually mark the message as Queued for GOOD.  Be sure to add their IP address into the DNSBL bypass IP list.

This selection will automatically mark messages from local users in your domain that go out to the Internet as ham.

Warning:

The possibility of BAD outbound messages does exist and so it could bias the GOOD frequency table.  This could happen if you have a machine in your network that is infected with a mass-mailing worm that spews outbound email to all the people in the local contact list.  Another possibility is that you perform broadcasts to customers that is spam-like.

In  either of these cases, don't enable this option.  Doing so and training upon those messages will likely make your token database 'dirty' and decrease its effectiveness.

 

This selection will cause Praetor to capture a specified number of sample inbound and outbound messages using the existing methods (default Bayesian tables and rules) to classify as GOOD or BAD for review.

CMS suggests that you set a value of at least 1000 messages.  Depending on your user population this may give you sufficient diversity of messages indicative of the business correspondence for your site.  While a larger sample size tends to show greater diversity and the Bayesian filtering will become more accurate in its discrimination between spam versus ham, keep in mind that it will take you longer to verify that the classification is correct before you perform the training.  Also, this larger sample will take longer to process.   With a smaller sample size, you can at least get the initial training done so that you can quickly start to see the benefits of the Bayesian filtering.

These Praetor options are convenient ways to accumulate trainable messages.  The use of the DNSBL and outgoing samples are especially convenient since these captured messages are not likely to need your review.  

For initial bulk training purposes CMS recommends that you check all the above four options, perhaps adjusting the quantity for how many messages to save for training.   Click here for a discussion of periodic incremental training.

Note:

If you have placed your Praetor G2 behind an existing Praetor v1.5, then G2 may not be getting any messages to quarantine since v1.5 may have already done so.  In this situation it is possible to copy those message samples from the v1.5 QUAR folder and import them into the G2 and .  This is performed by using the command line tool to populate this training folder called POPTRAIN.

Other group boxes on this page are:

  • Waiting for Review

show messages that were captured by rule filters that require review to insure they were correctly classified as GOOD or BAD

  • Ready to Train

are those messages you have already reviewed and those automatically  captured by DNSBL (BAD) or by outgoing from your own local users (GOOD)

  • Already Trained.

are those messages that have been analyzed and the frequency database has been updated with what tokens were found

 

 

Performing the training

Once messages have been captured, the initial training process consists of two steps: review and train.

1.

Review captured messages

Periodically check on the captured BAD messages by selecting the node under on the left pane.  Survey the messages you see in the right pane and identify any message that was incorrectly classified as spam by using the sender address or subject as clues.

Right-mouse click on a selected message and you will be presented with the  following possible actions.

  1. Train as GOOD

  2. Train as BAD

  3. Queue as GOOD

  4. Queue as BAD

  5. Delete the message from consideration

Choosing either action to train will cause immediate analysis and adjust the appropriate frequency table with the tokens seen.  If this action is selected, the message entry will move to the node.  

Please note that training on-the-fly in this manner may take some time and will impact Praetor mail reception and filtering.  For initial bulk training CMS does not recommend you to perform training; instead you should choose the action to Queue the message or simply Delete it from the list.  

When you choose to Queue (recommended for initial bulk training), this will simply mark that message for later training using the training wizard.  This choice will also move the message entry to the appropriate node.

Note:

It is possible to highlight multiple entries in the message list and perform the chosen action on all the selected messages.

In case the message summary that is visible from the list is insufficient and you are unable to determine whether it is really GOOD or BAD based on the sender address or subject matter, then you can open the selected message by double-clicking on the entry.  This will present you with full message details showing the message header and its contents.  The same actions are now available as buttons in the message window that appears.

 

In a similar manner, periodically check on the captured GOOD messages by selecting the node under on the left pane.

 

2.

Invoke Training Wizard

Once you have the reviewed the messages that were in the GOOD and BAD, you may begin the training on all the accumulated queued messages that are now in state by running the training wizard.

Pressing displays the review window allowing you to select which class you want to train -- GOOD, BAD, or both.

Note the numbers shown for training are those found in the state for both GOOD and BAD.  Press to begin the training.

Warning:

CMS does not recommended performing the bulk training during the business day.  The reason this is unwise is that the number of simultaneous accesses to the MSDE SQL Server will likely exceed the limit of 8 and this will cause its workload governor to get engaged.  This governor will slow both Bayesian classification of incoming messages and the training.

The training process consists of processing each message to break it down into its component tokens and accumulating them.

After this collection is done, the token database is updated with the addition of new tokens that are found in the training samples.

Depending on how many messages you train, how many tokens are found, and how busy the Praetor machine is, the training process may take some time and a significant amount of CPU resources.  Therefore it is a good idea to perform this training during a period when the machine is not busy.

The counter in the above display is only updated periodically so that resources are not wasted for display.

Once the initial bulk training is completed, all the messages once listed as will now be found in the archive list for the GOOD and BAD nodes.  You can always check how many messages have been trained thus far by viewing the Train tab page of the Bayesian configuration.

To begin using the Bayesian filter to classify your messages, you will need to make the setting change from training on this page.

Note:

Once training has been performed, the message samples will remain in the TrainBAD or TrainGOOD physical folders on the disk indefinitely.  The reason for leaving these message samples is to allow you to re-use them in future initial training after resetting (emptying) the token database.  If you want, you may delete these files by using the administration program and highlighting all entries in for the and/or and then right-mouse click and choose to delete them.