| |
 |
|
 |
| [return to FAQs
list] |
| |
What are Bayesian Statistics? What is Bayesian Mathematics?
Bayesian calculations indicate how likely it is that an item falls within a certain category based on information that has been collected from samples and other information for each possible category. The process is that first, statistical models are created for these categories using Bayesian methods. The items to be categorized are then compared to the models. The Bayesian calculation produces a number, usually shown as a percentage, that indicates how likely it is that the item is part of each category. Generally, items are assigned to the category with the highest probability estimate.
The Bayesian approach differs from rule based systems for categorizing items. For example, if rules were used to categorize playing cards, a rule for a "picture card category" might be "if a card is greater than 10 and less than an ace, it is a picture card." In contrast, the Bayesian approach would be to compare the card to be categorized to the statistical model of the cards already assigned to each category. This allows the user to specify the observations to be used. For instance, if the observation was “the number of colors on the card,” then in general picture cards would have high numbers, while other cards would be associated with low numbers. If the card to be categorized has more than one color, then the probability that it is a picture card would be expected to be very high.
Bayesians can use other information, not just the colors on the card, to help categorize the card. For example, you may note that picture cards all have images on them or that none of them have a number on them. Any observation can be used to make the estimate even more accurate.
Of course, this is a simple picture card example does not really show the power of the Bayesian approach. Picture cards can be easily defined by rules. In more complex examples, the rules that are always able to define a category may be hard or impossible to define. Also, new situations may arise that do not fit neatly into existing rules. Bayesian statistics are powerful because you do not need to know explicit rules and exceptions for how an item was categorized, you only need to have a sample to measure. For example, molecular scientists may not know how certain molecular interactions work, but they can observe the results. Bayesian statistics are very useful for categorizing these kinds of results.
Another benefit of the Bayesian approach is that as new items are categorized, the Bayesian estimations can be automatically updated. This is because new information can help to refine or reinforce the models. Bayesian models are also self-correcting; as the analysis changes with new information, so do the results from the models.
How do Bayesian anti-spam products work?
All Bayesian spam or junk e-mail filters generally have two things in common: (1) Bayesian filters use previous examples of actual e-mail and spam messages to classify new mail and (2) Bayesian statistics are applied to observations to calculate probabilities. Bayesian junk e-mail filters estimate the likelihood that a message should or should not be blocked based on a wide range of content. This approach differs from rule or list based systems that block e-mail messages that contain a certain combination of stop words or are from a "blacklist" of bad domains.
Almost all Bayesian spam and junk e-mail filters sample and categorize actual e-mail messages. For example, the messages might be sorted into a "spam" category (junk e-mails) and a "ham" category (good e-mails). The Bayesian paradigm does not define how the samples are to be collected, where the messages might come from, or how they are to be categorized. Each company can use its own methodology. For example, some companies may collect their own messages and will build models from them. Other companies may collect information from the user's e-mail folders.
One next step is to statistically analyze the content of the message to identify significant or meaningful characteristics. For example, certain words may be used frequently in good messages, but rarely in junk e-mail messages. (The names of family members or your company's products often are strong good e-mail indicators.)
The key to this step is a process known as tokenization. A token is the smallest unit for which a statistic is collected. A word may be a token. But, some companies may also identify punctuation marks, phrases, e-mail addresses and invisible contents as tokens. In some cases, a word in the body of an e-mail and the same word in the subject line in an email will be considered separate tokens. Tokens may also be weighted to give extra importance to certain types of tokens. Much of the quality of a Bayesian junk e-mail filter and the proper identification of spam will rest on the skill of the designer of the tokenization process.
Each token is assigned a mathematical value based on how strongly it indicates whether a given message is a good message or junk e-mail. A straightforward mathematical formula is used to evaluate all of the statistics (or just the most significant statistics) to produce a probability estimate that a message is either good or junk e-mail. The result is not a categorization, but rather the probability that it is either good or junk. For example, process might compute that the probability that a message is spam is .80 (80%), based on its content.
A characteristic of many Bayesian systems is that there can be a "maybe" category. It is possible to calculate that a given message has a .50 (50%) probability of being junk because it has both strong positive and negative indicators. The identification as “maybe” is under the user’s control, based on the observed probabilities.
Once a message is reviewed and identified as good or junk e-mail, its contents are added to the statistics that make future evaluations more accurate. Many Bayesian junk e-mail and spam filters will allow the user to adjust the threshold for good, junk, and maybe messages. Proper thresholds will minimize mistakes and false alarms.
Many of the original concepts used in Bayesian spam and junk e-mail filters may be found in Paul Graham's A Plan For Spam and Better Bayesian Filtering. A detailed description of the Bayesian mathematics is described in Gary Robinson's Rants. Much of the core work in InBoxer comes from the efforts of the SpamBayes community and the work of Python Labs.
Why Are Bayesian Spam and Junk E-Mail Filters Better?
U.S. Supreme Court Justice Potter Stewart once said that he could not define pornography, but he said that "I know pornography when I see it." It is possible that the same could be said about defining spam. It is hard to create a set of rules that would be accepted by everyone to define spam, but each individual may know what he or she considers to be spam when he or she sees it. That means that the best way to categorize spam is for each person to categorize it by example.
Bayesian spam and junk-email filters are based on real world examples of junk e-mail and spam, which are used to classify future e-mail. The best ones let each user define spam so that the filter is highly personalized. Bayesian filters also automatically update and are self-correcting as they process new information and add it to the database.
In addition, Bayesian systems generally act faster to block new types of junk e-mail. This is because rule and list based spam blockers react to spam, they do not anticipate it. Each set of rules or sender names to be blocked are created in response to the initial flood of messages. Then, after the messages arrive, the anti-spam product is updated. Unfortunately, the spammers get the updates at the same time as the filter users. This enables the spammers to modify their message so that the next salvo will bypass even the newest filters.
Are All Bayesian Spam and Junk E-Mail Filters The Same?
No. The term "Bayesian spam or junk e-mail filter" generally means that past examples of messages are used to predict whether a new message is junk and that certain types of statistical formulas are used to do the prediction. Beyond that, each filter uses different methods to calculate the result. Three factors are often considered to be very important for a technical evaluation:
-
Bayesian filters use previous messages as the starting point for analysis. Each company is free to determine its source for these messages. For example, one company collects opinions from a large panel of users to categorize a message as either good or junk e-mail. This has the advantage of providing broad input. It has the disadvantage of not being personalized. Therefore, a community can identify an e-mail about a topic that concerns you as junk. (A community may define solicitations for a political campaign as junk and would then block them. However, if you are a political science student, you may have wanted to see those messages.)
-
The tokenization process is critical. Each company is free to break up an e-mail message in any way and will make a determination as to the weight of each term. For example, a company may determine that a word in the subject line is twice as important as a word in the body of a message. One company may decide to ignore the time of day that a message was sent and another could consider it a strong indicator of spam. Companies need extensive experience in understanding language usage and technologies to make optimal choices.
-
The mathematical formulas used to calculate the Bayesian statistics are very important and choices in the formulation can change the result of the analysis. Companies with extensive mathematical backgrounds can have an edge.
InBoxer's shameless plug: InBoxer, Inc.believes that the InBoxer product is superior for the following reasons: (1) InBoxer uses the mail messages kept by the user or identified as junk by the user for its source material. As such it accurately reflects the user's view of what messages are important to be read, rather than a community standard that is prone to mistakes. (2) InBoxer's tokenization is done using significant expertise gained in the speech and language industry. Such experience is hard to duplicate in other ways. (3) Extensive experience with high performance mathematical formulas are common to the work being done by InBoxer.
|
|
 |
|
|
 |
|
|
|
 |
 |
|
Ask your question here. We will write back and tell you how you can try InBoxer for free.
|
|
|
|
|
 |
 |
|
|
|
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
Technical
Support Options |
|
 |
 |
 |
 |
 |
|
|
|
 |
|