Bayesian spam filtering
Online Advertising
Bayesian spam filtering
Bayesian spam filtering is the process of using
Bayesian statistical methods to classify documents into categories.
Bayesian filtering was proposed by Sahami et al. (1998) and gained attention
in 2002 when it was described in the paper A Plan for Spam by Paul Graham. Since then it has become a popular mechanism to distinguish
illegitimate
spam
email from legitimate
email. Many modern mail programs such as Mozilla Thunderbird implement Bayesian
spam filtering. Server-side email filters, such as SpamAssassin and ASSP, make
use of Bayesian spam filtering techniques, and the functionality is sometimes
embedded within mail server software itself.
Advantages
The advantage of Bayesian spam filtering is that it can be trained on a
per-user basis.
The spam that a user receives is often related to the online user's
activities. For example, a user may have been subscribed to an online newsletter
that the user considers to be spam. This online newsletter is likely to contain
words that are common to all newsletters, such as the name of the newsletter and
its originating email address. A Bayesian spam filter will eventually assign a
higher probability based on the user's specific patterns.
The legitimate e-mails a user receives will be tend to be different. For
example, in a corporate environment, the company name and the names of clients
or customers will be mentioned often. The filter will assign a lower spam
probability to emails containing those names.
The word probabilities are unique to each user and can evolve over time with
corrective training whenever the filter incorrectly classifies an email. As a
result, Bayesian spam filtering accuracy after training is often superior to
pre-defined rules.
It can perform particular well in avoiding false negatives, where legitimate
email is incorrectly classified as spam. For example, if the email contains the
word "Nigeria", which frequently appeared in a long spam campaign, a pre-defined
rules filter might reject it outright. A Bayesian filter would mark the word
"Nigeria" as a probable spam word, but would take into account other important
words that usually indicate legitimate e-mail. For example, the name of a spouse
may strongly indicate the e-mail is not spam, which could overcome the use of
the "Nigeria."
Some spam filters combine the results of both Bayesian spam filtering and
pre-defined rules resulting in even higher filtering accuracy. Recent
spammer
tactics include insertion of random innocuous words that are not normally
associated with spam, thereby decreasing the email's spam score, making it more
likely to slip past a Bayesian spam filter.
External links
References
- (Sahami et al., 1998): M. Sahami, S. Dumais, D. Heckerman, E. Horvitz:
A Bayesian approach to filtering junk e-mail, AAAI'98 Workshop on
Learning for Text Categorization, 1998.
Home | Up | Bayesian spam filtering | Markovian discrimination | Bogofilter | Complement set email filtering
Online Advertising, made by MultiMedia | Free content and software
This guide is licensed under the GNU
Free Documentation License. It uses material from the Wikipedia.
|