Using Naive Bayes in the
Classification of Spam Email

Team Members: Aspen Tng, Damien Snyder, Yadi Wang, Thai Quoc Hoang

Overview

This article is a step-by-step guide on how to create a (very simplified) Machine Learning model to classify spam emails using Bayes Theorem as the core of the model.

The task

Under the hood, we have have a list of emails that are already labeled by spam/non-spam. We want to use this existing list of emails as data to predict whether an unseen email is spam or not.

Assumptions

This simplified model

1. ignores punctuations

2. ignores repeating words

3. treats all words to be conditional independent

Architecture of machine

In order to create such a machine, we need these following components:

a. What is an ML model?

For simplification, we can think of an ML model as a model artifact that is created in the way that it can "learn" the patterns of the data given to it during a training process. In this example, our ML model is the core component to classify whether an email is spam or non-spam.

b. Training data

The training data used in our example is the list of emails that was previously labelled. This data will be fed into our ML model for the machine to learn its patterns.

c. Intelligence

Finally, intelligence refers to the final product obtained by training the ML model with the training data. This intelligence can look at any "unseen"/new email and give a prediction on whether it should be classified as a spam or not. The quality of the intelligence is strongly influenced by the ML model and the training data.

Let's get started!

Training Your Own Classifier

In this section, we will walk through the progress of training a email classifier using Naive Bayes interactively. You have the control to classify your own training inputs (by dragging the boxes between the different columns).

1. Feel free to label the emails in our list of example emails.
2. We'll get all UNIQUE words in each of our emails and count the number of spam/non-spam emails containing the respective words. This is reflected in the bar chart below.

Simple Spam/Non-spam Classifier

Drag and drop the white boxes from the "To Classify" section to either "Spam" or "Non-Spam".

The first box of text in "Spam" and "Non-spam" are examples, and cannot be moved.

To Classify
Trade your password for food!
Send us your review.
Review us :))
Send us your account!
Spam
Send us your password.
Non-spam
Review your food!
3. Now, we know the total number of spam/non-spam emails in our training set and the number of spam/non-spam emails containing each word that we just processed.

This wraps up our training section. Next, we'll move on to how we predict whether an "unread"/new email is spam or not.

Testing the Classifier

How does the classifier classify? What happens under the hood? Navigate using the 2 buttons to learn more.

Let's now try to use this classifier to predict whether an "unseen" email is spam or not.

Let's look at this sentence as an example:

Send us your food!

First, let's break the sentence into words:

{ send , us , your , food }

then, let us calculate P(word|spam)

that is, given an email is spam, what is the probability that each of these words appears in it?

P( |spam) = , P( |spam) = , P( |spam) = , P( |spam) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

in the same way, we can then calculate P(word|nonspam)

that is, given an email is nonspam, what is the probability that each of these words appears in it?

P( |nonspam) = , P( |nonspam) = , P( |nonspam) = , P( |nonspam) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

and let's then take a step back and calculate P(spam) and P(nonspam)

that is, based on our simple training data, what is the probability that a random email is spam, or nonspam?

P(spam) = , P(nonspam) = .

P(spam) = (|total spam emails|) / (|total emails|) = / =

P(nonspam) = (|total nonspam emails|) / (|total emails|) = / =

Now, recall the formula log(A B) = log(A) + log(B).

As probabilities are in range [0, 1], continuous multiplications would make our result very small. To prevent that, we use the log transformation trick shown above to make our probabilities be as follow:

log(P(spam)) = , log(P(nonspam)) =

log(P(spam)) = log(P(|spam)) + log(P(|spam)) + log(P(|spam)) + log(P(|spam)) = log() + log() + log() + log() =

log(P(nonspam)) = log(P(|nonspam)) + log(P(|nonspam)) + log(P(|nonspam)) + log(P(|nonspam)) = log() + log() + log() + log() =

Now, with log(P(A)) > log(P(B)), we can totally get P(A) > P(B).

For that reason, if log(P(spam)) > log(P(non-spam)), our given email would then be classified as spam;

and if log(P(spam)) < log(P(non-spam)) it would be classified as non-spam.

so the given email is predicted as

Classify your own emails!

Real
Spam
real emails not containing
spam emails not containing

Analysis

Explanation

We estimate a probability the email is spam.

The email contains the word .

of real emails contain the word .

of spam emails contain the word .

* = of emails are real and contain the word .

* = of emails are spam and contain the word .

/ ( + ) = of emails that contain the word are spam.