# Using Naive Bayes in the Classification of Spam Email

Team Members: Aspen Tng, Damien Snyder, Yadi Wang, Thai Quoc Hoang

### Overview

This article is a step-by-step guide on how to create a (very simplified) Machine Learning model to classify spam emails using Bayes Theorem as the core of the model.

Under the hood, we have have a list of emails that are already labeled by spam/non-spam. We want to use this existing list of emails as data to predict whether an unseen email is spam or not.

### Assumptions

This simplified model

1. ignores punctuations

2. ignores repeating words

3. treats all words to be conditional independent

### Architecture of machine

In order to create such a machine, we need these following components: #### a. What is an ML model?

For simplification, we can think of an ML model as a model artifact that is created in the way that it can "learn" the patterns of the data given to it during a training process. In this example, our ML model is the core component to classify whether an email is spam or non-spam. #### b. Training data

The training data used in our example is the list of emails that was previously labelled. This data will be fed into our ML model for the machine to learn its patterns.

#### c. Intelligence

Finally, intelligence refers to the final product obtained by training the ML model with the training data. This intelligence can look at any "unseen"/new email and give a prediction on whether it should be classified as a spam or not. The quality of the intelligence is strongly influenced by the ML model and the training data.

Let's get started!

In this section, we will walk through the progress of training a email classifier using Naive Bayes interactively. You have the control to classify your own training inputs (by dragging the boxes between the different columns).

#### Simple Spam/Non-spam Classifier

Drag and drop the white boxes from the "To Classify" section to either "Spam" or "Non-Spam".

The first box of text in "Spam" and "Non-spam" are examples, and cannot be moved.

To Classify
Review us :))
Spam
Non-spam
##### 3. Now, we know the total number of spam/non-spam emails in our training set and the number of spam/non-spam emails containing each word that we just processed.

This wraps up our training section. Next, we'll move on to how we predict whether an "unread"/new email is spam or not.

### Testing the Classifier

How does the classifier classify? What happens under the hood? Navigate using the 2 buttons to learn more.

Let's now try to use this classifier to predict whether an "unseen" email is spam or not.

Let's look at this sentence as an example:

First, let's break the sentence into words:

### { send , us , your , food }

then, let us calculate P(word|spam)

that is, given an email is spam, what is the probability that each of these words appears in it?

P( |spam) = , P( |spam) = , P( |spam) = , P( |spam) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

P( |spam) = (|total spam email containing | + 1) / (|total spam email| + 2) =( +1)/( +2) =

in the same way, we can then calculate P(word|nonspam)

that is, given an email is nonspam, what is the probability that each of these words appears in it?

P( |nonspam) = , P( |nonspam) = , P( |nonspam) = , P( |nonspam) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

P( |nonspam) = (|total nonspam email containing | + 1) / (|total nonspam email| + 2) =( +1)/( +2) =

and let's then take a step back and calculate P(spam) and P(nonspam)

that is, based on our simple training data, what is the probability that a random email is spam, or nonspam?

P(spam) = , P(nonspam) = .

P(spam) = (|total spam emails|) / (|total emails|) = / =

P(nonspam) = (|total nonspam emails|) / (|total emails|) = / =

Now, recall the formula log(A B) = log(A) + log(B).

As probabilities are in range [0, 1], continuous multiplications would make our result very small. To prevent that, we use the log transformation trick shown above to make our probabilities be as follow:

log(P(spam)) = , log(P(nonspam)) =

log(P(spam)) = log(P(|spam)) + log(P(|spam)) + log(P(|spam)) + log(P(|spam)) = log() + log() + log() + log() =

log(P(nonspam)) = log(P(|nonspam)) + log(P(|nonspam)) + log(P(|nonspam)) + log(P(|nonspam)) = log() + log() + log() + log() =

Now, with log(P(A)) > log(P(B)), we can totally get P(A) > P(B).

For that reason, if log(P(spam)) > log(P(non-spam)), our given email would then be classified as spam;

and if log(P(spam)) < log(P(non-spam)) it would be classified as non-spam.

so the given email is predicted as

##### Spam
real emails not containing
spam emails not containing

## Analysis

#### Explanation

We estimate a probability the email is spam.

The email contains the word .

of real emails contain the word .

of spam emails contain the word .

* = of emails are real and contain the word .

* = of emails are spam and contain the word .

/ ( + ) = of emails that contain the word are spam.