Using Naive Bayes in the
Classification of Spam Email

Team Members: Aspen Tng, Damien Snyder, Yadi Wang, Thai Quoc Hoang

Overview

This article is a step-by-step guide on how to create a (very simplified) Machine Learning model to classify spam emails using Bayes Theorem as the core of the model.

The task

Under the hood, we have have a list of emails that are already labeled by spam/non-spam. We want to use this existing list of emails as data to predict whether an unseen email is spam or not.

Assumptions

This simplified model

1. ignores punctuations

2. ignores repeating words

3. treats all words to be conditional independent

Architecture of machine

In order to create such a machine, we need these following components:

a. What is an ML model?

For simplification, we can think of an ML model as a model artifact that is created in the way that it can "learn" the patterns of the data given to it during a training process. In this example, our ML model is the core component to classify whether an email is spam or non-spam.

b. Training data

The training data used in our example is the list of emails that was previously labelled. This data will be fed into our ML model for the machine to learn its patterns.

c. Intelligence

Finally, intelligence refers to the final product obtained by training the ML model with the training data. This intelligence can look at any "unseen"/new email and give a prediction on whether it should be classified as a spam or not. The quality of the intelligence is strongly influenced by the ML model and the training data.

Let's get started!

Analysis

Words in email:

[no email written yet]

Explanation

We estimate a probability the email is spam.

The email contains the word .

of real emails contain the word .

of spam emails contain the word .

* = of emails are real and contain the word .

* = of emails are spam and contain the word .

/ ( + ) = of emails that contain the word are spam.

Using Naive Bayes in the
Classification of Spam Email

Overview

The task

Assumptions

Architecture of machine

a. What is an ML model?

b. Training data

c. Intelligence

Training Your Own Classifier

1. Feel free to label the emails in our list of example emails.

2. We'll get all UNIQUE words in each of our emails and count the number of spam/non-spam emails containing the respective words. This is reflected in the bar chart below.

Simple Spam/Non-spam Classifier

3. Now, we know the total number of spam/non-spam emails in our training set and the number of spam/non-spam emails containing each word that we just processed.

Testing the Classifier

Send us your food!

{ send , us , your , food }

Classify your own emails!

Real

Spam

Percent of emails that are spam:

Email text:

Analysis

Words in email:

Explanation

Using Naive Bayes in the Classification of Spam Email

Overview

The task

Assumptions

Architecture of machine

a. What is an ML model?

b. Training data

c. Intelligence

Training Your Own Classifier

1. Feel free to label the emails in our list of example emails.

2. We'll get all UNIQUE words in each of our emails and count the number of spam/non-spam emails containing the respective words. This is reflected in the bar chart below.

Simple Spam/Non-spam Classifier

3. Now, we know the total number of spam/non-spam emails in our training set and the number of spam/non-spam emails containing each word that we just processed.

Testing the Classifier

Send us your food!

{ send , us , your , food }

Classify your own emails!

Real

Spam

Percent of emails that are spam:

Email text:

Analysis

Words in email:

Explanation

Using Naive Bayes in the
Classification of Spam Email