Sms spam-detection

SMS Spam Detection
using Machine Learning
Approach
Author:
Houshmand Shirani-Mehr
hshirani@stanford.edu

Presented By:
Name ID
Tanvirul Islam 152-15-6117
Md.Shariful Islam 152-15-5955
Mimshadur Rahman 152-15-6119
Juaid Rakin 152-15-5753
Shamsujoha Sumon 152-15-6099

Introduction
3
Incoming SMS
Spam SMS
SMS

4
Problem Definition
 Short Message (SMS) has grown into a multi-billion dollars commercial industry .
 SMS spam is still not as common as email spam.
 SMS Spam is particularly more irritating than email spams.
 SMS Spam is showing growth, and in 2012 in parts of Asia up to 30% of text messages was spam.

Description of Dataset
The Dataset is of 5574 text messages from UCI Machine Learning repository
gathered in 2012 .
Total Data 5574
Spam 747
Non Spam(Ham) 4827
SMS Spam Collection Data Set from UCI Machine Learning Repository,
”http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection”

Methodology
6
Algorithm
Classifier Model
SMS Text Pre - Processing
Features
Extraction
New SMS
Text
Pre -
Processing
Features
Extraction
Classification of
new Text
Training
Tasting

Classification Algorithm( Naïve Bayes)
7
NB algorithm is applied to the final extracted features. The speed and simplicity along
with high accuracy of this algorithm makes it a desirable classifier for spam detection
problems. Applying naïve Bayes with multinomial event model to the dataset and using
10-fold cross validation results in Table 1.
Overall Error % 1.12
Accuracy % 98.88
SC % 94.5
BH % 0.51
Table 1

Classification Algorithm(Support Vector Machines) :
8
Accuracy % 98.88
SC % 94.5
BH % 0.51
Table 2 shows the 10-fold cross validation results of SVM with different
kernels applied to the dataset with extracted features…
Table 2

Classification Algorithm(k-nearest neighbor)
9
Table 3 shows the 10-fold cross validation results of k-nearest neighbor classifier applied
to the dataset.
Table 3

Classification Algorithm(Random Forests)
10
Accuracy % 97.84
SC % 87.7
BH % 0.73
Random forests is an averaging ensemble method for classification. The ensemble is a
combination of decision trees built from a bootstrap sample from training set.
Table 4

Classification Algorithm(Adaboost with Decision Tree)
11
Accuracy % 98.59
SC % 92.17
BH % 0.51
Adaboost is a boosting ensemble method. They tried the implementation of Adaboost with
decision trees using scikit-learn library. Using 10 estimators, the simulation shows..
Table 5

Performance Measure
12
Spams caught (SC) =
False negative cases
Number of Spams
Blocked hams (BH) =
False Positive cases
Number of Hams
Accuracy =
True Positive + True Negetive
Total Number of Test Data

Final results of different
classifiers
13
Model SC % BH % Accuracy %
Multinomial NB 94.87 0.51 98.88
SVM 92.99 0.31 98.86
k-nearest neighbour 82.60 0.40 97.47
Random Forests 90.62 0.29 98.57
Adaboost with decision trees 92.17 0.51 98.59
Table 6

Conclusion
14
From simulation results, multinomial naive Bayes and SVM (Support Vector Machine)
among the best classifiers for SMS spam detection which yields overall accuracy of
98.88% and 98.86%.
98.88
92.99
82.6
90.62
98.59
70 75 80 85 90 95 100 105
MULTINOMIAL NB
SVM
K-NEAREST NEIGHBOR
RANDOM FORESTS
ADABOOST WITH DECISION TREES
Average classification accuracy for five classifiers

15
Thanks!
Any questions?
You can find us at
￮ tanvirul15-6117@diu.edu.bd

Sms spam-detection

More Related Content

Sms spam-detection