SlideShare a Scribd company logo
SMS Spam Detection
using Machine Learning
Approach
Author:
Houshmand Shirani-Mehr
hshirani@stanford.edu
Presented By:
Name ID
Tanvirul Islam 152-15-6117
Md.Shariful Islam 152-15-5955
Mimshadur Rahman 152-15-6119
Juaid Rakin 152-15-5753
Shamsujoha Sumon 152-15-6099
Introduction
3
Incoming SMS
Spam SMS
SMS
4
Problem Definition
 Short Message (SMS) has grown into a multi-billion dollars commercial industry .
 SMS spam is still not as common as email spam.
 SMS Spam is particularly more irritating than email spams.
 SMS Spam is showing growth, and in 2012 in parts of Asia up to 30% of text messages was spam.
Description of Dataset
The Dataset is of 5574 text messages from UCI Machine Learning repository
gathered in 2012 .
Total Data 5574
Spam 747
Non Spam(Ham) 4827
SMS Spam Collection Data Set from UCI Machine Learning Repository,
”http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection”
Methodology
6
Algorithm
Classifier Model
SMS Text Pre - Processing
Features
Extraction
New SMS
Text
Pre -
Processing
Features
Extraction
Classification of
new Text
Training
Tasting
Classification Algorithm( Naïve Bayes)
7
NB algorithm is applied to the final extracted features. The speed and simplicity along
with high accuracy of this algorithm makes it a desirable classifier for spam detection
problems. Applying naïve Bayes with multinomial event model to the dataset and using
10-fold cross validation results in Table 1.
Overall Error % 1.12
Accuracy % 98.88
SC % 94.5
BH % 0.51
Table 1
Classification Algorithm(Support Vector Machines) :
8
Overall Error % 1.12
Accuracy % 98.88
SC % 94.5
BH % 0.51
Table 2 shows the 10-fold cross validation results of SVM with different
kernels applied to the dataset with extracted features…
Table 2
Classification Algorithm(k-nearest neighbor)
9
Table 3 shows the 10-fold cross validation results of k-nearest neighbor classifier applied
to the dataset.
Table 3
Classification Algorithm(Random Forests)
10
Overall Error % 2.16
Accuracy % 97.84
SC % 87.7
BH % 0.73
Random forests is an averaging ensemble method for classification. The ensemble is a
combination of decision trees built from a bootstrap sample from training set.
Table 4
Classification Algorithm(Adaboost with Decision Tree)
11
Overall Error % 1.41
Accuracy % 98.59
SC % 92.17
BH % 0.51
Adaboost is a boosting ensemble method. They tried the implementation of Adaboost with
decision trees using scikit-learn library. Using 10 estimators, the simulation shows..
Table 5
Performance Measure
12
Spams caught (SC) =
False negative cases
Number of Spams
Blocked hams (BH) =
False Positive cases
Number of Hams
Accuracy =
True Positive + True Negetive
Total Number of Test Data
Final results of different
classifiers
13
Model SC % BH % Accuracy %
Multinomial NB 94.87 0.51 98.88
SVM 92.99 0.31 98.86
k-nearest neighbour 82.60 0.40 97.47
Random Forests 90.62 0.29 98.57
Adaboost with decision trees 92.17 0.51 98.59
Table 6
Conclusion
14
From simulation results, multinomial naive Bayes and SVM (Support Vector Machine)
among the best classifiers for SMS spam detection which yields overall accuracy of
98.88% and 98.86%.
98.88
92.99
82.6
90.62
98.59
70 75 80 85 90 95 100 105
MULTINOMIAL NB
SVM
K-NEAREST NEIGHBOR
RANDOM FORESTS
ADABOOST WITH DECISION TREES
Average classification accuracy for five classifiers
15
Thanks!
Any questions?
You can find us at
○ tanvirul15-6117@diu.edu.bd

More Related Content

Sms spam-detection

  • 1. SMS Spam Detection using Machine Learning Approach Author: Houshmand Shirani-Mehr hshirani@stanford.edu
  • 2. Presented By: Name ID Tanvirul Islam 152-15-6117 Md.Shariful Islam 152-15-5955 Mimshadur Rahman 152-15-6119 Juaid Rakin 152-15-5753 Shamsujoha Sumon 152-15-6099
  • 4. 4 Problem Definition  Short Message (SMS) has grown into a multi-billion dollars commercial industry .  SMS spam is still not as common as email spam.  SMS Spam is particularly more irritating than email spams.  SMS Spam is showing growth, and in 2012 in parts of Asia up to 30% of text messages was spam.
  • 5. Description of Dataset The Dataset is of 5574 text messages from UCI Machine Learning repository gathered in 2012 . Total Data 5574 Spam 747 Non Spam(Ham) 4827 SMS Spam Collection Data Set from UCI Machine Learning Repository, ”http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection”
  • 6. Methodology 6 Algorithm Classifier Model SMS Text Pre - Processing Features Extraction New SMS Text Pre - Processing Features Extraction Classification of new Text Training Tasting
  • 7. Classification Algorithm( Naïve Bayes) 7 NB algorithm is applied to the final extracted features. The speed and simplicity along with high accuracy of this algorithm makes it a desirable classifier for spam detection problems. Applying naïve Bayes with multinomial event model to the dataset and using 10-fold cross validation results in Table 1. Overall Error % 1.12 Accuracy % 98.88 SC % 94.5 BH % 0.51 Table 1
  • 8. Classification Algorithm(Support Vector Machines) : 8 Overall Error % 1.12 Accuracy % 98.88 SC % 94.5 BH % 0.51 Table 2 shows the 10-fold cross validation results of SVM with different kernels applied to the dataset with extracted features… Table 2
  • 9. Classification Algorithm(k-nearest neighbor) 9 Table 3 shows the 10-fold cross validation results of k-nearest neighbor classifier applied to the dataset. Table 3
  • 10. Classification Algorithm(Random Forests) 10 Overall Error % 2.16 Accuracy % 97.84 SC % 87.7 BH % 0.73 Random forests is an averaging ensemble method for classification. The ensemble is a combination of decision trees built from a bootstrap sample from training set. Table 4
  • 11. Classification Algorithm(Adaboost with Decision Tree) 11 Overall Error % 1.41 Accuracy % 98.59 SC % 92.17 BH % 0.51 Adaboost is a boosting ensemble method. They tried the implementation of Adaboost with decision trees using scikit-learn library. Using 10 estimators, the simulation shows.. Table 5
  • 12. Performance Measure 12 Spams caught (SC) = False negative cases Number of Spams Blocked hams (BH) = False Positive cases Number of Hams Accuracy = True Positive + True Negetive Total Number of Test Data
  • 13. Final results of different classifiers 13 Model SC % BH % Accuracy % Multinomial NB 94.87 0.51 98.88 SVM 92.99 0.31 98.86 k-nearest neighbour 82.60 0.40 97.47 Random Forests 90.62 0.29 98.57 Adaboost with decision trees 92.17 0.51 98.59 Table 6
  • 14. Conclusion 14 From simulation results, multinomial naive Bayes and SVM (Support Vector Machine) among the best classifiers for SMS spam detection which yields overall accuracy of 98.88% and 98.86%. 98.88 92.99 82.6 90.62 98.59 70 75 80 85 90 95 100 105 MULTINOMIAL NB SVM K-NEAREST NEIGHBOR RANDOM FORESTS ADABOOST WITH DECISION TREES Average classification accuracy for five classifiers
  • 15. 15 Thanks! Any questions? You can find us at ○ tanvirul15-6117@diu.edu.bd