This document presents research on using machine learning algorithms to detect SMS spam messages. It introduces the problem of SMS spam, describes the dataset used containing over 5,000 SMS messages, and explains the preprocessing and feature extraction steps. It then evaluates the performance of various classification algorithms - Naive Bayes, SVM, k-NN, Random Forests, and AdaBoost with Decision Trees - on the SMS spam detection task, reporting the accuracy, spam caught rate, and ham blocked rate for each. It finds that Naive Bayes and SVM performed best with over 98% accuracy.
2. Presented By:
Name ID
Tanvirul Islam 152-15-6117
Md.Shariful Islam 152-15-5955
Mimshadur Rahman 152-15-6119
Juaid Rakin 152-15-5753
Shamsujoha Sumon 152-15-6099
4. 4
Problem Definition
Short Message (SMS) has grown into a multi-billion dollars commercial industry .
SMS spam is still not as common as email spam.
SMS Spam is particularly more irritating than email spams.
SMS Spam is showing growth, and in 2012 in parts of Asia up to 30% of text messages was spam.
5. Description of Dataset
The Dataset is of 5574 text messages from UCI Machine Learning repository
gathered in 2012 .
Total Data 5574
Spam 747
Non Spam(Ham) 4827
SMS Spam Collection Data Set from UCI Machine Learning Repository,
”http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection”
7. Classification Algorithm( Naïve Bayes)
7
NB algorithm is applied to the final extracted features. The speed and simplicity along
with high accuracy of this algorithm makes it a desirable classifier for spam detection
problems. Applying naïve Bayes with multinomial event model to the dataset and using
10-fold cross validation results in Table 1.
Overall Error % 1.12
Accuracy % 98.88
SC % 94.5
BH % 0.51
Table 1
8. Classification Algorithm(Support Vector Machines) :
8
Overall Error % 1.12
Accuracy % 98.88
SC % 94.5
BH % 0.51
Table 2 shows the 10-fold cross validation results of SVM with different
kernels applied to the dataset with extracted features…
Table 2
10. Classification Algorithm(Random Forests)
10
Overall Error % 2.16
Accuracy % 97.84
SC % 87.7
BH % 0.73
Random forests is an averaging ensemble method for classification. The ensemble is a
combination of decision trees built from a bootstrap sample from training set.
Table 4
11. Classification Algorithm(Adaboost with Decision Tree)
11
Overall Error % 1.41
Accuracy % 98.59
SC % 92.17
BH % 0.51
Adaboost is a boosting ensemble method. They tried the implementation of Adaboost with
decision trees using scikit-learn library. Using 10 estimators, the simulation shows..
Table 5
12. Performance Measure
12
Spams caught (SC) =
False negative cases
Number of Spams
Blocked hams (BH) =
False Positive cases
Number of Hams
Accuracy =
True Positive + True Negetive
Total Number of Test Data
13. Final results of different
classifiers
13
Model SC % BH % Accuracy %
Multinomial NB 94.87 0.51 98.88
SVM 92.99 0.31 98.86
k-nearest neighbour 82.60 0.40 97.47
Random Forests 90.62 0.29 98.57
Adaboost with decision trees 92.17 0.51 98.59
Table 6
14. Conclusion
14
From simulation results, multinomial naive Bayes and SVM (Support Vector Machine)
among the best classifiers for SMS spam detection which yields overall accuracy of
98.88% and 98.86%.
98.88
92.99
82.6
90.62
98.59
70 75 80 85 90 95 100 105
MULTINOMIAL NB
SVM
K-NEAREST NEIGHBOR
RANDOM FORESTS
ADABOOST WITH DECISION TREES
Average classification accuracy for five classifiers