SlideShare a Scribd company logo
PRESENTATION ON
SPAM E-MAIL DETECTION
PRESENTED BY
Nabin Jamkatel (3391)
Rajiv Gupta (3396)
Rakesh Chhetri (3397)
Sabina Lamichhane (3398)
INTRODUCTION
Spam e-mails can be not only annoying but also dangerous to
consumers.
Spam e-mails can be defined as :
1. Anonymity
2. Mass Mailings
3. Unsolicited:
Spam e-mail are message randomly sent to multiple addressees by all
sorts of groups, but mostly lazy advertisers and criminals who wish to
lead you to phishing sites.
NAÏVE BAYS CLASSIFIER
Simple probabilistic classifier that calculates a set of
probabilities by counting the frequency and combination of
values in a given dataset.
Represent as a vector of feature values.
It is very useful to classify the e-mails properly
The precision and recall of this method is known to be very
effective
PROBLEM STATEMENT
 Unwanted e-mails irritating internet connection
Critical e-mail message are missed and / or delayed.
Millions of compromised computers
Billions of dollars lost worldwide
Identity theft
Spam can crash mail servers and fill up hard drives
OBJECTIVE
The objective of identification of Spam e-mails are :
• To give knowledge to the user about the fake e-mails and
relevant e-mails
• To classify that mail spam or not.
LITERATURE REVIEW
• We consulted from G. He, Spam Detection, 1st ed. 2007 and
learned about this problem.
• Spam prevention is often neglected, although some simple
measures can dramatically reduce the amount of spam that
reaches your mailbox.
• Before they are able to send you spam, spammers obviously
first need to obtain your email address, which they can do
through different routes.
SCOPE OF THE PROJECT:
• It provides sensitivity to the client and adapts well to the
future spam techniques.
• It considers a complete message instead of single words with
respect to its organization.
• It increases Security and Control.
• It reduces IT Administration Costs.
• It also reduce Network Resource Costs.
DOCUMENT
PREPROCESSING
Tokenization
• Tokenization is the process of breaking a stream of text up into
words, phrases, symbols, or other meaningful elements called
tokens.
• The list of tokens becomes input for further processing such as
parsing or text mining.
LEMMATIZATION
• Lemmatization in linguistics, is the process of grouping
together the different inflected forms of a word so they can be
analysed as a single item.
• In computational linguistics, lemmatisation is the algorithmic
process of determining the lemma for a given word.
REMOVAL OF STOP WORD
• Sometimes, the extremely common word which would appear
to be of very little value in helping select documents matching
user need are excluded from the vocabulary entirely.
REQUIREMENT ANALYSIS
Functional Requirement
To classify the e-mails which is done by first taking out the feature
vector extraction which involves first taking out whether the word
is a spam or not.
Non-Functional Requirement
Ensures high availability of email data here datasets.
User should get the result as fast as possible.
It should be easy to use i.e., user is just required to type the words
and click then the result is displayed or user is just required to
enter a pair of reasonable sentence.
FEASIBILITY STUDY
• Technical Feasibility
• Economic Feasibility
• Operational Feasibility
TESTING
• we tested the datasets and found out which e-mail is spam
and which mail is non spam indicated as 0 and 1 respectively.
• We calculated the feature vector to know whether it is spam
or non-spam
• Using that feature vector Naïve Bayes Algorithm works by
comparing the trained data to test the data
DATASET
• Dataset is a collection of data or related information that is
composed for separate elements.
• A collection of dataset for e-mail spam contains spam and
non-spam messages
OUTPUT
Any external email can be detected and classified as spam e-
mail. So the users will be aware of such email.
Mails are classified into spam and non spam.
From the classified data we have calculated the accuracy as
99.18 %
Recall = 99.07%
F-measure= 99.53
Final  spam-e-mail-detection
Final  spam-e-mail-detection
Final  spam-e-mail-detection
CONCLUSION
• We are able to classify the emails as spam or non-spam. With
high number of emails lots if people using the system it will
be difficult to handle all possible mails as our project deals
with only limited amount of corpus.
REFERENCES
• [1]Clemmer, A. (2012). How Bayesian algorithm works. [online] Available
at: https://www.quora.com/How-do-Bayesian-algorithms-work-for-the-
identification-of-spam [Accessed 16 Aug. 2017].
• [2]What is Email Spam?. (2017). [Blog] comm100. Available at:
https://emailmarketing.comm100.com/email-marketing-ebook/email-
spam.aspx [Accessed 27 Aug. 2017].
• [3]G. He, Spam Detection, 1st ed. 2007.
• [4] bot2, V. (2017). Email Spam Filtering : A python implementation with
scikit-learn. [online] Machine Learning in Action. Available at:
https://appliedmachinelearning.wordpress.com/2017/01/23/email-spam-
filter-python-scikit-learn/ [Accessed 30 Aug. 2017].
Thank You

More Related Content

Final spam-e-mail-detection

  • 1. PRESENTATION ON SPAM E-MAIL DETECTION PRESENTED BY Nabin Jamkatel (3391) Rajiv Gupta (3396) Rakesh Chhetri (3397) Sabina Lamichhane (3398)
  • 2. INTRODUCTION Spam e-mails can be not only annoying but also dangerous to consumers. Spam e-mails can be defined as : 1. Anonymity 2. Mass Mailings 3. Unsolicited: Spam e-mail are message randomly sent to multiple addressees by all sorts of groups, but mostly lazy advertisers and criminals who wish to lead you to phishing sites.
  • 3. NAÏVE BAYS CLASSIFIER Simple probabilistic classifier that calculates a set of probabilities by counting the frequency and combination of values in a given dataset. Represent as a vector of feature values. It is very useful to classify the e-mails properly The precision and recall of this method is known to be very effective
  • 4. PROBLEM STATEMENT  Unwanted e-mails irritating internet connection Critical e-mail message are missed and / or delayed. Millions of compromised computers Billions of dollars lost worldwide Identity theft Spam can crash mail servers and fill up hard drives
  • 5. OBJECTIVE The objective of identification of Spam e-mails are : • To give knowledge to the user about the fake e-mails and relevant e-mails • To classify that mail spam or not.
  • 6. LITERATURE REVIEW • We consulted from G. He, Spam Detection, 1st ed. 2007 and learned about this problem. • Spam prevention is often neglected, although some simple measures can dramatically reduce the amount of spam that reaches your mailbox. • Before they are able to send you spam, spammers obviously first need to obtain your email address, which they can do through different routes.
  • 7. SCOPE OF THE PROJECT: • It provides sensitivity to the client and adapts well to the future spam techniques. • It considers a complete message instead of single words with respect to its organization. • It increases Security and Control. • It reduces IT Administration Costs. • It also reduce Network Resource Costs.
  • 8. DOCUMENT PREPROCESSING Tokenization • Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. • The list of tokens becomes input for further processing such as parsing or text mining.
  • 9. LEMMATIZATION • Lemmatization in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. • In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word.
  • 10. REMOVAL OF STOP WORD • Sometimes, the extremely common word which would appear to be of very little value in helping select documents matching user need are excluded from the vocabulary entirely.
  • 11. REQUIREMENT ANALYSIS Functional Requirement To classify the e-mails which is done by first taking out the feature vector extraction which involves first taking out whether the word is a spam or not. Non-Functional Requirement Ensures high availability of email data here datasets. User should get the result as fast as possible. It should be easy to use i.e., user is just required to type the words and click then the result is displayed or user is just required to enter a pair of reasonable sentence.
  • 12. FEASIBILITY STUDY • Technical Feasibility • Economic Feasibility • Operational Feasibility
  • 13. TESTING • we tested the datasets and found out which e-mail is spam and which mail is non spam indicated as 0 and 1 respectively. • We calculated the feature vector to know whether it is spam or non-spam • Using that feature vector Naïve Bayes Algorithm works by comparing the trained data to test the data
  • 14. DATASET • Dataset is a collection of data or related information that is composed for separate elements. • A collection of dataset for e-mail spam contains spam and non-spam messages
  • 15. OUTPUT Any external email can be detected and classified as spam e- mail. So the users will be aware of such email. Mails are classified into spam and non spam. From the classified data we have calculated the accuracy as 99.18 % Recall = 99.07% F-measure= 99.53
  • 19. CONCLUSION • We are able to classify the emails as spam or non-spam. With high number of emails lots if people using the system it will be difficult to handle all possible mails as our project deals with only limited amount of corpus.
  • 20. REFERENCES • [1]Clemmer, A. (2012). How Bayesian algorithm works. [online] Available at: https://www.quora.com/How-do-Bayesian-algorithms-work-for-the- identification-of-spam [Accessed 16 Aug. 2017]. • [2]What is Email Spam?. (2017). [Blog] comm100. Available at: https://emailmarketing.comm100.com/email-marketing-ebook/email- spam.aspx [Accessed 27 Aug. 2017]. • [3]G. He, Spam Detection, 1st ed. 2007. • [4] bot2, V. (2017). Email Spam Filtering : A python implementation with scikit-learn. [online] Machine Learning in Action. Available at: https://appliedmachinelearning.wordpress.com/2017/01/23/email-spam- filter-python-scikit-learn/ [Accessed 30 Aug. 2017].