SlideShare a Scribd company logo
High time to add
machine learning to
your information
security stack?
Minhaz | minhaz@owasp.org | https://blog.minhazav.xyz | twitter.com/minhazav | github.com/mebjas | Hyderabad, India
whoami• Currently, Software Engineer II at
Microsoft Azure, Production &
Infrastructure Engineering.
• Like to play around with data, statistics,
machine learning.
• OWASP Project maintainer for CSRF
Protector Project, Currently mentoring
student as GSOC mentor for OWASP
Security Knowledge Framework Project
CODING SOMETHING OR OTHER SINCE 2009
Minhaz
Some Previous Talks with
(以前的一些谈话)
OWASP / CSA!
Disclaimer (放弃)
1. This talk is about defending not attacking
2. No IP was damaged to make this presentation.
3. I’m not here to make inferences on what is or not the perfect
way to solve issue / or if ML is going to be the solution for
everyone
4. I’ll be citing couple of Organizations / Individuals whose work I’ll
be using here. I have no formal connection / sponsorship from
them – it’s purely based on my personal research.
Outline
High time to add machine learning to your
information security stack?
High time to add machine learning to your
information security stack?
High time to add machine learning to your
information security stack?
Machine
Learning
High time to add machine learning to your information security stack
Machine Learning
Deep Learning
Artificial Intelligence
High time to add machine learning to your information security stack
Problems being solved in the world
using these techniques
(使用这些技术在世界上解决的问题)
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
Different areas of machine
learning
High time to add machine learning to your information security stack
Components of Machine
Learning Pipeline
Let’s go through most of them with a study on Classification of
Malwares
Supervised Learning: Classification
Malware Classification
Malware: Malicious Software
Problem: How traditional anti virus systems work, and if
machine learning could be help full.
Traditional antiviruses works on:
1. Signature-based detection
2. Heuristic-based detection
3. Behavior based detection
4. Sandbox detection
5. Data mining techniques
Malware Classification
Classify an application as malware or not based on behavior i.e. to
train computer to learn boundary between behavior of a normal
application as compared to a malware
Step 1: Define your problem and see if you can gather
data + Domain Knowledge
定义您的问题,看看您是否可以收集数据
Problems:
1. Missing Items
2. Incorrect Items, specifically labels
3. Skewness
4. Low Volume
5. Outdated data
Data Source for demo: https://github.com/Te-k/malware-classification
ALWAYS REMEMBER:
Garbage IN Garbage OUT
Step 2: Feature Engineering
• Feature Extraction
• Feature Addition
• Feature Selection
• Manual
• Automatic
Step 3: Choice of Algorithm
There are wide range of algorithms from which we
can choose based on whether we are trying to do
prediction, classification or clustering. We can also
choose between linear and non-linear algorithms.
Naive Bayes, Support Vector Machines, Decision
Trees, k-Means Clustering are some common
algorithms used.
Step 4: Training
• In this step we tune our algorithm based on the data we already have. This data is called training set as it is
used to train our algorithm. This is the part where our machine or software learn and improve with experience.
• Test Train Split
• We divide our data (randomly) to testing and training datasets to be evaluate the capabilities of our models
with unknown datasets.
Step 5: Choice of Metrics / Evaluation
Criteria
• Accuracy
• False Positive Rate (FPR)
• False Negative Rate (FNR)
• Precision
• Recall
• f1-measure
• & More…
Step 6: Testing
Lastly, we test how our machine learning algorithm performs on an
unseen set of test cases. One way to do this, is to partition the data
into training and testing set. The training set is used in step 4 while the
test set is then used in this step. Techniques such as cross-validation
and leave-one-out can be used to deal with scenarios where we do not
have enough data.
Another interesting example
另一个有趣的例子
Another interesting way to do malware classification has by converting
malwares to images and applying machine learning / deep learning
techniques on top of them;
The proposed method generates RGB-colored pixels on image matrices
using the opcode sequences extracted from malware samples and
calculates the similarities for the image matrices.
Reference: Malware Analysis Using Visualized Image Matrices
https://www.hindawi.com/journals/tswj/2014/132713/
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
So is malware detection being done using
machine learning as of now?
Kaspersky: Machine Learning for Malware
Detection
Key points they mention are:
• Have the right data.
• Know theoretical machine learning and how to apply it to cybersecurity.
• Know user practical needs and be an expert at implementing machine
learning into products
• Earn a sufficient user base and use the power of feedback loop and
crowdsourcing.
• Keep detection methods in multi-layered synergy.
Link to whitepaper
Unsupervised Learning
Anomaly Detection
High time to add machine learning to your information security stack
Anomaly Detection
• Statistical techniques: Mean, Standard Deviation
• Supervised Algorithms: KNN, Random Forest, SVM
• Unsupervised Algorithms: SOM, K-means, CART Based, Local Outlier
Factor
• Deep Learning Models: LSTM, Auto Encoders
Twitter Anomaly Detection | scikit-learn | Facebook Prophet | LinkedIn
Luminol
Auto Encoders
Use cases in other areas
其他领域的用例
Supervised Learning
Classification
Malware Detection / Classification
Spam detection
Phishing Detection
Regression
Risk Scoring
User Behavior Analysis and Fraud
Detection
Unsupervised Learning
Clustering
Forensic analysis
Anomaly Detection
Network Traffic Analysis
Fraud Detection
Recommendations
Remediation Action Recommendations
In incident response
Pattern Detection, Correlation
and NLP
Log Correlation
Noise Reduction
Why now?
为什么现
在?
1. Volume of data (数据量)
Data has posed perhaps the single greatest challenge in cybersecurity
over the past decade. For a human, or even a large team of humans,
the amount of data produced daily on a global scale is unimaginable.
For every minute in 2017 there were:
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
High time to add machine learning to your information security stack
2. To focus on what’s important
3. Attacks are getting more sophisticated
攻击越来越复杂
Breaking captcha using deep convolutional networks
4. Solve set of problems like we solved for
SPAMS
5. Vendor Management - New vendors
coming up every other day
• You need to brace yourself and know what the technology has to
offer before evaluating what they offer.
• AI/ML is no longer just a buzz word. It has strong capabilities. But it’s
a tool at the end.
And how?
As individual or an Enterprise
作为个人或企业
• Online Courses Online
• Online Challenges and Open
Source Tools to try out stuff and
proof of concepts
• Using power of cloud to do
things at scale
• There is no lack of content out
there on this topic.
https://github.com/jivoi/awesome-ml-for-cybersecurity
Research related to Machine Learning
And cyber security.
Security ML requirements
MACHINE LEARNING EXPERTISE
TO THING BEYOND STANDARD
TOOLKITS.
DATA ACROSS THE STACK
HOST (EVENT LOGS, SYS LOGS,
AV LOGS)
NETWORK LOGS
SERVICE & APPLICATION LOGS
SECURE AND SCALABLE
PLATFORM
EYES ON GLASS TESTING WITH REAL ATTACKS
Open Source Communities
Create , Share and
Validate Open Data
Repositories.
01
Involve in
crowdsourced
generation of
labelled data.
02
Initiate research in
this area and
collaborate.
03
Brace ourselves for
next generation of
attack and
defence.
04
Takeaways
Takeaways
• ML/DL are here, embrace the change: the correct applicability of ML
can enhance defensive practices.
• There is a lot of possibilities in InfoSec for these techniques.
• Machine Learning / Deep Learning / AI – they are tools. It’s a tool you
have to know how to apply in order for it to reveal true insight. And
while it’s not the only tool we need to use but it’s bound to get more
powerful with time. We need to mix in experience. We have to work
with experts to capture their knowledge for the algorithms to reveal
actual security insights or issues.
Thanks
谢谢
Appendix
• Visual introduction to machine learning - http://www.r2d3.us/visual-intro-
to-machine-learning-part-1/
• Microsoft Malware Challenge on Kaggle -
https://www.kaggle.com/c/malware-classification
• Malware Detection and Classification Using Machine Learning on Microsoft
Malware Classification challenge - https://github.com/dchad/malware-
detection
• Collection of deep learning research papers -
https://medium.com/@jason_trost/collection-of-deep-learning-cyber-
security-research-papers-e1f856f71042
• Security data science papers - http://www.covert.io/security-datascience-
papers/
All Code and references available at
https://github.com/mebjas/owasp.tw.0718

More Related Content

High time to add machine learning to your information security stack

  • 1. High time to add machine learning to your information security stack? Minhaz | minhaz@owasp.org | https://blog.minhazav.xyz | twitter.com/minhazav | github.com/mebjas | Hyderabad, India
  • 2. whoami• Currently, Software Engineer II at Microsoft Azure, Production & Infrastructure Engineering. • Like to play around with data, statistics, machine learning. • OWASP Project maintainer for CSRF Protector Project, Currently mentoring student as GSOC mentor for OWASP Security Knowledge Framework Project CODING SOMETHING OR OTHER SINCE 2009 Minhaz
  • 3. Some Previous Talks with (以前的一些谈话) OWASP / CSA!
  • 4. Disclaimer (放弃) 1. This talk is about defending not attacking 2. No IP was damaged to make this presentation. 3. I’m not here to make inferences on what is or not the perfect way to solve issue / or if ML is going to be the solution for everyone 4. I’ll be citing couple of Organizations / Individuals whose work I’ll be using here. I have no formal connection / sponsorship from them – it’s purely based on my personal research.
  • 6. High time to add machine learning to your information security stack?
  • 7. High time to add machine learning to your information security stack?
  • 8. High time to add machine learning to your information security stack?
  • 13. Problems being solved in the world using these techniques (使用这些技术在世界上解决的问题)
  • 17. Different areas of machine learning
  • 19. Components of Machine Learning Pipeline Let’s go through most of them with a study on Classification of Malwares
  • 21. Malware: Malicious Software Problem: How traditional anti virus systems work, and if machine learning could be help full. Traditional antiviruses works on: 1. Signature-based detection 2. Heuristic-based detection 3. Behavior based detection 4. Sandbox detection 5. Data mining techniques Malware Classification Classify an application as malware or not based on behavior i.e. to train computer to learn boundary between behavior of a normal application as compared to a malware
  • 22. Step 1: Define your problem and see if you can gather data + Domain Knowledge 定义您的问题,看看您是否可以收集数据 Problems: 1. Missing Items 2. Incorrect Items, specifically labels 3. Skewness 4. Low Volume 5. Outdated data Data Source for demo: https://github.com/Te-k/malware-classification ALWAYS REMEMBER: Garbage IN Garbage OUT
  • 23. Step 2: Feature Engineering • Feature Extraction • Feature Addition • Feature Selection • Manual • Automatic
  • 24. Step 3: Choice of Algorithm There are wide range of algorithms from which we can choose based on whether we are trying to do prediction, classification or clustering. We can also choose between linear and non-linear algorithms. Naive Bayes, Support Vector Machines, Decision Trees, k-Means Clustering are some common algorithms used.
  • 25. Step 4: Training • In this step we tune our algorithm based on the data we already have. This data is called training set as it is used to train our algorithm. This is the part where our machine or software learn and improve with experience. • Test Train Split • We divide our data (randomly) to testing and training datasets to be evaluate the capabilities of our models with unknown datasets.
  • 26. Step 5: Choice of Metrics / Evaluation Criteria • Accuracy • False Positive Rate (FPR) • False Negative Rate (FNR) • Precision • Recall • f1-measure • & More…
  • 27. Step 6: Testing Lastly, we test how our machine learning algorithm performs on an unseen set of test cases. One way to do this, is to partition the data into training and testing set. The training set is used in step 4 while the test set is then used in this step. Techniques such as cross-validation and leave-one-out can be used to deal with scenarios where we do not have enough data.
  • 28. Another interesting example 另一个有趣的例子 Another interesting way to do malware classification has by converting malwares to images and applying machine learning / deep learning techniques on top of them; The proposed method generates RGB-colored pixels on image matrices using the opcode sequences extracted from malware samples and calculates the similarities for the image matrices. Reference: Malware Analysis Using Visualized Image Matrices https://www.hindawi.com/journals/tswj/2014/132713/
  • 31. So is malware detection being done using machine learning as of now?
  • 32. Kaspersky: Machine Learning for Malware Detection Key points they mention are: • Have the right data. • Know theoretical machine learning and how to apply it to cybersecurity. • Know user practical needs and be an expert at implementing machine learning into products • Earn a sufficient user base and use the power of feedback loop and crowdsourcing. • Keep detection methods in multi-layered synergy. Link to whitepaper
  • 35. Anomaly Detection • Statistical techniques: Mean, Standard Deviation • Supervised Algorithms: KNN, Random Forest, SVM • Unsupervised Algorithms: SOM, K-means, CART Based, Local Outlier Factor • Deep Learning Models: LSTM, Auto Encoders Twitter Anomaly Detection | scikit-learn | Facebook Prophet | LinkedIn Luminol
  • 37. Use cases in other areas 其他领域的用例 Supervised Learning Classification Malware Detection / Classification Spam detection Phishing Detection Regression Risk Scoring User Behavior Analysis and Fraud Detection Unsupervised Learning Clustering Forensic analysis Anomaly Detection Network Traffic Analysis Fraud Detection Recommendations Remediation Action Recommendations In incident response Pattern Detection, Correlation and NLP Log Correlation Noise Reduction
  • 39. 1. Volume of data (数据量) Data has posed perhaps the single greatest challenge in cybersecurity over the past decade. For a human, or even a large team of humans, the amount of data produced daily on a global scale is unimaginable. For every minute in 2017 there were:
  • 45. 2. To focus on what’s important
  • 46. 3. Attacks are getting more sophisticated 攻击越来越复杂 Breaking captcha using deep convolutional networks
  • 47. 4. Solve set of problems like we solved for SPAMS
  • 48. 5. Vendor Management - New vendors coming up every other day • You need to brace yourself and know what the technology has to offer before evaluating what they offer. • AI/ML is no longer just a buzz word. It has strong capabilities. But it’s a tool at the end.
  • 50. As individual or an Enterprise 作为个人或企业 • Online Courses Online • Online Challenges and Open Source Tools to try out stuff and proof of concepts • Using power of cloud to do things at scale • There is no lack of content out there on this topic.
  • 52. Security ML requirements MACHINE LEARNING EXPERTISE TO THING BEYOND STANDARD TOOLKITS. DATA ACROSS THE STACK HOST (EVENT LOGS, SYS LOGS, AV LOGS) NETWORK LOGS SERVICE & APPLICATION LOGS SECURE AND SCALABLE PLATFORM EYES ON GLASS TESTING WITH REAL ATTACKS
  • 53. Open Source Communities Create , Share and Validate Open Data Repositories. 01 Involve in crowdsourced generation of labelled data. 02 Initiate research in this area and collaborate. 03 Brace ourselves for next generation of attack and defence. 04
  • 55. Takeaways • ML/DL are here, embrace the change: the correct applicability of ML can enhance defensive practices. • There is a lot of possibilities in InfoSec for these techniques. • Machine Learning / Deep Learning / AI – they are tools. It’s a tool you have to know how to apply in order for it to reveal true insight. And while it’s not the only tool we need to use but it’s bound to get more powerful with time. We need to mix in experience. We have to work with experts to capture their knowledge for the algorithms to reveal actual security insights or issues.
  • 57. Appendix • Visual introduction to machine learning - http://www.r2d3.us/visual-intro- to-machine-learning-part-1/ • Microsoft Malware Challenge on Kaggle - https://www.kaggle.com/c/malware-classification • Malware Detection and Classification Using Machine Learning on Microsoft Malware Classification challenge - https://github.com/dchad/malware- detection • Collection of deep learning research papers - https://medium.com/@jason_trost/collection-of-deep-learning-cyber- security-research-papers-e1f856f71042 • Security data science papers - http://www.covert.io/security-datascience- papers/
  • 58. All Code and references available at https://github.com/mebjas/owasp.tw.0718