High time to add machine learning to your information security stack
- 1. High time to add
machine learning to
your information
security stack?
Minhaz | minhaz@owasp.org | https://blog.minhazav.xyz | twitter.com/minhazav | github.com/mebjas | Hyderabad, India
- 2. whoami• Currently, Software Engineer II at
Microsoft Azure, Production &
Infrastructure Engineering.
• Like to play around with data, statistics,
machine learning.
• OWASP Project maintainer for CSRF
Protector Project, Currently mentoring
student as GSOC mentor for OWASP
Security Knowledge Framework Project
CODING SOMETHING OR OTHER SINCE 2009
Minhaz
- 4. Disclaimer (放弃)
1. This talk is about defending not attacking
2. No IP was damaged to make this presentation.
3. I’m not here to make inferences on what is or not the perfect
way to solve issue / or if ML is going to be the solution for
everyone
4. I’ll be citing couple of Organizations / Individuals whose work I’ll
be using here. I have no formal connection / sponsorship from
them – it’s purely based on my personal research.
- 6. High time to add machine learning to your
information security stack?
- 7. High time to add machine learning to your
information security stack?
- 8. High time to add machine learning to your
information security stack?
- 21. Malware: Malicious Software
Problem: How traditional anti virus systems work, and if
machine learning could be help full.
Traditional antiviruses works on:
1. Signature-based detection
2. Heuristic-based detection
3. Behavior based detection
4. Sandbox detection
5. Data mining techniques
Malware Classification
Classify an application as malware or not based on behavior i.e. to
train computer to learn boundary between behavior of a normal
application as compared to a malware
- 22. Step 1: Define your problem and see if you can gather
data + Domain Knowledge
定义您的问题,看看您是否可以收集数据
Problems:
1. Missing Items
2. Incorrect Items, specifically labels
3. Skewness
4. Low Volume
5. Outdated data
Data Source for demo: https://github.com/Te-k/malware-classification
ALWAYS REMEMBER:
Garbage IN Garbage OUT
- 23. Step 2: Feature Engineering
• Feature Extraction
• Feature Addition
• Feature Selection
• Manual
• Automatic
- 24. Step 3: Choice of Algorithm
There are wide range of algorithms from which we
can choose based on whether we are trying to do
prediction, classification or clustering. We can also
choose between linear and non-linear algorithms.
Naive Bayes, Support Vector Machines, Decision
Trees, k-Means Clustering are some common
algorithms used.
- 25. Step 4: Training
• In this step we tune our algorithm based on the data we already have. This data is called training set as it is
used to train our algorithm. This is the part where our machine or software learn and improve with experience.
• Test Train Split
• We divide our data (randomly) to testing and training datasets to be evaluate the capabilities of our models
with unknown datasets.
- 26. Step 5: Choice of Metrics / Evaluation
Criteria
• Accuracy
• False Positive Rate (FPR)
• False Negative Rate (FNR)
• Precision
• Recall
• f1-measure
• & More…
- 27. Step 6: Testing
Lastly, we test how our machine learning algorithm performs on an
unseen set of test cases. One way to do this, is to partition the data
into training and testing set. The training set is used in step 4 while the
test set is then used in this step. Techniques such as cross-validation
and leave-one-out can be used to deal with scenarios where we do not
have enough data.
- 28. Another interesting example
另一个有趣的例子
Another interesting way to do malware classification has by converting
malwares to images and applying machine learning / deep learning
techniques on top of them;
The proposed method generates RGB-colored pixels on image matrices
using the opcode sequences extracted from malware samples and
calculates the similarities for the image matrices.
Reference: Malware Analysis Using Visualized Image Matrices
https://www.hindawi.com/journals/tswj/2014/132713/
- 31. So is malware detection being done using
machine learning as of now?
- 32. Kaspersky: Machine Learning for Malware
Detection
Key points they mention are:
• Have the right data.
• Know theoretical machine learning and how to apply it to cybersecurity.
• Know user practical needs and be an expert at implementing machine
learning into products
• Earn a sufficient user base and use the power of feedback loop and
crowdsourcing.
• Keep detection methods in multi-layered synergy.
Link to whitepaper
- 35. Anomaly Detection
• Statistical techniques: Mean, Standard Deviation
• Supervised Algorithms: KNN, Random Forest, SVM
• Unsupervised Algorithms: SOM, K-means, CART Based, Local Outlier
Factor
• Deep Learning Models: LSTM, Auto Encoders
Twitter Anomaly Detection | scikit-learn | Facebook Prophet | LinkedIn
Luminol
- 37. Use cases in other areas
其他领域的用例
Supervised Learning
Classification
Malware Detection / Classification
Spam detection
Phishing Detection
Regression
Risk Scoring
User Behavior Analysis and Fraud
Detection
Unsupervised Learning
Clustering
Forensic analysis
Anomaly Detection
Network Traffic Analysis
Fraud Detection
Recommendations
Remediation Action Recommendations
In incident response
Pattern Detection, Correlation
and NLP
Log Correlation
Noise Reduction
- 39. 1. Volume of data (数据量)
Data has posed perhaps the single greatest challenge in cybersecurity
over the past decade. For a human, or even a large team of humans,
the amount of data produced daily on a global scale is unimaginable.
For every minute in 2017 there were:
- 46. 3. Attacks are getting more sophisticated
攻击越来越复杂
Breaking captcha using deep convolutional networks
- 48. 5. Vendor Management - New vendors
coming up every other day
• You need to brace yourself and know what the technology has to
offer before evaluating what they offer.
• AI/ML is no longer just a buzz word. It has strong capabilities. But it’s
a tool at the end.
- 50. As individual or an Enterprise
作为个人或企业
• Online Courses Online
• Online Challenges and Open
Source Tools to try out stuff and
proof of concepts
• Using power of cloud to do
things at scale
• There is no lack of content out
there on this topic.
- 52. Security ML requirements
MACHINE LEARNING EXPERTISE
TO THING BEYOND STANDARD
TOOLKITS.
DATA ACROSS THE STACK
HOST (EVENT LOGS, SYS LOGS,
AV LOGS)
NETWORK LOGS
SERVICE & APPLICATION LOGS
SECURE AND SCALABLE
PLATFORM
EYES ON GLASS TESTING WITH REAL ATTACKS
- 53. Open Source Communities
Create , Share and
Validate Open Data
Repositories.
01
Involve in
crowdsourced
generation of
labelled data.
02
Initiate research in
this area and
collaborate.
03
Brace ourselves for
next generation of
attack and
defence.
04
- 55. Takeaways
• ML/DL are here, embrace the change: the correct applicability of ML
can enhance defensive practices.
• There is a lot of possibilities in InfoSec for these techniques.
• Machine Learning / Deep Learning / AI – they are tools. It’s a tool you
have to know how to apply in order for it to reveal true insight. And
while it’s not the only tool we need to use but it’s bound to get more
powerful with time. We need to mix in experience. We have to work
with experts to capture their knowledge for the algorithms to reveal
actual security insights or issues.
- 57. Appendix
• Visual introduction to machine learning - http://www.r2d3.us/visual-intro-
to-machine-learning-part-1/
• Microsoft Malware Challenge on Kaggle -
https://www.kaggle.com/c/malware-classification
• Malware Detection and Classification Using Machine Learning on Microsoft
Malware Classification challenge - https://github.com/dchad/malware-
detection
• Collection of deep learning research papers -
https://medium.com/@jason_trost/collection-of-deep-learning-cyber-
security-research-papers-e1f856f71042
• Security data science papers - http://www.covert.io/security-datascience-
papers/
- 58. All Code and references available at
https://github.com/mebjas/owasp.tw.0718