SlideShare a Scribd company logo
Malicious Client
Detection using
machine learning
SATYAM SAXENA
Threats
•There are many types of malware for all types of
devices and operating systems
•Most if not all malware relies on a support system –
command and control infrastructure
•Bad guys use DNS to scale and hide their C&C
infrastructure
•Bad guys use DNS for C&C to bypass corporate security
(tunneling)
•Bad guys use cloud providers to roll out, scale, manage
and quickly move their C&C Infrastructure
Without reliance on any particular end point operation
system or configuration, we can use big data analytics
on network data to detect malware.
Malware use of DNS
rndruppbakyokv[.]com
1.2.3.4
Command and
Control
Infrastructure
Communication
Chanel with C&C
is established.
Compromised device
receives updates,
instructions, targets.
DNS Server
DNS Server
End point device
Raw
pDNS
Domain Name
classifier
DNS Resolver
classifier
Device Behavior
classifier
Compromised Device
(Security Event)
classifier
Malicious
Domains
Malicious
Resolvers
Behavior
Anomalies
Machine Learning Pipeline
DGA Network Time
Tunnel
Network Time
Network Time
Architecture
DGA Model
• Detect Randomly generated domains in the pDNS data.
• Model is trained on 6 categories of malware families like zeus, tinba, pushdo, etc.
• 29 features extracted from the domain.
• 29 features dimensionally reduced to 16 features using PCA.
• Those reduced features set is then used to train a GBM classifier.
Domain Features
Common Letter Score Entropy
Domain Features(2)
Length of largest meaningful string Mean length of dictionary words
DGA Features
DGA Classification Performance
Overall model performance
(Random Forrest)
Metric Performance
Accuracy 98.738%
Precision 99.288%
Recall 98.181%
AUC 99.801%
Performance per malware family
Malware Family % Detection
Conflicker 86.309%
Cryptolocker 98.348%
Pushdo 95.515%
Ramdo 99.823%
Tinba 96.715%
Zeus 100.0%
Network Model
• Using WHOIS record to find if a domain is malicious or benign.
• WHOIS record contains very rich information about a domain.
• Age based features.
• Registration Features.
Network Features – Whois Server
Malicious Domains Benign Domains
Network Features – creation Date
Network Model Performance
• Final Set of features :- creation Date, update Date, expiration Date,admin country, registrant
country, tech country, status, whois server
Metric Performance
Error 0.00450864127
Area Under Curve 0.96615884041
Compromised Client Detection
Hadoop
HDFS
Spark
Compute
IP DGA WHOIS NX SERVER
ip1 #10 #3 #4 #5
Ip2 #8 #1 #2 #3
ip3 #5 #2 #0 #0
ip4 #3 #3 #0 #0
pDNS
Data
Group
By
Thank You

More Related Content

Malicious Client Detection Using Machine Learning

  • 2. Threats •There are many types of malware for all types of devices and operating systems •Most if not all malware relies on a support system – command and control infrastructure •Bad guys use DNS to scale and hide their C&C infrastructure •Bad guys use DNS for C&C to bypass corporate security (tunneling) •Bad guys use cloud providers to roll out, scale, manage and quickly move their C&C Infrastructure Without reliance on any particular end point operation system or configuration, we can use big data analytics on network data to detect malware.
  • 3. Malware use of DNS rndruppbakyokv[.]com 1.2.3.4 Command and Control Infrastructure Communication Chanel with C&C is established. Compromised device receives updates, instructions, targets. DNS Server DNS Server End point device
  • 4. Raw pDNS Domain Name classifier DNS Resolver classifier Device Behavior classifier Compromised Device (Security Event) classifier Malicious Domains Malicious Resolvers Behavior Anomalies Machine Learning Pipeline DGA Network Time Tunnel Network Time Network Time
  • 6. DGA Model • Detect Randomly generated domains in the pDNS data. • Model is trained on 6 categories of malware families like zeus, tinba, pushdo, etc. • 29 features extracted from the domain. • 29 features dimensionally reduced to 16 features using PCA. • Those reduced features set is then used to train a GBM classifier.
  • 8. Domain Features(2) Length of largest meaningful string Mean length of dictionary words
  • 10. DGA Classification Performance Overall model performance (Random Forrest) Metric Performance Accuracy 98.738% Precision 99.288% Recall 98.181% AUC 99.801% Performance per malware family Malware Family % Detection Conflicker 86.309% Cryptolocker 98.348% Pushdo 95.515% Ramdo 99.823% Tinba 96.715% Zeus 100.0%
  • 11. Network Model • Using WHOIS record to find if a domain is malicious or benign. • WHOIS record contains very rich information about a domain. • Age based features. • Registration Features.
  • 12. Network Features – Whois Server Malicious Domains Benign Domains
  • 13. Network Features – creation Date
  • 14. Network Model Performance • Final Set of features :- creation Date, update Date, expiration Date,admin country, registrant country, tech country, status, whois server Metric Performance Error 0.00450864127 Area Under Curve 0.96615884041
  • 15. Compromised Client Detection Hadoop HDFS Spark Compute IP DGA WHOIS NX SERVER ip1 #10 #3 #4 #5 Ip2 #8 #1 #2 #3 ip3 #5 #2 #0 #0 ip4 #3 #3 #0 #0 pDNS Data Group By