Malware vs Big Data

Malware vs Big Data
Frank Denis - @jedisct1
OpenDNS - Umbrella Security Labs
http://opendns.com - http://umbrella.com

DNS still has unsolved issues
A lot.
2 preso.nb

Why use an alternative DNS resolver?
ü Because most DNS resolvers run by ISPs suck.
ü Because some governments use DNS for censorship.
ü Because DNS is a cheap and easy way to track users.
ü OpenDNS: 208.67.222.222 & 208.67.220.220, ports 53 and 443.
preso.nb 3

Why use OpenDNS?
Fast
We have machines in 20 datacenters with 20 Gb/s bandwidth minimum everywhere, we use anycast,
we peer with a lot of people, we have an operations team with mad skillz at optimizing routes for super-
low latency.
Reliable
100% uptime since 2007, at least from a user perspective (not from a Nagios perspective, but anycast+-
kickass ops team works).
Not only DNS
We are running VPN servers, too. With the same features as what we offer for DNS.
Stats
Domain & category blocking
4 preso.nb

Look ‘ma. (Almost) no proxy required!
preso.nb 5

Security
In addition to generic categories, we can block internet threats (but you have to give us $$$ for that.
Please proceed):
ü Phishing, fraud
ü Malware (botnets, ransomware, bitcoin mining...)
ü APT (actually no, you don’t get it in the package).
preso.nb 7

A typical infection
Infectors
It all begins with some exciting spam. Or a Facebook app. Or some shit on Craigslist. Or an ad banner
on a benign web site. Or a Google result for “download Justin Bieber MP3”.
Or a XSS vulnerability. Because these can do more than alert(document.cookie). Like, injecting an
iframe.
Exploitation
Browser exploits, Flash, PDF, Java. Even smart people can be p0wn3d by a 0day.
Multiple intermediaries between an infector and the actual exploit are common.
Payload
The actual shit that is going to constantly run on your machine and that will make your life miserable.
That shit is either downloaded directly, or through an installer, involving more intermediaries
8 preso.nb

You’ve been p0wn3d, now what?
Ransomware: you will have to pay in order to recover your data (or not).
Scam: you will have to pay for a fake antivirus.
Botnet: welcome to the community. You’re now going to send massive amounts of spam and help with
DDoSing some random dudes.
Click fraud: this is your lucky day. Your browser has a new cute toolbar that hijacks all your search
results and offers cures for your erectile disfunction.
Keyloggers / banking trojans / file scrapers / bitcoin miners: thanks for your donations.
Targetted attacks: if you happen to work on a nuclear weapon, shit is going to hit the nuclear fan.
And more! Endless fun!
preso.nb 9

Command and conquer^H^H^H^Htrol
This is how botnet operators are going to mess with your computer.
ü DGAs: infected clients are going to send queries to pseudorandom domain names / URLs, derived
from the current date.
ü Infected clients can also try to connect to a hard-coded (sometimes huge) list of URLs, some of them
being totally benign, just to confuse antiviruses (and our algorithms as well). Oh, IRC is still cool as a
C&C, too.
ü Malware authors are getting more and more creative and are now also taking advantage of Google
Docs, Google Search, Twitter, Pastebin and abusing various protocols.
10 preso.nb

OMG, this is running in the cloud!
This is a highly scalable, fault-tolerant architecture.
Each piece of the puzzle can easily be moved/removed/updated.
Exploits and payloads are constantly repacked in order to bypass antiviruses.
By the time a sample is found and analyzed, it’s already too late.
Exploit kits rule the cybercrime world, and more diversity is not required.
Exploit kits are sufficiently advanced to give a lot of work to researchers already. All they need is an
update from time to time, to take advantage of new Java^H^H^H^H vulnerabilities.
preso.nb 11

Our approach
Antivirus vendors / malware researchers are doing an excellent job.
But we take a different, complementary approach, leveraging the massive amount of traffic that our
DNS resolvers are seeing.
Not a new idea in academic papers, and “predictive analytics” has become a cool marketing buzzword.
But we’re doing it with real data, for real customers, in a constantly evolving landscape.
12 preso.nb

Linear regression
SeedRandom@1D
data = Table@8i, i + RandomReal@80, 2<D<, 8i, 1, 10<D
Show@ListPlot@data, PlotStyle Æ Red, PlotMarkers Æ 8"*", 42<D,
ImageSize Æ Scaled@0.75DD
* *
* *
* *
* *
*
*
2 4 6 8 10
4
6
8
10
preso.nb 13

model = LinearModelFit@data, 8x<, xD
Plot@model@xD, 8x, 0, 10<, PlotStyle Æ BlueD, ImageSize Æ Scaled@0.75DD
FittedModelB 0.880909 +0.988312 x F
* *
* *
* *
* *
*
*
2 4 6 8 10
4
6
8
10
data2 = Append@8@@1DD, @@2DD^2< & êû data, 82.5, 250<D
model2 = LinearModelFit@data2, 8x<, xD
Show@ListPlot@data2, PlotStyle Æ Red, PlotMarkers Æ 8"*", 42<D,
Plot@model2@xD, 8x, 0, 10<, PlotStyle Æ BlueD, ImageSize Æ Scaled@0.75DD
* * * * * *
* *
*
*
*
2 4 6 8 10
50
100
150
200
250
Please note that throwing more data wouldn’t help. We need to change the model and/or do something
to the training set.
14 preso.nb

data3 = 8@@1DD, Sqrt@@@2DDD< & êû data2
H* data3 = Delete@data3, 11D <- outlier! *L
model3 = LinearModelFit@data3, 8x<, xD
Show@ListPlot@data3, PlotStyle Æ Red, PlotMarkers Æ 8"*", 42<D,
Plot@model3@xD, 8x, 0, 10<, PlotStyle Æ BlueD, ImageSize Æ Scaled@0.75DD
* *
* * * *
* *
*
*
*
2 4 6 8 10
5
10
15
preso.nb 15

model4 = LinearModelFit@data, 8x, x^2<, xD
model5 = LinearModelFit@data, 8x, x^2, x^3, x^4, x^5, x^6, x^7, x^8, x^9<, xD
Plot@8model5@xD, model@xD<, 8x, 0, 10<, PlotStyle Æ 8Blue, Dotted<D,
ImageSize Æ Scaled@0.75DD
FittedModelB 1.82804 +0.514745 x+0.0430515 x2 F
FittedModelB 326.819 -865.008 x+Ü9á+0.0204298 x8
-0.000411603 x9 F
* *
* *
* *
* *
*
*
2 4 6 8 10
4
6
8
10
Overfitting can do more harm than good.
16 preso.nb

Multivariate linear regression
Unless we found an actual malware sample, all we can infer is a bunch of scores.
This hold true for all reputation-based systems.
This also holds true for antiviruses when using heuristics (different “severity levels”).
This also holds true for antiviruses when not using heuristics, because false positives and false nega-
tives are present in any signature/whitelist/blacklist-based system.
We use multivariate linear regression to aggregate scores.
preso.nb 17

Classification
ts1 = TableForm@88"PHP", "Cpanel", "Wordpress", "Popularity", "Reputation", "Label"<,
8"T", "T", "T", 0.1`, -50, "malicious"<, 8"T", "T", "F", 75, 80, "benign"<,
8"T", "F", "T", 2.3`, -4, "malicious"<, 8"F", "F", "F", 40, 2, "benign"<,
8"...", "...", "...", "...", "...", "..."<,
8"T", "F", "F", 20, -10, "NA"<<D
PHP Cpanel Wordpress Popularity Reputation Label
T T T 0.1 -50 malicious
T T F 75 80 benign
T F T 2.3 -4 malicious
F F F 40 2 benign
... ... ... ... ... ...
T F F 20 -10 NA
18 preso.nb

What classifiers are used for
preso.nb 19

Data we are storing, indexing and crunching
ü 40 billion client queries / day. But only 4.3 billion valid distinct (client_ip,domain) pairs.
ü Responses from authoritative servers: IP addresses, TTLs, mail servers and responses codes.
ü Routing tables.
ü Lists of URLs and domain names that have been flagged as malicious by 3rd party feeds, by
individual security researchers, by ourselves (manually) and by our models. And, for phishing web
sites, by our community (Phishtank).
ü We keep historical data for all of these.
ü Most of this data, plus the output of our models, is stored twice: in HDFS for our M/R jobs, and in
HBase for ad-hoc lookups and classfication. We’re planning to store it in GraphLab as well.
ü Some bits are stored in PostgreSQL, too. Because for JOINs and custom index types, it’s dope.
ü Pro-tip: bloom filters work really well for deduplicating data without having to sort it.
preso.nb 21

The Security Graph
The “Security Graph” exposes some of our data, including the output of some of our models, through a
simple RESTful API.
Other security researchers can use our data to build their own models. Access is restricted, but free.
Exposed data includes:
ü Lexical features computed on the fly.
ü Network features: domain records + client diversity.
ü Output from some models.
Client IP addresses are never exposed.
22 preso.nb

The Security Graph: demo!
(dear Wi-Fi, please do not crap out. Not now).
preso.nb 23

Fast flux?
Fast-fluxing is the art for a domain name to quickly switch to new IP addresses, and/or new name
servers (double-fluxing).
These domains can quickly jump to another host before being detected. And firewalls can hardly block
them.
The only way to take them down is to disable the domain name itself. Only registrars can do that. And
they very rarely do.
We are using a classifier with 21 features collected over a short time window to discover new fast-fluxy
domains:
ü TTL (mean, stddev) for A and NS records
ü Number of IPs, prefixes and ASNs
ü
Number of IPs
Number of ASNs
ü Number of countries
ü Mean geographical distance for IP records and NS records
ü Number of name servers
ü ASN and domain age
We used a very strict training set, and as a result, only ª10 domains are flagged every day, but with no
known false positives so far.
24 preso.nb

Fast-flux are dying
Netcraft confirms™.
Fast-flux are mostly used for phishing, scams and by rogue pharmacies.
Cybercriminals are shifting to other techniques: disposable domains, dynamic DNS services, redirection
services and compromised hosts.
preso.nb 25

Security features
Client geographic distribution
For a given domain, we use the Kolmogorov-Smirnov test to compare the geographic distribution of
client queries with the predicted one for the TLD of this domain.
The output plays a very important role for detecting domain names that are *not* malicious.
Caveat: only applicable to domain names using a CCTLD.
ASN, prefix and IP reputation
Ex: ASN score
Da: the set of domain names resolving to at least 1 IP announced by ASN a.
M: the set of domain names flagged as malicious.
c: mean number of malicious domains in 8a Da › M π ∆<.
SHaL =
Da›M
c+ Da
26 preso.nb

Popularity
The Alexa rank only ranks web sites, not domain names. In particular, CDNs and ad networks are not
included.
Mobile traffic? Nope.
And who is still using the Alexa toolbar? No, seriously?
The number of DNS queries is not a good popularity indicator, if only because it highly depends on the
TTL.
Thus, we compute a “popularity score” based on the number of distinct IP addresses having visited a
domain name. This is a bayesian average, similar to reputation scores.
In addition, we also run the PageRank algorithm on an undirected graph built from (client_ip, domain_-
name) pairs.
preso.nb 29

The SecureRank
We consider the (client_ip, domain_name) pairs as an undirected graph, and assign an initial positive
rank Sr0 to domain names from the Alexa top 200,000 list, and a negative score -Sr0to domain names
from our blacklists.
Initialization
SrHC1L = Sr0
SrHC2L = -Sr0
First iteration
SrHC1L
4
SrHC1L
4
SrHC1L
4
SrHC1L
4
SrHC2L
3
SrHC2L
3
SrHC2L
3
C1
D1 D2 D3 D4
C2
D5 D6
SrHD1L := SrHC1L
4
SrHD4L := SrHC1L
4
+ SrHC2L
3
SrHD6L := SrHC2L
3
Next iteration
SrHD1L SrHD2L SrHD3L
SrHD4L
2
SrHD4L
2
SrHD5L SrHD6L
D1
C1
D2 D3 D4
C2
D5 D6
SrHC1L := SrHD1L + SrHD2L + SrHD3L + SrHD4L
2
SrHC2L := SrHD4L
2
+ SrHD5L + SrHD6L
30 preso.nb

SrHC1L := SrHD1L + SrHD2L + SrHD3L + SrHD4L
2
SrHC2L := SrHD4L
2
+ SrHD5L + SrHD6L
Caveats
ü What prevents the SecureRank from converging towards the actual PageRank in practice:
ü We only perform a limited set of iterations, not actualy iterating until convergence
ü The graph is not dense
ü Only immediate neighbors are visited at each iteration
ü Low-degree vertices can lead to false positives
ü The final SecureRank is highly dependant on the initial SecureRank
ü A dangerous variable to use for classification
Still a very useful algorithm, see next slide...
preso.nb 31

Candidates selection
Every day, we build several lists of domain names having very low scores:
ü SecureRank
ü IP, Prefix and ASN reputation scores
ü Lexical scores
ü FrequencyRank
ü Traffic spikes
ü Traffic spikes for nonexistent domains
ü C-Rank?
These lists have too many false positives to be blindly added to our list of blocked domains. However,
they are used as inputs for other models.
32 preso.nb

Co-occurrences: demo
(Saint Wi-Fi, priez pour que le reseau ne foire pas pendant cette demo)
preso.nb 33

DNS is a mess
github.com [A]
a248.e.akamai.net [A]
1.courier-push-apple.com.akadns.net [A]
e3191.c.akamaiedge.net.0.1.cn.akamaiedge.net [A]
api.travis-ci.org [A]
codeclimate.com [A]
api.twitter.com [A]
s.twitter.com [AAAA]
gmail-imap.l.google.com [A]
i2.wp.com [A]
raw.github.com [A]
github.com [AAAA]
a248.e.akamai.net [AAAA]
api.travis-ci.org [AAAA]
i2.wp.com [AAAA]
raw.github.com [AAAA]
secure.gravatar.com [A]
travis-ci.org [A]
secure.gravatar.com [AAAA]
travis-ci.org [AAAA]
l.ghostery.com [A]
codeclimate.com [AAAA]
34 preso.nb

DNS logs are a mess
Our DNS resolver event loop:
ü Accept / read a client query
ü Retrieve ACLs, Active Directory permissions, and stuff
ü Check if the response is cached
ü Bummer, it’s not. Ask authoritative servers, cope with errors, timeouts and stuff
ü Store the response in a dedicated log file for the beloved research team
ü Cache the response
ü Check for blocked domains, categories, security flags and stuff
ü Recurse if required: go to (current step - 5)
ü Update stats, counters, and stuff
ü Send the response to the client
ü Add a line to the log file, containing the current timestamp, and something that approximately
describes what just happened.
See the problem?
Timestamps we are logging don’t reflect the order of queries as sent by a client. And UDP doesn’t
preserve order anyways.
But temporal proximity is all we have in order to link domains to eacher other, anyways.
More often than once, more data doesn’t improve a crappy model, it just makes it slow. But in this case,
throwing a lot of data definitely helps turning shit into gold.
preso.nb 35

Co-occurrences
Intuition: domains names frequently looked up around the same time are more likely to be related that
domain names that aren’t.
Intuition: the data really needs to be cleaned up if we want to get decent results.
Intuition: the previous intuitions don’t look too bad. Let’s do it.
36 preso.nb

Co-occurrences (2)
All we have to identify a device is a client IP, and the 1 client IP = 1 device assertion just doesn’t work:
ü Dynamic IP addresses / roaming users
ü NAT
What we are doing to mitigate this:
ü Instead of processing log files from a entire day, we process a 1 hour time slice over 5 days (the peak
hour for each datacenter).
ü When multiple (client_ip, domain_name) pairs are seen, we only keep the one with the oldest
timestamp, so that our model can’t be screwed up (intentionally or not) by a single client.
ü Queries from clients IPs having sent more queries than 99.9% of other clients are discarded. This
ignores heavy NAT users in order to improve our model.
Pro-Tip: LinkedIn’s DataFu package is cool for computing approximate quantiles on a data stream.
preso.nb 37

Co-occurrences (3)
t1,...,tn: timestamps of queries sent by a given client.
We need to define the distance between 2 domains i and j looked up by this client.
How about: d(i,j)=°ti - tj• ?
What does the distribution of d(i,j) look like, on real examples of related domains?
gHi, jL = 1
1+a °ti-tj•
What if all the clients having sent queries to domains i and j had a say?
gHi, jL = ⁄8c cŒCÏ8i,j<ÃDc<
1
1+a °tiHcL-tjHcL•
C: client IPs
D: domain names
Dc : domains names looked up by client c
tiHcL:min(timestamp) of a query for domain i by client c.
g(i,j):co-occurrence score of domain j for domain i.
38 preso.nb

Co-occurrences (4)
This simple model performs very well.
Unfortunately, it’s useless for discovering interesting domains.
Let’s refine the function to address this:
sHi, jL =
gHi,jL
⁄kŒDgHk,jL
Normalization doesn’t hurt:
s' Hi, jL =
sHi,jL
⁄kŒDsHi,kL
preso.nb 39

Co-occurrences (done!)
40 preso.nb

C-Rank
Intuition: domain names frequently co-occurring with domain names already flagged as malicious are
most likely to be malicious.
Let’s use the co-occurrences scores for that.
M: set of domain names already flagged as malicious.
CrHiL = -
⁄jŒMs' Hj,iL
⁄jŒDs' Hj,iL
preso.nb 41

Domain classification
We extract 50 features (prior to vectorization) from our log records, from the output of our models, and
some basic features computed on the fly:
ü TLD
ü TTL {min, max, mean, median, stddev}
ü Server countries and country count
ü ASNs and ASN count
ü Prefixes and prefix count
ü Geographic locations, geographic locations count and mean distance
ü IPs stability and diversity
ü Lexical scores and other features based on the domain name
ü KS test on geo client diversity
ü Response codes
ü CNAME record
ü Popularity and PageRank
ü C-Rank
ü Tags, for domain names that have been manually reviewed
ü Number of days since the first lookup for this domain has been seen
ü ...
We use a Random Forest classifier (1000 trees, 12 random variables / tree, using a private fork of the
RF-Ace project) to predict a label for unknown domains.
Predictions are done in real-time, in order to always use the most up-to-date dataset.
The amount of false positive is very low (ª 1%), but the classifier has to be retrained frequently.
42 preso.nb

On the importance of the training set
Our current training set contains ª 200,000 records.
The initial training set we used had ª 500,000 records but performed really poorly. We threw everything
we had, including domains from the Conficker botnet, phishing and domain names that were flagged as
malicious a long time ago.
Since we didn’t had a lot of internal categories to sort our blacklists (“malware” was a catch-all list we
dumped everything into), we used the Google Safe Browsing API in order to filter the list.
The size of the training set went down to ª 30,000 records, but the performance of the classifier
increased drastically, without any other changes.
preso.nb 43

Combining the output of different models
Every day, we build several lists of domain names having very low scores (yay, that was already said a
couple slides back).
We remove domain names that we already flagged as malicious from these lists, and use these to build
new lists, some of which are added to our blacklists.
ü Prefix scores › Spikes
ü Prefix scores › C-Rank
ü IP score › SecureRank
ü IP score › FrequencyRank
ü IP score › NXDomain spikes
ü ASN score › Google Safe Browsing › classifier label
ü 3rd party lists › classifier label
ü Experimental models › VirusTotal
44 preso.nb

A paradigm shift
Newly registered domains. Pseudorandom names. Short TTLs. A myriad of IPs spread over unrelated
ASNs, most of them being already well known for hosting malicious content.
These are strong indicators, among others, that a domain is very likely to be malicious, and we have
been using algorithms leveraging these to automatically spot these and protect our customers for a long
time.
However, the security industry is currently observing a significant paradigm shift.
Spammers, scammers and malware authors are now massively abusing compromised machines in
order to operate their business.
Apache/Lighty/Nginx backdoors can stay under the radar for a long time.
preso.nb 45

The Kelihos trojan
Right after the Boston marathon bombing tragedy, a spam campaign drove recipients to a web page
containing actual videos of the explosion.
With, as a bonus, a malicious iframe exploiting a recent vulnerability in Java in order to download and
install the Kelihos trojan.
Here are some of the infectors serving the malicious Java bytecode:
kentuckyautoexchange.com
infoland.xtrastudio.com
aandjlandscapecreations.com
detectorspecials.com
incasi.xtrastudio.com
sylaw.net
franklincotn.us
earart.com
bigbendrivertours.com
aeroimageworks.com
winerackcellar.com
What all of these web sites have in common is that they were not malicious.
These were totally benign web sites, established for a long time, with a decent security track record and
no distinctive network features.
They just got compromised, usually because of weak or stolen passwords.
46 preso.nb

Results (2)
When a new threat has been found by other researchers, we frequently discover related domains
before these are flagged by antiviruses and reputation systems. “before” varies between 1 minute and 1
week. Median is < 1 day.
Our models actually do a pretty good job at spotting false positives from 3rd party models.
But let’s be fair, we didn’t discover anything major (yet).
Or maybe we actually did block malicious content that nobody knows it was malicious. Not even us.
We are constantly seeing suspicious behaviors, suspicious domain names, suspicious client activity,
suspicious network features. Now, is it an actual threat? Unless we find actual samples, we don’t know.
This why we don’t only rely on our own work. We just add our little piece to the puzzle.
48 preso.nb

An endless battle?
ü We are mining DNS queries in order to find a needle in the haystack, namely domain names that are
more likely to be malicious than others.
ü Malware authors are smart, and are constantly finding new tricks to keep security researchers busy.
ü There’s no such thing as a one-size-fits-all model. We build different models to predict the
maliciousness of a domain, combine these with 3rd party models, do a lot of manual investigation,
and proactively block really suspicious stuff.
ü This is slide #42.
preso.nb 49

Malware vs Big Data

Related slideshows

More Related Content

Malware vs Big Data