SlideShare a Scribd company logo
Mining 
human-­‐scale 
insights 
from 
log 
data 
with 
machine 
learning 
David 
Andrzejewski 
-­‐ 
@davidandrzej 
Data 
Sciences 
Engineering, 
Sumo 
Logic 
OC 
Big 
Data 
Meetup, 
September 
17, 
2014
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic
Logs 
4
The 
Problem 
We 
Solve 
“More Logs Are Created In A Single Day Now Than in All of FY 2003,” Gartner 
Machine Generated 
Clickstream 
Web Servers, Email 
Applications, Mobile 
Security Devices, Desktops 
Human Generated 
Orders, Blogs, Social Networks, 
HR, Inventory, Manufacturing 
Networks, Servers, Hypervisors 
Machine Data is the largest, fastest growing, most 
complex segment of Big Data. 
2003 2005 2007 2009 2011 2013 2015
Sumo 
Logic 
“Turning Machine Data Into IT and Business Insights” 
6 
Search, monitor, visualize 
Learn, classify, predict
Use Cases 
Availability & 
Performance 
Customer 
Insights 
Security and 
Compliance 
7
Monitoring 
and 
reporOng 
8
TroubleshooOng 
and 
root 
cause 
analysis 
9 
Custom App Code 
Open Source Software 
Middleware 
Databases 
Server / OS 
Virtualization 
Network 
Session ID Customer ID 
12/20/2011 17:23:44 PST [user=234fsf] failed transaction, 
sessionid:2F0A232324, [host=pay002.sjc] amount=1725.00 
66.249.67.24 - - [20/Dec/2011:17:23:40 -0700] ”POST /APP/ 
Order.php HTTP/1.1" 304 146 "-" SESSION=2F0A232324 
Job number 
12/20/11 17:23:34 AMQ7163: WebSphere MQ job number 18429 
started FOR client_session=2F0A232324. 
12202011 17:23:27 /usr/local/build/mysql/libexec/mysqld: 
Abnormal shutdown [18429] 
20-12-2011 17:23:19 database-host login[3866]: DEAD_PROCESS: 18429 
ttys000 
Process ID 
Root cause! 
Dec 20, 2011 17:22:14,,, message=Created virtual machine 
user-3 on esxi01.office.thedomain.com 
<134>Dec 20 2011 17:22:12: %PIX-6-106100: access-list 
inside_access_out denied tcp inside/68.162.72.163(4326) -> 
outside/45.200.244.124(3127) hit-cnt 1(first hit)
Anatomy 
of 
a 
log 
message: 
Five 
W’s 
10
Anatomy 
of 
a 
log 
message: 
Five 
W’s 
! When? 
11 
Timestamp 
with 
Ome 
zone
Anatomy 
of 
a 
log 
message: 
Five 
W’s 
! When? 
12 
Timestamp 
with 
Ome 
zone 
! Where? 
Host, 
module, 
code 
locaOon
Anatomy 
of 
a 
log 
message: 
Five 
W’s 
! When? 
13 
Timestamp 
with 
Ome 
zone 
! Where? 
Host, 
module, 
code 
locaOon 
! Who? 
AuthenOcaOon 
context
Anatomy 
of 
a 
log 
message: 
Five 
W’s 
! When? 
14 
Timestamp 
with 
Ome 
zone 
! Where? 
Host, 
module, 
code 
locaOon 
! Who? 
AuthenOcaOon 
context 
! What? 
Log 
level 
and 
key-­‐value 
pairs
Inhuman 
scale 
! Logs: 
like 
“computer 
tweets” 
! TwiZer 
2013* 
• Peak 
@ 
~144k 
TPS 
• Avg 
~6k 
tweets 
/ 
second 
! Log 
data 
• Example: 
1 
TB 
/ 
day 
• Avg 
~25k 
logs 
/ 
second 
* https://blog.twitter.com/2013/new-tweets-per-second-record-and-how 
15
Inhuman 
complexity 
South 
Hampstead 
Marylebone 
Dalston Junction 
Haggerston 
“A 
distributed 
system 
is 
one 
in 
which 
the 
failure 
of 
a 
computer 
you 
didn't 
even 
know 
existed 
can 
render 
your 
own 
computer 
unusable.” 
-­‐ 
Leslie 
Lamport 
16 
River Thames 
Central 
2 
2 
Moorgate 
1 1 Tottenham 
Court Road 
Piccadilly 
Circus 
1 
Embankment 
Lambeth 
North 
Bethnal 
Green 
Pimlico 
Camden Town 
Swiss Cottage 
Imperial 
Wharf 
Finchley Road 
Stepney Cannon Street 
Mansion House 
Borough 
Brondesbury Caledonian 
Road & 
Barnsbury 
Homerton 
Limehouse Wapping 
Hoxton 
Rotherhithe 
Surrey Quays 
Whitechapel 
Baker 
Street 
Regent’s Park 
Edgware 
Road 
Goodge 
Street 
Bayswater 
Warren Street 
Aldgate 
Euston 
Farringdon 
Barbican 
Russell 
Square 
Mornington 
Crescent 
High Street 
Kensington 
Old Street 
St. John’s Wood 
Green Park 
Notting 
Hill Gate 
Victoria 
Aldgate 
East 
Blackfriars 
Temple 
Oxford 
Circus 
Bond 
Street 
Tower 
Hill 
Westminster 
Charing 
Cross 
Holborn 
Tower 
Gateway 
Monument 
Leicester Square 
London 
Bridge 
St. Paul’s 
Hyde Park Corner 
Knightsbridge 
Angel 
Queensway Marble 
Arch 
South 
Kensington 
Sloane 
Square 
Covent Garden 
Liverpool 
Street 
Great 
Portland 
Street 
Bank 
Chancery 
Lane 
Lancaster 
Gate 
Fenchurch Street 
Gloucester 
Road St. James’s 
Park 
Bermondsey 
Shoreditch 
High Street 
King’s Cross 
St. Pancras 
Euston 
Edgware Square 
Road 
Southwark 
Waterloo 
Canonbury 
Shadwell 
Canada 
Water
All-­‐too-­‐human 
messiness 
and 
variety 
! (wildly) 
varying 
formats 
17 
• prind, 
JSON, 
XML, 
Windows, 
X-­‐delimited, 
... 
! Specialized 
knowledge 
[2008-05-07 09:50:08.450 'App' 3560 verbose] 
[VpxdHeartbeat] Invalid heartbeat from 
10.17.218.46
Q: 
how 
to 
get 
human-­‐scale 
insights 
from 
log 
data? 
18
Q: 
how 
to 
get 
human-­‐scale 
insights 
from 
log 
data? 
A: machine learning (and friends) 
! Unsupervised pattern discovery 
! Anomaly / outlier detection 
! Supervised classification 
! Time-series data modeling 
! Graph analysis 
! Probabilistic data structures 
19
Too 
many 
logs! 
“data 
disorientaOon” 
~60k results: 30 minutes, one component
21 
Unsupervised clustering 
! Given: set of items 
! Do: group similar items
22 
Unsupervised clustering 
! Given: set of items 
! Do: group similar items
DisOll 
logs 
down 
to 
underlying 
structure
Results 
"compressed” 
~1000x
In 
the 
beginning, 
there 
was 
the 
prind() 
printf("Health status check: %s is %s”, 
hostid, hoststatus) 
Log generation 
Health status check: zim-5 is OK 
Health status check: gir-3 is OK 
Health status check: gir-2 is TIMED OUT 
Health status check: dib-1 is OK
Reverse 
engineering 
prind() 
printf("Health status check: %s is %s”, 
hostid, hoststatus) 
Log generation 
Health status check: zim-5 is OK 
Health status check: gir-3 is OK 
Health status check: gir-2 is TIMED OUT 
Health status check: dib-1 is OK 
“magic” 
Health status check: *** is ***
1. Define string distance function 
2. Do distance-based clustering 
27 
 
 
Unsupervised clustering 
! Given: log messages 
! Do: group by “signature”
Drill-­‐down 
into 
the 
original 
raw 
logs
29 
Partially supervised clustering 
! Given: set of items + side info 
! Do: group similar items
30 
Partially supervised clustering 
! Given: set of items + side info 
! Do: group similar items
Too 
many 
wildcards! 
31
“Hint” 
from 
human 
user 
32
Not 
enough 
wildcards! 
33
“Hint” 
from 
human 
user 
34
unknown 
unknowns 
35
36 
Outlier detection 
! Given: data points 
! Do: identify outliers
37 
Outlier detection 
! Given: data points 
! Do: identify outliers
38 
Health check OK 
Request processed 
Txn timeout, retry 
Anomaly detection 
! Given: log data 
! Do: flag anomalies
39 
Health check OK 
Request processed 
Txn timeout, retry 
Anomaly detection 
! Given: log data 
! Do: flag anomalies
InvesOgate 
and 
annotate 
events 
40 
HUMAN 
signatures 
logs 
RAW DATA
InvesOgate 
and 
annotate 
events 
41 
HUMAN 
signatures 
logs 
RAW DATA
InvesOgate 
and 
annotate 
events 
42 
HUMAN 
event 
signatures 
logs 
RAW DATA
InvesOgate 
and 
annotate 
events 
43 
HUMAN 
timeline / 
alerts 
event 
signatures 
logs 
RAW DATA
44 
Supervised classification 
! Given: labeled data points 
! Do: predict future labels
45 
Supervised classification 
! Given: labeled data points 
! Do: predict future labels
46 
Supervised 
classification 
! Given: log data, 
annotated events 
! Do: classify new 
occurrences 
event 
timeline / 
alerts
Connected components 
! Given: nodes/edges 
! Do: identify component 
User 
action 
webID=7F92 
Initiating 
requestID=082A 
for 
webID=7F92 
… 
… 
orderID=34C8 
received 
for 
requestID=082A 
… 
Retrieving 
userID=11D2 
for 
requestID=082A 
… 
… 
accountID=1234 
access, 
userID=11D2 
… 
ERROR 
accountID=1234 
not 
found! 
PROCESSING 
FAILED: 
webID=79F92
User 
action 
webID=7F92
User 
action 
webID=7F92 
Initiating 
requestID=082A 
for 
webID=7F92 
…
User 
action 
webID=7F92 
Initiating 
requestID=082A 
for 
webID=7F92 
… 
… 
orderID=34C8 
received 
for 
requestID=082A 
…
User 
action 
webID=7F92 
Initiating 
requestID=082A 
for 
webID=7F92 
… 
… 
orderID=34C8 
received 
for 
requestID=082A 
… 
Retrieving 
userID=11D2 
for 
requestID=082A 
…
User 
action 
webID=7F92 
Initiating 
requestID=082A 
for 
webID=7F92 
… 
… 
orderID=34C8 
received 
for 
requestID=082A 
… 
Retrieving 
userID=11D2 
for 
requestID=082A 
… 
… 
accountID=1234 
access, 
userID=11D2 
…
User 
action 
webID=7F92 
Initiating 
requestID=082A 
for 
webID=7F92 
… 
… 
orderID=34C8 
received 
for 
requestID=082A 
… 
Retrieving 
userID=11D2 
for 
requestID=082A 
… 
… 
accountID=1234 
access, 
userID=11D2 
… 
ERROR 
accountID=1234 
not 
found! 
PROCESSING 
FAILED: 
webID=79F92
Time-series detection 
! Given: time-series metric data 
! Do: identify unusual data pts
Time-series detection 
! Given: time-series metric data 
! Do: identify unusual data pts 
Level change
Time-series detection 
! Given: time-series metric data 
! Do: identify unusual data pts 
Level change Spikes
“Bollinger 
bands” 
– 
rolling 
window 
approach 
μ ± 3σ
58 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!)
59 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!) ...
60 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!) ... 
4 
3 
2 
2
61 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!) ... 
4 
3 
2 
2 
TOP 
2
62 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!) 
Count-Min Sketch 
(Cormode  Muthukrishnan, 2003)
63 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!) 
Count-Min Sketch 
(Cormode  Muthukrishnan, 2003)
64 
Top-K identification 
! Given: stream of observations 
! Do: identify k most frequent 
(WITH FIXED MEMORY!) 
Count-Min Sketch 
(Cormode  Muthukrishnan, 2003)
65 
Cardinality estimation 
! Given: stream of observations 
! Do: identify number of distinct 
items (WITH FIXED MEMORY!)
66 
Cardinality estimation 
! Given: stream of observations 
! Do: identify number of distinct 
items (WITH FIXED MEMORY!) ...
67 
Cardinality estimation 
! Given: stream of observations 
! Do: identify number of distinct 
items (WITH FIXED MEMORY!) ... 
|{ , , , }| = 4
68 
Cardinality estimation 
! Given: stream of observations 
! Do: identify number of distinct 
items (WITH FIXED MEMORY!) 
HyperLogLog 
(Flajolet et al, 2007)
Hooray! 
Monoid 
homomorphism! 
69 
logs 
logs 
logs 
f(s1 + s2) = f(s1) ⊕ f(s2)
FINAL 
OBLIGATORY 
PLUG 
 
70 
freesumo.com

More Related Content

OC Big Data Monthly Meetup #5 - Session 2 - Sumo Logic

  • 1. Mining human-­‐scale insights from log data with machine learning David Andrzejewski -­‐ @davidandrzej Data Sciences Engineering, Sumo Logic OC Big Data Meetup, September 17, 2014
  • 5. The Problem We Solve “More Logs Are Created In A Single Day Now Than in All of FY 2003,” Gartner Machine Generated Clickstream Web Servers, Email Applications, Mobile Security Devices, Desktops Human Generated Orders, Blogs, Social Networks, HR, Inventory, Manufacturing Networks, Servers, Hypervisors Machine Data is the largest, fastest growing, most complex segment of Big Data. 2003 2005 2007 2009 2011 2013 2015
  • 6. Sumo Logic “Turning Machine Data Into IT and Business Insights” 6 Search, monitor, visualize Learn, classify, predict
  • 7. Use Cases Availability & Performance Customer Insights Security and Compliance 7
  • 9. TroubleshooOng and root cause analysis 9 Custom App Code Open Source Software Middleware Databases Server / OS Virtualization Network Session ID Customer ID 12/20/2011 17:23:44 PST [user=234fsf] failed transaction, sessionid:2F0A232324, [host=pay002.sjc] amount=1725.00 66.249.67.24 - - [20/Dec/2011:17:23:40 -0700] ”POST /APP/ Order.php HTTP/1.1" 304 146 "-" SESSION=2F0A232324 Job number 12/20/11 17:23:34 AMQ7163: WebSphere MQ job number 18429 started FOR client_session=2F0A232324. 12202011 17:23:27 /usr/local/build/mysql/libexec/mysqld: Abnormal shutdown [18429] 20-12-2011 17:23:19 database-host login[3866]: DEAD_PROCESS: 18429 ttys000 Process ID Root cause! Dec 20, 2011 17:22:14,,, message=Created virtual machine user-3 on esxi01.office.thedomain.com <134>Dec 20 2011 17:22:12: %PIX-6-106100: access-list inside_access_out denied tcp inside/68.162.72.163(4326) -> outside/45.200.244.124(3127) hit-cnt 1(first hit)
  • 10. Anatomy of a log message: Five W’s 10
  • 11. Anatomy of a log message: Five W’s ! When? 11 Timestamp with Ome zone
  • 12. Anatomy of a log message: Five W’s ! When? 12 Timestamp with Ome zone ! Where? Host, module, code locaOon
  • 13. Anatomy of a log message: Five W’s ! When? 13 Timestamp with Ome zone ! Where? Host, module, code locaOon ! Who? AuthenOcaOon context
  • 14. Anatomy of a log message: Five W’s ! When? 14 Timestamp with Ome zone ! Where? Host, module, code locaOon ! Who? AuthenOcaOon context ! What? Log level and key-­‐value pairs
  • 15. Inhuman scale ! Logs: like “computer tweets” ! TwiZer 2013* • Peak @ ~144k TPS • Avg ~6k tweets / second ! Log data • Example: 1 TB / day • Avg ~25k logs / second * https://blog.twitter.com/2013/new-tweets-per-second-record-and-how 15
  • 16. Inhuman complexity South Hampstead Marylebone Dalston Junction Haggerston “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” -­‐ Leslie Lamport 16 River Thames Central 2 2 Moorgate 1 1 Tottenham Court Road Piccadilly Circus 1 Embankment Lambeth North Bethnal Green Pimlico Camden Town Swiss Cottage Imperial Wharf Finchley Road Stepney Cannon Street Mansion House Borough Brondesbury Caledonian Road & Barnsbury Homerton Limehouse Wapping Hoxton Rotherhithe Surrey Quays Whitechapel Baker Street Regent’s Park Edgware Road Goodge Street Bayswater Warren Street Aldgate Euston Farringdon Barbican Russell Square Mornington Crescent High Street Kensington Old Street St. John’s Wood Green Park Notting Hill Gate Victoria Aldgate East Blackfriars Temple Oxford Circus Bond Street Tower Hill Westminster Charing Cross Holborn Tower Gateway Monument Leicester Square London Bridge St. Paul’s Hyde Park Corner Knightsbridge Angel Queensway Marble Arch South Kensington Sloane Square Covent Garden Liverpool Street Great Portland Street Bank Chancery Lane Lancaster Gate Fenchurch Street Gloucester Road St. James’s Park Bermondsey Shoreditch High Street King’s Cross St. Pancras Euston Edgware Square Road Southwark Waterloo Canonbury Shadwell Canada Water
  • 17. All-­‐too-­‐human messiness and variety ! (wildly) varying formats 17 • prind, JSON, XML, Windows, X-­‐delimited, ... ! Specialized knowledge [2008-05-07 09:50:08.450 'App' 3560 verbose] [VpxdHeartbeat] Invalid heartbeat from 10.17.218.46
  • 18. Q: how to get human-­‐scale insights from log data? 18
  • 19. Q: how to get human-­‐scale insights from log data? A: machine learning (and friends) ! Unsupervised pattern discovery ! Anomaly / outlier detection ! Supervised classification ! Time-series data modeling ! Graph analysis ! Probabilistic data structures 19
  • 20. Too many logs! “data disorientaOon” ~60k results: 30 minutes, one component
  • 21. 21 Unsupervised clustering ! Given: set of items ! Do: group similar items
  • 22. 22 Unsupervised clustering ! Given: set of items ! Do: group similar items
  • 23. DisOll logs down to underlying structure
  • 25. In the beginning, there was the prind() printf("Health status check: %s is %s”, hostid, hoststatus) Log generation Health status check: zim-5 is OK Health status check: gir-3 is OK Health status check: gir-2 is TIMED OUT Health status check: dib-1 is OK
  • 26. Reverse engineering prind() printf("Health status check: %s is %s”, hostid, hoststatus) Log generation Health status check: zim-5 is OK Health status check: gir-3 is OK Health status check: gir-2 is TIMED OUT Health status check: dib-1 is OK “magic” Health status check: *** is ***
  • 27. 1. Define string distance function 2. Do distance-based clustering 27 Unsupervised clustering ! Given: log messages ! Do: group by “signature”
  • 28. Drill-­‐down into the original raw logs
  • 29. 29 Partially supervised clustering ! Given: set of items + side info ! Do: group similar items
  • 30. 30 Partially supervised clustering ! Given: set of items + side info ! Do: group similar items
  • 36. 36 Outlier detection ! Given: data points ! Do: identify outliers
  • 37. 37 Outlier detection ! Given: data points ! Do: identify outliers
  • 38. 38 Health check OK Request processed Txn timeout, retry Anomaly detection ! Given: log data ! Do: flag anomalies
  • 39. 39 Health check OK Request processed Txn timeout, retry Anomaly detection ! Given: log data ! Do: flag anomalies
  • 40. InvesOgate and annotate events 40 HUMAN signatures logs RAW DATA
  • 41. InvesOgate and annotate events 41 HUMAN signatures logs RAW DATA
  • 42. InvesOgate and annotate events 42 HUMAN event signatures logs RAW DATA
  • 43. InvesOgate and annotate events 43 HUMAN timeline / alerts event signatures logs RAW DATA
  • 44. 44 Supervised classification ! Given: labeled data points ! Do: predict future labels
  • 45. 45 Supervised classification ! Given: labeled data points ! Do: predict future labels
  • 46. 46 Supervised classification ! Given: log data, annotated events ! Do: classify new occurrences event timeline / alerts
  • 47. Connected components ! Given: nodes/edges ! Do: identify component User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 … ERROR accountID=1234 not found! PROCESSING FAILED: webID=79F92
  • 49. User action webID=7F92 Initiating requestID=082A for webID=7F92 …
  • 50. User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A …
  • 51. User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A …
  • 52. User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 …
  • 53. User action webID=7F92 Initiating requestID=082A for webID=7F92 … … orderID=34C8 received for requestID=082A … Retrieving userID=11D2 for requestID=082A … … accountID=1234 access, userID=11D2 … ERROR accountID=1234 not found! PROCESSING FAILED: webID=79F92
  • 54. Time-series detection ! Given: time-series metric data ! Do: identify unusual data pts
  • 55. Time-series detection ! Given: time-series metric data ! Do: identify unusual data pts Level change
  • 56. Time-series detection ! Given: time-series metric data ! Do: identify unusual data pts Level change Spikes
  • 57. “Bollinger bands” – rolling window approach μ ± 3σ
  • 58. 58 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!)
  • 59. 59 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!) ...
  • 60. 60 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!) ... 4 3 2 2
  • 61. 61 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!) ... 4 3 2 2 TOP 2
  • 62. 62 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!) Count-Min Sketch (Cormode Muthukrishnan, 2003)
  • 63. 63 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!) Count-Min Sketch (Cormode Muthukrishnan, 2003)
  • 64. 64 Top-K identification ! Given: stream of observations ! Do: identify k most frequent (WITH FIXED MEMORY!) Count-Min Sketch (Cormode Muthukrishnan, 2003)
  • 65. 65 Cardinality estimation ! Given: stream of observations ! Do: identify number of distinct items (WITH FIXED MEMORY!)
  • 66. 66 Cardinality estimation ! Given: stream of observations ! Do: identify number of distinct items (WITH FIXED MEMORY!) ...
  • 67. 67 Cardinality estimation ! Given: stream of observations ! Do: identify number of distinct items (WITH FIXED MEMORY!) ... |{ , , , }| = 4
  • 68. 68 Cardinality estimation ! Given: stream of observations ! Do: identify number of distinct items (WITH FIXED MEMORY!) HyperLogLog (Flajolet et al, 2007)
  • 69. Hooray! Monoid homomorphism! 69 logs logs logs f(s1 + s2) = f(s1) ⊕ f(s2)
  • 70. FINAL OBLIGATORY PLUG 70 freesumo.com