Pattern Mining: Extracting Value from Log Data

Pattern Mining: Getting the
most out of your log data.
Krishna Sridhar
Staff Data Scientist, Dato Inc.
krishna_srd

• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me!

45+$and$growing$fast!
About Us!

+ =
Questions?
• (Now) We are monitoring the chat window.
• (Later) Email me srikris@dato.com.
Webinars

Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
exploration
data
modeling
Data Science Workflow
Ingest Transform Model Deploy

GraphLab(Create(
Train
Model
Pipeline
Deploy
Models
Serve
Requests
(REST API)
Monitor
Services
Get Live
Feedback
Update
Pipelines
Prototype &
Develop
Model
Pipelines
Update Live
Experiment
Deploy New Pipeline
Dato(Predic2ve(Services(
Dato’s
Products Dato(Distributed(
We can help!

Log Journey
Lots of data
Insights Profits

Machine Learning in Logs
Source: Mining Your Logs - Gaining Insight Through Visualization

Frequent Pattern Mining
What sets of items were bought together?

Can we recommend items?
Rule Mining

Log Mining: Feature Extraction

Feature Extraction
0 1 0 0 0 0 1 1 0
1 1 0 0 1 0 0 0 0
0 0 1 1 1 0
Receipt Space Features in
Menu Space
ML

3 Useful Data Mining Tasks
Rule MiningPattern Mining Feature Extraction

ML is not a black-box.
Transparency
Learning is also about understanding.
Interpretability
Whatever can go wrong, will go wrong.
Diagnosis
Moving on

Formulating Pattern Mining
N distinct items → 2N itemsets

Find the top K most frequent sets of length at least L
that occur at least M times.

Find the top K most frequent sets of length at least L
that occur at least M times.
- max_patterns
- min_length
- min_support

Pattern Mining
N distinct items → 2N itemsets

Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 5 is not frequent

Principle 1: What is frequent?
A pattern is frequent if it occurs at least M times.
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{C, D}: 5 is frequent
M = 4
{A, D}: 5 is not frequent
min_support

Principle 2: Apriori principle
A pattern is frequent only if a subset is frequent
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}
{B, C, D} : 5 is frequent therefore
{C, D} : 5 is frequent
{A} : 3 is not frequent therefore
{A, D} : 3 is not frequent
M = 4

Two Main Algorithms
• Candidate Generation
- Apriori
- Eclat
• Pattern Growth
- FP-Growth
- TopK FP-Growth [GLC 1.6]

Lots of Generalizations
Source: http://www.philippe-fournier-viger.com/spmf/

Candidate Generation
Two phases
1. Candidate generation.
2. Candidate filtering.
Exploit Apriori Principle!

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}

{AB} : ? {AC} : ? {AD} : ? {BC} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}

{AB} : ? {AC} : ? {AD} : ? {BC} : 4 {BD} : 4 {CD} : 5
{A} : 3 {B} : 4 {C} : 5 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{B, C, D}
{A, C, D}
{A, B, C, D}
{A, D}
{B, C, D}
{B, C, D}

Pattern Growth
Two phases
1. Candidate filtering
2. Conditional database constructions.
Avoid full scans over the data & large
candidate sets!

Pattern Growth - Depth First {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : 1 {AC} : 2 {AD} : 3 {BD} : 4 {CD} : 4
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : 0 {ABD} : 1 {ACD} : 2 {BCD} : 2
{BC} : 2

Pattern Growth - Preprocessing {B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6

{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : ? {B} : ? {C} : ? {D} : ?
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?

{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : ? {AC} : ? {AD} : ? {BD} : ? {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : ?

{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : 2

Pattern Growth
{B} : 4
{ } : 6
Call: Growth(db = DB{}, item = B, freq = {B,C,D})
DB{}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}

Pattern Growth
{B} : 4
{ } : 6
Conditional Database Construction
DB{} DB{B}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
{C, D}
{D}
{C, D}
{D}

Pattern Growth
{B} : 4
{ } : 6
Candidate Filtering
DB{B}
{C, D}
{D}
{C, D}
{D}
{D} : 4
{C} : 2
DB{}
{B, C, D}
{A, C, D}
{B, D}
{A, C, D}
{B, C, D}
{A, B, D}
DB{B}
Add {BD} as frequent

Pattern Growth - Depth First {C, D}
{D}
{C, D}
{D}
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : ? {ACD} : ? {BCD} : ?
{BC} : 2

Pattern Growth
Recurse: Growth(db = DB{B}, item = D, freq = {D})
DB{B}
{C, D}
{D}
{C, D}
{D}
{B} : 4
{ } : 6
{BD} : 4
DB{BD}

Pattern Growth - Depth First
{AB} : X {AC} : ? {AD} : ? {BD} : 4 {CD} : ?
{A} : 3 {B} : 4 {C} : 4 {D} : 6
{ } : 6
{ABC} : ? {ABD} : X {ACD} : ? {BCD} : X
{BC} : 2

Compare & Constrast
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.

Compare & Constrast
+ Better than brute force
+ Filters candidate sets
- Multiple passes over the data
• Pattern Growth
+ Fewer passes over the data
+ Space efficient.
Better choice

FP-Tree Compression
Figures From Florian Verhein’s Slides on FP-Growth

FP-Growth Algorithm
Figures From Florian Verhein’s Slides on FP-Growth
Two phases
1. Candidate filtering.
2. Conditional database constructions.

TopK FP-Growth Algorithm
Similar to FP-Growth
1. Dynamically raise min_support.
2. Estimates of min_support greatly help.

Performance on Website Logs
• 1.5m events
• 84k sessions
• 3k unique ids

Distributed FP-Growth
Partition database on item-ids.
Database

Bags + Sequences
× 2
Itemset: {Item}
Bags: {Item: quantity}
Sequences : (item)

Summary
Log Data Mining
≠
Rocket Science
• FP-Growth for finding frequent patterns.
• Find rules from patterns to make predictions.
• Extract features for useful ML in pattern space.

SELECT questions FROM audience
WHERE difficulty == “Easy”
Thanks!

Pattern Mining: Extracting Value from Log Data

More Related Content

Pattern Mining: Extracting Value from Log Data