Data Analysis - Making Big Data Work

Data Analysis Making Big Data Work
David Chiu
2014/11/24

About Me
Founder of LargitData
Ex-Trend Micro Engineer
ywchiu.com

Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Being A Data Scientist, You Need to Know That Much? Seriously?

Statistic
Single Variable、Multi Variable、ANOVA
Data Munging
Data Extraction, Transformation, Loading
Data Visualization
Figure, Business Intelligence
Required Skills

What You Probably Need Is A Team
Business Analyst Knowing how to use different tools under different circumstance
Statistician How to process big data?
DBA How to deal with unstructured data
Software Engineer Knowing how to user statistics

Four Dimension
12
Single Machine Memory R Local File
Cloud Distributed Hadoop HDFS
Statistics Analysis Linear Algebra
Architect Management Standard
Concept MapReduce Linear Algebra Logistic Regression
Tool Hadoop PostgreSQL R
Analyst How to use these tools
Hackers R Python Java

“80% are doing summing and averaging”
Content
1.Data Munging
2.Data Analysis
3.Interpret Result
What Data Scientists Do?

Application of Data Analysis
Text Mining
Classify Spam Mail
Build Index
Data Search Engine
Social Network Analysis
Finding Opinion Leader
Recommendation System
What user likes?
Opinion Mining
Positive/Negative Opinion
Fraud Analysis
Credit Card Fraud

Feed data to computer
Make Computer to Do Analysis

Predictive Analysis
Learn from experience (Data), to predict future behavior
What to Predict？
e.g. Who is likely to click on that ad？
For What？
e.g. According to the click possibility and revenue to decide which ad to show.
Predictive Analysis

Customer buying beer will also buy pampers?
People are surfing telephone fee rate are likely to switch its vendor
People belong to same group are tend to have same telecom vendor
Surprising Conclusion

According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior.
Predictive Model

Linear Model
e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability)
Female 13.5%，Male1.5%
Rule Model
e.g.
If the user is “She”
And Income is over 30k
And haven’t seen the ad yet
The click rate is 11%
Simple Predictive Model

Induction
From detail to general
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E
-- Tom Mitchell (1998)
Discover an effective model
Start from a simple model
Update the model based on feeding data
Keep on improving prediction power
Machine Learning

Statistic Analysis
Regression Analysis
Clustering
Classification
Recommendation
Text Mining
Application
22

Decision Tree
Rate > 1,299/Month
Probability to switch vendor 15%
Yes
No

Decision Tree
Rate > 1,299/Month
Yes
No
Income>22,000
Yes
No

Decision Tree
Rate > 1,299/Month
Yes
No
Income>22,000
Yes
No
Free for intranet
Yes
No

Supervised Learning
Regression
Classification
Unsupervised Learning
Dimension Reduction
Clustering
Machine Learning

Classification
e.g. Stock prediction on bull/bear market
Regression
e.g. Price prediction
Supervised Learning

Dimension Reduction
e.g. Making a new index
Clustering
e.g. Customer Segmentation
Unsupervised Learning

Lift
The better the lift, the greater the cost?
The more decision rule, the more campaign?
Design strategy for different persona?
The lift for 4 campaign?
The lift for 20 ampaign?
Lift

Can we use the production rate of butter to predict stock market?
Overfitting

Use noise as information
Over assumption
Over Interpretation
What overfitting learn is not truth
Like memorize all answers in a single test.
Overfitting

Testing Model
Use external data or partial data as testing dataset

Statistics On The Fly
Built-in Math and Graphic Function
Free and Open Source
http://cran.r-project.org/src/base/
R Language
36

Functional Programming
Use Function Definition To Retrieve Answer
Interpreted Language
Statistics On the Fly
Object Oriented Language
S3 and S4 Method
R Language

Most Used Analytic Language
Most popular languages are R, Python (39%), SQL (37%). SAS (20%).
By Gregory Piatetsky, Aug 27, 2013.

Kaggle
http://www.kaggle.com/
Most often used language in Kaggle competition

Data Scientist in Google and Apple Use R
What is your programming language of choice, R, Python or something else?
“I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.”
http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/
“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred”
http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook

Discover which customer is likely to churn?
Customer Churn Analysis

Account Information
state
account length.
area code
phone number
User Behavior
international plan
voice mail plan, number vmail messages
total day minutes, total day calls, total day charge
total eve minutes, total eve calls, total eve charge
total night minutes, total night calls, total night charge
total intl minutes, total intl calls, total intl charge
number customer service calls
Target
Churn (Yes/No)
Data Description

> install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,]
Split data into training and testing dataset
70% as training dataset
30% as testing dataset

churn.rp <- rpart(churn ~ ., data=trainset) plot(churn.rp, margin= 0.1) text(churn.rp, all=TRUE, use.n = TRUE)
Build Classifier
Classfication

> predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions)
Prediction Result
pred
no
yes
no
859
18
yes
41
100

> confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes
Use Confusion Matrix

Use Testing Data to Validate Result
predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))

Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)

Dynamic Language
Execution at runtime
Dynamic Type
Interpreted Language
See the result after execution
OOP
Python Language
49

Cross Platform(Python VM)
Third-Party Resource
(Data Analysis、Graphics、Website Development)
Simple, and easy to learn
Benefit of Python

Data Analysis
Scipy
Numpy
Scikit-learn
Pandas
51

Use InfoLite Tool To Extract DOM

Use Python To Build Up Dashboard

Monitor Social Media and News
Monitor post on social media
Configure keyword and alert
Use line plot to show daily post statistics
55
蘋果, nownews, udn, 中央跟風傳媒還有其他財經媒體

Configure Alert and Keyword
58

Have You Learned Big Data?
61

Product Centric
Customer Centric
Product Centric v.s. Customer Centric

Customer Centric?
http://goo.gl/iuy4lY

Knowing Who You Are?
Personal recommendation
Customer relation management
Knowing What Futures Likes?
From the history, we can see the future
Predictive analysis
Knowing What is Hidden Beneath?
Correlation, Correlation, Correlation
So… What is Big Data?

Apache Project – From Yahoo
Feature
Extensible
Cost Effective
Flexible
High Fault Tolerant
Hadoop

Hadoop Eco System
HDFS
MR
IMPALA
HBASE
PIG
HIVE
SQOOP FLUME
HUE, Oozie, Mahout

Tools for different scale
Size
Classification
Tools
Lines
Sample Data
Analysis and Visualisation
Whiteboard,
Bash, ...
KBs – low MBs
Prototype Data
Analysis and Visualisation
Matlab, Octave, R, Processing, Bash, ...
MBs – low GBs
Online Data
Storage
MySQL (DBs), ...
Analysis
NumPy, SciPy, Pandas, Weka..
Visualisation
Flare, AmCharts, Raphael
GBs
– TBs
– PBs
Big Data
Storage
HDFS, Hbase, Cassandra,...
Analysis
Hive, Giraph, Hama, Mahout

Recommendation System
Javascript
Flume
HDFS
HBase
Pig
Mahout

Use Flume To Collect Streaming Data
From /tmp/postlog.txt To /user/cloudera/flume

JSON sample data
{"food":"Tacos", "person":"Alice", "amount":3}
{"food":"Tomato Soup", "person":"Sarah", "amount":2}
{"food":"Grilled Cheese", "person":"Alex", "amount":5}
Demo Code
second_table = LOAD 'second_table.json'
USING JsonLoader('food:chararray, person:chararray, amount:int');
Use Pig To Load JSON

$ hbase shell
> create ‘mydata’, ‘mycf’
Build Table In HBase

Use Pig To Transfer Data Into HBase

Focus on algorithm
Divide and Conquer, Trie, Collaborative Filtering
Being an expert of single programming language
But knowing what tools and algorithm you can use to solve your problem
Define your role
Statistician
Software engineer
What You Should Do

Website:
largitdata.com
ywchiu.com
Email:
david@largitdata.com
tr.ywchiu@gmail.com
Contacts

Data Analysis - Making Big Data Work

Related slideshows

More Related Content

Data Analysis - Making Big Data Work