SlideShare a Scribd company logo
Data Analysis Making Big Data Work 
David Chiu 
2014/11/24
About Me 
Founder of LargitData 
Ex-Trend Micro Engineer 
ywchiu.com
Big Data & Data Science
US Election Prediction 
4
World Cup Prediction
Hurricane Prediction
Data Science 
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data Analysis - Making Big Data Work
Being A Data Scientist, You Need to Know That Much? Seriously?
Statistic 
Single Variable、Multi Variable、ANOVA 
Data Munging 
Data Extraction, Transformation, Loading 
Data Visualization 
Figure, Business Intelligence 
Required Skills
What You Probably Need Is A Team 
Business Analyst Knowing how to use different tools under different circumstance 
Statistician How to process big data? 
DBA How to deal with unstructured data 
Software Engineer Knowing how to user statistics
Four Dimension 
12 
Single Machine Memory R Local File 
Cloud Distributed Hadoop HDFS 
Statistics Analysis Linear Algebra 
Architect Management Standard 
Concept MapReduce Linear Algebra Logistic Regression 
Tool Hadoop PostgreSQL R 
Analyst How to use these tools 
Hackers R Python Java
“80% are doing summing and averaging” 
Content 
1.Data Munging 
2.Data Analysis 
3.Interpret Result 
What Data Scientists Do?
Application of Data Analysis 
Text Mining 
Classify Spam Mail 
Build Index 
Data Search Engine 
Social Network Analysis 
Finding Opinion Leader 
Recommendation System 
What user likes? 
Opinion Mining 
Positive/Negative Opinion 
Fraud Analysis 
Credit Card Fraud
Feed data to computer 
Make Computer to Do Analysis
Let Computer Predict For You
Predictive Analysis 
Learn from experience (Data), to predict future behavior 
What to Predict? 
e.g. Who is likely to click on that ad? 
For What? 
e.g. According to the click possibility and revenue to decide which ad to show. 
Predictive Analysis
Customer buying beer will also buy pampers? 
People are surfing telephone fee rate are likely to switch its vendor 
People belong to same group are tend to have same telecom vendor 
Surprising Conclusion
According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior. 
Predictive Model
Linear Model 
e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability) 
Female 13.5%,Male1.5% 
Rule Model 
e.g. 
If the user is “She” 
And Income is over 30k 
And haven’t seen the ad yet 
The click rate is 11% 
Simple Predictive Model
Induction 
From detail to general 
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E 
-- Tom Mitchell (1998) 
Discover an effective model 
Start from a simple model 
Update the model based on feeding data 
Keep on improving prediction power 
Machine Learning
Statistic Analysis 
Regression Analysis 
Clustering 
Classification 
Recommendation 
Text Mining 
Application 
22
Image recognition
Decision Tree 
Rate > 1,299/Month 
Probability to switch vendor 15% 
Probability to switch vendor 3% 
Yes 
No
Decision Tree 
Rate > 1,299/Month 
Probability to switch vendor 3% 
Yes 
No 
Probability to switch vendor 10% 
Probability to switch vendor 22% 
Income>22,000 
Yes 
No
Decision Tree 
Rate > 1,299/Month 
Yes 
No 
Probability to switch vendor 10% 
Probability to switch vendor 22% 
Income>22,000 
Yes 
No 
Probability to switch vendor 1% 
Probability to switch vendor 7% 
Free for intranet 
Yes 
No
Supervised Learning 
Regression 
Classification 
Unsupervised Learning 
Dimension Reduction 
Clustering 
Machine Learning
Supervised Learning
Classification 
e.g. Stock prediction on bull/bear market 
Regression 
e.g. Price prediction 
Supervised Learning
Dimension Reduction 
e.g. Making a new index 
Clustering 
e.g. Customer Segmentation 
Unsupervised Learning
Lift 
The better the lift, the greater the cost? 
The more decision rule, the more campaign? 
Design strategy for different persona? 
The lift for 4 campaign? 
The lift for 20 ampaign? 
Lift
Can we use the production rate of butter to predict stock market? 
Overfitting
Use noise as information 
Over assumption 
Over Interpretation 
What overfitting learn is not truth 
Like memorize all answers in a single test. 
Overfitting
Testing Model 
Use external data or partial data as testing dataset
Traditional Analysis Tool
Statistics On The Fly 
Built-in Math and Graphic Function 
Free and Open Source 
http://cran.r-project.org/src/base/ 
R Language 
36
Functional Programming 
Use Function Definition To Retrieve Answer 
Interpreted Language 
Statistics On the Fly 
Object Oriented Language 
S3 and S4 Method 
R Language
Most Used Analytic Language 
Most popular languages are R, Python (39%), SQL (37%). SAS (20%). 
By Gregory Piatetsky, Aug 27, 2013.
Kaggle 
http://www.kaggle.com/ 
Most often used language in Kaggle competition
Data Scientist in Google and Apple Use R 
What is your programming language of choice, R, Python or something else? 
“I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.” 
http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/ 
“Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” 
http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
Discover which customer is likely to churn? 
Customer Churn Analysis
Account Information 
state 
account length. 
area code 
phone number 
User Behavior 
international plan 
voice mail plan, number vmail messages 
total day minutes, total day calls, total day charge 
total eve minutes, total eve calls, total eve charge 
total night minutes, total night calls, total night charge 
total intl minutes, total intl calls, total intl charge 
number customer service calls 
Target 
Churn (Yes/No) 
Data Description
> install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,] 
Split data into training and testing dataset 
70% as training dataset 
30% as testing dataset
churn.rp <- rpart(churn ~ ., data=trainset) plot(churn.rp, margin= 0.1) text(churn.rp, all=TRUE, use.n = TRUE) 
Build Classifier 
Classfication
> predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions) 
Prediction Result 
pred 
no 
yes 
no 
859 
18 
yes 
41 
100
> confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes 
Use Confusion Matrix
Use Testing Data to Validate Result 
predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))
Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
Dynamic Language 
Execution at runtime 
Dynamic Type 
Interpreted Language 
See the result after execution 
OOP 
Python Language 
49
Cross Platform(Python VM) 
Third-Party Resource 
(Data Analysis、Graphics、Website Development) 
Simple, and easy to learn 
Benefit of Python
Data Analysis 
Scipy 
Numpy 
Scikit-learn 
Pandas 
51
Company that use python 
52
Use InfoLite Tool To Extract DOM
Use Python To Build Up Dashboard
Monitor Social Media and News 
Monitor post on social media 
Configure keyword and alert 
Use line plot to show daily post statistics 
55 
蘋果, nownews, udn, 中央跟風傳媒 還有 其他財經媒體
Daily Statistics Report 
56
Examine Associate Article 
57
Configure Alert and Keyword 
58
Configure Monitor Channel 
59
Track Specific Article 
60
Have You Learned Big Data? 
61
Data Analysis - Making Big Data Work
The 3Vs of Big Data
Data Analysis - Making Big Data Work
Product Centric 
Customer Centric 
Product Centric v.s. Customer Centric
Customer Centric? 
http://goo.gl/iuy4lY
Personal Recommendation
Knowing Who You Are? 
Personal recommendation 
Customer relation management 
Knowing What Futures Likes? 
From the history, we can see the future 
Predictive analysis 
Knowing What is Hidden Beneath? 
Correlation, Correlation, Correlation 
So… What is Big Data?
So… How To Analyze?
Apache Project – From Yahoo 
Feature 
Extensible 
Cost Effective 
Flexible 
High Fault Tolerant 
Hadoop
Hadoop Eco System 
HDFS 
MR 
IMPALA 
HBASE 
PIG 
HIVE 
SQOOP FLUME 
HUE, Oozie, Mahout
Tools for different scale 
Size 
Classification 
Tools 
Lines 
Sample Data 
Analysis and Visualisation 
Whiteboard, 
Bash, ... 
KBs – low MBs 
Prototype Data 
Analysis and Visualisation 
Matlab, Octave, R, Processing, Bash, ... 
MBs – low GBs 
Online Data 
Storage 
MySQL (DBs), ... 
Analysis 
NumPy, SciPy, Pandas, Weka.. 
Visualisation 
Flare, AmCharts, Raphael 
GBs 
– TBs 
– PBs 
Big Data 
Storage 
HDFS, Hbase, Cassandra,... 
Analysis 
Hive, Giraph, Hama, Mahout
Amazon
Facebook
Recommendation System 
Javascript 
Flume 
HDFS 
HBase 
Pig 
Mahout
Item- Based
User - Based
Monitor User Rating
Send User Behavior to Backend
Use Flume To Collect Streaming Data 
From /tmp/postlog.txt To /user/cloudera/flume
JSON sample data 
{"food":"Tacos", "person":"Alice", "amount":3} 
{"food":"Tomato Soup", "person":"Sarah", "amount":2} 
{"food":"Grilled Cheese", "person":"Alex", "amount":5} 
Demo Code 
second_table = LOAD 'second_table.json' 
USING JsonLoader('food:chararray, person:chararray, amount:int'); 
Use Pig To Load JSON
Build Recommendation Model
$ hbase shell 
> create ‘mydata’, ‘mycf’ 
Build Table In HBase
Examine Data In HDFS
Use Pig To Transfer Data Into HBase
Examine Data In HBase
Build API
Recommendation System
Focus on algorithm 
Divide and Conquer, Trie, Collaborative Filtering 
Being an expert of single programming language 
But knowing what tools and algorithm you can use to solve your problem 
Define your role 
Statistician 
Software engineer 
What You Should Do
Website: 
largitdata.com 
ywchiu.com 
Email: 
david@largitdata.com 
tr.ywchiu@gmail.com 
Contacts
Data Analysis - Making Big Data Work

More Related Content

Data Analysis - Making Big Data Work

  • 1. Data Analysis Making Big Data Work David Chiu 2014/11/24
  • 2. About Me Founder of LargitData Ex-Trend Micro Engineer ywchiu.com
  • 3. Big Data & Data Science
  • 9. Being A Data Scientist, You Need to Know That Much? Seriously?
  • 10. Statistic Single Variable、Multi Variable、ANOVA Data Munging Data Extraction, Transformation, Loading Data Visualization Figure, Business Intelligence Required Skills
  • 11. What You Probably Need Is A Team Business Analyst Knowing how to use different tools under different circumstance Statistician How to process big data? DBA How to deal with unstructured data Software Engineer Knowing how to user statistics
  • 12. Four Dimension 12 Single Machine Memory R Local File Cloud Distributed Hadoop HDFS Statistics Analysis Linear Algebra Architect Management Standard Concept MapReduce Linear Algebra Logistic Regression Tool Hadoop PostgreSQL R Analyst How to use these tools Hackers R Python Java
  • 13. “80% are doing summing and averaging” Content 1.Data Munging 2.Data Analysis 3.Interpret Result What Data Scientists Do?
  • 14. Application of Data Analysis Text Mining Classify Spam Mail Build Index Data Search Engine Social Network Analysis Finding Opinion Leader Recommendation System What user likes? Opinion Mining Positive/Negative Opinion Fraud Analysis Credit Card Fraud
  • 15. Feed data to computer Make Computer to Do Analysis
  • 17. Predictive Analysis Learn from experience (Data), to predict future behavior What to Predict? e.g. Who is likely to click on that ad? For What? e.g. According to the click possibility and revenue to decide which ad to show. Predictive Analysis
  • 18. Customer buying beer will also buy pampers? People are surfing telephone fee rate are likely to switch its vendor People belong to same group are tend to have same telecom vendor Surprising Conclusion
  • 19. According to personal behavior, predictive model can use personal characteristic to generate a probabilistic score, which the higher the score, the more likely the behavior. Predictive Model
  • 20. Linear Model e.g. Based on a cosmetic ad. We can give 90% weight to female customers, give10% to male customer. Based on the click probability (15%), we can calculate the possibility score (or probability) Female 13.5%,Male1.5% Rule Model e.g. If the user is “She” And Income is over 30k And haven’t seen the ad yet The click rate is 11% Simple Predictive Model
  • 21. Induction From detail to general A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E -- Tom Mitchell (1998) Discover an effective model Start from a simple model Update the model based on feeding data Keep on improving prediction power Machine Learning
  • 22. Statistic Analysis Regression Analysis Clustering Classification Recommendation Text Mining Application 22
  • 24. Decision Tree Rate > 1,299/Month Probability to switch vendor 15% Probability to switch vendor 3% Yes No
  • 25. Decision Tree Rate > 1,299/Month Probability to switch vendor 3% Yes No Probability to switch vendor 10% Probability to switch vendor 22% Income>22,000 Yes No
  • 26. Decision Tree Rate > 1,299/Month Yes No Probability to switch vendor 10% Probability to switch vendor 22% Income>22,000 Yes No Probability to switch vendor 1% Probability to switch vendor 7% Free for intranet Yes No
  • 27. Supervised Learning Regression Classification Unsupervised Learning Dimension Reduction Clustering Machine Learning
  • 29. Classification e.g. Stock prediction on bull/bear market Regression e.g. Price prediction Supervised Learning
  • 30. Dimension Reduction e.g. Making a new index Clustering e.g. Customer Segmentation Unsupervised Learning
  • 31. Lift The better the lift, the greater the cost? The more decision rule, the more campaign? Design strategy for different persona? The lift for 4 campaign? The lift for 20 ampaign? Lift
  • 32. Can we use the production rate of butter to predict stock market? Overfitting
  • 33. Use noise as information Over assumption Over Interpretation What overfitting learn is not truth Like memorize all answers in a single test. Overfitting
  • 34. Testing Model Use external data or partial data as testing dataset
  • 36. Statistics On The Fly Built-in Math and Graphic Function Free and Open Source http://cran.r-project.org/src/base/ R Language 36
  • 37. Functional Programming Use Function Definition To Retrieve Answer Interpreted Language Statistics On the Fly Object Oriented Language S3 and S4 Method R Language
  • 38. Most Used Analytic Language Most popular languages are R, Python (39%), SQL (37%). SAS (20%). By Gregory Piatetsky, Aug 27, 2013.
  • 39. Kaggle http://www.kaggle.com/ Most often used language in Kaggle competition
  • 40. Data Scientist in Google and Apple Use R What is your programming language of choice, R, Python or something else? “I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google.” http://simplystatistics.org/2013/02/15/interview-with-nick-chamandy-statistician-at-google/ “Expert knowledge of SAS (With Enterprise Guide/Miner) required and candidates with strong knowledge of R will be preferred” http://www.kdnuggets.com/jobs/13/03-29-apple-sr-data- scientist.html?utm_source=twitterfeed&utm_medium=facebook&utm_campaign=tfb&utm_content=FaceBook&utm_term=analytics#.UVXibgXOpfc.facebook
  • 41. Discover which customer is likely to churn? Customer Churn Analysis
  • 42. Account Information state account length. area code phone number User Behavior international plan voice mail plan, number vmail messages total day minutes, total day calls, total day charge total eve minutes, total eve calls, total eve charge total night minutes, total night calls, total night charge total intl minutes, total intl calls, total intl charge number customer service calls Target Churn (Yes/No) Data Description
  • 43. > install.packages("C50") > library(C50) > data(churn) > str(churnTrain) > churnTrain = churnTrain[,! names(churnTrain) %in% c("state", "area_code", "account_length") ] > set.seed(2) > ind <- sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3)) > trainset = churnTrain[ind == 1,] > testset = churnTrain[ind == 2,] Split data into training and testing dataset 70% as training dataset 30% as testing dataset
  • 44. churn.rp <- rpart(churn ~ ., data=trainset) plot(churn.rp, margin= 0.1) text(churn.rp, all=TRUE, use.n = TRUE) Build Classifier Classfication
  • 45. > predictions <- predict(churn.rp, testset, type="class") > table(testset$churn, predictions) Prediction Result pred no yes no 859 18 yes 41 100
  • 46. > confusionMatrix(table(predictions, testset$churn)) Confusion Matrix and Statistics predictions yes no yes 100 18 no 41 859 Accuracy : 0.942 95% CI : (0.9259, 0.9556) No Information Rate : 0.8615 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7393 Mcnemar's Test P-Value : 0.004181 Sensitivity : 0.70922 Specificity : 0.97948 Pos Pred Value : 0.84746 Neg Pred Value : 0.95444 Prevalence : 0.13851 Detection Rate : 0.09823 Detection Prevalence : 0.11591 Balanced Accuracy : 0.84435 'Positive' Class : yes Use Confusion Matrix
  • 47. Use Testing Data to Validate Result predictions <- predict(churn.rp, testset, type="prob") pred.to.roc <- predictions[, 1] pred.rocr <- prediction(pred.to.roc, as.factor(testset[,(dim(testset)[[2]])])) perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff") perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr") plot(perf.tpr.rocr, colorize=T,main=paste("AUC:",(perf.rocr@y.values)))
  • 48. Finding Most Important Variable model=fit(churn~.,trainset,model="svm") VariableImportance=Importance(model,trainset,method="sensv") L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$ sresponses) mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)
  • 49. Dynamic Language Execution at runtime Dynamic Type Interpreted Language See the result after execution OOP Python Language 49
  • 50. Cross Platform(Python VM) Third-Party Resource (Data Analysis、Graphics、Website Development) Simple, and easy to learn Benefit of Python
  • 51. Data Analysis Scipy Numpy Scikit-learn Pandas 51
  • 52. Company that use python 52
  • 53. Use InfoLite Tool To Extract DOM
  • 54. Use Python To Build Up Dashboard
  • 55. Monitor Social Media and News Monitor post on social media Configure keyword and alert Use line plot to show daily post statistics 55 蘋果, nownews, udn, 中央跟風傳媒 還有 其他財經媒體
  • 58. Configure Alert and Keyword 58
  • 61. Have You Learned Big Data? 61
  • 63. The 3Vs of Big Data
  • 65. Product Centric Customer Centric Product Centric v.s. Customer Centric
  • 68. Knowing Who You Are? Personal recommendation Customer relation management Knowing What Futures Likes? From the history, we can see the future Predictive analysis Knowing What is Hidden Beneath? Correlation, Correlation, Correlation So… What is Big Data?
  • 69. So… How To Analyze?
  • 70. Apache Project – From Yahoo Feature Extensible Cost Effective Flexible High Fault Tolerant Hadoop
  • 71. Hadoop Eco System HDFS MR IMPALA HBASE PIG HIVE SQOOP FLUME HUE, Oozie, Mahout
  • 72. Tools for different scale Size Classification Tools Lines Sample Data Analysis and Visualisation Whiteboard, Bash, ... KBs – low MBs Prototype Data Analysis and Visualisation Matlab, Octave, R, Processing, Bash, ... MBs – low GBs Online Data Storage MySQL (DBs), ... Analysis NumPy, SciPy, Pandas, Weka.. Visualisation Flare, AmCharts, Raphael GBs – TBs – PBs Big Data Storage HDFS, Hbase, Cassandra,... Analysis Hive, Giraph, Hama, Mahout
  • 75. Recommendation System Javascript Flume HDFS HBase Pig Mahout
  • 79. Send User Behavior to Backend
  • 80. Use Flume To Collect Streaming Data From /tmp/postlog.txt To /user/cloudera/flume
  • 81. JSON sample data {"food":"Tacos", "person":"Alice", "amount":3} {"food":"Tomato Soup", "person":"Sarah", "amount":2} {"food":"Grilled Cheese", "person":"Alex", "amount":5} Demo Code second_table = LOAD 'second_table.json' USING JsonLoader('food:chararray, person:chararray, amount:int'); Use Pig To Load JSON
  • 83. $ hbase shell > create ‘mydata’, ‘mycf’ Build Table In HBase
  • 85. Use Pig To Transfer Data Into HBase
  • 89. Focus on algorithm Divide and Conquer, Trie, Collaborative Filtering Being an expert of single programming language But knowing what tools and algorithm you can use to solve your problem Define your role Statistician Software engineer What You Should Do
  • 90. Website: largitdata.com ywchiu.com Email: david@largitdata.com tr.ywchiu@gmail.com Contacts