Machine Learning With R

應用 Machine Learning 到你的 Data 上吧
從 R 開始
@ COSCUP 2013David Chiu

About Me
Trend Micro
Taiwan R User Group
ywchiu-tw.appspot.com

Big Data Era
Quick analysis, finding meaning beneath data.

Data Analysis
1. Preparing to run the Data (Munging)
2. Running the model (Analysis)
3. Interpreting the result

Machine Learning
Black-box, algorithmic approach to producing predictions or
classifications from data
A computer program is said to learn from
experience E with respect to some task T and some
performance measure P, if its performance on T, as
measured by P, improves with experience E
Tom Mitchell (1998)

Using to do Machine
Learning
Using R

Why Using R?
1. Statistic analysis on the fly
2. Mathematical function and graphic module embedded
3. FREE! & Open Source!

Application of Machine
Learning
1. Recommender systems
2. Pattern Recognition
3. Stock market analysis
4. Natural language processing
5. Information Retrieval

Topics of Machine Learning
Supervised Learning
Regression
Classfication
Unsupervised Learning
Dimension Reduction
Clustering

Regression
Predict one set of numbers given another set of numbers
Given number of friends x, predict how many
goods I will receive on each facebook posts

Scatter Plot
dataset <- read.csv('fbgood.txt',head=TRUE, sep='t', row.names=1)
x = dataset$friends
y = dataset$getgoods
plot(x,y)

Linear Fit
fit <- lm(y ~ x);
abline(fit, col = 'red', lwd=3)

2nd order polynomial fit
plot(x,y)
polyfit2 <- lm(y ~ poly(x, 2));
lines(sort(x), polyfit2$fit[order(x)], col = 2, lwd = 3)

3rd order polynomial fit
plot(x,y)
polyfit3 <- lm(y ~ poly(x, 3));
lines(sort(x), polyfit3$fit[order(x)], col = 2, lwd = 3)

Other Regression Packages
MASS rlm - Robust Regression
GLM - Generalized linear Models
GAM - Generalized Additive Models

Classfication
Identifying to which of a set of categories a new observation belongs,
on the basis of a training set of data
Given features of bank costumer, predict whether
the client will subscribe a term deposit

Data Description
Features:
age,job,marital,education,default,balance,housing,loan,contact
Labels:
Customers subscribe a term deposit (Yes/No)

Classify Data With LibSVM
library(e1071)
dataset <- read.csv('bank.csv',head=TRUE, sep=';')
dati = split.data(dataset, p = 0.7)
train = dati$train
test = dati$test
model <- svm(y~., data = train, probability = TRUE)
pred <- predict(model, test[,1:(dim(test)[[2]]-1)], probability = TRUE)

Verify the predictions
table(pred,test[,dim(test)[2]])
pred no yes
no 1183 99
yes 27 47

Using ROC for assessment
library(ROCR)
pred.prob <- attr(pred, "probabilities")
pred.to.roc <- pred.prob[, 2]
pred.rocr <- prediction(pred.to.roc, as.factor(test[,(dim(test)[[2]])]))
perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")
perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")
plot(perf.tpr.rocr, colorize=T, main=paste("AUC:",(perf.rocr@y.values)))

Support Vector Machines and
Kernel Methods
e1071 - LIBSVM
kernlab - SVM, RVM and other kernel learning algorithms
klaR - SVMlight
rdetools - Model selection and prediction

Dimension Reduction
Seeks linear combinations of the columns of X with maximalvariance
Calculate a new index to measure economy index
of each Taiwan city/county

Economic Index of Taiwan
County
縣市
營利事業銷售額
經濟發展支出佔歲出比例
得收入者平均每人可支配所得
2012年《天下雜誌》幸福城市大調查 - 第505期

Component Bar Plot
dataset <- read.csv('eco_index.csv',head=TRUE, sep=',', row.names=1)
pc.cr <- princomp(dataset, cor = TRUE)
plot(pc.cr)

Component Line Plot
screeplot(pc.cr, type="lines")
abline(h=1, lty=3)

PCA barplot
barplot(sort(-pc.cr$scores[,1], TRUE))

Other Dimension Reduction
Packages
kpca - Kernel PCA
cmdscale - Multi Dimension Scaling
SVD - Singular Value Decomposition
fastICA - Independent Component Analysis

Clustering
Birds of a feather flock together
Segment customers based on existing features

Customer Segmentation
Clustering by 4 features
Visit Time
Average Expense
Loyalty Days
Age

Determing Clusters
mydata <- read.csv('costumer_segment.txt',head=TRUE, sep='t')
mydata <- scale(mydata)
d <- dist(mydata, method = "euclidean")
fit <- hclust(d, method="ward")
plot(fit)

Cutting trees
k1 = 4
groups <- cutree(fit, k=k1)
rect.hclust(fit, k=k1, border="red")

Kmeans Clustering
fit <- kmeans(mydata, k1)
plot(mydata, col = fit$cluster)

Principal Component Plot
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, lines=0)

Other Clustering Packages
kernlab - Spectral Clustering
specc - Spectral Clustering
fpc - DBSCAN

Machine Learning Dignostic
1. Get more training examples
2. Try smaller sets of features
3. Try getting additional features
4. Try adding polynomial features
5. Try parameter increasing/decreasing

Overfitting
Trainging error to be low, test error to be highe. g. θJtraining θJtest

THANK YOU
Please Come and Visit Taiwan R User
Group

Machine Learning With R

Related slideshows

More Related Content

Machine Learning With R