SlideShare a Scribd company logo
Big Data Analysis With
RHadoop
David Chiu (Yu-Wei, Chiu)
@ML/DM Monday
2014/03/17
About Me
 Co-Founder of NumerInfo
 Ex-Trend Micro Engineer
 ywchiu-tw.appspot.com
R + Hadoop
http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg
 Scaling R
Hadoop enables R to do parallel computing
 Do not have to learn new language
Learning to use Java takes time
Why Using RHadoop

Recommended for you

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java

An introduction to using Hadoop Streaming to write map/reduce functions without knowing Java. Includes examples written in Python.

hadoop
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes

Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.

mapreducehadoop
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC

I originally gave this presentation as an internal briefing at SDSC based on my experiences in working with Spark to solve scientific problems.

spark hadoop hpc
Rhadoop Architecture
rhdfs
rhbase
rmr2
HDFS
MapReduce
Hbase Thrift
Gateway
Streaming API
Hbase
R
 Enable developer to write Mapper/Reducer in
any scripting language(R, python, perl)
 Mapper, reducer, and optional combiner
processes are written to read from standard
input and to write to standard output
 Streaming Job would have additional overhead
of starting a scripting VM
Streaming v.s. Native Java.
 Writing MapReduce Using R
 mapreduce function
Mapreduce(input output, map, reduce…)
 Changelog
rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0
rmr 2.3.0 (2013/10/07): support plyrmr
rmr2
 Access HDFS From R
 Exchange data from R dataframe and HDFS
rhdfs

Recommended for you

MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model

The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.

mongodbhadoopscribe
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy

Provide a system level and pseudo-code level anatomy of Hive, a data warehousing system based on Hadoop.

data warehousinghivehadoop
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...

Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial: 1) Hadoop Streaming and Why Do We Need it? 2) Writing Streaming Jobs 3) Testing Streaming jobs and Hands-on on CloudxLab

hadoop architecturewhat is mapreducemap reduce
 Exchange data from R to Hbase
 Using Thrift API
rhbase
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
RHadoop
Installation
 R and related packages should be installed on
each tasknode of the cluster
 A Hadoop cluster, CDH3 and higher or Apache
1.0.2 and higher but limited to mr1, not mr2.
Compatibility with mr2 from Apache 2.2.0 or
HDP2
Prerequisites

Recommended for you

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading

The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.

hadoop
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter

- Profiling Hadoop jobs at Twitter revealed that compression/decompression of intermediate data and deserialization of complex object keys were very expensive. Optimizing these led to performance improvements of 1.5x or more. - Using columnar file formats like Apache Parquet allows reading only needed columns, avoiding deserialization of unused data. This led to gains of up to 3x. - Scala macros were developed to generate optimized implementations of Hadoop's RawComparator for common data types, avoiding deserialization for sorting.

hadoop summittwitterapache hadoop
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig

Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.

Getting Ready (Cloudera VM)
 Download
http://www.cloudera.com/content/cloudera-content/cloudera-
docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
 This VM runs
CentOS 6.2
CDH4.4
R 3.0.1
Java 1.6.0_32
CDH 4.4
Get RHadoop
 https://github.com/RevolutionAnalytics/RHadoop
/wiki/Downloads
Installing rmr2 dependencies
 Make sure the package is installed system wise
$ sudo R
> install.packages(c("codetools", "R", "Rcpp",
"RJSONIO", "bitops", "digest", "functional", "stringr",
"plyr", "reshape2", "rJava“, “caTools”))

Recommended for you

Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R

RHive aims to integrate R and Hive by allowing analysts to use R's familiar environment while leveraging Hive's capabilities for big data analysis. RHive allows R functions and objects to be used in Hive queries through RUDFs and RUDAFs. It also provides functions like napply to analyze big data in HDFS using R in a distributed manner. RHive provides a bridge between the two environments without requiring users to learn MapReduce programming.

hiverintegrate
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial: 1) SparkR (R on Spark) 2) SparkR DataFrames 3) Launch SparkR 4) Creating DataFrames from Local DataFrames 5) DataFrame Operation 6) Creating DataFrames - From JSON 7) Running SQL Queries from SparkR

sparkrsparkr tutorialbig data analytics
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

The document provides an overview of various Apache Pig features including: - The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS. - Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data. - Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined. - Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.

Install rmr2
$ wget --no-check-certificate
https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Installing…
 http://cran.r-project.org/src/contrib/Archive/Rcpp/
Downgrade Rcpp
$ wget --no-check-certificate http://cran.r-
project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t
ar.gz
$sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
Install Rcpp_0.11.0

Recommended for you

Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010

Hive is a data warehouse system built on top of Hadoop that allows users to query large datasets using SQL. It is used at Facebook to manage over 15TB of new data added daily across a 300+ node Hadoop cluster. Key features include using SQL for queries, extensibility through custom functions and file formats, and optimizations for performance like predicate pushdown and partition pruning.

hive icde
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies

1. The document discusses multi-resource packing of tasks with dependencies to improve cluster scheduler performance. It describes problems with current schedulers related to resource fragmentation and over-allocation. 2. A packing heuristic is proposed that assigns tasks to machines based on an alignment score to reduce fragmentation and spread load. A job completion time heuristic is also described. 3. The paper presents results showing improvements in makespan and job completion times from approaches that consider dependent tasks and multiple resource demands compared to current schedulers. It also discusses achieving trade-offs between performance and fairness.

hadoop summit
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop

This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation. This session was given in Arabic and i may provide a video for the session soon.

big datacloud computingdistributed systems
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Install rmr2 again
Install RHDFS
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/rhdfs/m
aster/build/rhdfs_1.0.8.tar.gz
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD
INSTALL rhdfs_1.0.8.tar.gz
Enable hdfs
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
> library(rmr2)
> library(rhdfs)
> hdfs.init()
Javareconf error
$ sudo R CMD javareconf

Recommended for you

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

mapreducegooglemapreducegoogle
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...

Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0 This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial: 1) Why MapReduce? 2) Write a MapReduce Job to Count Unique Words in a Text File 3) Create Mapper and Reducer in Java 4) Create Driver 5) MapReduce Input Splits, Secondary Sorting, and Partitioner 6) Combiner Functions in MapReduce 7) Job Chaining and Pipes in MapReduce

cloudxlabhadoopapache hadoop
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop

Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.

apache hadoophadoop summitbig data
javareconf with correct JAVA_HOME
$ echo $JAVA_HOME
$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
MapReduce
With RHadoop
MapReduce
 mapreduce(input, output, map, reduce)
 Like sapply, lapply, tapply within R
Hello World – For Hadoop
http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png

Recommended for you

新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐���)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)

財經新聞版面上充斥者許多看似可以賺錢的訊息,但看著新聞做股票,就像用了會誤導人的GPS,不但沒有帶領你找尋到交易的聖杯,處於資訊弱勢的散戶,只能跟著新聞訊息追高殺低,反而落入大戶養、套、殺的圈套中。 而我們這次將透由R語言的文字探勘與金融資料進行交叉分析,幫助你解讀每一次的新聞訊息或輿情資料,是大戶出貨的訊息,或是保障獲利的名燈,讓你趨吉避凶,正確從新聞中找到穩健投資的康莊大道。 在本簡報中,將介紹如何使用 rvest 獲取金融資料,並透過jiebaR 的斷詞,以及tmcn.word2vec 的分析,找出金融產業的相關聯性,讓進行投資時,可以參考關聯網路圖,發現先機!

jiebaword2vecgephi
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정

RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정 (한글판)

bigdatarhivehive
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수

RHive 튜토리얼 2 - 한글판

Move File Into HDFS
# Put data into hdfs
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-
cdh4.4.0.jar")
library(rmr2)
library(rhdfs)
hdfs.init()
hdfs.mkdir(“/user/cloudera/wordcount/data”)
hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data")
$ hadoop fs –mkdir /user/cloudera/wordcount/data
$ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
Wordcount Mapper
map <- function(k,lines) {
words.list <- strsplit(lines, 's')
words <- unlist(words.list)
return( keyval(words, 1) )
}
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
#Mapper
Wordcount Reducer
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
#Reducer
Call Wordcount
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output,
input.format="text", map=map, reduce=reduce)
}
out <- wordcount(hdfs.data, hdfs.out)

Recommended for you

Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop

The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.

mapreducehadoopr
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)

This document discusses using Python for social network analysis on Facebook data. It provides examples of: - Connecting to the Facebook API and obtaining an access token - Retrieving user and friend data via API calls - Analyzing likes on posts to determine who likes a user's posts the most - Performing text mining on post messages using NLTK and Jieba to determine popular topics - Modeling the friendship network as a graph and using NetworkX and community detection to identify groups within the social network.

social networkgephinetworkx
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop

Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.

apache hadoophadoop summit 2013big data
Read data from HDFS
results <- from.dfs(out)
results$key[order(results$val, decreasing
= TRUE)][1:10]
$ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 |
sort –k 2 –nr | head –n 10
MapReduce Benchmark
> a.time <- proc.time()
> small.ints2=1:100000
> result.normal = sapply(small.ints2, function(x) x^2)
> proc.time() - a.time
> b.time <- proc.time()
> small.ints= to.dfs(1:100000)
> result = mapreduce(input = small.ints, map = function(k,v)
cbind(v,v^2))
> proc.time() - b.time
sapply
Elapsed 0.982 second
mapreduce
Elapsed 102.755 seconds

Recommended for you

Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...

R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements. Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative? This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics. Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.

r languagehadoop
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis

The document discusses integrating R and Hadoop for big data analytics. It notes that existing statistical applications like R are incapable of handling big data, while data management tools lack analytical capabilities. Integrating R with Hadoop bridges this gap by leveraging R's analytics and statistics functionality with Hadoop's ability to process and store distributed data. RHadoop is introduced as an open source project that allows R programmers to directly use MapReduce functionality in R code. Specific RHadoop packages like rhdfs and rmr2 are described that enable interacting with HDFS and performing statistical analysis via MapReduce on Hadoop clusters. Text analytics use cases with R and Hadoop like sentiment analysis are also briefly outlined.

text miningranalytics
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R

This document outlines an agenda for analyzing social networks with R. It discusses connecting to social networks like Facebook via APIs, extracting friend data, creating a friendship matrix, and visualizing the resulting friend graph in Gephi. It also provides examples of analyzing Facebook data like extracting post likes counts and generating statistics on popular posts. The document encourages exploring one's own social network data to find insights like common interests between friends or the gender distribution of one's network.

social networkr
 HDFS stores your files as data chunk distributed on
multiple datanodes
 M/R runs multiple programs called mapper on each of
the data chunks or blocks. The (key,value) output of
these mappers are compiled together as result by
reducers.
 It takes time for mapper and reducer being spawned on
these distributed system.
Hadoop Latency
Its not possible to apply built in machine learning
method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
Kmeans Clustering
kmeans =
function(points, ncenters, iterations = 10, distfun = NULL) {
if(is.null(distfun))
distfun = function(a,b) norm(as.matrix(a-b), type = 'F')
newCenters =
kmeans.iter(
points,
distfun,
ncenters = ncenters)
# interatively choosing new centers
for(i in 1:iterations) {
newCenters = kmeans.iter(points, distfun,
centers = newCenters)
}
newCenters
}
Kmeans in MapReduce Style
kmeans.iter =
function(points, distfun, ncenters = dim(centers)[1], centers = NULL)
{
from.dfs(mapreduce(input = points,
map =
if (is.null(centers)) { #give random point as sample
function(k,v) keyval(sample(1:ncenters,1),v)}
else {
function(k,v) { #find center of minimum distance
distances = apply(centers, 1, function(c) distfun(c,v))
keyval(centers[which.min(distances),], v)}},
reduce = function(k,vv) keyval(NULL,
apply(do.call(rbind, vv), 2, mean))),
to.data.frame = T)
}
Kmeans in MapReduce Style

Recommended for you

Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work

This document discusses big data analysis and data science. It introduces common data analysis techniques like predictive modeling, machine learning, and recommendation systems. It also discusses tools for working with big data, including Hadoop, HDFS, Pig, HBase, Mahout and languages like R and Python. The document provides an example of using these techniques and tools to build a recommendation system using streaming data from Flume stored in HDFS and analyzed with Pig and HBase.

07 2
07 207 2
07 2

This document summarizes algorithms for large-scale data mining using MapReduce, including: 1) Information retrieval algorithms like distributed grep, calculating URL access frequency, and constructing the reverse web link graph. 2) Graph algorithms like PageRank, which is computed through an iterative process of message passing between nodes. 3) Clustering algorithms like canopy clustering, which uses two distance thresholds to create overlapping clusters in a single pass over the data.

Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R

This document discusses using machine learning with R for data analysis. It covers topics like preparing data, running models, and interpreting results. It explains techniques like regression, classification, dimensionality reduction, and clustering. Regression is used to predict numbers given other numbers, while classification identifies categories. Dimensionality reduction finds combinations of variables with maximum variance. Clustering groups similar data points. R is recommended for its statistical analysis, functions, and because it is free and open source. Examples are provided for techniques like linear regression, support vector machines, principal component analysis, and k-means clustering.

machine learningrdata analysis
One More Thing…
plyrmr
 Perform common data manipulation operations,
as found in plyr and reshape2
 It provides a familiar plyr-like interface while
hiding many of the mapreduce details
 plyr: Tools for splitting, applying and
combining data
NEW! plyrmr
Installation plyrmr dependencies
$ yum install libxml2-devel
$ sudo yum install curl-devel
$ sudo R
> Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE
> Install.packages(“devtools”, dependencies = TRUE)
> library(devtools)
> install_github("pryr", "hadley")
> Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
$ wget -no-check-certificate
https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz
$ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz
Installation plyrmr

Recommended for you

Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...

The document discusses the next generation design of Hadoop MapReduce. It aims to address scalability, availability, and utilization limitations in the current MapReduce framework. The key aspects of the new design include splitting the JobTracker into independent resource and application managers, distributing the application lifecycle management, enabling wire compatibility between versions, and allowing multiple programming paradigms like MPI and machine learning to run alongside MapReduce on the same Hadoop cluster. This architecture improves scalability, availability, utilization, and agility compared to the current MapReduce implementation.

hadoopindiasummithadoophadoopsummit
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing presented at PDPTA 2011 (http://www.world-academy-of-science.org/worldcomp11/ws/conferences/pdpta11)

market basket anamapreducehadoop
Map Reduce
Map ReduceMap Reduce
Map Reduce

The document provides an overview of MapReduce, including: 1) MapReduce is a programming model and implementation that allows for large-scale data processing across clusters of computers. It handles parallelization, distribution, and reliability. 2) The programming model involves mapping input data to intermediate key-value pairs and then reducing by key to output results. 3) Example uses of MapReduce include word counting and distributed searching of text.

map reduceparallel programmingcalifornia polytechnic university
> data(mtcars)
> head(mtcars)
> transform(mtcars, carb.per.cyl = carb/cyl)
> library(plyrmr)
> output(input(mtcars), "/tmp/mtcars")
> as.data.frame(transform(input("/tmp/mtcars"),
carb.per.cyl = carb/cyl))
> output(transform(input("/tmp/mtcars"), carb.per.cyl =
carb/cyl), "/tmp/mtcars.out")
Transform in plyrmr
 where(
select(
mtcars,
carb.per.cyl = carb/cyl,
.replace = FALSE),
carb.per.cyl >= 1)
select and where
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
 as.data.frame(
select(
group(
input("/tmp/mtcars"),
cyl),
mean.mpg = mean(mpg)))
Group by
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
Reference
 https://github.com/RevolutionAnalytics/RHadoop/wik
i
 http://www.slideshare.net/RevolutionAnalytics/rhado
op-r-meets-hadoop
 http://www.slideshare.net/Hadoop_Summit/enabling-
r-on-hadoop

Recommended for you

R server and spark
R server and sparkR server and spark
R server and spark

Microsoft R server for distributed computing โดย กฤษฏิ์ คำตื้อ Technical Evangelist Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND

distributed computingmicrosoft r servermicrosoft r open
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing

The document introduces Microsoft R Server and Microsoft R Open. It discusses that R is a popular open source programming language and platform for statistics, analytics, and data science. Microsoft R Server allows for distributed computing on big data using R and brings enterprise-grade support and capabilities to the open source R platform. It can perform analytics both in-database using SQL Server and in Hadoop environments without moving data.

data sciencesbusiness analyticsmicrosoft r server
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop

The document describes a market basket analysis algorithm using MapReduce and HBase to analyze transaction data from stores. The algorithm breaks transaction data into key-value pairs of item pairs, aggregates the counts using MapReduce, and stores the results in HBase. An experiment loaded transaction data of various sizes into Hadoop and analyzed the data, finding execution times increased with more data and nodes but HBase provided faster retrieval compared to HDFS alone.

nosqlhbasehadoop
 Website
ywchiu-tw.appspot.com
 Email
david@numerinfo.com
tr.ywchiu@gmail.com
 Company
numerinfo.com
Contact
Big Data Analysis With RHadoop

More Related Content

What's hot

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Titus Damaiyanti
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
Glenn K. Lockwood
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Glenn K. Lockwood
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
Jason Shao
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
JunHo Cho
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 

What's hot (20)

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 

Viewers also liked

Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
DataWorks Summit
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
David Chiu
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
Aiden Seonghak Hong
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
Aiden Seonghak Hong
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
David Chiu
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
David Chiu
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
David Chiu
 
07 2
07 207 2
07 2
a_b_g
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
David Chiu
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Yahoo Developer Network
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Jongwook Woo
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
R server and spark
R server and sparkR server and spark
R server and spark
BAINIDA
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
BAINIDA
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Jongwook Woo
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSuper Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
System ID Warehouse
 

Viewers also liked (20)

Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
PyCon APAC 2014 - Social Network Analysis Using Python (David Chiu)
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
07 2
07 207 2
07 2
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud ComputingMarket Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
R server and spark
R server and sparkR server and spark
R server and spark
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security PresentationSuper Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
 

Similar to Big Data Analysis With RHadoop

RHadoop - beginners
RHadoop - beginnersRHadoop - beginners
RHadoop - beginners
Mohamed Ramadan
 
Unit 2
Unit 2Unit 2
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
Tae Young Lee
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Hadoop
HadoopHadoop
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 

Similar to Big Data Analysis With RHadoop (20)

RHadoop - beginners
RHadoop - beginnersRHadoop - beginners
RHadoop - beginners
 
Unit 2
Unit 2Unit 2
Unit 2
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Recently uploaded

What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 

Recently uploaded (20)

What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 

Big Data Analysis With RHadoop

  • 1. Big Data Analysis With RHadoop David Chiu (Yu-Wei, Chiu) @ML/DM Monday 2014/03/17
  • 2. About Me  Co-Founder of NumerInfo  Ex-Trend Micro Engineer  ywchiu-tw.appspot.com
  • 4.  Scaling R Hadoop enables R to do parallel computing  Do not have to learn new language Learning to use Java takes time Why Using RHadoop
  • 6.  Enable developer to write Mapper/Reducer in any scripting language(R, python, perl)  Mapper, reducer, and optional combiner processes are written to read from standard input and to write to standard output  Streaming Job would have additional overhead of starting a scripting VM Streaming v.s. Native Java.
  • 7.  Writing MapReduce Using R  mapreduce function Mapreduce(input output, map, reduce…)  Changelog rmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0 rmr 2.3.0 (2013/10/07): support plyrmr rmr2
  • 8.  Access HDFS From R  Exchange data from R dataframe and HDFS rhdfs
  • 9.  Exchange data from R to Hbase  Using Thrift API rhbase
  • 10.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  • 12.  R and related packages should be installed on each tasknode of the cluster  A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2 Prerequisites
  • 13. Getting Ready (Cloudera VM)  Download http://www.cloudera.com/content/cloudera-content/cloudera- docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html  This VM runs CentOS 6.2 CDH4.4 R 3.0.1 Java 1.6.0_32
  • 16. Installing rmr2 dependencies  Make sure the package is installed system wise $ sudo R > install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava“, “caTools”))
  • 17. Install rmr2 $ wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
  • 20. $ wget --no-check-certificate http://cran.r- project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.t ar.gz $sudo R CMD INSTALL Rcpp_0.11.0.tar.gz Install Rcpp_0.11.0
  • 21. $ sudo R CMD INSTALL rmr2_3.0.0.tar.gz Install rmr2 again
  • 22. Install RHDFS $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/m aster/build/rhdfs_1.0.8.tar.gz $ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz
  • 23. Enable hdfs > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") > library(rmr2) > library(rhdfs) > hdfs.init()
  • 24. Javareconf error $ sudo R CMD javareconf
  • 25. javareconf with correct JAVA_HOME $ echo $JAVA_HOME $ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
  • 27. MapReduce  mapreduce(input, output, map, reduce)  Like sapply, lapply, tapply within R
  • 28. Hello World – For Hadoop http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
  • 29. Move File Into HDFS # Put data into hdfs Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20- mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1- cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/cloudera/wordcount/data”) hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data") $ hadoop fs –mkdir /user/cloudera/wordcount/data $ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
  • 30. Wordcount Mapper map <- function(k,lines) { words.list <- strsplit(lines, 's') words <- unlist(words.list) return( keyval(words, 1) ) } public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } #Mapper
  • 31. Wordcount Reducer reduce <- function(word, counts) { keyval(word, sum(counts)) } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } #Reducer
  • 32. Call Wordcount hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out') wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)
  • 33. Read data from HDFS results <- from.dfs(out) results$key[order(results$val, decreasing = TRUE)][1:10] $ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 | sort –k 2 –nr | head –n 10
  • 34. MapReduce Benchmark > a.time <- proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time > b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time
  • 37.  HDFS stores your files as data chunk distributed on multiple datanodes  M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers.  It takes time for mapper and reducer being spawned on these distributed system. Hadoop Latency
  • 38. Its not possible to apply built in machine learning method on MapReduce Program kcluster= kmeans((mydata, 4, iter.max=10) Kmeans Clustering
  • 39. kmeans = function(points, ncenters, iterations = 10, distfun = NULL) { if(is.null(distfun)) distfun = function(a,b) norm(as.matrix(a-b), type = 'F') newCenters = kmeans.iter( points, distfun, ncenters = ncenters) # interatively choosing new centers for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters) } newCenters } Kmeans in MapReduce Style
  • 40. kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { #give random point as sample function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { #find center of minimum distance distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T) } Kmeans in MapReduce Style
  • 42.  Perform common data manipulation operations, as found in plyr and reshape2  It provides a familiar plyr-like interface while hiding many of the mapreduce details  plyr: Tools for splitting, applying and combining data NEW! plyrmr
  • 43. Installation plyrmr dependencies $ yum install libxml2-devel $ sudo yum install curl-devel $ sudo R > Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE > Install.packages(“devtools”, dependencies = TRUE) > library(devtools) > install_github("pryr", "hadley") > Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
  • 45. > data(mtcars) > head(mtcars) > transform(mtcars, carb.per.cyl = carb/cyl) > library(plyrmr) > output(input(mtcars), "/tmp/mtcars") > as.data.frame(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl)) > output(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl), "/tmp/mtcars.out") Transform in plyrmr
  • 46.  where( select( mtcars, carb.per.cyl = carb/cyl, .replace = FALSE), carb.per.cyl >= 1) select and where https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
  • 47.  as.data.frame( select( group( input("/tmp/mtcars"), cyl), mean.mpg = mean(mpg))) Group by https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md