SlideShare a Scribd company logo
1 
Scalable Analytics 
with 
R, Hadoop and RHadoop 
Gwen Shapira, Software Engineer 
@gwenshap 
gshapira@cloudera.com
2
3
4

Recommended for you

Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale

Yahoo migrated most of its Pig workload from MapReduce to Tez to achieve significant performance improvements and resource utilization gains. Some key challenges in the migration included addressing misconfigurations, bad programming practices, and behavioral changes between the frameworks. Yahoo was able to run very large and complex Pig on Tez jobs involving hundreds of vertices and terabytes of data smoothly at scale. Further optimizations are still needed around speculative execution and container reuse to improve utilization even more. The migration to Tez resulted in up to 30% reduction in runtime, memory, and CPU usage for Yahoo's Pig workload.

hadoop summit
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How

Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.

maprimpalaapache hbase
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

Hadoop became the most common systm to store big data. With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself. Together they form a big ecosystem. This presentation covers some of those systems. While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.

hivescaldingparquet
#include warning.h 
5
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
6
Get Started with R-Studio 
7
Basic Data Types 
• String 
• Number 
• Boolean 
• Assignment <- 
8

Recommended for you

Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial

This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.

apache_hadoop mapreduce big_data cloud_computing
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides

This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.

Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR

R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset? SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.

dws17hadoop summithadoop
R can be a nice calculator 
> x <- 1 
> x * 2 
[1] 2 
> y <- x + 3 
> y 
[1] 4 
> log(y) 
[1] 1.386294 
> help(log) 
9
Complex Data Types 
• Vector 
• c, seq, rep, [] 
• List 
• Data Frame 
• Lists of vectors of same length 
• Not a matrix 
10
Creating vectors 
> v1 <- c(1,2,3,4) 
[1] 1 2 3 4 
> v1 * 4 
[1] 4 8 12 16 
> v4 <- c(1:5) 
[1] 1 2 3 4 5 
> v2 <- seq(2,12,by=3) 
[1] 2 5 8 11 
> v1 * v2 
[1] 2 10 24 44 
> v3 <- rep(3,4) 
[1] 3 3 3 3 
11
Accessing and filtering vectors 
> v1 <- c(2,4,6,8) 
[1] 2 4 6 8 
> v1[2] 
[1] 4 
> v1[2:4] 
[1] 4 6 8 
> v1[-2] 
[1] 2 6 8 
> v1[v1>3] 
[1] 4 6 8 
12

Recommended for you

Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive

Flink provides unified batch and stream processing. It natively supports streaming dataflows, long batch pipelines, machine learning algorithms, and graph analysis through its layered architecture and treatment of all computations as data streams. Flink's optimizer selects efficient execution plans such as shipping strategies and join algorithms. It also caches loop-invariant data to speed up iterative algorithms and graph processing.

apache flinkhadoop summithadoop
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)

PyCascading provides a Python API for the Cascading framework to process data flows on Hadoop. It allows defining data flows as Python functions and operations instead of Java code. The document discusses Hadoop concepts, shows how to define a WordCount workflow in PyCascading with fewer lines of code than Java, and walks through a full example of finding friends' most common interests. Key advantages are using Python instead of Java and leveraging any Python libraries, though performance-critical parts require Java.

datadata processingpython
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase

Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover: + What is Spark Streaming and what is it used for? + How does Spark Streaming work? + Example code to read, process, and write the processed data

Lists 
> lst <- list (1,"x",FALSE) 
[[1]] 
[1] 1 
[[2]] 
[1] "x" 
[[3]] 
[1] FALSE 
> lst[1] 
[[1]] 
[1] 1 
> lst[[1]] 
[1] 1 
13
Data Frames 
books <- read.csv("~/books.csv") 
books[1,] 
books[,1] 
books[3:4] 
books$price 
books[books$price==6.99,] 
martin_price <- books[books$author_t=="George 
R.R. Martin",]$price 
mean(martin_price) 
subset(books,select=-c(id,cat,sequence_i)) 
14
15
Functions 
> sq <- function(x) { x*x } 
> sq(3) 
[1] 9 
16 
Note: 
R is a functional programming language. 
Functions are first class objects 
And can be passed to other functions.

Recommended for you

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...

Efficient data access is one of the key factors for having a high performance data processing pipeline. Determining the layout of data values in the filesystem often has fundamental impacts on the performance of data access. In this talk, we will show insights on how data layout affects the performance of data access. We will first explain how modern columnar file formats like Parquet and ORC work and explain how to use them efficiently to store data values. Then, we will present our best practice on how to store datasets, including guidelines on choosing partitioning columns and deciding how to bucket a table.

spark summitapache spark
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.

hadoop summit tokyobig datahadoop cluster
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop

This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.

data warehousinghadoopetl
packages 
17
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
18
“In pioneer days they used oxen for heavy 
pulling, and when one ox couldn’t budge a log, 
we didn’t try to grow a larger ox” 
— Grace Hopper, early advocate of distributed computing
20 
Hadoop in a Nutshell

Recommended for you

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive

LLAP enables sub-second analytical queries in Hive by running query fragments directly in memory on compute nodes using a long-running daemon process. It provides high performance scans and execution through an in-memory columnar cache shared across queries. LLAP queries are coordinated independently by Tez while utilizing Hive operators for processing and Tez for data transfers. It improves upon traditional MapReduce and Tez by keeping intermediate query results in memory rather than writing to disk.

hadoop summitbig dataapache hive
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro

augmented my real-time hadoop talk to include a programming intro to mapreduce for google developer groups

hadoop mapreduce real-time
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores

The document discusses scaling HDFS to manage billions of files. It describes how HDFS usage has grown from millions of files in 2007 to potentially billions of files in the future. To address this, the speakers propose storing HDFS metadata in a key-value store like LevelDB instead of solely in memory. They evaluate this approach and find comparable performance to HDFS for most operations. Future work includes improving operations like compaction and failure recovery in the new architecture.

hadoop summitapache hadoophortonworks
Map-Reduce is the interesting bit 
• Map – Apply a function to each input record 
• Shuffle & Sort – Partition the map output and sort 
each partition 
• Reduce – Apply aggregation function to all values in 
each partition 
• Map reads input from disk 
• Reduce writes output to disk 
21
Example – Sessionize clickstream 
22
Sessionize 
Identify unique “sessions” of interacting with our 
website 
Session – for each user (IP), set of clicks that happened 
within 30 minutes of each other 
23
Input – Apache Access Log Records 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 
24

Recommended for you

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.

clouderaapache hadoopmapreduce
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...

Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial: 1) Spark Streaming - Workflow 2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection 3) Spark Streaming - DStream 4) Word Count Hands-on using Spark Streaming 5) Spark Streaming - Running Locally Vs Running on Cluster 6) Introduction to Apache Kafka 7) Apache Kafka Hands-on on CloudxLab 8) Integrating Spark Streaming & Kafka 9) Spark Streaming & Kafka Hands-on

spark streamingkafkaapache kafka
Big datacourse
Big datacourseBig datacourse
Big datacourse

The document outlines an introduction to analyzing and visualizing geo-data in R. It discusses exploring the structure of spatially distributed point data through point process statistics like the Complete Spatial Randomness test and Ripley's K-function. It also covers visualizing maps and point patterns with packages like maps, ggmap, rworldmap, and ggplot2. The document provides examples of mapping different regions, geocoding location data, and plotting point patterns on maps in R.

social mediadata analysisgeospatial data
Output – Add Session ID 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 15 
25
Overview 
26 
Map 
Map 
Map 
Reduce 
Reduce 
Log line 
Log line 
Log line 
IP1, log lines 
Log line, session ID
Map 
parsedRecord = re.search(‘(d+.d+….’,record) 
IP = parsedRecord.group(1) 
timestamp = parsedRecord.group(2) 
print ((IP,Timestamp),record) 
27
Shuffle & Sort 
Partition by: IP 
Sort by: timestamp 
Now reduce gets: 
(IP,timestamp) [record1,record2,record3….] 
28

Recommended for you

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011

In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.

mrv2mapreducehadoop
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Abstract: The presentation describes - What is the BigData problem - How Hadoop helps to solve BigData problems - The main principles of the Hadoop architecture as a distributed computational platform - History and definition of the MapReduce computational model - Practical examples of how to write MapReduce programs and run them on Hadoop clusters The talk is targeted to a wide audience of engineers who do not have experience using Hadoop.

apache hadoopmapreducedistributed computing
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning

This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.

machine learningr programmingdata science
Reduce 
SessionID = 1 
curr_record = records[0] 
Curr_timestamp = getTimestamp(curr_record) 
foreach record in records: 
if (curr_timestamp – getTimestamp(record) > 30): 
sessionID += 1 
curr_timestamp = getTimestamp(record) 
print(record + “ “ + sessionID) 
29
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
30
Reshape2 
• Two functions: 
• Melt – wide format to long format 
• Cast – long format to wide format 
• Columns: identifiers or measured variables 
• Molten data: 
• Unique identifiers 
• New column – variable name 
• New column – value 
• Default – all numbers are values 
31
Melt 
> tips 
total_bill tip sex smoker day time size 
16.99 1.01 Female No Sun Dinner 2 
10.34 1.66 Male No Sun Dinner 3 
21.01 3.50 Male No Sun Dinner 3 
> melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
32

Recommended for you

Hadoop london
Hadoop londonHadoop london
Hadoop london

Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.

ydnyahoohadoop
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R

This document provides a step-by-step guide to learning R. It begins with the basics of R, including downloading and installing R and R Studio, understanding the R environment and basic operations. It then covers R packages, vectors, data frames, scripts, and functions. The second section discusses data handling in R, including importing data from external files like CSV and SAS files, working with datasets, creating new variables, data manipulations, sorting, removing duplicates, and exporting data. The document is intended to guide users through the essential skills needed to work with data in R.

r datar functionsdata analysis & predictive modeling course
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)

This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.

sparkbig dataai
Cast 
> m_tips <- melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
> dcast(m_tips,sex+time~variable,mean) 
sex time total_bill tip size 
Female Dinner 19.21308 3.002115 2.461538 
Female Lunch 16.33914 2.582857 2.457143 
Male Dinner 21.46145 3.144839 2.701613 
Male Lunch 18.04848 2.882121 2.363636 
33
*Apply 
• apply – apply function on rows or columns of matrix 
• lapply – apply function on each item of list 
• Returns list 
• sapply – like lapply, but return vector 
• tapply – apply function to subsets of vector or lists 
34
plyr 
• Split – apply – combine 
• Ddply – data frame to data frame 
ddply(.data, .variables, .fun = NULL, ..., 
• Summarize – aggregate data into new data frame 
• Transform – modify data frame 
35
DDPLY Example 
> ddply(tips,c("sex","time"),summarize, 
+ mean=mean(tip), 
+ sd=sd(tip), 
+ ratio=mean(tip/total_bill) 
+ ) 
sex time mean sd ratio 
1 Female Dinner 3.002115 1.193483 0.1693216 
2 Female Lunch 2.582857 1.075108 0.1622849 
3 Male Dinner 3.144839 1.529116 0.1554065 
4 Male Lunch 2.882121 1.329017 0.1660826 
36

Recommended for you

Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx

R is a language and environment for statistical computing and graphics. It includes facilities for data manipulation, calculation, graphical display, and programming. Some key features of R include effective data handling, a suite of operators for calculations on arrays and matrices, graphical facilities, and a programming language with conditionals, loops, and functions. Common data structures in R include vectors, matrices, factors, lists, and data frames. Basic operations include arithmetic, logical operations, indexing, subsetting, applying functions, binding, and coercing between different structures.

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

This document provides an introduction to Apache Spark, a general purpose cluster computing framework. It discusses how Spark improves upon MapReduce by offering better performance, support for iterative algorithms, and an easier developer experience. Spark retains MapReduce's advantages like scalability, fault tolerance, and data locality, but offers more by leveraging distributed memory and supporting directed acyclic graphs of tasks. Examples demonstrate how Spark can run programs up to 100x faster than Hadoop MapReduce and how it supports machine learning algorithms and streaming data analysis.

apache sparksparkmeetup
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.

Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
37
Rhadoop Projects 
• RMR 
• RHDFS 
• RHBase 
• (new) PlyRMR 
38
Most Important: 
RMR does not parallelize algorithms. 
It allows you to implement MapReduce in R. 
Efficiently. That’s it. 
39
What does that mean? 
• Use RMR if you can break your problem down to 
small pieces and apply the algorithm there 
• Use commercial R+Hadoop if you need a parallel 
version of well known algorithm 
• Good fit: Fit piecewise regression model for each 
county in the US 
• Bad fit: Fit piecewise regression model for the entire 
US population 
• Bad fit: Logistic regression 
40

Recommended for you

Scala and spark
Scala and sparkScala and spark
Scala and spark

This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.

scala apache spark
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.

big data ingestionbig data analyticsdata visualization
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing

Hadoop in Data Warehousing, done as a part of INFO-H-419: Data Warehouses course at the ULB. The report is available at http://goo.gl/gc9Krz

hadoopdata warehosuing
Use-case examples – Good or Bad? 
1. Model power consumption per household to 
determine if incentive programs work 
2. Aggregate corn yield per 10x10 portion of field to 
determine best seeds to use 
3. Create churn models for service subscribers and 
determine who is most likely to cancel 
4. Determine correlation between device restarts and 
support calls 
41
Second Most Important: 
RMR requires R, RMR and all libraries you’ll 
use to be installed on all nodes and 
accessible by Hadoop user 
42
RMR is different from Hadoop Streaming. 
RMR mapper input: 
Key, [List of Records] 
This is so we can use vector operations 
43
How to RMRify a Problem 
44

Recommended for you

Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind

Apache Hadoop has emerged as the storage and processing platform of choice for Big Data. In this tutorial, I will give an overview of Apache Hadoop and its ecosystem, with specific use cases. I will explain the MapReduce programming framework in detail, and outline how it interacts with Hadoop Distributed File System (HDFS). While Hadoop is written in Java, MapReduce applications can be written using a variety of languages using a framework called Hadoop Streaming. I will give several examples of MapReduce applications using Hadoop Streaming.

 
by EMC
apache hadoopmapreducehadoop tutorial
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project

This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.

hbaseperformancehadoop
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf

This document provides an overview and introduction to using R. It discusses why R is useful, outlines the R interface and workspace, describes how to get help and install packages, and provides tips on resolving conflicting object names. The document is intended to help new users get started with the basic functionality of R.

In more detail… 
• Mappers get list of values 
• You need to process each one independently 
• But do it for all lines at once. 
• Reducers work normally 
45
Demo 6 
> library(rmr2) 
t <- list("hello world","don't worry be happy") 
unlist(sapply(t,function (x) {strsplit(x," ")})) 
function(k,v) { 
ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) 
keyval(ret_k,1) 
} 
function(k,v) { 
keyval(k,sum(v))} 
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", 
output=”~/wc.json",input.format="text”,output.format=”json", 
map=wc.map,reduce=wc.reduce); 
46
Cheating in MapReduce: 
Do everything possible to have 
map only jobs 
47
Avg Tips per Person – Naïve Input 
Gwen 1 
Jeff 2 
Leon 1 
Gwen 2.5 
Leon 3 
Jeff 1 
Gwen 1 
Gwen 2 
Jeff 1.5 
48

Recommended for you

A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R

This document outlines the agenda for a two-day workshop on learning R and analytics. Day 1 will introduce R and cover data input, quality, and exploration. Day 2 will focus on data manipulation, visualization, regression models, and advanced topics. Sessions include lectures and demos in R. The goal is to help attendees learn R in 12 hours and gain an introduction to analytics skills for career opportunities.

rstatsbusiness analyticsr
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup

Spark is a fast and general engine for large-scale data processing. It improves on MapReduce by allowing iterative algorithms through in-memory caching and by supporting interactive queries. Spark features include in-memory caching, general execution graphs, APIs in multiple languages, and integration with Hadoop. It is faster than MapReduce, supports iterative algorithms needed for machine learning, and enables interactive data analysis through its flexible execution model.

hadoopbig dataspark
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*

This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.

matlabgpuintel software
Avg Tips per Person - Naive 
avg.map <- function(k,v){keyval(v$V1,v$V2)} 
avg.reduce <- function(k,v) {keyval(k,mean(v))} 
mapreduce(input=”~/hadoop-recipes/data/tip1.txt", 
output="~/avg.txt", 
input.format=make.input.format("csv"), 
output.format="text", 
map=avg.map,reduce=avg.reduce); 
49
Avg Tips per Person – Awesome Input 
Gwen 1,2.5,1,2 
Jeff 2,1,1.5 
Leon 1,3 
50
Avg Tips per Person - Optimized 
function(k,v) { 
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," 
")})) 
keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) 
} 
mapreduce(input=”~/hadoop-recipes/data/tip2.txt", 
output="~/avg2.txt", 
input.format=make.input.format("csv",sep=","), 
output.format="text",map=avg2.map); 
51
Few Final RMR Tips 
• Backend = “local” has files as input and output 
• Backend = “hadoop” uses HDFS directories 
• In “hadoop” mode, print(X) inside the mapper will fail 
the job. 
• Use: cat(“ERROR!”, file = stderr()) 
52

Recommended for you

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial: 1) Thinking in Map / Reduce 2) Understanding Unix Pipeline 3) Examples to understand MapReduce 4) Merging 5) Mappers & Reducers 6) Mapper Example 7) Input Split 8) mapper() & reducer() Code 9) Example - Count number of words in a file using MapReduce 10) Example - Compute Max Temperature using MapReduce 11) Hands-on - Count number of words in a file using MapReduce on CloudxLab

cloudxlabhadoopapache hadoop
Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive

In which disk-related failure 
scenarios of Apache Kafka are discussed in unprecedented level of detail

kafkadevops
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote

The document discusses lies that architects sometimes tell and truths they avoid. It provides examples of six common lies: 1) saying a system is real-time or has big data when it really has specific requirements, 2) claiming a microservices architecture exists when the goal is still to migrate, 3) saying hybrid/multi-cloud architectures don't exist when the architecture is just copy-pasted, 4) using "best of breed" when really using only one of everything, 5) claiming something can't be done at an organization due to its nature when other similar organizations succeeded, and 6) avoiding risk or change by safely interpreting things in a non-threatening way. The document advocates defining responsibilities clearly, embracing change, taking measured

Recommended Reading 
• http://cran.r-project.org/doc/manuals/R-intro.html 
• http://blog.revolutionanalytics.com/2013/02/10-r-packages- 
every-data-scientist-should-know-about. 
html 
• http://had.co.nz/reshape/paper-dsc2005.pdf 
• http://seananderson.ca/2013/12/01/plyr.html 
• https://github.com/RevolutionAnalytics/rmr2/blob/m 
aster/docs/tutorial.md 
• http://cran.r-project. 
org/web/packages/data.table/index.html 
53
54

More Related Content

What's hot

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 

Similar to R for hadoopers

Big datacourse
Big datacourseBig datacourse
Big datacourse
Massimiliano Ruocco
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 
Hadoop london
Hadoop londonHadoop london
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
karthikks82
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
BusyBird2
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
Ajay Ohri
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

Similar to R for hadoopers (20)

Big datacourse
Big datacourseBig datacourse
Big datacourse
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Hadoop
HadoopHadoop
Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 

More from Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
Gwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Recently uploaded

How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
yogita singh$A17
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
vasudha malikmonii$A17
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
jiya khan$A17
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
depikasharma
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 

Recently uploaded (20)

How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 

R for hadoopers

  • 1. 1 Scalable Analytics with R, Hadoop and RHadoop Gwen Shapira, Software Engineer @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. 3
  • 4. 4
  • 6. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 6
  • 7. Get Started with R-Studio 7
  • 8. Basic Data Types • String • Number • Boolean • Assignment <- 8
  • 9. R can be a nice calculator > x <- 1 > x * 2 [1] 2 > y <- x + 3 > y [1] 4 > log(y) [1] 1.386294 > help(log) 9
  • 10. Complex Data Types • Vector • c, seq, rep, [] • List • Data Frame • Lists of vectors of same length • Not a matrix 10
  • 11. Creating vectors > v1 <- c(1,2,3,4) [1] 1 2 3 4 > v1 * 4 [1] 4 8 12 16 > v4 <- c(1:5) [1] 1 2 3 4 5 > v2 <- seq(2,12,by=3) [1] 2 5 8 11 > v1 * v2 [1] 2 10 24 44 > v3 <- rep(3,4) [1] 3 3 3 3 11
  • 12. Accessing and filtering vectors > v1 <- c(2,4,6,8) [1] 2 4 6 8 > v1[2] [1] 4 > v1[2:4] [1] 4 6 8 > v1[-2] [1] 2 6 8 > v1[v1>3] [1] 4 6 8 12
  • 13. Lists > lst <- list (1,"x",FALSE) [[1]] [1] 1 [[2]] [1] "x" [[3]] [1] FALSE > lst[1] [[1]] [1] 1 > lst[[1]] [1] 1 13
  • 14. Data Frames books <- read.csv("~/books.csv") books[1,] books[,1] books[3:4] books$price books[books$price==6.99,] martin_price <- books[books$author_t=="George R.R. Martin",]$price mean(martin_price) subset(books,select=-c(id,cat,sequence_i)) 14
  • 15. 15
  • 16. Functions > sq <- function(x) { x*x } > sq(3) [1] 9 16 Note: R is a functional programming language. Functions are first class objects And can be passed to other functions.
  • 18. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 18
  • 19. “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox” — Grace Hopper, early advocate of distributed computing
  • 20. 20 Hadoop in a Nutshell
  • 21. Map-Reduce is the interesting bit • Map – Apply a function to each input record • Shuffle & Sort – Partition the map output and sort each partition • Reduce – Apply aggregation function to all values in each partition • Map reads input from disk • Reduce writes output to disk 21
  • 22. Example – Sessionize clickstream 22
  • 23. Sessionize Identify unique “sessions” of interacting with our website Session – for each user (IP), set of clicks that happened within 30 minutes of each other 23
  • 24. Input – Apache Access Log Records 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 24
  • 25. Output – Add Session ID 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15 25
  • 26. Overview 26 Map Map Map Reduce Reduce Log line Log line Log line IP1, log lines Log line, session ID
  • 27. Map parsedRecord = re.search(‘(d+.d+….’,record) IP = parsedRecord.group(1) timestamp = parsedRecord.group(2) print ((IP,Timestamp),record) 27
  • 28. Shuffle & Sort Partition by: IP Sort by: timestamp Now reduce gets: (IP,timestamp) [record1,record2,record3….] 28
  • 29. Reduce SessionID = 1 curr_record = records[0] Curr_timestamp = getTimestamp(curr_record) foreach record in records: if (curr_timestamp – getTimestamp(record) > 30): sessionID += 1 curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID) 29
  • 30. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 30
  • 31. Reshape2 • Two functions: • Melt – wide format to long format • Cast – long format to wide format • Columns: identifiers or measured variables • Molten data: • Unique identifiers • New column – variable name • New column – value • Default – all numbers are values 31
  • 32. Melt > tips total_bill tip sex smoker day time size 16.99 1.01 Female No Sun Dinner 2 10.34 1.66 Male No Sun Dinner 3 21.01 3.50 Male No Sun Dinner 3 > melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 32
  • 33. Cast > m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 > dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115 2.461538 Female Lunch 16.33914 2.582857 2.457143 Male Dinner 21.46145 3.144839 2.701613 Male Lunch 18.04848 2.882121 2.363636 33
  • 34. *Apply • apply – apply function on rows or columns of matrix • lapply – apply function on each item of list • Returns list • sapply – like lapply, but return vector • tapply – apply function to subsets of vector or lists 34
  • 35. plyr • Split – apply – combine • Ddply – data frame to data frame ddply(.data, .variables, .fun = NULL, ..., • Summarize – aggregate data into new data frame • Transform – modify data frame 35
  • 36. DDPLY Example > ddply(tips,c("sex","time"),summarize, + mean=mean(tip), + sd=sd(tip), + ratio=mean(tip/total_bill) + ) sex time mean sd ratio 1 Female Dinner 3.002115 1.193483 0.1693216 2 Female Lunch 2.582857 1.075108 0.1622849 3 Male Dinner 3.144839 1.529116 0.1554065 4 Male Lunch 2.882121 1.329017 0.1660826 36
  • 37. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 37
  • 38. Rhadoop Projects • RMR • RHDFS • RHBase • (new) PlyRMR 38
  • 39. Most Important: RMR does not parallelize algorithms. It allows you to implement MapReduce in R. Efficiently. That’s it. 39
  • 40. What does that mean? • Use RMR if you can break your problem down to small pieces and apply the algorithm there • Use commercial R+Hadoop if you need a parallel version of well known algorithm • Good fit: Fit piecewise regression model for each county in the US • Bad fit: Fit piecewise regression model for the entire US population • Bad fit: Logistic regression 40
  • 41. Use-case examples – Good or Bad? 1. Model power consumption per household to determine if incentive programs work 2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use 3. Create churn models for service subscribers and determine who is most likely to cancel 4. Determine correlation between device restarts and support calls 41
  • 42. Second Most Important: RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user 42
  • 43. RMR is different from Hadoop Streaming. RMR mapper input: Key, [List of Records] This is so we can use vector operations 43
  • 44. How to RMRify a Problem 44
  • 45. In more detail… • Mappers get list of values • You need to process each one independently • But do it for all lines at once. • Reducers work normally 45
  • 46. Demo 6 > library(rmr2) t <- list("hello world","don't worry be happy") unlist(sapply(t,function (x) {strsplit(x," ")})) function(k,v) { ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) keyval(ret_k,1) } function(k,v) { keyval(k,sum(v))} mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", output=”~/wc.json",input.format="text”,output.format=”json", map=wc.map,reduce=wc.reduce); 46
  • 47. Cheating in MapReduce: Do everything possible to have map only jobs 47
  • 48. Avg Tips per Person – Naïve Input Gwen 1 Jeff 2 Leon 1 Gwen 2.5 Leon 3 Jeff 1 Gwen 1 Gwen 2 Jeff 1.5 48
  • 49. Avg Tips per Person - Naive avg.map <- function(k,v){keyval(v$V1,v$V2)} avg.reduce <- function(k,v) {keyval(k,mean(v))} mapreduce(input=”~/hadoop-recipes/data/tip1.txt", output="~/avg.txt", input.format=make.input.format("csv"), output.format="text", map=avg.map,reduce=avg.reduce); 49
  • 50. Avg Tips per Person – Awesome Input Gwen 1,2.5,1,2 Jeff 2,1,1.5 Leon 1,3 50
  • 51. Avg Tips per Person - Optimized function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) } mapreduce(input=”~/hadoop-recipes/data/tip2.txt", output="~/avg2.txt", input.format=make.input.format("csv",sep=","), output.format="text",map=avg2.map); 51
  • 52. Few Final RMR Tips • Backend = “local” has files as input and output • Backend = “hadoop” uses HDFS directories • In “hadoop” mode, print(X) inside the mapper will fail the job. • Use: cat(“ERROR!”, file = stderr()) 52
  • 53. Recommended Reading • http://cran.r-project.org/doc/manuals/R-intro.html • http://blog.revolutionanalytics.com/2013/02/10-r-packages- every-data-scientist-should-know-about. html • http://had.co.nz/reshape/paper-dsc2005.pdf • http://seananderson.ca/2013/12/01/plyr.html • https://github.com/RevolutionAnalytics/rmr2/blob/m aster/docs/tutorial.md • http://cran.r-project. org/web/packages/data.table/index.html 53
  • 54. 54

Editor's Notes

  1. Modern CPUs are optimized with vector instructions – so many vector operations can be done on entire vectors in one instructions. Loops obviously take many instructions both for the operations and for running through the loop.
  2. This quote is excerpted from the one at the beginning of Chapter 1 in Hadoop: The Definitive Guide by Tom White.
  3. Example to illustrate MR
  4. RevolutionR and Oracle have (expensive) packages of popular algorithms, parallelized.
  5. Just saved you hours of debugging. You can thank me later 