RHadoop - beginners

RHadoop
Introduction & installation
By: Mohamed Ramadan

Agenda
• Quick introduction about Hadoop
• Preparing the RHadoop environment
• Installing rmr2
• Installing rhdfs
• Operating HDFS with rhdfs

Introduction about Hadoop
• Apache Hadoop is an open source Java framework for
processing and querying vast amounts of data on large
clusters of commodity hardware .
• However, if the data (for example, behaviors of all online users) is too
large to fit in the memory of a single machine, you have no choice but
to use a supercomputer or some other scalable solution. The most
popular scalable big-data solution is Hadoop

Cont.
• Cluster can be one computer or more connected with each other by
very fast network.
• Apache Hadoop has two main features:
• HDFS (Hadoop Distributed File System) - Storing
• Map Reduce – Processing

Preparing the RHadoop environment
• RHadoop is a collection of five R packages that allow users to manage
and analyze data with Hadoop.
• There are five main packages, which are:
1. rhdfs : (install only on the node that will run the R client)
This is an interface between R and HDFS, which calls the HDFS API to
access the data stored in HDFS. The use of rhdfs is very similar to the
use of the Hadoop shell, which allows users to manipulate HDFS easily
from the R console.

Cont.
2- rmr: (install on every task node)
This is an interface between R and Hadoop MapReduce, which calls the
Hadoop streaming MapReduce API to perform MapReduce jobs across
Hadoop clusters. To develop an R MapReduce program, you only need
to focus on the design of the map and reduce functions, and the
remaining scalability issues will be taken care of by Hadoop itself.
4- plyrmr: (every node)
This is a higher-level abstraction of MapReduce, which allows users to
perform common data manipulation in a plyr-like syntax. This package
greatly lowers the learning curve of big-data manipulation.

Cont.
4- rhbase: ( only on the node that will run the R client)
This is an interface between R and HBase, which accesses Hbase and is
distributed in clusters through a Thrift server. You can use rhbase to
read/write data and manipulate tables stored within HBase.
5- ravro: (only on the node that will run the R client)
This allows users to read avro files in R, or write avro files. It allows R to
exchange data with HDFS.

Cont.
• 1- Also download VM:
https://downloads.cloudera.com/demo_vm/vmware/cloudera-
quickstart-vm-5.2.0-0-vmware.7z
• 2- Instead of building a new Hadoop system, we can use the Cloudera
QuickStart VM (the VM is free), which contains a single node Apache
Hadoop Cluster
https://www.cloudera.com/downloads/quickstart_vms/5-13.html
• 3- Then test Hadoop and R by writing in terminal (hadoop,R)
You Will face a lot of issues until run successfully (Good with you)
• There is a text file include commands to solve potential problems

Installing rmr2
1. First, open the terminal within the Cloudera QuickStart VM.
2. Use the permission of the root to enter an R session:
$ sudo R
3. You can then install dependent packages before installing rmr2:
> install.packages(c("codetools", "Rcpp", "RJSONIO", "bitops",
"digest", "functional", "stringr", "plyr", "reshape2", "rJava",
"caTools"))
4. Quit the R session:
> q()

Cont.
5. Next, you can download rmr-3.3.0 to the QuickStart VM. You may need to
update
the link if Revolution Analytics upgrades the version of rmr2:
$ wget --no-check-certificate
https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/3.3.0
/build/rmr2_3.3.0.tar.gz
6. You can then install rmr-3.3.0 to the QuickStart VM:
$ sudo R CMD INSTALL rmr2_3.3.0.tar.gz
7. Lastly, you can enter an R session and use the library function to test
whether the
library has been successfully installed:
$ R
> library(rmr2)

Hint about installing RStudio Server
• wget https://download2.rstudio.org/rstudio-server-rhel-1.1.453-
x86_64.rpm
• sudo yum install rstudio-server-rhel-1.1.453-x86_64.rpm –y
• Browse (loclhost:8787)
By your OS username and password Or change your password
$ whoiam
$sudo passwd <your username>

Installing rhdfs
1. First, you can download rhdfs 1.0.8 from GitHub. You may need to update
the link if Revolution Analytics upgrades the version of rhdfs:
$wget --no-check-certificate
https://raw.github.com/RevolutionAnalytics/rhdfs/master/build/rh
dfs_1.0.8.tar.gz
2. Next, you can install rhdfs under the command-line mode:
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL
rhdfs_1.0.8.tar.gz
3. You can then set up JAVA_HOME. The configuration of JAVA_HOME
depends on the
installed Java version within the VM:
$ sudo JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera R CMD
javareconf

Cont.
4. Last, you can set up the system environment and initialize rhdfs. You
may need to
update the environment setup if you use a different version of
QuickStart VM:
$ R
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-
mapreduce/hadoopstreaming-2.5.0-cdh5.2.0.jar")
> library(rhdfs)
> hdfs.init()

Operating HDFS with rhdfs
1. Initialize the rhdfs package: (must coded in each new session)
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-
mapreduce/hadoopstreaming-2.5.0-cdh5.2.0.jar")
> library(rhdfs)
> hdfs.init ()

• hdfs.put: Copy a file from the local filesystem to HDFS:
> hdfs.put('word.txt', './')
• hdfs.ls: Read the list of directory from HDFS:
> hdfs.ls('./')
• hdfs.copy: Copy a file from one HDFS directory to another:
> hdfs.copy('word.txt', 'wordcnt.txt')
• hdfs.move : Move a file from one HDFS directory to another:
> hdfs.move('wordcnt.txt', './data/wordcnt.txt')
Cont.

• hdfs.delete: Delete an HDFS directory from R:
> hdfs.delete('./data/')
• hdfs.rm: Delete an HDFS directory from R:
> hdfs.rm('./data/')
• hdfs.get: Download a file from HDFS to a local filesystem:
> hdfs.get(word.txt', '/home/cloudera/word.txt')
• hdfs.rename: Rename a file stored on HDFS:
>hdfs.rename('./test/q1.txt','./test/test.txt')
• hdfs.chmod: Change the permissions of a file or directory:
> hdfs.chmod('test', permissions= '777')
• hdfs.file.info: Read the meta information of the HDFS file:
> hdfs.file.info('./')
Cont.

• Write stream to the HDFS file
> f = hdfs.file("iris.txt","w")
> data(iris)
> hdfs.write(iris,f)
> hdfs.close(f)
• Read stream from the HDFS file
> f = hdfs.file("iris.txt", "r")
> dfserialized = hdfs.read(f)
> df = unserialize(dfserialized)
> df
> hdfs.close(f)
Cont.

Cont.
Instead of setting the configuration each time you’re using rhdfs, you
can put the configurations in the .rprofile file. Therefore, every time
you start an R
session, the configuration will be automatically loaded.

RHadoop - beginners

Related slideshows

More Related Content

RHadoop - beginners