R Jobs on the Cloud

R Jobs on the Cloud
Doxaras Yiannis for
mineknowledge

Your Another
Company's Company's
Cloud Cloud

Competitors
Company's
Cloud

My
Company's
Cloud

Instance Types
Data Sets For Single
Instance R AMI

1-2 GB 4-5 GB 9-10 GB

Pricing*
• Small ami.small

• Large ami.large

• XLarge ami.xlarge

Data received by EC2 instances costs 10¢ per GB (10243
bytes). Data sent from EC2 instances is charged on a sliding
scale, depending on the volume of data transferred during the
month: 18¢/GB from 0 to 10 TB, 16¢/GB from 10 to 50 TB, and
13¢/GB for any amount over 50 TB.

Data transfers between EC2 instances incur no transfer fees.
Data transfers between EC2 instances and S3 buckets located
in the United States are also free, but data transfers
between EC2 instances and S3 buckets located in Europe incur
the standard transfer fees.

EBS

You can visualize the EBS metaphor as an
external hard drive, that serves as a data
storage space on S3 for persistence between
AMI reboots and failures.

Security
•Pairing Keys, public/private
key cryptography (openssl)

•Network Security, pre-
configured “default”, Inner-
EC2, configure for external
communication

AMI Setup

•Search AMI manifest ID.
•Image Location in S3.
•m1.manifest.xml.

AMI Statuses

pending (0): launching and not yet
started

•running (16): launched and performing
like a normal computer (though not necessarily
finished booting the AMI's operating system)

•shutting-down (32): in the process of
terminating

•terminated (42): no longer running

Starting Instances
• ImageId*

• MinCount*

• MaxCount*

• KeyName

• SecurityGroup

• InstanceType

• UserData

• AddressingType

Logging@Instance

• proper security group

• proper public DNS entry

• RSA Key value Authentication

• $ ssh -i ec2-private-key.enc
root@ec2-67-202-4-222.z-1.compute-1.amazonaw
s.com

• chmod 400 ec2-private-key.enc.

Logging@Instance
__| __|_ ) Rev: 2
_| ( /
• proper security group ___|___|___|

Welcome to an EC2 Public Image :-)
• proper public DNS entryGetting Started
__ c __ /etc/ec2/release-notes.txt

• RSA Key value Authentication
[root@domU-12-31-35-00-53-82 ~]#

• $ ssh -i ec2-private-key.enc
root@ec2-67-202-4-222.z-1.compute-1.amazonaw
s.com

• chmod 400 ec2-private-key.enc.

Register An AMI
•Bundle and Upload to S3 With
Manifest.xml.

•Register an AMI.
•Describe AMI attributes.
•Reset AMI Attributes.
•Confirm AMI product Code.

Performance Issues

•Instance Type
•Shared Subsystems*
•Network Bandwidth
•Storage Space Initialization
•RAID

Persistence
• S3 is the main storage service for
EC2

• Cloud Programming Involves Backup
Mechanisms From Beginning of
Deployment!

Persistence
• S3 is the main storage service for
EC2

• Cloud Programming Involves Backup
Mechanisms From Beginning of
Deployment!
• Use EC2 as a cache.

• Perform Schedule Backups.

• Perform Schedule Bundling to an
AMI.

• Mount S3 as a local partition.

• Push your luck.

Our AMI Choice
• Operating System*
• Software*
• Auditing Actions*
• Configure System Services*
• Installed Amazon Building Tools*
• Develop Shell Util Scripts*
• Build and Upload to S3

R on Fedora
•Extra Packages*
using R scripting
Plotting to eps, pdf

•Plotting Utilities ?

•Services ?
•Data Distribution* Integration via
web services with Oracle BI, and Microsoft
Reports.

Demo
ssh@AMI
Tools Used.
ElasticFox,
S3Fox,
bash scripting,
python,
rscript

R Cloud Data Handling

AWS

S3 R INPUT
EC2 root
network
sdd drive R OUTPUT
sda
sdb
AMI #1
AMI #2
AMI Backup AMI #3

Batch Processing With R
#! /usr/bin/Rscript --vanilla --default-packages=utils
args <- commandArgs(TRUE)
res <- try(install.packages(args))

if(inherits(res, "try-error"))
q(status=1)
else
q()

$ R --vanilla --slave < hello_world.R
$ R --vanilla --slave < hello_world.R > result.txt

$ cat > print_my_args.R << EOF
args <- commandArgs(TRUE)
print(args)
q()
EOF

$ R --slave "--args a=100 b=200" < print_my_args.R

Large Data Sets
•Excel, SAS, SPSS, etc
•Upload files to S3 (use scripts)
•Data Parallelism vs. Task
Parallelism

•Service Queuing
•Messaging Interfaces

R Data Fragmentation

• No Correlation type algorithms should
be used in R Scripting

• Data capture and delivery
• Choose Proper AMI Type
• Probabilistic Algorithm Outcomes
• Consider Data Fragmentation In R
Scripting* S3 integration and data preparation ?

To Parallelize Or
Not?

•R is not Thread Safe
•R stores all Data in Memory
•Algorithms are Serial
Processes

•Solutions Like Rmpi Raise
Learning Curve.

Data Parallelism vs.
Task Parallelism

Parallel
Agent

R Parallel
Loop
Parallelization For
Task Fragmentation

Rmpi
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}

# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

# Tell all slaves to close down, and exit the program

mpi.close.Rslaves()
mpi.quit()

Rmpi
Cluster
conﬁguration
from inside R.
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}

# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

# Tell all slaves to close down, and exit the program

mpi.close.Rslaves()
mpi.quit()

REvolution Parallel R

# Load the Parallel R stack
require ('doNWS')
# We define the function f in our local environment
f <- function (x) { sqrt (x) }
# Start up two R worker processes and register them with foreach/
parallel version
setSleigh (sleigh (workerCount=2))
registerDoNWS ()
# Run a simple foreach loop in parallel on the two workers
foreach (j=1:9, .combine=c) %dopar% f(j)
# Note that the workers use the function f from our local environment,
even though it was not explicitly
# defined on the workers!

REvolution Parallel R

foreach
# Load the Parallel R stack iterators
require ('doNWS')
# We define the function f in our local environment
f <- function (x) { sqrt (x) }
# Start up two R worker processes and register them with foreach/
parallel version
setSleigh (sleigh (workerCount=2))
registerDoNWS ()
# Run a simple foreach loop in parallel on the two workers
foreach (j=1:9, .combine=c) %dopar% f(j)
# Note that the workers use the function f from our local environment,
even though it was not explicitly
# defined on the workers!

R as a Data Worker
For Hadoop
1. data plumbing: to take apply's in R,
present them to Hadoop as input for a job --
essentially split the input vector into
partitions, each of which goes to a Mapper
task, then have the Reducers combine the
results, which are sent back to R. Then R
continues its processing.

2. R algorithm parallelization -- rewriting
critical parts of popular algorithms
implemented in R, so that they can take
advantage of R-Hadoop integration.

R as a Data Worker
For Hadoop
HadoopStreaming 1. data plumbing: to take apply's in R,


R as a Data Worker
For Hadoop
HadoopStreaming 1. data plumbing: to take apply's in R,


User Experiences
400GB - > 5AMI’s

R Data Parallelization

• Use RToolkit for “Parallel R”
Processing

• DNS and DynDNS node configuration
• Node and Memory Optimization
• Develop R Script and Distribute

Further Lookup
• http://calculator.s3.amazonaws.com/calc5.html

• Secure EC2 Instance

http://developer.amazonwebservices.com/connect/
entry.jspa?externalID=1233

• http://www.revolution-computing.com/

• http://math.acadiau.ca/ACMMaC/Rmpi/sample.html

• http://finzi.psych.upenn.edu/R/library/utils/html/
Rscript.html

• http://www.rparallel.org/

• http://cran.r-project.org/web/views/

Hopefully You Have More
Clouded Days From Now On

doxaras@mineknowledge.com

R Jobs on the Cloud

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to R Jobs on the Cloud

Similar to R Jobs on the Cloud (20)

More from John Doxaras

More from John Doxaras (15)

Recently uploaded

Recently uploaded (20)

R Jobs on the Cloud