SlideShare a Scribd company logo
R Jobs on the Cloud
    Doxaras Yiannis for
       mineknowledge
Your                                 Another
Company's                             Company's
  Cloud                                 Cloud




                        Competitors
                         Company's
                          Cloud

               My
            Company's
              Cloud
Instance Types
                     Data Sets For Single
                       Instance R AMI




 1-2 GB   4-5 GB   9-10 GB
Pricing*
•   Small ami.small

•   Large ami.large

•   XLarge ami.xlarge



                          Data received by EC2 instances costs 10¢ per GB (10243
                          bytes). Data sent from EC2 instances is charged on a sliding
                          scale, depending on the volume of data transferred during the
                          month: 18¢/GB from 0 to 10 TB, 16¢/GB from 10 to 50 TB, and
                          13¢/GB for any amount over 50 TB.

                          Data transfers between EC2 instances incur no transfer fees.
                          Data transfers between EC2 instances and S3 buckets located
                          in the United States are also free, but data transfers
                          between EC2 instances and S3 buckets located in Europe incur
                          the standard transfer fees.

Recommended for you

Introduction to cloudforecast
Introduction to cloudforecastIntroduction to cloudforecast
Introduction to cloudforecast

CloudForecast is a system monitoring and visualization tool that uses Perl and RRDTool to collect data from servers and generate graphs. It collects metrics like CPU usage, network traffic, and Gearman worker status. Data is stored in RRD files and a SQLite database. A radar component collects data and a web interface is used to view graphs generated from the collected data.

Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy

Provide a system level and pseudo-code level anatomy of Hive, a data warehousing system based on Hadoop.

data warehousinghivehadoop
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction

This document introduces infrastructure as code (IaC) using Terraform and provides examples of deploying infrastructure on AWS including: - A single EC2 instance - A single web server - A cluster of web servers using an Auto Scaling Group - Adding a load balancer using an Elastic Load Balancer It also discusses Terraform concepts and syntax like variables, resources, outputs, and interpolation. The target audience is people who deploy infrastructure on AWS or other clouds.

EBS



You can visualize the EBS metaphor as an
external hard drive, that serves as a data
storage space on S3 for persistence between
AMI reboots and failures.
Security
•Pairing Keys, public/private
 key cryptography (openssl)




•Network Security, pre-
 configured “default”, Inner-
 EC2, configure for external
 communication
AMI Setup

•Search AMI manifest ID.
•Image Location in S3.
•m1.manifest.xml.
AMI Statuses

 pending (0):     launching and not yet
 started

•running (16):      launched and performing
 like a normal computer (though not necessarily
 finished booting the AMI's operating system)

•shutting-down (32):        in the process of
 terminating

•terminated (42):       no longer running

Recommended for you

R hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functionsR hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functions

The document provides information on advanced functions in RHive including UDF, UDAF, and UDTF. It explains: 1) RHive allows writing UDF and UDAF functions in R and deploying them for use in Hive, allowing R functions to be used at lower complexity levels in Hive's MapReduce programming. 2) The rhive.assign, rhive.export, and rhive.exportAll functions are used to deploy R functions and objects to the distributed Hive environment for processing large datasets. 3) An example demonstrates creating a sum function in R, assigning it using rhive.assign, and executing it on the USArrests Hive table using rhive

bigdatahivehadoop
Configuration management II - Terraform
Configuration management II - TerraformConfiguration management II - Terraform
Configuration management II - Terraform

Terraform can be used to automate the deployment and management of infrastructure as code. It allows defining infrastructure components like VMs, networks, DNS records etc. as code in configuration files. Key benefits include versioning infrastructure changes, consistency across environments, and automation of deployments. The document then provides details on installing Terraform, using common commands like plan, apply and import, defining resources, variables, modules and managing remote state. It also demonstrates creating an EC2 instance using a generated AMI.

terraformdevopssoftware
Terraform day1
Terraform day1Terraform day1
Terraform day1

The document discusses Terraform, an infrastructure as code tool. It covers installing Terraform, deploying infrastructure like EC2 instances using Terraform configuration files, destroying resources, and managing Terraform state. Key topics include authentication with AWS for Terraform, creating a basic EC2 instance, validating and applying configuration changes, and storing state locally versus remotely.

Starting Instances
•   ImageId*

•   MinCount*

•   MaxCount*

•   KeyName

•   SecurityGroup

•   InstanceType

•   UserData

•   AddressingType
Logging@Instance

•   proper security group

•   proper public DNS entry

•   RSA Key value Authentication

•   $ ssh -i ec2-private-key.enc
    root@ec2-67-202-4-222.z-1.compute-1.amazonaw
    s.com

•   chmod 400 ec2-private-key.enc.
Logging@Instance
                                    __| __|_ )     Rev: 2
                                    _| (      /
•   proper security group          ___|___|___|

                            Welcome to an EC2 Public Image :-)
•   proper public DNS     entryGetting Started
                               __ c __ /etc/ec2/release-notes.txt

•   RSA Key value Authentication
                       [root@domU-12-31-35-00-53-82         ~]#


•   $ ssh -i ec2-private-key.enc
    root@ec2-67-202-4-222.z-1.compute-1.amazonaw
    s.com

•   chmod 400 ec2-private-key.enc.
Register An AMI
•Bundle and Upload to S3 With
 Manifest.xml.

•Register an AMI.
•Describe AMI attributes.
•Reset AMI Attributes.
•Confirm AMI product Code.

Recommended for you

あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つ���方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法

Vector and ListBuffer have similar performance for random reads. Benchmarking showed no significant difference in throughput, average time, or sample times between reading randomly from a Vector versus a ListBuffer. Vectors are generally faster than Lists for random access due to Vectors being implemented as arrays under the hood.

scalamatsuribenchmarkscala
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk

We are excited to continue our work on BeanStalk with the introduction of a range of great new features. If you are a Python shop you'll learn how BeanStalk now supports Python containers and the Django and Flask frameworks. Hear about BeanStalk integration with RDS and how custom configuration of containers is possible through simple configuration files.

getting-startedryan-shuttleworthamazon
MongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquareMongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquare

Foursquare uses MongoDB to power their location-based social network. They have over 9 million users generating around 3 million check-ins per day across over 15 million venues. Foursquare chose MongoDB because it is fast, supports rich queries, sharding, replication, and geo-indexes. Foursquare runs 8 MongoDB clusters across around 40 machines storing over 2.3 billion records and handling around 15,000 queries per second. They developed Rogue, a Scala DSL for MongoDB, to make queries type-safe and add features like pagination, logging, and index awareness.

mongosffoursquarerogue
Performance Issues

•Instance Type
•Shared Subsystems*
•Network Bandwidth
•Storage Space Initialization
•RAID
Persistence
• S3   is the main storage service for
  EC2

• Cloud Programming Involves Backup
  Mechanisms From Beginning of
  Deployment!
Persistence
• S3   is the main storage service for
  EC2

• Cloud Programming Involves Backup
  Mechanisms From Beginning of
  Deployment!
                  •   Use EC2 as a cache.

                  •   Perform Schedule Backups.

                  •   Perform Schedule Bundling to an
                      AMI.

                  •   Mount S3 as a local partition.

                  •   Push your luck.
Our AMI Choice
• Operating System*
• Software*
• Auditing Actions*
• Configure System Services*
• Installed Amazon Building Tools*
• Develop Shell Util Scripts*
• Build and Upload to S3

Recommended for you

Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform

Terraform allows users to define infrastructure as code to provision resources across multiple cloud platforms. It aims to describe infrastructure in a configuration file, provision resources efficiently by leveraging APIs, and manage the full lifecycle from creation to deletion. Key features include supporting composability across different infrastructure tiers, using a graph-based approach to parallelize operations for efficiency, and managing state to track resource unique IDs and allow recreating resources. Providers enable connectivity to different cloud APIs while resources define the specific infrastructure components and their properties.

terraforminfrastructuredevops
Refactoring terraform
Refactoring terraformRefactoring terraform
Refactoring terraform

The document discusses refactoring Terraform configuration files to improve their design. It provides an example of refactoring a "supermarket-terraform" configuration that originally defined AWS resources across multiple files. The refactoring consolidates the configuration into a single file and adds testing using Test Kitchen. It emphasizes starting small by adding tests incrementally and not making changes without tests to avoid introducing errors.

terraformdevopshashicorp
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018

1. Terraform allows users to define infrastructure as code and treat it like versioned code. It uses configuration files that are shared and versioned. 2. Terraform uses providers to manage cloud infrastructure through their APIs. It generates and executes plans to build, change, and destroy infrastructure based on the configuration files. 3. Terraform supports variables, modules, data sources, and workspaces to help manage infrastructure in different environments like dev, staging, and production in an automated and reusable way.

terraforminfrastructure as codedevops
R on Fedora
•Extra      Packages*
 using R scripting
                         Plotting to eps, pdf



•Plotting Utilities          ?

•Services ?
•Data Distribution*           Integration via
 web services with Oracle BI, and Microsoft
 Reports.
Demo
ssh@AMI
          Tools Used.
            ElasticFox,
              S3Fox,
          bash scripting,
              python,
              rscript
R Cloud Data Handling

AWS

                                       S3   R INPUT
      EC2   root
                             network
                    sdd       drive         R OUTPUT
        sda
              sdb
                                             AMI #1
                                              AMI #2
                          AMI Backup           AMI #3
Batch Processing With R
                                   #! /usr/bin/Rscript --vanilla --default-packages=utils
                                   args <- commandArgs(TRUE)
                                   res <- try(install.packages(args))

                                   if(inherits(res, "try-error")) 
                                      q(status=1) 
                                   else 
                                      q()




$ R --vanilla --slave < hello_world.R  
$ R --vanilla --slave < hello_world.R > result.txt 

$ cat > print_my_args.R << EOF  
args <- commandArgs(TRUE)  
print(args)  
q()  
EOF 

$ R --slave "--args a=100 b=200" < print_my_args.R 

Recommended for you

Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)

Tips and experiences about building and running Ruby on Solaris OS, with examples of bug fixes for supporting Solaris on Ruby.

#rubykaigi2015#rubykaigib#rubykaigi
The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015

It's Dangerous to GC alone. Take this! IBM's talk on work integrating the OMR GC into Ruby. OMR preview: goo.gl/P3yXuy

rubykaigirubygc
Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins

Scripting Embulk plugins makes plugin development easier drastically. You can develop, test, and productionize data integrations using any scripting languages. It's most suitable way to integrate data with SaaS using vendor-provided SDKs. https://techplay.jp/event/781988

embulkdigdagopen source
Large Data Sets
•Excel, SAS, SPSS, etc
•Upload files to S3 (use scripts)
•Data Parallelism vs. Task
 Parallelism

•Service Queuing
•Messaging Interfaces
R Data Fragmentation

• No Correlation type algorithms should
  be used in R Scripting

• Data capture and delivery
• Choose Proper AMI Type
• Probabilistic Algorithm Outcomes
• Consider Data Fragmentation In R
  Scripting*   S3 integration and data preparation ?
To Parallelize Or
         Not?

•R is not Thread Safe
•R stores all Data in Memory
•Algorithms are Serial
 Processes

•Solutions Like Rmpi Raise
 Learning Curve.
Data Parallelism vs.
Task Parallelism




          Parallel
            Agent

Recommended for you

Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akka

Akka is using the Actors together with STM to create a unified runtime and programming model for scaling both UP (multi-core) and OUT (grid/cloud). Akka provides location transparency by abstracting away both these tangents of scalability by turning them into an ops task. This gives the Akka runtime freedom to do adaptive automatic load-balancing, cluster rebalancing, replication & partitioning

actorsconcurrencyakka
Introduce to Terraform
Introduce to TerraformIntroduce to Terraform
Introduce to Terraform

This document discusses an introduction to Terraform infrastructure as code. It covers what Terraform is, its key features like being declarative and reusable, pros and cons like reducing human error but having a high entry barrier. It discusses major providers supported by Terraform including cloud providers, software, and monitoring tools. It concludes with basic steps for getting started with Terraform like installation, AWS profile setup, creating demo files, and executing commands to provision a VPC on AWS.

terraformiacdevops
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR

Abhishek Sinha is a senior product manager at Amazon for Amazon EMR. Amazon EMR allows customers to easily run data frameworks like Hadoop, Spark, and Presto on AWS. It provides a managed platform and tools to launch clusters in minutes that leverage the elasticity of AWS. Customers can customize clusters and choose from different applications, instances types, and access methods. Amazon EMR allows separating compute and storage where the low-cost S3 can be used for persistent storage while clusters are dynamically scaled based on workload.

awsaws-loft-london-2016aws cloud
R Parallel
               Loop
        Parallelization For
        Task Fragmentation
Rmpi
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
   library("Rmpi")
}

# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
   if (is.loaded("mpi_initialize")){
      if (mpi.comm.size(1) > 0){
         print("Please use mpi.close.Rslaves() to close slaves.")
         mpi.close.Rslaves()
      }
      print("Please use mpi.quit() to quit R")
      .Call("mpi_finalize")
   }
}

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

# Tell all slaves to close down, and exit the program

mpi.close.Rslaves()
mpi.quit()
Rmpi
                                                                          Cluster
                                                                       configuration
                                                                      from inside R.
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
   library("Rmpi")
}

# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
   if (is.loaded("mpi_initialize")){
      if (mpi.comm.size(1) > 0){
         print("Please use mpi.close.Rslaves() to close slaves.")
         mpi.close.Rslaves()
      }
      print("Please use mpi.quit() to quit R")
      .Call("mpi_finalize")
   }
}

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

# Tell all slaves to close down, and exit the program

mpi.close.Rslaves()
mpi.quit()
REvolution Parallel R

# Load the Parallel R stack
require ('doNWS')
# We define the function f in our local environment
f <- function (x) { sqrt (x) }
# Start up two R worker processes and register them with foreach/
parallel version
setSleigh (sleigh (workerCount=2))
registerDoNWS ()
# Run a simple foreach loop in parallel on the two workers
foreach (j=1:9, .combine=c) %dopar% f(j)
# Note that the workers use the function f from our local environment,
even though it was not explicitly
# defined on the workers!

Recommended for you

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce

"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR;  how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."

cloud computingamazon web servicesgaurav agrawal - aol inc
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.

amazon web servicescloud computingbdt309
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf

Introduction to Yaetos, an open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in the AWS cloud. Focus on the Spark component.

awsbig datacloud computing
REvolution Parallel R

                                                               foreach
# Load the Parallel R stack                                   iterators
require ('doNWS')
# We define the function f in our local environment
f <- function (x) { sqrt (x) }
# Start up two R worker processes and register them with foreach/
parallel version
setSleigh (sleigh (workerCount=2))
registerDoNWS ()
# Run a simple foreach loop in parallel on the two workers
foreach (j=1:9, .combine=c) %dopar% f(j)
# Note that the workers use the function f from our local environment,
even though it was not explicitly
# defined on the workers!
Map Reduce
Map Reduce
R as a Data Worker
    For Hadoop
          1. data plumbing: to take apply's in R,
          present them to Hadoop as input for a job --
          essentially split the input vector into
          partitions, each of which goes to a Mapper
          task, then have the Reducers combine the
          results, which are sent back to R.  Then R
          continues its processing.

          2. R algorithm parallelization -- rewriting
          critical parts of popular algorithms
          implemented in R, so that they can take
          advantage of R-Hadoop integration.

Recommended for you

Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applications

The document discusses strategies for scaling LAMP applications on cloud computing platforms like AWS. It recommends: 1) Moving static files to scalable services like S3 and using a CDN to distribute load. 2) Using dedicated caching systems like Memcache instead of local caches and storing sessions in Memcache or DynamoDB for scalability. 3) Scaling databases horizontally using master-slave replication or sharding across multiple availability zones for high availability and read scaling. 4) Leveraging auto-scaling and load balancing on AWS with tools like Elastic Load Balancers, CloudWatch, and scaling alarms to dynamically scale application instances based on metrics.

php cloudlampaws
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...

Amazon EMR is a managed service that makes it easy for customers to use big data frameworks and applications like Apache Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3, Amazon’s highly scalable object storage service. In this session, we will introduce Amazon EMR and the greater Apache Hadoop ecosystem, and show how customers use them to implement and scale common big data use cases such as batch analytics, real-time data processing, interactive data science, and more. Then, we will walk through a demo to show how you can start processing your data at scale within minutes.

amazon web servicesapache sparkhadoop
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform

In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.

eva tse - netflixcloudreinvent2015
R as a Data Worker
    For Hadoop
      HadoopStreaming   1. data plumbing: to take apply's in R,
                        present them to Hadoop as input for a job --
                        essentially split the input vector into
                        partitions, each of which goes to a Mapper
                        task, then have the Reducers combine the
                        results, which are sent back to R.  Then R
                        continues its processing.

                        2. R algorithm parallelization -- rewriting
                        critical parts of popular algorithms
                        implemented in R, so that they can take
                        advantage of R-Hadoop integration.
R as a Data Worker
    For Hadoop
      HadoopStreaming   1. data plumbing: to take apply's in R,
                        present them to Hadoop as input for a job --
                        essentially split the input vector into
                        partitions, each of which goes to a Mapper
                        task, then have the Reducers combine the
                        results, which are sent back to R.  Then R
                        continues its processing.

                        2. R algorithm parallelization -- rewriting
                        critical parts of popular algorithms
                        implemented in R, so that they can take
                        advantage of R-Hadoop integration.


                        User Experiences
                        400GB - > 5AMI’s
R Data Parallelization


• Use RToolkit for “Parallel R”
  Processing

• DNS and DynDNS node configuration
• Node and Memory Optimization
• Develop R Script and Distribute
Further Lookup
•   http://calculator.s3.amazonaws.com/calc5.html

•   Secure EC2 Instance

    http://developer.amazonwebservices.com/connect/
    entry.jspa?externalID=1233

•   http://www.revolution-computing.com/

•   http://math.acadiau.ca/ACMMaC/Rmpi/sample.html

•   http://finzi.psych.upenn.edu/R/library/utils/html/
    Rscript.html

•   http://www.rparallel.org/

•   http://cran.r-project.org/web/views/

Recommended for you

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform

This document summarizes Netflix's big data platform, which uses Presto and Spark on Amazon EMR and S3. Key points: - Netflix processes over 50 billion hours of streaming per quarter from 65+ million members across over 1000 devices. - Their data warehouse contains over 25PB stored on S3. They read 10% daily and write 10% of reads. - They use Presto for interactive queries and Spark for both batch and iterative jobs. - They have customized Presto and Spark for better performance on S3 and Parquet, and contributed code back to open source projects. - Their architecture leverages dynamic EMR clusters with Presto and Spark deployed via bootstrap actions for scalability.

데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming

This document discusses various options for migrating data and workloads between on-premises environments and AWS. It covers tools like AWS Database Migration Service for database migration, VM Import/Export for virtual machine migration, copying files between S3 buckets, and using services like Route53 for transitioning traffic during a migration. Specific techniques discussed include copying AMIs, EBS snapshots, security groups, and database parameters between regions; using the AWS Schema Conversion Tool; and DynamoDB cross-region replication.

data migration데이터 마이그레이션aws
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR

by Dario Rivera, Solutions Architect, AWS Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features. This session will feature Asurion, a provider of device protection and support services for over 280 million smartphones and other consumer electronics devices.

awsamazon web servicescloud
Hopefully You Have More
Clouded Days From Now On

doxaras@mineknowledge.com

More Related Content

What's hot

Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
Devopam Mittra
 
Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017
Jonathon Brouse
 
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Stephane Jourdan
 
Introduction to cloudforecast
Introduction to cloudforecastIntroduction to cloudforecast
Introduction to cloudforecast
Masahiro Nagano
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
Jason Vance
 
R hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functionsR hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functions
Aiden Seonghak Hong
 
Configuration management II - Terraform
Configuration management II - TerraformConfiguration management II - Terraform
Configuration management II - Terraform
Xavier Serrat Bordas
 
Terraform day1
Terraform day1Terraform day1
Terraform day1
Gourav Varma
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
x1 ichi
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
Amazon Web Services
 
MongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquareMongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquare
jorgeortiz85
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
Radek Simko
 
Refactoring terraform
Refactoring terraformRefactoring terraform
Refactoring terraform
Nell Shamrell-Harrington
 
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018
Mathieu Herbert
 
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
ngotogenome
 
The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
craig lehmann
 
Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
Sadayuki Furuhashi
 
Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akka
nartamonov
 
Introduce to Terraform
Introduce to TerraformIntroduce to Terraform
Introduce to Terraform
Samsung Electronics
 

What's hot (20)

Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017
 
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
 
Introduction to cloudforecast
Introduction to cloudforecastIntroduction to cloudforecast
Introduction to cloudforecast
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
 
R hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functionsR hive tutorial - udf, udaf, udtf functions
R hive tutorial - udf, udaf, udtf functions
 
Configuration management II - Terraform
Configuration management II - TerraformConfiguration management II - Terraform
Configuration management II - Terraform
 
Terraform day1
Terraform day1Terraform day1
Terraform day1
 
あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法あなたのScalaを爆速にする7つの方法
あなたのScalaを爆速にする7つの方法
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
 
MongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquareMongoSF - mongodb @ foursquare
MongoSF - mongodb @ foursquare
 
Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 
Refactoring terraform
Refactoring terraformRefactoring terraform
Refactoring terraform
 
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018
 
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
Running Ruby on Solaris (RubyKaigi 2015, 12/Dec/2015)
 
The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
 
Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Above the clouds: introducing Akka
Above the clouds: introducing AkkaAbove the clouds: introducing Akka
Above the clouds: introducing Akka
 
Introduce to Terraform
Introduce to TerraformIntroduce to Terraform
Introduce to Terraform
 

Similar to R Jobs on the Cloud

Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
Amazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
prevota
 
Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applications
Corley S.r.l.
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Amazon Web Services
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
Amazon Web Services Korea
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
Manage cloud infrastructures using Zend Framework 2 (and ZF1)
Manage cloud infrastructures using Zend Framework 2 (and ZF1)Manage cloud infrastructures using Zend Framework 2 (and ZF1)
Manage cloud infrastructures using Zend Framework 2 (and ZF1)
Enrico Zimuel
 
Closing the DevOps gaps
Closing the DevOps gapsClosing the DevOps gaps
Closing the DevOps gaps
dev2ops
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 

Similar to R Jobs on the Cloud (20)

Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Cloud computing & lamp applications
Cloud computing & lamp applicationsCloud computing & lamp applications
Cloud computing & lamp applications
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Manage cloud infrastructures using Zend Framework 2 (and ZF1)
Manage cloud infrastructures using Zend Framework 2 (and ZF1)Manage cloud infrastructures using Zend Framework 2 (and ZF1)
Manage cloud infrastructures using Zend Framework 2 (and ZF1)
 
Closing the DevOps gaps
Closing the DevOps gapsClosing the DevOps gaps
Closing the DevOps gaps
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 

More from John Doxaras

Chatbots - A new era in digital banking
Chatbots - A new era in digital bankingChatbots - A new era in digital banking
Chatbots - A new era in digital banking
John Doxaras
 
Programmatic Mobile First
Programmatic Mobile FirstProgrammatic Mobile First
Programmatic Mobile First
John Doxaras
 
Mobile Wallets and Value Added Services
Mobile Wallets and Value Added Services Mobile Wallets and Value Added Services
Mobile Wallets and Value Added Services
John Doxaras
 
Entrepreneurship for Physicists
Entrepreneurship for Physicists Entrepreneurship for Physicists
Entrepreneurship for Physicists
John Doxaras
 
Warply Features
Warply FeaturesWarply Features
Warply Features
John Doxaras
 
Enter2013 Travel Industry Context Marketing
Enter2013 Travel Industry Context MarketingEnter2013 Travel Industry Context Marketing
Enter2013 Travel Industry Context Marketing
John Doxaras
 
Speed
SpeedSpeed
Responsive design
Responsive designResponsive design
Responsive design
John Doxaras
 
eDMO Our proposal for Greek's strategic tourism marketing.
eDMO Our proposal for Greek's strategic tourism marketing.eDMO Our proposal for Greek's strategic tourism marketing.
eDMO Our proposal for Greek's strategic tourism marketing.
John Doxaras
 
Open Source Mobile Development
Open Source Mobile Development 	Open Source Mobile Development
Open Source Mobile Development
John Doxaras
 
Business Planning in Real Life, Part 1
Business Planning in Real Life, Part 1Business Planning in Real Life, Part 1
Business Planning in Real Life, Part 1
John Doxaras
 
Reality Advert Slideshare
Reality Advert SlideshareReality Advert Slideshare
Reality Advert Slideshare
John Doxaras
 
Open Source GIS and Modeling Tools
Open Source GIS and  Modeling ToolsOpen Source GIS and  Modeling Tools
Open Source GIS and Modeling Tools
John Doxaras
 

More from John Doxaras (15)

Chatbots - A new era in digital banking
Chatbots - A new era in digital bankingChatbots - A new era in digital banking
Chatbots - A new era in digital banking
 
Programmatic Mobile First
Programmatic Mobile FirstProgrammatic Mobile First
Programmatic Mobile First
 
Mobile Wallets and Value Added Services
Mobile Wallets and Value Added Services Mobile Wallets and Value Added Services
Mobile Wallets and Value Added Services
 
Entrepreneurship for Physicists
Entrepreneurship for Physicists Entrepreneurship for Physicists
Entrepreneurship for Physicists
 
Warply Features
Warply FeaturesWarply Features
Warply Features
 
Enter2013 Travel Industry Context Marketing
Enter2013 Travel Industry Context MarketingEnter2013 Travel Industry Context Marketing
Enter2013 Travel Industry Context Marketing
 
Speed
SpeedSpeed
Speed
 
Responsive design
Responsive designResponsive design
Responsive design
 
Cheapcoffee
CheapcoffeeCheapcoffee
Cheapcoffee
 
eDMO Our proposal for Greek's strategic tourism marketing.
eDMO Our proposal for Greek's strategic tourism marketing.eDMO Our proposal for Greek's strategic tourism marketing.
eDMO Our proposal for Greek's strategic tourism marketing.
 
Open Source Mobile Development
Open Source Mobile Development 	Open Source Mobile Development
Open Source Mobile Development
 
Business Planning in Real Life, Part 1
Business Planning in Real Life, Part 1Business Planning in Real Life, Part 1
Business Planning in Real Life, Part 1
 
Reality Advert Slideshare
Reality Advert SlideshareReality Advert Slideshare
Reality Advert Slideshare
 
Learning 2.0
Learning 2.0Learning 2.0
Learning 2.0
 
Open Source GIS and Modeling Tools
Open Source GIS and  Modeling ToolsOpen Source GIS and  Modeling Tools
Open Source GIS and Modeling Tools
 

Recently uploaded

論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 

Recently uploaded (20)

論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 

R Jobs on the Cloud

  • 1. R Jobs on the Cloud Doxaras Yiannis for mineknowledge
  • 2. Your Another Company's Company's Cloud Cloud Competitors Company's Cloud My Company's Cloud
  • 3. Instance Types Data Sets For Single Instance R AMI 1-2 GB 4-5 GB 9-10 GB
  • 4. Pricing* • Small ami.small • Large ami.large • XLarge ami.xlarge Data received by EC2 instances costs 10¢ per GB (10243 bytes). Data sent from EC2 instances is charged on a sliding scale, depending on the volume of data transferred during the month: 18¢/GB from 0 to 10 TB, 16¢/GB from 10 to 50 TB, and 13¢/GB for any amount over 50 TB. Data transfers between EC2 instances incur no transfer fees. Data transfers between EC2 instances and S3 buckets located in the United States are also free, but data transfers between EC2 instances and S3 buckets located in Europe incur the standard transfer fees.
  • 5. EBS You can visualize the EBS metaphor as an external hard drive, that serves as a data storage space on S3 for persistence between AMI reboots and failures.
  • 6. Security •Pairing Keys, public/private key cryptography (openssl) •Network Security, pre- configured “default”, Inner- EC2, configure for external communication
  • 7. AMI Setup •Search AMI manifest ID. •Image Location in S3. •m1.manifest.xml.
  • 8. AMI Statuses pending (0): launching and not yet started •running (16): launched and performing like a normal computer (though not necessarily finished booting the AMI's operating system) •shutting-down (32): in the process of terminating •terminated (42): no longer running
  • 9. Starting Instances • ImageId* • MinCount* • MaxCount* • KeyName • SecurityGroup • InstanceType • UserData • AddressingType
  • 10. Logging@Instance • proper security group • proper public DNS entry • RSA Key value Authentication • $ ssh -i ec2-private-key.enc root@ec2-67-202-4-222.z-1.compute-1.amazonaw s.com • chmod 400 ec2-private-key.enc.
  • 11. Logging@Instance __| __|_ ) Rev: 2 _| ( / • proper security group ___|___|___| Welcome to an EC2 Public Image :-) • proper public DNS entryGetting Started __ c __ /etc/ec2/release-notes.txt • RSA Key value Authentication [root@domU-12-31-35-00-53-82 ~]# • $ ssh -i ec2-private-key.enc root@ec2-67-202-4-222.z-1.compute-1.amazonaw s.com • chmod 400 ec2-private-key.enc.
  • 12. Register An AMI •Bundle and Upload to S3 With Manifest.xml. •Register an AMI. •Describe AMI attributes. •Reset AMI Attributes. •Confirm AMI product Code.
  • 13. Performance Issues •Instance Type •Shared Subsystems* •Network Bandwidth •Storage Space Initialization •RAID
  • 14. Persistence • S3 is the main storage service for EC2 • Cloud Programming Involves Backup Mechanisms From Beginning of Deployment!
  • 15. Persistence • S3 is the main storage service for EC2 • Cloud Programming Involves Backup Mechanisms From Beginning of Deployment! • Use EC2 as a cache. • Perform Schedule Backups. • Perform Schedule Bundling to an AMI. • Mount S3 as a local partition. • Push your luck.
  • 16. Our AMI Choice • Operating System* • Software* • Auditing Actions* • Configure System Services* • Installed Amazon Building Tools* • Develop Shell Util Scripts* • Build and Upload to S3
  • 17. R on Fedora •Extra Packages* using R scripting Plotting to eps, pdf •Plotting Utilities ? •Services ? •Data Distribution* Integration via web services with Oracle BI, and Microsoft Reports.
  • 18. Demo ssh@AMI Tools Used. ElasticFox, S3Fox, bash scripting, python, rscript
  • 19. R Cloud Data Handling AWS S3 R INPUT EC2 root network sdd drive R OUTPUT sda sdb AMI #1 AMI #2 AMI Backup AMI #3
  • 20. Batch Processing With R #! /usr/bin/Rscript --vanilla --default-packages=utils args <- commandArgs(TRUE) res <- try(install.packages(args)) if(inherits(res, "try-error"))  q(status=1)  else  q() $ R --vanilla --slave < hello_world.R   $ R --vanilla --slave < hello_world.R > result.txt  $ cat > print_my_args.R << EOF   args <- commandArgs(TRUE)   print(args)   q()   EOF  $ R --slave "--args a=100 b=200" < print_my_args.R 
  • 21. Large Data Sets •Excel, SAS, SPSS, etc •Upload files to S3 (use scripts) •Data Parallelism vs. Task Parallelism •Service Queuing •Messaging Interfaces
  • 22. R Data Fragmentation • No Correlation type algorithms should be used in R Scripting • Data capture and delivery • Choose Proper AMI Type • Probabilistic Algorithm Outcomes • Consider Data Fragmentation In R Scripting* S3 integration and data preparation ?
  • 23. To Parallelize Or Not? •R is not Thread Safe •R stores all Data in Memory •Algorithms are Serial Processes •Solutions Like Rmpi Raise Learning Curve.
  • 24. Data Parallelism vs. Task Parallelism Parallel Agent
  • 25. R Parallel Loop Parallelization For Task Fragmentation
  • 26. Rmpi # Load the R MPI package if it is not already loaded. if (!is.loaded("mpi_initialize")) { library("Rmpi") } # Spawn as many slaves as possible mpi.spawn.Rslaves() # In case R exits unexpectedly, have it automatically clean up # resources taken up by Rmpi (slaves, memory, etc...) .Last <- function(){ if (is.loaded("mpi_initialize")){ if (mpi.comm.size(1) > 0){ print("Please use mpi.close.Rslaves() to close slaves.") mpi.close.Rslaves() } print("Please use mpi.quit() to quit R") .Call("mpi_finalize") } } # Tell all slaves to return a message identifying themselves mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size())) # Tell all slaves to close down, and exit the program mpi.close.Rslaves() mpi.quit()
  • 27. Rmpi Cluster configuration from inside R. # Load the R MPI package if it is not already loaded. if (!is.loaded("mpi_initialize")) { library("Rmpi") } # Spawn as many slaves as possible mpi.spawn.Rslaves() # In case R exits unexpectedly, have it automatically clean up # resources taken up by Rmpi (slaves, memory, etc...) .Last <- function(){ if (is.loaded("mpi_initialize")){ if (mpi.comm.size(1) > 0){ print("Please use mpi.close.Rslaves() to close slaves.") mpi.close.Rslaves() } print("Please use mpi.quit() to quit R") .Call("mpi_finalize") } } # Tell all slaves to return a message identifying themselves mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size())) # Tell all slaves to close down, and exit the program mpi.close.Rslaves() mpi.quit()
  • 28. REvolution Parallel R # Load the Parallel R stack require ('doNWS') # We define the function f in our local environment f <- function (x) { sqrt (x) } # Start up two R worker processes and register them with foreach/ parallel version setSleigh (sleigh (workerCount=2)) registerDoNWS () # Run a simple foreach loop in parallel on the two workers foreach (j=1:9, .combine=c) %dopar% f(j) # Note that the workers use the function f from our local environment, even though it was not explicitly # defined on the workers!
  • 29. REvolution Parallel R foreach # Load the Parallel R stack iterators require ('doNWS') # We define the function f in our local environment f <- function (x) { sqrt (x) } # Start up two R worker processes and register them with foreach/ parallel version setSleigh (sleigh (workerCount=2)) registerDoNWS () # Run a simple foreach loop in parallel on the two workers foreach (j=1:9, .combine=c) %dopar% f(j) # Note that the workers use the function f from our local environment, even though it was not explicitly # defined on the workers!
  • 32. R as a Data Worker For Hadoop 1. data plumbing: to take apply's in R, present them to Hadoop as input for a job -- essentially split the input vector into partitions, each of which goes to a Mapper task, then have the Reducers combine the results, which are sent back to R.  Then R continues its processing. 2. R algorithm parallelization -- rewriting critical parts of popular algorithms implemented in R, so that they can take advantage of R-Hadoop integration.
  • 33. R as a Data Worker For Hadoop HadoopStreaming 1. data plumbing: to take apply's in R, present them to Hadoop as input for a job -- essentially split the input vector into partitions, each of which goes to a Mapper task, then have the Reducers combine the results, which are sent back to R.  Then R continues its processing. 2. R algorithm parallelization -- rewriting critical parts of popular algorithms implemented in R, so that they can take advantage of R-Hadoop integration.
  • 34. R as a Data Worker For Hadoop HadoopStreaming 1. data plumbing: to take apply's in R, present them to Hadoop as input for a job -- essentially split the input vector into partitions, each of which goes to a Mapper task, then have the Reducers combine the results, which are sent back to R.  Then R continues its processing. 2. R algorithm parallelization -- rewriting critical parts of popular algorithms implemented in R, so that they can take advantage of R-Hadoop integration. User Experiences 400GB - > 5AMI’s
  • 35. R Data Parallelization • Use RToolkit for “Parallel R” Processing • DNS and DynDNS node configuration • Node and Memory Optimization • Develop R Script and Distribute
  • 36. Further Lookup • http://calculator.s3.amazonaws.com/calc5.html • Secure EC2 Instance http://developer.amazonwebservices.com/connect/ entry.jspa?externalID=1233 • http://www.revolution-computing.com/ • http://math.acadiau.ca/ACMMaC/Rmpi/sample.html • http://finzi.psych.upenn.edu/R/library/utils/html/ Rscript.html • http://www.rparallel.org/ • http://cran.r-project.org/web/views/
  • 37. Hopefully You Have More Clouded Days From Now On doxaras@mineknowledge.com