SlideShare a Scribd company logo
From Hadoop to Spark
1/2
Dr. Fabio Fumarola
Contents
• Aggregate and Cluster
• Scatter Gather and MapReduce
• MapReduce
• Why Spark?
• Spark:
– Example, task and stages
– Docker Example
– Scala and Anonymous Functions
• Next Topics in 2/2
2
Aggregates and Clusters
• Aggregate-oriented databases change the rules for
data storage (CAP)
• But running on a cluster changes also computation
models
• When you store data on a cluster you can process
data in parallel
3
Database vs Client Processing
• With a centralized database data can be processed
on the database server or on the client machine
• Running on the client:
– Pros: flexibility and programming languages
– Cons: data transfer from the server to the client
• Running on the server:
– Pros: Data locality
– Cons: programming languages and debugging
4

Recommended for you

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.

big datapigdata mining
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab

In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding). Finally we presents other document oriented database and when to use or not document oriented databases.

mongodbtutorial
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.

big dataapache sparkmeetup
Cluster and Computation
• We can spread the computation across the cluster
• However, we have to reduce the amount of data
transferred over the network
• We need have computation locality
• That is process the data in the same node where is
stored
5
Scatter-Gather
6
Use a Scatter-Gather that broadcasts a message to multiple recipients and
re-aggregates the responses back into a single message.
2003
Map-Reduce
• It is a way to organize processing by taking
advantage of clusters
• It gained prominence with Google’s MapReduce
framework (Dean and Ghemawat 2004)
• It was then implemented in Hadoop Framework
7
http://research.google.com/archive/mapreduce.html
https://hadoop.apache.org/
Programming Model: MapReduce
• We have a huge text document
• Count the number of times each distinct word
appears in the file
• Sample applications
– Analyze web server logs to find popular URLs
– Term statistics for search
8

Recommended for you

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

spark; internal; shuffle;
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem

This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction

Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.

hadoopsparksqlspark
Word Count
• Assumption: the input file is too large for memory,
but all <word, count> pairs fit in memory
• We can compute the pairs by
– wc file.txt | sort | uniq -c
• Encyclopedia Britannica, 11th Edition, Volume 4, Part
3 (http://www.gutenberg.org/files/19699/19699.zip)
9
Word Count Steps
wc file.txt | sort | uniq –c
•Map
• Scan input file record-at-a-time
• Extract keys from each record
•Group by key
• Sort and shuffle
•Reduce
• Aggregate, summarize, filter or transform
• Write the results
10
MapReduce: Logical Steps
• Map
• Group by Key
• Reduce
11
Map Phase
12

Recommended for you

Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training

The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.

spark summit 2015apache spark
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview

This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure

This document discusses scaling extract, transform, load (ETL) processes with Apache Hadoop. It describes how data volumes and varieties have increased, challenging traditional ETL approaches. Hadoop offers a flexible way to store and process structured and unstructured data at scale. The document outlines best practices for extracting data from databases and files, transforming data using tools like MapReduce, Pig and Hive, and loading data into data warehouses or keeping it in Hadoop. It also discusses workflow management with tools like Oozie. The document cautions against several potential mistakes in ETL design and implementation with Hadoop.

hadoopetlsqoop
Group and Reduce Phase
13
Partition and shuffling
MapReduce: Word Counting
14
Word Count with MapReduce
15
Example: Language Model
• Count the number of times each 5-word sequence
occurs in a large corpus of documents
• Map
– Extract <5-word sequence, count> pairs from each
document
• Reduce
– Combine the counts
16

Recommended for you

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core

Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components

Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.

Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop

This document discusses Sqoop, a tool for importing structured data from databases into Hadoop. Sqoop automates the process of importing data from relational databases into HDFS and generating code to work with the imported data. It supports importing from popular databases into various data formats like text files or SequenceFiles. Sqoop also integrates with Hive, allowing SQL-like queries over imported data. The tool aims to simplify common ETL tasks and make database data accessible for MapReduce jobs and analytics on Hadoop.

MapReduce: Physical Execution
17
Physical Execution: Concerns
• Mapper intermediate results are send to a single
reduces:
– This is the only steps that require a communication over
the network
• Thus the Partition and Shuffle phase are critical
18
Partition and Shuffle
Reducing the cost of these steps, dramatically reduces
the cost in time of the computation:
•The Partitioner determines which partition a given
(key, value) pair will go to.
•The default partitioner computes a hash value for the
key and assigns the partition based on this result.
•The Shuffler moves map outputs to the reducers.
19
MapReduce: Features
• Partioning the input data
• Scheduling the program’s execution across a set of
machines
• Performing the group by key step
• Handling node failures
• Managing required inter-machine communication
20

Recommended for you

Spark core
Spark coreSpark core
Spark core

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

sparkhadoop
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore

This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?

With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0. We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?

big datahadoop3.0hdfs
Hadoop
MapReduce
21
Hadoop
• An Open-Source software for distributed storage of large
dataset on commodity hardware
• Provides a programming model/framework for processing
large dataset in parallel
22
Map
Map
Map
Reduce
Reduce
Input Output
Hadoop: Architecture
23
Distributed File System
• Data is kept in “chunks” spread across machines
• Each chink is replicated on different machines
(Persistence and Availability)
24

Recommended for you

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.

apache sparkrddrdd deep dive
Data Science
Data ScienceData Science
Data Science

Scalable Data Analytics, Distributed Data Analytics, Fault Tolerant Data Analytics, Big Data Analytics

data sciencedata analysisbig data
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases

In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.

column-orientednosqlimmutability
Distributed File System
• Chunk Servers
– File is split into contiguous chunks (16-64 MB)
– Each chunk is replicated 3 times
– Try to keep replicas on different machines
• Master Node
– Name Node in Hadoop’s HDFS
– Stores metadata about where files are stored
– Might be replicated
25
Hadoop’s Limitations
26
Limitations of Map Reduce
• Slow due to replication, serialization, and disk IO
• Inefficient for:
– Iterative algorithms (Machine Learning, Graphs & Network Analysis)
– Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching)
27
Input iter. 1iter. 1 iter. 2iter. 2 . . .
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Map
Map
Map
Reduce
Reduce
Input Output
Solutions?
• Leverage to memory:
– load Data into Memory
– Replace disks with SSD
28

Recommended for you

8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker

This document discusses how to setup HBase with Docker in three configurations: single-node standalone, pseudo-distributed single-machine, and fully-distributed cluster. It describes features of HBase like consistent reads/writes, automatic sharding and failover. It provides instructions for installing HBase in a single node using Docker, including building an image and running it with ports exposed. It also covers running HBase in pseudo-distributed mode with the processes running as separate containers and interacting with the HBase shell.

hbasedocker
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab

In this lecture we analyze graph oriented databases. In particular, we consider TtitanDB as graph database. We analyze how to query using gremlin and how to create edges and vertex. Finally, we presents how to use rexster to visualize the storeg graph/

graph dbtitandbrexster
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases

Introduction The Lack of relationship for RDBMS and NoSQL Graph Databases: Features Relations Query Language Data Modeling with Graphs Conclusions

nosqldesign nosql databasesgraph db
Apache Spark
• A big data analytics cluster-computing framework
written in Scala.
• Open Sourced originally in AMPLab at UC Berkley
• Provides in-memory analytics based on RDD
• Highly compatible with Hadoop Storage API
– Can run on top of an Hadoop cluster
• Developer can write programs using multiple
programming languages
29
Spark architecture
30
HDFS
Datanode Datanode Datanode....
Spark
Worker
Spark
Worker
Spark
Worker
....
CacheCache CacheCache CacheCache
Block Block Block
Cluster Manager
Spark Driver (Master)
Hadoop Data Flow
31
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Spark Data Flow
32
iter. 1iter. 1 iter. 2iter. 2 . . .
Input
Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFS
read

Recommended for you

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce

Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.

chapterstudentvnit-acm
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective. In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.

hadoopspark sqlspark mllib
Spark
SparkSpark
Spark

This document discusses the Spark analytics platform and its advantages over existing Hadoop-based platforms. Spark provides a unified data processing engine for batch, interactive, and streaming workloads. It includes components like Spark Core for distributed computing, Spark Streaming for real-time data processing, Shark for SQL and analytics, GraphX for graph processing, and MLlib for machine learning. Spark aims to be up to 100x faster than Hadoop for interactive queries by keeping data in-memory using its Resilient Distributed Datasets (RDDs). It also leverages other projects like Mesos for resource management and Tachyon for a fault-tolerant shared storage layer.

Spark Programming Model
33
Datanode
HDFS
Datanode…
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
SparkContextSparkContext
Cluster
Manager
Cluster
Manager
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Spark Programming Model
34
User
(Developer)
Writes
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
Driver Program
RDD
(Resilient
Distributed
Dataset)
RDD
(Resilient
Distributed
Dataset)
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
• Immutable Data structure
• In-memory (explicitly)
• Fault Tolerant
• Parallel Data Structure
• Controlled partitioning to
optimize data placement
• Can be manipulated using rich set
of operators.
RDD
• Programming Interface: Programmer can perform 3 types of
operations
35
Transformations
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Transformations
•Create a new dataset from
and existing one.
•Lazy in nature. They are
executed only when some
action is performed.
•Example :
• Map(func)
• Filter(func)
• Distinct()
Actions
•Returns to the driver
program a value or exports
data to a storage system
after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Actions
•Returns to the driver
program a value or exports
data to a storage system
after performing a
computation.
•Example:
• Count()
• Reduce(funct)
• Collect
• Take()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
Persistence
•For caching datasets in-
memory for future
operations.
•Option to store on disk or
RAM or mixed (Storage
Level).
•Example:
• Persist()
• Cache()
How Spark works
• RDD: Parallel collection with partitions
• User application create RDDs, transform them, and
run actions.
• This results in a DAG (Directed Acyclic Graph) of
operators.
• DAG is compiled into stages
• Each stage is executed as a series of Task (one Task
for each Partition).
36

Recommended for you

Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction

This document provides an introduction to HBase, including: - An overview of BigTable, which HBase is modeled after - Descriptions of the key features of HBase like being distributed, column-oriented, and versioned - Examples of using the HBase shell to create tables, insert and retrieve data - An explanation of the Java APIs for administering HBase, inserting/updating/retrieving data using Puts, Gets, and Scans - Suggestions for setting up HBase with Docker for coding examples

hbase introduction
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab

The document discusses cloning Twitter using HBase. It describes some key features of Twitter like allowing users to post status updates, follow other users, mention users, and re-tweet posts. It then provides an overview of HBase including its features like consistency, automatic sharding and failover. It discusses how to install HBase in single node, pseudo-distributed and fully distributed modes using Docker. It also demonstrates some common HBase shell commands like creating and listing tables, putting and getting data. Finally, it discusses how to model the user, tweet, follower and following relationships in HBase.

twitbasedockerhbase
3 Git
3 Git3 Git
3 Git

The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data. Course Website http://pbdmng.datatoknowledge.it/ Contact me to download the slides

git
Example
37
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile
Example
38
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
RDD[String]
textFile map
RDD[List[String]]
Example
39
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
Example
40
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_)
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey

Recommended for you

8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory

This document discusses cloning Twitter using Redis by storing user, follower, and post data in Redis keys and data structures. It provides examples of how to store: 1) User profiles as Hashes with fields like username and ID. 2) Follower and following relationships as Sorted Sets with user IDs and timestamps. 3) User posts and timelines as Lists by pushing new post IDs. It explains that while Redis lacks tables, its keys and data structures like Hashes, Sets and Lists allow building the same data model without secondary indexes. The document also notes that the system can scale horizontally by sharding the data across multiple Redis servers.

key-value storeredis – data structure servertwitter
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2

Moving from Hadoop to Spark. My Talk at Bay Area ACM Meetup on 2015-02-23 http://www.meetup.com/SF-Bay-ACM/events/220032641/

hadoopspark
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala

In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.

big data processingscalabig data with spark and scala
Example
41
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(“t”))
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect
Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in
between
42
textFile map map
reduceByKey
collect
Stage 1 Stage 2
Execution Plan
43
textFile map map
reduceByK
ey
collect
Stage
1
Stage
2
Stage
1
Stage
2
1. Read HDFS split
2. Apply both the maps
3. Start Partial reduce
4. Write shuffle data
1. Read shuffle data
2. Final reduce
3. Send result to driver
program
Stage Execution
• Create a task for each Partition in the new RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
44
Task 1
Task 2
Task 2
Task 2

Recommended for you

7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth

DynamoDB is a key-value database that achieves high availability and scalability through several techniques: 1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability. 2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates. 3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.

redis – data structure serverriak basholeveldb – google key-value store
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark

Quick Introduction of Hadoop & it's Limitation Introduction of Spark Spark Architecture Programming model of Spark Demo Spark Use Cases

analyticsapache sparkhadoop
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL

- Data modeling for NoSQL databases is different than relational databases and requires designing the data model around access patterns rather than object structure. Key differences include not having joins so data needs to be duplicated and modeling the data in a way that works for querying, indexing, and retrieval speed. - The data model should focus on making the most of features like atomic updates, inner indexes, and unique identifiers. It's also important to consider how data will be added, modified, and retrieved factoring in object complexity, marshalling/unmarshalling costs, and index maintenance. - The _id field can be tailored to the access patterns, such as using dates for time-series data to keep recent

mongodbnosqlqconsf
Spark Executor (Slaves)
45
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3
Summary of Components
• Task: The fundamental unit of execution in Spark
• Stage: Set of Tasks that run parallel
• DAG: Logical Graph of RDD operations
• RDD: Parallel dataset with partitions
46
Start the docker container
From
•https://github.com/sequenceiq/docker-spark
docker pull sequenceiq/spark:1.3.0
docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu
/etc/ bootstrap.sh bash
•Run the spark shell using yarn or local
spark-shell --master yarn-client --driver-memory 1g --executor-memory
1g --executor-cores 2
47
Separate Container Master/Worker
$ docker pull snufkin/spark-master
$ docker pull snufkin/spark-worker
•These images are based on snufkin/spark-base
$ docker run … master
$ docker run … worker
48

Recommended for you

Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case

The document discusses using Oracle Solaris technologies for an Apache Hadoop cluster. It describes how Oracle Solaris Zones and ZFS provide benefits like fast provisioning of cluster nodes, high network throughput, large data capacity, and optimized I/O performance for Hadoop deployments. Various Oracle Solaris tools are also highlighted that can help monitor resource usage and troubleshoot performance issues for Hadoop workloads.

solariszfszones
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark

This document compares MapReduce and Spark frameworks. It discusses their histories and basic functionalities. MapReduce uses input, map, shuffle, and reduce stages, while Spark uses RDDs (Resilient Distributed Datasets) and transformations and actions. Spark is easier to program than MapReduce due to its interactive mode, but MapReduce has more supporting tools. Performance benchmarks show Spark is faster than MapReduce for sorting. The hardware and developer costs of Spark are also lower than MapReduce.

Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce

This document is a presentation on Apache Spark that compares its performance to MapReduce. It discusses how Spark is faster than MapReduce, provides code examples of performing word counts in both Spark and MapReduce, and explains features that make Spark suitable for big data analytics such as simplifying data analysis, providing built-in machine learning and graph libraries, and speaking multiple languages. It also lists many large companies that use Spark for applications like recommendations, business intelligence, and fraud detection.

apache spark and scalabig databig data analytics with spark
Running the example and Shell
• To Run an example
$ run-example SparkPi 10
• We can start a spark shell via
–spark-shell -- master local n
• The -- master specifies the master URL for a
distributed cluster
• Example applications are also provided in Python
–spark-submit example/src/main/python/pi.py 10
49
Scala Base Course - Start
50
Scala vs Java vs Python
• Spark was originally written in Scala, which allows
concise function syntax and interactive use
• Java API added for standalone applications
• Python API added more recently along with an
interactive shell.
51
Why Scala?
• High-level language for the JVM
– Object oriented + functional programming
• Statistically typed
– Type Inference
• Interoperates with Java
– Can use any Java Class
– Can be called from Java code
52

Recommended for you

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase

The document discusses realtime analytics using Hadoop and HBase. It begins by introducing the speaker and their experience. It then discusses moving from batch processing with Hadoop to more realtime needs, and how systems like HBase can help bridge that gap. Several designs are presented for using HBase and Hadoop together to enable both realtime and batch analytics on large datasets.

hadoophbaseanalytics
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Apache Spark
Apache SparkApache Spark
Apache Spark

This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.

apache spark
Quick Tour of Scala
53
Laziness
• For variables we can define lazy val, that are evaluated when
called
lazy val x = 10 * 10 * 10 * 10 //long computation
• For methods we can define call by value and call by name for
the parameters
def square(x: Double) // call by value
def square(x: => Double) // call by name
• It changes the order the parameter are evaluated
54
Anonymous functions
55
scala> val square = (x: Int) => x * x
square: Int => Int = <function1>
We define an anonymous function from Int to Int
The square is a val square of type Function1, which is equivalent to
scala> def square(x: Int) = x * x
square: (x: Int)Int
Anonymous Functions
(x: Int) => x * x
This is a syntactic sugar for
new Function1[Int ,Int] {
def apply(x: Int): Int = x * x
}
56

Recommended for you

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech

The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming

This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.

big datanosqlspark
Data Science
Data ScienceData Science
Data Science

The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses: - Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark. - Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs. - Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access. - Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.

Currying
Converting a function with multiple arguments into a function
with a single argument that returns another function.
def gen(f: Int => Int)(x: Int) = f(x)
def identity(x: Int) = gen(i => i)(x)
def square(x: Int) = gen(i => i * i)(x)
def cube(x: Int) = gen(i => i * i * i)(x)
57
Anonymous Functions
//Explicit type declaration
val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y)
//The compiler expects 2 ints so x and y types are inferred
val call2 = doWithOneAndTwo((x, y) => x + y)
//Even more concise syntax
val call3 = doWithOneAndTwo(_ + _)
58
Returning multiple variables
def swap(x:String, y:String) = (y, x)
val (a,b) = swap("hello","world")
println(a, b)
59
High Order Functions
Methods that take as parameter functions
val list = (1 to 4).toList
list.foreach( x => println(x))
list.foreach(println)
list.map(x => x + 2)
list.map(_ + 2)
list.filter(x => x % 2 == 1)
list.filter(_ % 2 == 1)
list.reduce((x,y) => x + y)
list.reduce(_ + _)
60

Recommended for you

Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx

Spark improves on Hadoop MapReduce by keeping data in-memory between jobs. It reads data into resilient distributed datasets (RDDs) that can be transformed and cached in memory across nodes for faster iterative jobs. RDDs are immutable, partitioned collections distributed across a Spark cluster. Transformations define operations on RDDs, while actions trigger computation by passing data to the driver program.

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark

Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.

sparkapache sparkstream processing
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing

This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.

big datamachine learningsoftware development
Function Methods on Collections
61
http://www.scala-lang.org/api/2.11.6/index.html#scala.collection.Seq
Scala Base Course - End
http://scalatutorials.com/
62
Next Topics
• Spark Shell
– Scala
– Python
• Shark Shell
• Data Frames
• Spark Streaming
• Code Examples: Processing and Machine Learning
63

More Related Content

What's hot

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop
HadoopHadoop
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
Fabio Fumarola
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Data Science
Data ScienceData Science
Data Science
Ahmet Bulut
 

What's hot (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Spark core
Spark coreSpark core
Spark core
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Data Science
Data ScienceData Science
Data Science
 

Viewers also liked

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
Fabio Fumarola
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
Fabio Fumarola
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
Fabio Fumarola
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
Fabio Fumarola
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Spark
SparkSpark
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
Fabio Fumarola
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
Fabio Fumarola
 
3 Git
3 Git3 Git
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
Fabio Fumarola
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
Sujee Maniyam
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
Tony Tam
 
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Orgad Kimchi
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 

Viewers also liked (20)

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Spark
SparkSpark
Spark
 
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
3 Git
3 Git3 Git
3 Git
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 

Similar to 11. From Hadoop to Spark 1:2

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Hadoop
HadoopHadoop
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Dong Ngoc
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 

Similar to 11. From Hadoop to Spark 1:2 (20)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Data Science
Data ScienceData Science
Data Science
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 

More from Fabio Fumarola

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
Fabio Fumarola
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
Fabio Fumarola
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
Fabio Fumarola
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
Fabio Fumarola
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
Fabio Fumarola
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
Fabio Fumarola
 
08 datasets
08 datasets08 datasets
08 datasets
Fabio Fumarola
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Fabio Fumarola
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
Fabio Fumarola
 

More from Fabio Fumarola (11)

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
 
08 datasets
08 datasets08 datasets
08 datasets
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 

Recently uploaded

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
nikita dubey$A17
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
cwavvyy
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
aashuverma204
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Alisha Pathan $A17
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
yogita singh$A17
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
khansayyad1256
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
shruti singh$A17
 

Recently uploaded (20)

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 

11. From Hadoop to Spark 1:2

  • 1. From Hadoop to Spark 1/2 Dr. Fabio Fumarola
  • 2. Contents • Aggregate and Cluster • Scatter Gather and MapReduce • MapReduce • Why Spark? • Spark: – Example, task and stages – Docker Example – Scala and Anonymous Functions • Next Topics in 2/2 2
  • 3. Aggregates and Clusters • Aggregate-oriented databases change the rules for data storage (CAP) • But running on a cluster changes also computation models • When you store data on a cluster you can process data in parallel 3
  • 4. Database vs Client Processing • With a centralized database data can be processed on the database server or on the client machine • Running on the client: – Pros: flexibility and programming languages – Cons: data transfer from the server to the client • Running on the server: – Pros: Data locality – Cons: programming languages and debugging 4
  • 5. Cluster and Computation • We can spread the computation across the cluster • However, we have to reduce the amount of data transferred over the network • We need have computation locality • That is process the data in the same node where is stored 5
  • 6. Scatter-Gather 6 Use a Scatter-Gather that broadcasts a message to multiple recipients and re-aggregates the responses back into a single message. 2003
  • 7. Map-Reduce • It is a way to organize processing by taking advantage of clusters • It gained prominence with Google’s MapReduce framework (Dean and Ghemawat 2004) • It was then implemented in Hadoop Framework 7 http://research.google.com/archive/mapreduce.html https://hadoop.apache.org/
  • 8. Programming Model: MapReduce • We have a huge text document • Count the number of times each distinct word appears in the file • Sample applications – Analyze web server logs to find popular URLs – Term statistics for search 8
  • 9. Word Count • Assumption: the input file is too large for memory, but all <word, count> pairs fit in memory • We can compute the pairs by – wc file.txt | sort | uniq -c • Encyclopedia Britannica, 11th Edition, Volume 4, Part 3 (http://www.gutenberg.org/files/19699/19699.zip) 9
  • 10. Word Count Steps wc file.txt | sort | uniq –c •Map • Scan input file record-at-a-time • Extract keys from each record •Group by key • Sort and shuffle •Reduce • Aggregate, summarize, filter or transform • Write the results 10
  • 11. MapReduce: Logical Steps • Map • Group by Key • Reduce 11
  • 13. Group and Reduce Phase 13 Partition and shuffling
  • 15. Word Count with MapReduce 15
  • 16. Example: Language Model • Count the number of times each 5-word sequence occurs in a large corpus of documents • Map – Extract <5-word sequence, count> pairs from each document • Reduce – Combine the counts 16
  • 18. Physical Execution: Concerns • Mapper intermediate results are send to a single reduces: – This is the only steps that require a communication over the network • Thus the Partition and Shuffle phase are critical 18
  • 19. Partition and Shuffle Reducing the cost of these steps, dramatically reduces the cost in time of the computation: •The Partitioner determines which partition a given (key, value) pair will go to. •The default partitioner computes a hash value for the key and assigns the partition based on this result. •The Shuffler moves map outputs to the reducers. 19
  • 20. MapReduce: Features • Partioning the input data • Scheduling the program’s execution across a set of machines • Performing the group by key step • Handling node failures • Managing required inter-machine communication 20
  • 22. Hadoop • An Open-Source software for distributed storage of large dataset on commodity hardware • Provides a programming model/framework for processing large dataset in parallel 22 Map Map Map Reduce Reduce Input Output
  • 24. Distributed File System • Data is kept in “chunks” spread across machines • Each chink is replicated on different machines (Persistence and Availability) 24
  • 25. Distributed File System • Chunk Servers – File is split into contiguous chunks (16-64 MB) – Each chunk is replicated 3 times – Try to keep replicas on different machines • Master Node – Name Node in Hadoop’s HDFS – Stores metadata about where files are stored – Might be replicated 25
  • 27. Limitations of Map Reduce • Slow due to replication, serialization, and disk IO • Inefficient for: – Iterative algorithms (Machine Learning, Graphs & Network Analysis) – Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching) 27 Input iter. 1iter. 1 iter. 2iter. 2 . . . HDFS read HDFS write HDFS read HDFS write Map Map Map Reduce Reduce Input Output
  • 28. Solutions? • Leverage to memory: – load Data into Memory – Replace disks with SSD 28
  • 29. Apache Spark • A big data analytics cluster-computing framework written in Scala. • Open Sourced originally in AMPLab at UC Berkley • Provides in-memory analytics based on RDD • Highly compatible with Hadoop Storage API – Can run on top of an Hadoop cluster • Developer can write programs using multiple programming languages 29
  • 30. Spark architecture 30 HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... CacheCache CacheCache CacheCache Block Block Block Cluster Manager Spark Driver (Master)
  • 31. Hadoop Data Flow 31 iter. 1iter. 1 iter. 2iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write
  • 32. Spark Data Flow 32 iter. 1iter. 1 iter. 2iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark HDFS read
  • 33. Spark Programming Model 33 Datanode HDFS Datanode… User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program SparkContextSparkContext Cluster Manager Cluster Manager Worker Node ExecuterExecuter CacheCache TaskTask TaskTask Worker Node ExecuterExecuter CacheCache TaskTask TaskTask
  • 34. Spark Programming Model 34 User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program RDD (Resilient Distributed Dataset) RDD (Resilient Distributed Dataset) • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators. • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators.
  • 35. RDD • Programming Interface: Programmer can perform 3 types of operations 35 Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache()
  • 36. How Spark works • RDD: Parallel collection with partitions • User application create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a series of Task (one Task for each Partition). 36
  • 39. Example 39 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map
  • 40. Example 40 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map RDD[(String, Int)] reduceByKey
  • 41. Example 41 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey Array[(String, Int)] collect
  • 42. Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between 42 textFile map map reduceByKey collect Stage 1 Stage 2
  • 43. Execution Plan 43 textFile map map reduceByK ey collect Stage 1 Stage 2 Stage 1 Stage 2 1. Read HDFS split 2. Apply both the maps 3. Start Partial reduce 4. Write shuffle data 1. Read shuffle data 2. Final reduce 3. Send result to driver program
  • 44. Stage Execution • Create a task for each Partition in the new RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything) 44 Task 1 Task 2 Task 2 Task 2
  • 45. Spark Executor (Slaves) 45 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 1 Core 2 Core 3
  • 46. Summary of Components • Task: The fundamental unit of execution in Spark • Stage: Set of Tasks that run parallel • DAG: Logical Graph of RDD operations • RDD: Parallel dataset with partitions 46
  • 47. Start the docker container From •https://github.com/sequenceiq/docker-spark docker pull sequenceiq/spark:1.3.0 docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu /etc/ bootstrap.sh bash •Run the spark shell using yarn or local spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 2 47
  • 48. Separate Container Master/Worker $ docker pull snufkin/spark-master $ docker pull snufkin/spark-worker •These images are based on snufkin/spark-base $ docker run … master $ docker run … worker 48
  • 49. Running the example and Shell • To Run an example $ run-example SparkPi 10 • We can start a spark shell via –spark-shell -- master local n • The -- master specifies the master URL for a distributed cluster • Example applications are also provided in Python –spark-submit example/src/main/python/pi.py 10 49
  • 50. Scala Base Course - Start 50
  • 51. Scala vs Java vs Python • Spark was originally written in Scala, which allows concise function syntax and interactive use • Java API added for standalone applications • Python API added more recently along with an interactive shell. 51
  • 52. Why Scala? • High-level language for the JVM – Object oriented + functional programming • Statistically typed – Type Inference • Interoperates with Java – Can use any Java Class – Can be called from Java code 52
  • 53. Quick Tour of Scala 53
  • 54. Laziness • For variables we can define lazy val, that are evaluated when called lazy val x = 10 * 10 * 10 * 10 //long computation • For methods we can define call by value and call by name for the parameters def square(x: Double) // call by value def square(x: => Double) // call by name • It changes the order the parameter are evaluated 54
  • 55. Anonymous functions 55 scala> val square = (x: Int) => x * x square: Int => Int = <function1> We define an anonymous function from Int to Int The square is a val square of type Function1, which is equivalent to scala> def square(x: Int) = x * x square: (x: Int)Int
  • 56. Anonymous Functions (x: Int) => x * x This is a syntactic sugar for new Function1[Int ,Int] { def apply(x: Int): Int = x * x } 56
  • 57. Currying Converting a function with multiple arguments into a function with a single argument that returns another function. def gen(f: Int => Int)(x: Int) = f(x) def identity(x: Int) = gen(i => i)(x) def square(x: Int) = gen(i => i * i)(x) def cube(x: Int) = gen(i => i * i * i)(x) 57
  • 58. Anonymous Functions //Explicit type declaration val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y) //The compiler expects 2 ints so x and y types are inferred val call2 = doWithOneAndTwo((x, y) => x + y) //Even more concise syntax val call3 = doWithOneAndTwo(_ + _) 58
  • 59. Returning multiple variables def swap(x:String, y:String) = (y, x) val (a,b) = swap("hello","world") println(a, b) 59
  • 60. High Order Functions Methods that take as parameter functions val list = (1 to 4).toList list.foreach( x => println(x)) list.foreach(println) list.map(x => x + 2) list.map(_ + 2) list.filter(x => x % 2 == 1) list.filter(_ % 2 == 1) list.reduce((x,y) => x + y) list.reduce(_ + _) 60
  • 61. Function Methods on Collections 61 http://www.scala-lang.org/api/2.11.6/index.html#scala.collection.Seq
  • 62. Scala Base Course - End http://scalatutorials.com/ 62
  • 63. Next Topics • Spark Shell – Scala – Python • Shark Shell • Data Frames • Spark Streaming • Code Examples: Processing and Machine Learning 63

Editor's Notes

  1. Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner. This is the main concept around which the whole Spark framework revolves around. Currently 2 types of RDDs: Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU. Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  2. Transformations: Like map – takes an RDD as an input, passes &amp; process each element to a function, and return a new transformed RDD as an output. By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access. RDD can be persisted on discs as well. Caching is the Key tool for iterative algorithms. Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY. MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they&amp;apos;re needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don&amp;apos;t fit on disk, and read them from there when they&amp;apos;re needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don&amp;apos;t fit in memory to disk instead of recomputing them on the fly each time they&amp;apos;re needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 etc Same as the levels above, but replicate each partition on two cluster nodes. Which Storage level is best: Few things to consider: Try to keep in-memory as much as possible Try not to spill to disc unless your computed datasets are memory expensive Use replication only if you want fault tolerance