SlideShare a Scribd company logo
Apache Spark
Fernando Rodriguez Olivera
@frodriguez
Buenos Aires, Argentina, Nov 2014
JAVACONF 2014
Fernando Rodriguez Olivera
Twitter: @frodriguez
Professor at Universidad Austral (Distributed Systems, Compiler
Design, Operating Systems, …)
Creator of mvnrepository.com
Organizer at Buenos Aires High Scalability Group, Professor at
nosqlessentials.com
Apache Spark
Apache Spark is a Fast and General Engine
for Large-Scale data processing
Supports for Batch, Interactive and Stream
processing with Unified API
In-Memory computing primitives
Hadoop MR Limits
Job Job Job
Hadoop HDFS
- Communication between jobs through FS
- Fault-Tolerance (between jobs) by Persistence to FS
- Memory not managed (relies on OS caches)
MapReduce designed for Batch Processing:
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
3X faster using 10X fewer machines
Hadoop vs Spark for Iterative Proc
source: https://spark.apache.org/
Logistic regression in Hadoop and Spark
Apache Spark
Apache Spark (Core)
Spark
SQL
Spark
Streaming
ML lib GraphX
Powered by Scala and Akka
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Partitions Recomputed on Failure
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
N
Int
Action
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
Partitions
Compute Function
Dependencies
Preferred Compute
Location
(for each partition)
RDD Implementation
Partitioner
depends on
N
Int
Action
Spark API
val spark = new SparkContext()
val lines = spark.textFile(“hdfs://docs/”) // RDD[String]
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String]
val count = nonEmpty.count
Scala
SparkContext spark = new SparkContext();
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”)
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0);
long count = nonEmpty.count();
Java8Python
spark = SparkContext()
lines = spark.textFile(“hdfs://docs/”)
nonEmpty = lines.filter(lambda line: len(line) > 0)
count = nonEmpty.count()
RDD Operations
map(func)
flatMap(func)
filter(func)
take(N)
count()
collect()
Transformations Actions
groupByKey()
reduceByKey(func)
reduce(func)
… …
mapValues(func)
takeOrdered(N)
top(N)
Text Processing Example
Top Words by Frequency
(Step by step)
Create RDD from External Data
// Step 1 - Create RDD from Hadoop Text File
val docs = spark.textFile(“/docs/”)
Hadoop FileSystem,
I/O Formats, Codecs
HBaseS3HDFS MongoDB
Cassandra
…
Apache Spark
Spark can read/write from any data source supported by Hadoop
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop)
ElasticSearch
Function map
Hello World
A New Line
hello
...
The end
.map(line => line.toLowerCase)
RDD[String] RDD[String]
hello world
a new line
hello
...
the end
.map(_.toLowerCase)
// Step 2 - Convert lines to lower case
val lower = docs.map(line => line.toLowerCase)
=
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
Note: flatten() not available in spark, only flatMap
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
// Step 3 - Split lines into words
val words = lower.flatMap(line => line.split(“s+“))
.flatten
hello
a
...
world
new
line
RDD[String]
*
Key-Value Pairs
hello
a
...
world
new
line
hello
hello
a
...
world
new
line
hello
.map(word => Tuple2(word, 1))
1
1
1
1
1
1
.map(word => (word, 1))
RDD[String] RDD[(String, Int)]
// Step 4 - Split lines into words
val counts = words.map(word => (word, 1))
=
RDD[Tuple2[String, Int]]
Pair RDD
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
// Step 5 - Count all words
val freq = counts.reduceByKey(_ + _)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Top N (Prepare data)
world
a
1
1
new 1
line
hello
1
2
// Step 6 - Swap tuples (partial code)
freq.map(_.swap)
.map(_.swap)
world
a
1
1
new1
line
hello
1
2
RDD[(String, Int)] RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
.sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)] Array[(Int, String)]
hello
world
2
1
.take(N).sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N
Array[(Int, String)]
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
world
a
1
1
.top(N)
hello
line
2
1
hello
line
2
1
local top N *
local top N *
reduction
// Step 6 - Swap tuples (complete code)
val top = freq.map(_.swap).top(N)
* local top N implemented by bounded priority queues
val spark = new SparkContext()
// RDD creation from external data source
val docs = spark.textFile(“hdfs://docs/”)
// Split lines into words
val lower = docs.map(line => line.toLowerCase)
val words = lower.flatMap(line => line.split(“s+“))
val counts = words.map(word => (word, 1))
// Count all words (automatic combination)
val freq = counts.reduceByKey(_ + _)
// Swap tuples and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
Top Words by Frequency (Full Code)
RDD Persistence (in-memory)
…
...
...
...
...
…
…
…
...
RDD
.cache()
.persist()
.persist(storageLevel)
StorageLevel:
MEMORY_ONLY,
MEMORY_ONLY_SER,
MEMORY_AND_DISK,
MEMORY_AND_DISK_SER,
DISK_ONLY, …
(memory only)
(memory only)
(lazy persistence & caching)
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
SchemaRDD
RRD of Row + Column Metadata
Queries with SQL
Support for Reflection, JSON,
Parquet, …
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
topWords
case class Word(text: String, n: Int)
val wordsFreq = freq.map {
case (text, count) => Word(text, count)
} // RDD[Word]
wordsFreq.registerTempTable("wordsFreq")
val topWords = sql("select text, n
from wordsFreq
order by n desc
limit 20”) // RDD[Row]
topWords.collect().foreach(println)
nums = words.filter(_.matches(“[0-9]+”))
RDD Lineage
HadoopRDDwords = sc.textFile(“hdfs://large/file/”)
.map(_.toLowerCase)
alpha.count()
MappedRDD
alpha = words.filter(_.matches(“[a-z]+”))
FlatMappedRDD.flatMap(_.split(“ “))
FilteredRDD
Lineage
(built on the driver
by the transformations)
FilteredRDD
Action (run job on the cluster)
RDD Transformations
Deployment with Hadoop
A
B
C
D
/large/file
Data
Node 1
Data
Node 3
Data
Node 4
Data
Node 2
A A AB BBCC
CD DDRF 3
Name
Node
Spark
Master
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Client
Submit App
(mode=cluster)
Driver Executors Executors Executors
allocates resources
(cores and memory)
Application
DN + Spark
HDFSSpark
Fernando Rodriguez Olivera
twitter: @frodriguez

More Related Content

What's hot

BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-Reduce
Lilia Sfaxi
 
TP1 Big Data - MapReduce
TP1 Big Data - MapReduceTP1 Big Data - MapReduce
TP1 Big Data - MapReduce
Amal Abid
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka StreamsTraitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
ENSET, Université Hassan II Casablanca
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Mise en oeuvre des framework de machines et deep learning v1
Mise en oeuvre des framework de machines et deep learning v1 Mise en oeuvre des framework de machines et deep learning v1
Mise en oeuvre des framework de machines et deep learning v1
ENSET, Université Hassan II Casablanca
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
Alexis Seigneurin
 
Support NodeJS avec TypeScript Express MongoDB
Support NodeJS avec TypeScript Express MongoDBSupport NodeJS avec TypeScript Express MongoDB
Support NodeJS avec TypeScript Express MongoDB
ENSET, Université Hassan II Casablanca
 
Chapitre 2 hadoop
Chapitre 2 hadoopChapitre 2 hadoop
Chapitre 2 hadoop
Mouna Torjmen
 
NoSQL et Big Data
NoSQL et Big DataNoSQL et Big Data
NoSQL et Big Data
acogoluegnes
 
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Hatim CHAHDI
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
Mathieu Dumoulin
 
Une introduction à Hive
Une introduction à HiveUne introduction à Hive
Une introduction à Hive
Modern Data Stack France
 

What's hot (20)

BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-Reduce
 
TP1 Big Data - MapReduce
TP1 Big Data - MapReduceTP1 Big Data - MapReduce
TP1 Big Data - MapReduce
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka StreamsTraitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Mise en oeuvre des framework de machines et deep learning v1
Mise en oeuvre des framework de machines et deep learning v1 Mise en oeuvre des framework de machines et deep learning v1
Mise en oeuvre des framework de machines et deep learning v1
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 
Support NodeJS avec TypeScript Express MongoDB
Support NodeJS avec TypeScript Express MongoDBSupport NodeJS avec TypeScript Express MongoDB
Support NodeJS avec TypeScript Express MongoDB
 
Chapitre 2 hadoop
Chapitre 2 hadoopChapitre 2 hadoop
Chapitre 2 hadoop
 
NoSQL et Big Data
NoSQL et Big DataNoSQL et Big Data
NoSQL et Big Data
 
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
 
Une introduction à Hive
Une introduction à HiveUne introduction à Hive
Une introduction à Hive
 

Viewers also liked

Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
E6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong laiE6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong lai
co_doc_nhan
 
Scala idioms
Scala idiomsScala idioms
Scala idioms
Knoldus Inc.
 
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
Dominik Gruber
 
Performance
PerformancePerformance
Performance
Christophe Marchal
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
Massimiliano Martella
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
sjoerdluteyn
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
colorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
Dean Wampler
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
sarith divakar
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
Hadoop / Spark Conference Japan
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
David Taieb
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
Shintaro Fukushima
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
Christophe Marchal
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
David Taieb
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
MapR Technologies
 

Viewers also liked (20)

Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
E6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong laiE6 dieu hanh cho tuong lai
E6 dieu hanh cho tuong lai
 
Scala idioms
Scala idiomsScala idioms
Scala idioms
 
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
 
Performance
PerformancePerformance
Performance
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 

Similar to Apache Spark with Scala

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Sriram Kailasam
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
punesparkmeetup
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 

Similar to Apache Spark with Scala (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Recently uploaded

Online music portal management system project report.pdf
Online music portal management system project report.pdfOnline music portal management system project report.pdf
Online music portal management system project report.pdf
Kamal Acharya
 
Social media management system project report.pdf
Social media management system project report.pdfSocial media management system project report.pdf
Social media management system project report.pdf
Kamal Acharya
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
Mani Krishna Sarkar
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
santoshpatilrao33
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
GOWSIKRAJA PALANISAMY
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
Anwar Patel
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
peacekipu
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
maisnampibarel
 
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
ProexportColombia1
 
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
ProexportColombia1
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
Iwiss Tools Co.,Ltd
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
bookhotbebes1
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
itssurajthakur06
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
Servizi a rete
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 

Recently uploaded (20)

Online music portal management system project report.pdf
Online music portal management system project report.pdfOnline music portal management system project report.pdf
Online music portal management system project report.pdf
 
Social media management system project report.pdf
Social media management system project report.pdfSocial media management system project report.pdf
Social media management system project report.pdf
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
 
Lecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdfLecture 6 - The effect of Corona effect in Power systems.pdf
Lecture 6 - The effect of Corona effect in Power systems.pdf
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
 
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
 
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 

Apache Spark with Scala

  • 1. Apache Spark Fernando Rodriguez Olivera @frodriguez Buenos Aires, Argentina, Nov 2014 JAVACONF 2014
  • 2. Fernando Rodriguez Olivera Twitter: @frodriguez Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing Supports for Batch, Interactive and Stream processing with Unified API In-Memory computing primitives
  • 4. Hadoop MR Limits Job Job Job Hadoop HDFS - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) MapReduce designed for Batch Processing: Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 5. Daytona Gray Sort 100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized
  • 6. Daytona Gray Sort 100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines
  • 7. Hadoop vs Spark for Iterative Proc source: https://spark.apache.org/ Logistic regression in Hadoop and Spark
  • 8. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka
  • 9. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects
  • 10. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed
  • 11. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 12. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 13. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings
  • 14. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings e.g: apply function to count chars Compute Function (transformation)
  • 15. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation)
  • 16. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on
  • 17. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on N Int Action
  • 18. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) Partitions Compute Function Dependencies Preferred Compute Location (for each partition) RDD Implementation Partitioner depends on N Int Action
  • 19. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java8Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. Text Processing Example Top Words by Frequency (Step by step)
  • 22. Create RDD from External Data // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) Hadoop FileSystem, I/O Formats, Codecs HBaseS3HDFS MongoDB Cassandra … Apache Spark Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) ElasticSearch
  • 23. Function map Hello World A New Line hello ... The end .map(line => line.toLowerCase) RDD[String] RDD[String] hello world a new line hello ... the end .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase) =
  • 24. Functions map and flatMap hello world a new line hello ... the end RDD[String]
  • 25. Functions map and flatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]]
  • 26. Functions map and flatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 27. Functions map and flatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 28. Functions map and flatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) .flatten hello a ... world new line RDD[String] *
  • 29. Key-Value Pairs hello a ... world new line hello hello a ... world new line hello .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 .map(word => (word, 1)) RDD[String] RDD[(String, Int)] // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) = RDD[Tuple2[String, Int]] Pair RDD
  • 32. Shuffling hello a world new line hello 1 1 1 1 1 1 world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 33. Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) => a + b) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 34. Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 35. Top N (Prepare data) world a 1 1 new 1 line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap) .map(_.swap) world a 1 1 new1 line hello 1 2 RDD[(String, Int)] RDD[(Int, String)]
  • 36. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)]
  • 37. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] .sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 38. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] Array[(Int, String)] hello world 2 1 .take(N).sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 39. Top N Array[(Int, String)] world a 1 1 new1 line hello 1 2 RDD[(Int, String)] world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N) * local top N implemented by bounded priority queues
  • 40. val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println) Top Words by Frequency (Full Code)
  • 42. SchemaRDD Row ... ... ... ... Row Row Row ... SchemaRDD RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 43. SchemaRDD Row ... ... ... ... Row Row Row ... topWords case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 44. nums = words.filter(_.matches(“[0-9]+”)) RDD Lineage HadoopRDDwords = sc.textFile(“hdfs://large/file/”) .map(_.toLowerCase) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FlatMappedRDD.flatMap(_.split(“ “)) FilteredRDD Lineage (built on the driver by the transformations) FilteredRDD Action (run job on the cluster) RDD Transformations
  • 45. Deployment with Hadoop A B C D /large/file Data Node 1 Data Node 3 Data Node 4 Data Node 2 A A AB BBCC CD DDRF 3 Name Node Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Client Submit App (mode=cluster) Driver Executors Executors Executors allocates resources (cores and memory) Application DN + Spark HDFSSpark