SlideShare a Scribd company logo
Big Data processing
with Apache Spark
Lucian Neghina
Big Data Architect
by Developer for Developers
Outline
● Spark Overview
● Structured APIs
● Low-Level APIs
● Streaming
● Writing an Application
Who uses Spark, and for what
Data Science
● Analyze and model the data
● Transforming the data into a
usable format
● Ad-hoc analysis, statistics,
machine learning
Data Processing
● Parallelize across clusters
● Hides the complexity of
distributed systems
○ Programming
○ Networking communication
○ Fault tolerance
Ecosystem
Functionality
Basic Architecture
Master-Workers
Directed Acyclic Graph (DAG)
DataFrame & Partitions
Core data structures
● Immutable
● Lives in memory
● Strongly typed
● Operations
○ Transformations (lazy)
○ Actions
Transformations and Actions
● Transformations return new RDDs as results
They are lazy, their result is not immediately computed
● Actions compute a result based on an RDD, and either
returned or saved to a storage
They are eager, their result is immediately computed
Transformations
Transformations
Actions
lines
#
#
#
#
count() cause Spark to:
● read data
● sum within partitions
● combine sums in driver
Actions
lines
#
#
#
#
Spark recomputes lines:
● read data (again)
● sum within partitions
● combine sums in driver
comments
#
#
#
#
What happens when an action is executed?
Driver
Worker Worker Worker
What happens when an action is executed?
The data is partitioned into different blocks
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
What happens when an action is executed?
Driver sends the code to be executed on
each block
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
What happens when an action is executed?
Read HDFS block
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
What happens when an action is executed?
Read HDFS block and cache the data
Process and send the result to the driver
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
What happens when an action is executed?
Driver combine the results / sum
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
What happens when an action is executed?
Process from cache
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
What happens when an action is executed?
Send the data back to the driver
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
Distributed processing
Distributed processing
The result is calculated as follows,
Partition 1 : Sum(all Elements) + 3 (Zero value)
Partition 2 : Sum(all Elements) + 3 (Zero value)
Partition 3 : Sum(all Elements) + 3 (Zero value)
Result = Partition1 + Partition2 + Partition3 + 3(Zero value)
So we get 21 + 22 + 31 + (4 * 3) = 86
Structured APIs
DataFrames, Datasets and SQL
Structured API Overview
Structured APIs apply to both batch and streaming
computation.
Core type of distributed collections:
● Datasets (typed) - checks schema at runtime
● DataFrames (untyped) - checks schema at compile time
● SQL tables and views
Structured API Execution
Logical Planning
Physical Planning
Execution Pipeline
Basic Operations
● Schemas (schema-on-read)
Defines the column names and types of a DataFrames
● Columns and Expressions
Columns in Spark are similar to columns in a spreadsheet.
Expressions are the operations like select, manipulate, remove
columns
● Records and Rows
Each row is a single record represented as an object of type Row.
Data Sources
Read API structure
DataFrameReader.format(...).option(“key”, “value”).schema(...).load()
● CSV
● JSON
● Parquet
● ORC
● JDBC/ODBC connections
● Plain-text files
● and many, many others from community
Aggregations
● Aggregation Functions
count, countDistinct, first and last, min and max, sum, avg
● Grouping
with Expressions, with Maps
● Grouping Sets
Rollups, Cube, Pivot
● Join Types
● Computation strategy
○ node-to-node (shuffle join)
○ per-node (broadcast join)
Joins
Big table to big table - shuffle join
Big table to small table - broadcast join
Low-Level APIs
Resilient Distributed Datasets (RDDs)
What is an RDD
RDD as a Distributed Dataset
RDD is Fault-Tolerance
RDD is Immutable
Working with RDDs
RDD Transformations
RDD Transformations
RDD Actions
Actions that return results to the driver program
RDD Actions
Action with side effects
Key-Value Pairs
What is a Key-Value Pair RDD?
● Any RDD whose elements are key-value pairs
○ Key-Value pair is a tuple with two components (key, value)
○ Different pair may have the same keys
○ Both keys and values can be of primitive or complex data
type
Transformations on Pair RDDs
Transformations on Pair RDDs
Actions on Pair RDDs
Aggregation on Pair RDDs
Aggregation on Pair RDDs
Grouping and Sorting on Pair RDDs
● Grouping values with the same key
○ Reorganizing data by a new key
○ Post-processing per-key groups
● Sorting values using keys
○ Generating special-purpose datasets
○ Generating reports that require ordering
Grouping and Sorting on Pair RDDs
Grouping and Sorting on Pair RDDs
Grouping and Sorting on Pair RDDs
Distributed Variables
Broadcast Variables
Main use cases:
● Application tasks across multiple stage need the same, relatively large and
immutable dataset
● Application tasks need the same, relatively large and immutable dataset cached
in deserialized form
Accumulators
Main use cases:
● Counting and summation
● Application needs to compute multiple aggregates on the same dataset
● Application needs custom aggregation not supported by existing Spark operations
Persistence
Persistence
Persistence
Persistence
Persistence
Persistence
cache
cache
cache
Tuning Partitioning
Distributed Collection of Partitions
● Spark automatically partitions RDDs
● Spark automatically distributes partitions among nodes
RDD Partitioning Properties
Number of partitions
Property Description
partitions Returns an array with all partition references for the sources RDD
partitions.size Returns a number of partitions in the source RDD
Partitioner
Property Description
partitioner Returns an Option[Partitioner] for the source RDD
Partitioner can refer to HashPartitioner, RangePartitioner or custom
Partitioning and Computation
● Partition is the smallest unit of data
● Task is the smallest unit of computation
● Number or partitions = Number of tasks
● Number of partitions
○ Affects a number of tasks and the level of parallelism
○ Goal: balancing task execution and scheduling times
● Partitioner
○ Affects key-based operations
○ Goal: Avoiding shuffling the same dataset multiple times
Controlling Partitioning
Partitioning Rules
● Parallelizing a Scala Collection
○ partitions size = defaultParallelism
○ partitioner = None
● Reading data from HDFS
○ partitions size = max(number of file blocks or defaultParallelism)
○ partitioner = None
● Retrieving data from Cassandra
○ partitions size = max(data-size / 64 MBs or defaultParallelism)
○ partitioner = None
Setting and Reusing a Partitioner
Mechanisms for Controlling Partitioning
Data Shuffling

More Related Content

Big Data processing with Apache Spark