Big Data processing with Apache Spark

Big Data processing
with Apache Spark
Lucian Neghina
Big Data Architect
by Developer for Developers

Outline
● Spark Overview
● Structured APIs
● Low-Level APIs
● Streaming
● Writing an Application

Who uses Spark, and for what
Data Science
● Analyze and model the data
● Transforming the data into a
usable format
● Ad-hoc analysis, statistics,
machine learning
Data Processing
● Parallelize across clusters
● Hides the complexity of
distributed systems
○ Programming
○ Networking communication
○ Fault tolerance

Core data structures
● Immutable
● Lives in memory
● Strongly typed
● Operations
○ Transformations (lazy)
○ Actions

Transformations and Actions
● Transformations return new RDDs as results
They are lazy, their result is not immediately computed
● Actions compute a result based on an RDD, and either
returned or saved to a storage
They are eager, their result is immediately computed

Actions
lines
#
#
#
#
count() cause Spark to:
● read data
● sum within partitions
● combine sums in driver

Actions
lines
#
#
#
#
Spark recomputes lines:
● read data (again)
● sum within partitions
● combine sums in driver
comments
#
#
#
#

What happens when an action is executed?
Driver
Worker Worker Worker

The data is partitioned into different blocks
Driver
Block 1 Block 2 Block 3

Driver sends the code to be executed on
each block
Driver

Read HDFS block
Driver

Read HDFS block and cache the data
Process and send the result to the driver
Driver
Cache Cache Cache

Driver combine the results / sum
Driver
Cache Cache Cache

Process from cache
Driver
Cache Cache Cache

Send the data back to the driver
Driver
Cache Cache Cache

Distributed processing
The result is calculated as follows,
Partition 1 : Sum(all Elements) + 3 (Zero value)
Result = Partition1 + Partition2 + Partition3 + 3(Zero value)
So we get 21 + 22 + 31 + (4 * 3) = 86

Structured APIs
DataFrames, Datasets and SQL

Structured API Overview
Structured APIs apply to both batch and streaming
computation.
Core type of distributed collections:
● Datasets (typed) - checks schema at runtime
● DataFrames (untyped) - checks schema at compile time
● SQL tables and views

Basic Operations
● Schemas (schema-on-read)
Deﬁnes the column names and types of a DataFrames
● Columns and Expressions
Columns in Spark are similar to columns in a spreadsheet.
Expressions are the operations like select, manipulate, remove
columns
● Records and Rows
Each row is a single record represented as an object of type Row.

Data Sources
Read API structure
DataFrameReader.format(...).option(“key”, “value”).schema(...).load()
● CSV
● JSON
● Parquet
● ORC
● JDBC/ODBC connections
● Plain-text ﬁles
● and many, many others from community

Aggregations
● Aggregation Functions
count, countDistinct, ﬁrst and last, min and max, sum, avg
● Grouping
with Expressions, with Maps
● Grouping Sets
Rollups, Cube, Pivot

● Join Types
● Computation strategy
○ node-to-node (shufﬂe join)
○ per-node (broadcast join)
Joins

Big table to big table - shufﬂe join

Big table to small table - broadcast join

Low-Level APIs
Resilient Distributed Datasets (RDDs)

RDD Actions
Actions that return results to the driver program

RDD Actions
Action with side effects

What is a Key-Value Pair RDD?
● Any RDD whose elements are key-value pairs
○ Key-Value pair is a tuple with two components (key, value)
○ Different pair may have the same keys
○ Both keys and values can be of primitive or complex data
type

Grouping and Sorting on Pair RDDs
● Grouping values with the same key
○ Reorganizing data by a new key
○ Post-processing per-key groups
● Sorting values using keys
○ Generating special-purpose datasets
○ Generating reports that require ordering

Grouping and Sorting on Pair RDDs

Broadcast Variables
Main use cases:
● Application tasks across multiple stage need the same, relatively large and
immutable dataset
● Application tasks need the same, relatively large and immutable dataset cached
in deserialized form

Accumulators
Main use cases:
● Counting and summation
● Application needs to compute multiple aggregates on the same dataset
● Application needs custom aggregation not supported by existing Spark operations

Distributed Collection of Partitions
● Spark automatically partitions RDDs
● Spark automatically distributes partitions among nodes

RDD Partitioning Properties
Number of partitions
Property Description
partitions Returns an array with all partition references for the sources RDD
partitions.size Returns a number of partitions in the source RDD
Partitioner
Property Description
partitioner Returns an Option[Partitioner] for the source RDD
Partitioner can refer to HashPartitioner, RangePartitioner or custom

Partitioning and Computation
● Partition is the smallest unit of data
● Task is the smallest unit of computation
● Number or partitions = Number of tasks

● Number of partitions
○ Affects a number of tasks and the level of parallelism
○ Goal: balancing task execution and scheduling times
● Partitioner
○ Affects key-based operations
○ Goal: Avoiding shufﬂing the same dataset multiple times
Controlling Partitioning

Partitioning Rules
● Parallelizing a Scala Collection
○ partitions size = defaultParallelism
○ partitioner = None
● Reading data from HDFS
○ partitions size = max(number of ﬁle blocks or defaultParallelism)
● Retrieving data from Cassandra
○ partitions size = max(data-size / 64 MBs or defaultParallelism)

Setting and Reusing a Partitioner

Mechanisms for Controlling Partitioning

Big Data processing with Apache Spark

Related slideshows

More Related Content

Big Data processing with Apache Spark