Big Data processing with Apache Spark
- 3. Who uses Spark, and for what
Data Science
● Analyze and model the data
● Transforming the data into a
usable format
● Ad-hoc analysis, statistics,
machine learning
Data Processing
● Parallelize across clusters
● Hides the complexity of
distributed systems
○ Programming
○ Networking communication
○ Fault tolerance
- 10. Core data structures
● Immutable
● Lives in memory
● Strongly typed
● Operations
○ Transformations (lazy)
○ Actions
- 11. Transformations and Actions
● Transformations return new RDDs as results
They are lazy, their result is not immediately computed
● Actions compute a result based on an RDD, and either
returned or saved to a storage
They are eager, their result is immediately computed
- 17. What happens when an action is executed?
The data is partitioned into different blocks
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
- 18. What happens when an action is executed?
Driver sends the code to be executed on
each block
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
- 19. What happens when an action is executed?
Read HDFS block
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
- 20. What happens when an action is executed?
Read HDFS block and cache the data
Process and send the result to the driver
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
- 21. What happens when an action is executed?
Driver combine the results / sum
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
- 22. What happens when an action is executed?
Process from cache
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
- 23. What happens when an action is executed?
Send the data back to the driver
Driver
Worker Worker Worker
Block 1 Block 2 Block 3
Cache Cache Cache
- 25. Distributed processing
The result is calculated as follows,
Partition 1 : Sum(all Elements) + 3 (Zero value)
Partition 2 : Sum(all Elements) + 3 (Zero value)
Partition 3 : Sum(all Elements) + 3 (Zero value)
Result = Partition1 + Partition2 + Partition3 + 3(Zero value)
So we get 21 + 22 + 31 + (4 * 3) = 86
- 27. Structured API Overview
Structured APIs apply to both batch and streaming
computation.
Core type of distributed collections:
● Datasets (typed) - checks schema at runtime
● DataFrames (untyped) - checks schema at compile time
● SQL tables and views
- 32. Basic Operations
● Schemas (schema-on-read)
Defines the column names and types of a DataFrames
● Columns and Expressions
Columns in Spark are similar to columns in a spreadsheet.
Expressions are the operations like select, manipulate, remove
columns
● Records and Rows
Each row is a single record represented as an object of type Row.
- 33. Data Sources
Read API structure
DataFrameReader.format(...).option(“key”, “value”).schema(...).load()
● CSV
● JSON
● Parquet
● ORC
● JDBC/ODBC connections
● Plain-text files
● and many, many others from community
- 35. ● Join Types
● Computation strategy
○ node-to-node (shuffle join)
○ per-node (broadcast join)
Joins
- 49. What is a Key-Value Pair RDD?
● Any RDD whose elements are key-value pairs
○ Key-Value pair is a tuple with two components (key, value)
○ Different pair may have the same keys
○ Both keys and values can be of primitive or complex data
type
- 55. Grouping and Sorting on Pair RDDs
● Grouping values with the same key
○ Reorganizing data by a new key
○ Post-processing per-key groups
● Sorting values using keys
○ Generating special-purpose datasets
○ Generating reports that require ordering
- 60. Broadcast Variables
Main use cases:
● Application tasks across multiple stage need the same, relatively large and
immutable dataset
● Application tasks need the same, relatively large and immutable dataset cached
in deserialized form
- 61. Accumulators
Main use cases:
● Counting and summation
● Application needs to compute multiple aggregates on the same dataset
● Application needs custom aggregation not supported by existing Spark operations
- 69. Distributed Collection of Partitions
● Spark automatically partitions RDDs
● Spark automatically distributes partitions among nodes
- 70. RDD Partitioning Properties
Number of partitions
Property Description
partitions Returns an array with all partition references for the sources RDD
partitions.size Returns a number of partitions in the source RDD
Partitioner
Property Description
partitioner Returns an Option[Partitioner] for the source RDD
Partitioner can refer to HashPartitioner, RangePartitioner or custom
- 71. Partitioning and Computation
● Partition is the smallest unit of data
● Task is the smallest unit of computation
● Number or partitions = Number of tasks
- 72. ● Number of partitions
○ Affects a number of tasks and the level of parallelism
○ Goal: balancing task execution and scheduling times
● Partitioner
○ Affects key-based operations
○ Goal: Avoiding shuffling the same dataset multiple times
Controlling Partitioning
- 73. Partitioning Rules
● Parallelizing a Scala Collection
○ partitions size = defaultParallelism
○ partitioner = None
● Reading data from HDFS
○ partitions size = max(number of file blocks or defaultParallelism)
○ partitioner = None
● Retrieving data from Cassandra
○ partitions size = max(data-size / 64 MBs or defaultParallelism)
○ partitioner = None