Introduction to Spark 2.0 Dataset API
- 2. ● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
- 3. Agenda
● API’s of Spark
● Dataset abstraction
● Spark Session
● Dataset wordcount
● RDD to Dataset
● Dataset Vs Dataframe API’s
● Understanding Encoders
- 4. API’s of Spark
● RDD
○ Lowest level, functional style for unstructured data
processing, introduced in 0.1
● Dataframe
○ Structured processing, relational API introduced in
1.3
● Dataset
Combines both functional and relational to one API
introduced in 1.6
- 5. Dawn of Structured Processing
● From last few years, more and more data in big data
coming from structured or semi structured sources
● Hadoop only supported unstructured data at platform
level
● But as demand for structure data processing increased,
spark went ahead and supported structured data in
platform level itself[1]
● Spark 2.0 is the big step in the direction of structured
first, unstructured next approach.
- 6. Dataset Abstraction
● A Dataset is a strongly typed collection of domain-
specific objects that can be transformed in parallel using
functional or relational operations. Each dataset also
has an untyped view called a DataFrame, which is a
Dataset of Row
● RDD represents an immutable,partitioned collection of
elements that can be operated on in parallel
● Has custom DSL and runs on custom memory
management
- 7. Spark Session API
● New entry point in spark for creating for creating
datasets
● Replaces SQLContext,HiveContext and
StreamingContext
● Most of the programs only need to create this no more
SparkContext
● Move from SparkContext to SparkSession signifies
move away from RDD
● Ex : SparkSessionExample.scala
- 8. Dataset WordCount
● Dataset provides very similar DSL as RDD
● It combines best of RDD and Dataframe to single API
● Dataframe is now aliased now to Dataset[Row]
● One of the big change from RDD API, is moving away
from key/value pair based API to more SQL like API
● Dataset signifies departure from well know Map/Reduce
like API to more of optimized data handling DSL
● Ex : DataSetWordCount.scala
- 9. RDD to Dataset
● Dataframe lacked functional programming aspects of
RDD which made moving code from RDD to DF more
challenging
● But with Dataset, most of the RDD expressions can be
easily expressed in more elegantly
● Though both are DSL, they differ large in
implementation
● Most of the Dataset operation is ran through code
generation and custom serialization
● Ex : RDDToDataset.scala
- 11. Dataframe vs Dataset
● Most of the logical plans and optimizations of Dataframe
are now moved into Dataset
● Dataframe is now a schema less Dataset
● One of the difference of Dataset from Dataframe is, it
adds an additional step for serialization and checking for
proper schema
● This serialization is different than spark and kryo. It’s a
macro based serialization framework
● Ex : DatasetVsDataframe.scala