Introduction to Spark 2.0 Dataset API

Introduction to Dataset API
Spark 2.0 Dataset Abstraction
https://github.com/phatak-dev/spark2.0-examples

● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● API’s of Spark
● Dataset abstraction
● Spark Session
● Dataset wordcount
● RDD to Dataset
● Dataset Vs Dataframe API’s
● Understanding Encoders

API’s of Spark
● RDD
○ Lowest level, functional style for unstructured data
processing, introduced in 0.1
● Dataframe
○ Structured processing, relational API introduced in
1.3
● Dataset
Combines both functional and relational to one API
introduced in 1.6

Dawn of Structured Processing
● From last few years, more and more data in big data
coming from structured or semi structured sources
● Hadoop only supported unstructured data at platform
level
● But as demand for structure data processing increased,
spark went ahead and supported structured data in
platform level itself[1]
● Spark 2.0 is the big step in the direction of structured
first, unstructured next approach.

Dataset Abstraction
● A Dataset is a strongly typed collection of domain-
specific objects that can be transformed in parallel using
functional or relational operations. Each dataset also
has an untyped view called a DataFrame, which is a
Dataset of Row
● RDD represents an immutable,partitioned collection of
elements that can be operated on in parallel
● Has custom DSL and runs on custom memory
management

Spark Session API
● New entry point in spark for creating for creating
datasets
● Replaces SQLContext,HiveContext and
StreamingContext
● Most of the programs only need to create this no more
SparkContext
● Move from SparkContext to SparkSession signifies
move away from RDD
● Ex : SparkSessionExample.scala

Dataset WordCount
● Dataset provides very similar DSL as RDD
● It combines best of RDD and Dataframe to single API
● Dataframe is now aliased now to Dataset[Row]
● One of the big change from RDD API, is moving away
from key/value pair based API to more SQL like API
● Dataset signifies departure from well know Map/Reduce
like API to more of optimized data handling DSL
● Ex : DataSetWordCount.scala

RDD to Dataset
● Dataframe lacked functional programming aspects of
RDD which made moving code from RDD to DF more
challenging
● But with Dataset, most of the RDD expressions can be
easily expressed in more elegantly
● Though both are DSL, they differ large in
implementation
● Most of the Dataset operation is ran through code
generation and custom serialization
● Ex : RDDToDataset.scala

Dataframe vs Dataset
● Most of the logical plans and optimizations of Dataframe
are now moved into Dataset
● Dataframe is now a schema less Dataset
● One of the difference of Dataset from Dataframe is, it
adds an additional step for serialization and checking for
proper schema
● This serialization is different than spark and kryo. It’s a
macro based serialization framework
● Ex : DatasetVsDataframe.scala

References
● https://www.youtube.com/watch?v=0jd3EWmKQfo
● http://blog.madhukaraphatak.com/categories/spark-two/
● https://www.brighttalk.com/webcast/12891/202021
● https://spark-summit.org/2016/schedule/
● https://databricks.com/blog/2016/07/14/a-tale-of-three-
apache-spark-apis-rdds-dataframes-and-datasets.html
● https://www.youtube.com/watch?v=hHFuKeeQujc

Introduction to Spark 2.0 Dataset API

Related slideshows

More Related Content

Introduction to Spark 2.0 Dataset API