SlideShare a Scribd company logo
Introduction to Dataset API
Spark 2.0 Dataset Abstraction
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● API’s of Spark
● Dataset abstraction
● Spark Session
● Dataset wordcount
● RDD to Dataset
● Dataset Vs Dataframe API’s
● Understanding Encoders
API’s of Spark
● RDD
○ Lowest level, functional style for unstructured data
processing, introduced in 0.1
● Dataframe
○ Structured processing, relational API introduced in
1.3
● Dataset
Combines both functional and relational to one API
introduced in 1.6
Dawn of Structured Processing
● From last few years, more and more data in big data
coming from structured or semi structured sources
● Hadoop only supported unstructured data at platform
level
● But as demand for structure data processing increased,
spark went ahead and supported structured data in
platform level itself[1]
● Spark 2.0 is the big step in the direction of structured
first, unstructured next approach.
Dataset Abstraction
● A Dataset is a strongly typed collection of domain-
specific objects that can be transformed in parallel using
functional or relational operations. Each dataset also
has an untyped view called a DataFrame, which is a
Dataset of Row
● RDD represents an immutable,partitioned collection of
elements that can be operated on in parallel
● Has custom DSL and runs on custom memory
management
Spark Session API
● New entry point in spark for creating for creating
datasets
● Replaces SQLContext,HiveContext and
StreamingContext
● Most of the programs only need to create this no more
SparkContext
● Move from SparkContext to SparkSession signifies
move away from RDD
● Ex : SparkSessionExample.scala
Dataset WordCount
● Dataset provides very similar DSL as RDD
● It combines best of RDD and Dataframe to single API
● Dataframe is now aliased now to Dataset[Row]
● One of the big change from RDD API, is moving away
from key/value pair based API to more SQL like API
● Dataset signifies departure from well know Map/Reduce
like API to more of optimized data handling DSL
● Ex : DataSetWordCount.scala
RDD to Dataset
● Dataframe lacked functional programming aspects of
RDD which made moving code from RDD to DF more
challenging
● But with Dataset, most of the RDD expressions can be
easily expressed in more elegantly
● Though both are DSL, they differ large in
implementation
● Most of the Dataset operation is ran through code
generation and custom serialization
● Ex : RDDToDataset.scala
Dataframe and Dataset
Dataframe vs Dataset
● Most of the logical plans and optimizations of Dataframe
are now moved into Dataset
● Dataframe is now a schema less Dataset
● One of the difference of Dataset from Dataframe is, it
adds an additional step for serialization and checking for
proper schema
● This serialization is different than spark and kryo. It’s a
macro based serialization framework
● Ex : DatasetVsDataframe.scala
References
● https://www.youtube.com/watch?v=0jd3EWmKQfo
● http://blog.madhukaraphatak.com/categories/spark-two/
● https://www.brighttalk.com/webcast/12891/202021
● https://spark-summit.org/2016/schedule/
● https://databricks.com/blog/2016/07/14/a-tale-of-three-
apache-spark-apis-rdds-dataframes-and-datasets.html
● https://www.youtube.com/watch?v=hHFuKeeQujc

More Related Content

Introduction to Spark 2.0 Dataset API

  • 1. Introduction to Dataset API Spark 2.0 Dataset Abstraction https://github.com/phatak-dev/spark2.0-examples
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● API’s of Spark ● Dataset abstraction ● Spark Session ● Dataset wordcount ● RDD to Dataset ● Dataset Vs Dataframe API’s ● Understanding Encoders
  • 4. API’s of Spark ● RDD ○ Lowest level, functional style for unstructured data processing, introduced in 0.1 ● Dataframe ○ Structured processing, relational API introduced in 1.3 ● Dataset Combines both functional and relational to one API introduced in 1.6
  • 5. Dawn of Structured Processing ● From last few years, more and more data in big data coming from structured or semi structured sources ● Hadoop only supported unstructured data at platform level ● But as demand for structure data processing increased, spark went ahead and supported structured data in platform level itself[1] ● Spark 2.0 is the big step in the direction of structured first, unstructured next approach.
  • 6. Dataset Abstraction ● A Dataset is a strongly typed collection of domain- specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row ● RDD represents an immutable,partitioned collection of elements that can be operated on in parallel ● Has custom DSL and runs on custom memory management
  • 7. Spark Session API ● New entry point in spark for creating for creating datasets ● Replaces SQLContext,HiveContext and StreamingContext ● Most of the programs only need to create this no more SparkContext ● Move from SparkContext to SparkSession signifies move away from RDD ● Ex : SparkSessionExample.scala
  • 8. Dataset WordCount ● Dataset provides very similar DSL as RDD ● It combines best of RDD and Dataframe to single API ● Dataframe is now aliased now to Dataset[Row] ● One of the big change from RDD API, is moving away from key/value pair based API to more SQL like API ● Dataset signifies departure from well know Map/Reduce like API to more of optimized data handling DSL ● Ex : DataSetWordCount.scala
  • 9. RDD to Dataset ● Dataframe lacked functional programming aspects of RDD which made moving code from RDD to DF more challenging ● But with Dataset, most of the RDD expressions can be easily expressed in more elegantly ● Though both are DSL, they differ large in implementation ● Most of the Dataset operation is ran through code generation and custom serialization ● Ex : RDDToDataset.scala
  • 11. Dataframe vs Dataset ● Most of the logical plans and optimizations of Dataframe are now moved into Dataset ● Dataframe is now a schema less Dataset ● One of the difference of Dataset from Dataframe is, it adds an additional step for serialization and checking for proper schema ● This serialization is different than spark and kryo. It’s a macro based serialization framework ● Ex : DatasetVsDataframe.scala
  • 12. References ● https://www.youtube.com/watch?v=0jd3EWmKQfo ● http://blog.madhukaraphatak.com/categories/spark-two/ ● https://www.brighttalk.com/webcast/12891/202021 ● https://spark-summit.org/2016/schedule/ ● https://databricks.com/blog/2016/07/14/a-tale-of-three- apache-spark-apis-rdds-dataframes-and-datasets.html ● https://www.youtube.com/watch?v=hHFuKeeQujc