SlideShare a Scribd company logo
Spark SQL, Dataframes, SparkR
hadoop fs -cat /data/spark/books.xml
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
…
…
</book>
<book id="bk101">
…
…
</book>
…
...
</catalog>
Loading XML
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-xml
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
Loading XML
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-xml
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
Load the Data:
val df = spark.read.format("xml").option("rowTag",
"book").load("/data/spark/books.xml")
OR
val df = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "book").load("/data/spark/books.xml")
Loading XML
Spark SQL, Dataframes, SparkR
Loading XML
scala> df.show()
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
| _id| author| description| genre|price|publish_date| title|
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
|bk101|Gambardella, Matthew|
An in...| Computer|44.95| 2000-10-01|XML Developer's G...|
|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain|
|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
|bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
|bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail|
|bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds|
|bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash|
|bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies|
|bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost|
|bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...|
|bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...|
|bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...|
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
Display Data:
df.show()
Spark SQL, Dataframes, SparkR
What is RPC - Remote Process Call
[{
Name: John,
Phone: 1234
},
{
Name: John,
Phone: 1234
},]
…
getPhoneBook("myuserid")
Spark SQL, Dataframes, SparkR
Avro is:
1. A Remote Procedure call
2. Data Serialization Framework
3. Uses JSON for defining data types and protocols
4. Serializes data in a compact binary format
5. Similar to Thrift and Protocol Buffers
6. Doesn't require running a code-generation program
Its primary use is in Apache Hadoop, where it can provide both a serialization format
for persistent data, and a wire format for communication between Hadoop nodes,
and from client programs to the Hadoop services.
Apache Spark SQL can access Avro as a data source.[1]
AVRO
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-avro
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
Loading AVRO
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-avro
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
Load the Data:
val df = spark.read.format("com.databricks.spark.avro")
.load("/data/spark/episodes.avro")
Display Data:
df.show()
+--------------------+----------------+------+
| title| air_date|doctor|
+--------------------+----------------+------+
| The Eleventh Hour| 3 April 2010| 11|
| The Doctor's Wife| 14 May 2011| 11|
Loading AVRO
Spark SQL, Dataframes, SparkR
https://parquet.apache.org/
Data Sources
● Columnar storage format
● Any project in the Hadoop ecosystem
● Regardless of
○ Data processing framework
○ Data model
○ Programming language.
Spark SQL, Dataframes, SparkR
var df = spark.read.load("/data/spark/users.parquet")
Data Sources
Method1 - Automatically (parquet unless otherwise configured)
Spark SQL, Dataframes, SparkR
var df = spark.read.load("/data/spark/users.parquet")
df = df.select("name", "favorite_color")
df.write.save("namesAndFavColors_21jan2018.parquet")
Data Sources
Method1 - Automatically (parquet unless otherwise configured)
Spark SQL, Dataframes, SparkR
Data Sources
Method2 - Manually Specifying Options
df = spark.read.format("json").load("/data/spark/people.json")
df = df.select("name", "age")
df.write.format("parquet").save("namesAndAges.parquet")
Spark SQL, Dataframes, SparkR
Data Sources
Method3 - Directly running sql on file
val sqlDF = spark.sql("SELECT * FROM parquet.`/data/spark/users.parquet`")
val sqlDF = spark.sql("SELECT * FROM json.`/data/spark/people.json`")
Spark SQL, Dataframes, SparkR
● Spark SQL also supports reading and writing data stored in Apache Hive.
● Since Hive has a large number of dependencies, it is not included in the default Spark assembly.
Hive Tables
Spark SQL, Dataframes, SparkR
Hive Tables
● Place your hive-site.xml, core-site.xml and hdfs-site.xml file in conf/
● Not required in case of CloudxLab, it already done.
Spark SQL, Dataframes, SparkR
Hive Tables - Example
/usr/spark2.0.1/bin/spark-shell
scala> import spark.implicits._
import spark.implicits._
scala> var df = spark.sql("select * from a_student")
scala> df.show()
+---------+-----+-----+------+
| name|grade|marks|stream|
+---------+-----+-----+------+
| Student1| A| 1| CSE|
| Student2| B| 2| IT|
| Student3| A| 3| ECE|
| Student4| B| 4| EEE|
| Student5| A| 5| MECH|
| Student6| B| 6| CHEM|
Spark SQL, Dataframes, SparkR
Hive Tables - Example
import java.io.File
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.enableHiveSupport()
.getOrCreate()
Spark SQL, Dataframes, SparkR
From DBs using JDBC
● Spark SQL also includes a data source that can read data from DBs using JDBC.
● Results are returned as a DataFrame
● Easily be processed in Spark SQL or joined with other data sources
Spark SQL, Dataframes, SparkR
hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar
From DBs using JDBC
Spark SQL, Dataframes, SparkR
hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar
/usr/spark2.0.1/bin/spark-shell --driver-class-path
mysql-connector-java-5.1.36-bin.jar --jars
mysql-connector-java-5.1.36-bin.jar
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex")
.option("dbtable", "widgets")
.option("user", "sqoopuser")
.option("password", "NHkkP876rp")
.load()
jdbcDF.show()
From DBs using JDBC
Spark SQL, Dataframes, SparkR
val jdbcDF = spark.read.format("jdbc").option("url",
"jdbc:mysql://ip-172-31-13-154/sqoopex").option("dbtable",
"widgets").option("user", "sqoopuser").option("password",
"NHkkP876rp").load()
jdbcDF.show()
var df = spark.sql("select * from a_student")
df.show()
jdbcDF.createOrReplaceTempView("jdbc_widgets");
df.createOrReplaceTempView("hive_students");
spark.sql("select * from jdbc_widgets j, hive_students h where h.marks =
j.id").show()
Joining Across
Spark SQL, Dataframes, SparkR
Data Frames
Dataframes
(Spark SQL)
JSON
HIVE
RDD
TEXT
Parquet map(), reduce() ...
SQL
RDMS (JDBC)
Spark SQL, Dataframes, SparkR
● Spark SQL as a distributed query engine
● using its JDBC/ODBC
● or command-line interface.
● Users can run SQL queries on Spark
● without the need to write any code.
Distributed SQL Engine
Spark SQL, Dataframes, SparkR
Distributed SQL Engine - Setting up
Step 1: Running the Thrift JDBC/ODBC server
The thrift JDBC/ODBC here corresponds to HiveServer. You can start it
from the local installation:
./sbin/start-thriftserver.sh
It starts in the background and writes data to log file. To see the logs use,
tail -f command
Spark SQL, Dataframes, SparkR
Step 2: Connecting
Connect to thrift service using beeline:
./bin/beeline
On the beeline shell:
!connect jdbc:hive2://localhost:10000
You can further query using the same commands as hive.
Distributed SQL Engine - Setting up
Spark SQL, Dataframes, SparkR
Demo
Distributed SQL Engine
Thank you!
Dataframes & Spark SQL

More Related Content

What's hot

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
Yiguang Hu
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
Yogesh Kulkarni
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
Databricks
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 

What's hot (20)

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Scala+data
Scala+dataScala+data
Scala+data
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 

Similar to Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
Syed Danyal Khaliq
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Olgun Aydın
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Ankara Big Data Meetup
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Spark浅谈
Spark浅谈Spark浅谈
Spark浅谈
Jiahua Zhu
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
Alex Thompson
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
Edureka!
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
Sri Ambati
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
Eran Rom
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
hkyoon2
 

Similar to Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Spark浅谈
Spark浅谈Spark浅谈
Spark浅谈
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
 
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 

Recently uploaded

RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 

Recently uploaded (20)

RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Spark SQL, Dataframes, SparkR hadoop fs -cat /data/spark/books.xml <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description> An in-depth look at creating applications … … </book> <book id="bk101"> … … </book> … ... </catalog> Loading XML
  • 2. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-xml Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1 Loading XML
  • 3. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-xml Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1 Load the Data: val df = spark.read.format("xml").option("rowTag", "book").load("/data/spark/books.xml") OR val df = spark.read.format("com.databricks.spark.xml") .option("rowTag", "book").load("/data/spark/books.xml") Loading XML
  • 4. Spark SQL, Dataframes, SparkR Loading XML scala> df.show() +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ | _id| author| description| genre|price|publish_date| title| +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ |bk101|Gambardella, Matthew| An in...| Computer|44.95| 2000-10-01|XML Developer's G...| |bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain| |bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant| |bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy| |bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail| |bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds| |bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash| |bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies| |bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost| |bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...| |bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...| |bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...| +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ Display Data: df.show()
  • 5. Spark SQL, Dataframes, SparkR What is RPC - Remote Process Call [{ Name: John, Phone: 1234 }, { Name: John, Phone: 1234 },] … getPhoneBook("myuserid")
  • 6. Spark SQL, Dataframes, SparkR Avro is: 1. A Remote Procedure call 2. Data Serialization Framework 3. Uses JSON for defining data types and protocols 4. Serializes data in a compact binary format 5. Similar to Thrift and Protocol Buffers 6. Doesn't require running a code-generation program Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Apache Spark SQL can access Avro as a data source.[1] AVRO
  • 7. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-avro Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 Loading AVRO
  • 8. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-avro Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 Load the Data: val df = spark.read.format("com.databricks.spark.avro") .load("/data/spark/episodes.avro") Display Data: df.show() +--------------------+----------------+------+ | title| air_date|doctor| +--------------------+----------------+------+ | The Eleventh Hour| 3 April 2010| 11| | The Doctor's Wife| 14 May 2011| 11| Loading AVRO
  • 9. Spark SQL, Dataframes, SparkR https://parquet.apache.org/ Data Sources ● Columnar storage format ● Any project in the Hadoop ecosystem ● Regardless of ○ Data processing framework ○ Data model ○ Programming language.
  • 10. Spark SQL, Dataframes, SparkR var df = spark.read.load("/data/spark/users.parquet") Data Sources Method1 - Automatically (parquet unless otherwise configured)
  • 11. Spark SQL, Dataframes, SparkR var df = spark.read.load("/data/spark/users.parquet") df = df.select("name", "favorite_color") df.write.save("namesAndFavColors_21jan2018.parquet") Data Sources Method1 - Automatically (parquet unless otherwise configured)
  • 12. Spark SQL, Dataframes, SparkR Data Sources Method2 - Manually Specifying Options df = spark.read.format("json").load("/data/spark/people.json") df = df.select("name", "age") df.write.format("parquet").save("namesAndAges.parquet")
  • 13. Spark SQL, Dataframes, SparkR Data Sources Method3 - Directly running sql on file val sqlDF = spark.sql("SELECT * FROM parquet.`/data/spark/users.parquet`") val sqlDF = spark.sql("SELECT * FROM json.`/data/spark/people.json`")
  • 14. Spark SQL, Dataframes, SparkR ● Spark SQL also supports reading and writing data stored in Apache Hive. ● Since Hive has a large number of dependencies, it is not included in the default Spark assembly. Hive Tables
  • 15. Spark SQL, Dataframes, SparkR Hive Tables ● Place your hive-site.xml, core-site.xml and hdfs-site.xml file in conf/ ● Not required in case of CloudxLab, it already done.
  • 16. Spark SQL, Dataframes, SparkR Hive Tables - Example /usr/spark2.0.1/bin/spark-shell scala> import spark.implicits._ import spark.implicits._ scala> var df = spark.sql("select * from a_student") scala> df.show() +---------+-----+-----+------+ | name|grade|marks|stream| +---------+-----+-----+------+ | Student1| A| 1| CSE| | Student2| B| 2| IT| | Student3| A| 3| ECE| | Student4| B| 4| EEE| | Student5| A| 5| MECH| | Student6| B| 6| CHEM|
  • 17. Spark SQL, Dataframes, SparkR Hive Tables - Example import java.io.File val spark = SparkSession .builder() .appName("Spark Hive Example") .enableHiveSupport() .getOrCreate()
  • 18. Spark SQL, Dataframes, SparkR From DBs using JDBC ● Spark SQL also includes a data source that can read data from DBs using JDBC. ● Results are returned as a DataFrame ● Easily be processed in Spark SQL or joined with other data sources
  • 19. Spark SQL, Dataframes, SparkR hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar From DBs using JDBC
  • 20. Spark SQL, Dataframes, SparkR hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar /usr/spark2.0.1/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex") .option("dbtable", "widgets") .option("user", "sqoopuser") .option("password", "NHkkP876rp") .load() jdbcDF.show() From DBs using JDBC
  • 21. Spark SQL, Dataframes, SparkR val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex").option("dbtable", "widgets").option("user", "sqoopuser").option("password", "NHkkP876rp").load() jdbcDF.show() var df = spark.sql("select * from a_student") df.show() jdbcDF.createOrReplaceTempView("jdbc_widgets"); df.createOrReplaceTempView("hive_students"); spark.sql("select * from jdbc_widgets j, hive_students h where h.marks = j.id").show() Joining Across
  • 22. Spark SQL, Dataframes, SparkR Data Frames Dataframes (Spark SQL) JSON HIVE RDD TEXT Parquet map(), reduce() ... SQL RDMS (JDBC)
  • 23. Spark SQL, Dataframes, SparkR ● Spark SQL as a distributed query engine ● using its JDBC/ODBC ● or command-line interface. ● Users can run SQL queries on Spark ● without the need to write any code. Distributed SQL Engine
  • 24. Spark SQL, Dataframes, SparkR Distributed SQL Engine - Setting up Step 1: Running the Thrift JDBC/ODBC server The thrift JDBC/ODBC here corresponds to HiveServer. You can start it from the local installation: ./sbin/start-thriftserver.sh It starts in the background and writes data to log file. To see the logs use, tail -f command
  • 25. Spark SQL, Dataframes, SparkR Step 2: Connecting Connect to thrift service using beeline: ./bin/beeline On the beeline shell: !connect jdbc:hive2://localhost:10000 You can further query using the same commands as hive. Distributed SQL Engine - Setting up
  • 26. Spark SQL, Dataframes, SparkR Demo Distributed SQL Engine