SlideShare a Scribd company logo
Spark SQL, Dataframes, SparkR
hadoop fs -cat /data/spark/books.xml
<?xml version="1.0"?>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
An in-depth look at creating applications
<book id="bk101">
Loading XML
Spark SQL, Dataframes, SparkR
We will use:
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
Loading XML
Spark SQL, Dataframes, SparkR
We will use:
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
Load the Data:
val df ="xml").option("rowTag",
val df ="com.databricks.spark.xml")
.option("rowTag", "book").load("/data/spark/books.xml")
Loading XML
Spark SQL, Dataframes, SparkR
Loading XML
| _id| author| description| genre|price|publish_date| title|
|bk101|Gambardella, Matthew|
An in...| Computer|44.95| 2000-10-01|XML Developer's G...|
|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain|
|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
|bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
|bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail|
|bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds|
|bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash|
|bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies|
|bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost|
|bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...|
|bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...|
|bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...|
Display Data:
Spark SQL, Dataframes, SparkR
What is RPC - Remote Process Call
Name: John,
Phone: 1234
Name: John,
Phone: 1234
Spark SQL, Dataframes, SparkR
Avro is:
1. A Remote Procedure call
2. Data Serialization Framework
3. Uses JSON for defining data types and protocols
4. Serializes data in a compact binary format
5. Similar to Thrift and Protocol Buffers
6. Doesn't require running a code-generation program
Its primary use is in Apache Hadoop, where it can provide both a serialization format
for persistent data, and a wire format for communication between Hadoop nodes,
and from client programs to the Hadoop services.
Apache Spark SQL can access Avro as a data source.[1]
Spark SQL, Dataframes, SparkR
We will use:
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
Loading AVRO
Spark SQL, Dataframes, SparkR
We will use:
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
Load the Data:
val df ="com.databricks.spark.avro")
Display Data:
| title| air_date|doctor|
| The Eleventh Hour| 3 April 2010| 11|
| The Doctor's Wife| 14 May 2011| 11|
Loading AVRO
Spark SQL, Dataframes, SparkR
Data Sources
● Columnar storage format
● Any project in the Hadoop ecosystem
● Regardless of
○ Data processing framework
○ Data model
○ Programming language.
Spark SQL, Dataframes, SparkR
var df ="/data/spark/users.parquet")
Data Sources
Method1 - Automatically (parquet unless otherwise configured)
Spark SQL, Dataframes, SparkR
var df ="/data/spark/users.parquet")
df ="name", "favorite_color")"namesAndFavColors_21jan2018.parquet")
Data Sources
Method1 - Automatically (parquet unless otherwise configured)
Spark SQL, Dataframes, SparkR
Data Sources
Method2 - Manually Specifying Options
df ="json").load("/data/spark/people.json")
df ="name", "age")
Spark SQL, Dataframes, SparkR
Data Sources
Method3 - Directly running sql on file
val sqlDF = spark.sql("SELECT * FROM parquet.`/data/spark/users.parquet`")
val sqlDF = spark.sql("SELECT * FROM json.`/data/spark/people.json`")
Spark SQL, Dataframes, SparkR
● Spark SQL also supports reading and writing data stored in Apache Hive.
● Since Hive has a large number of dependencies, it is not included in the default Spark assembly.
Hive Tables
Spark SQL, Dataframes, SparkR
Hive Tables
● Place your hive-site.xml, core-site.xml and hdfs-site.xml file in conf/
● Not required in case of CloudxLab, it already done.
Spark SQL, Dataframes, SparkR
Hive Tables - Example
scala> import spark.implicits._
import spark.implicits._
scala> var df = spark.sql("select * from a_student")
| name|grade|marks|stream|
| Student1| A| 1| CSE|
| Student2| B| 2| IT|
| Student3| A| 3| ECE|
| Student4| B| 4| EEE|
| Student5| A| 5| MECH|
| Student6| B| 6| CHEM|
Spark SQL, Dataframes, SparkR
Hive Tables - Example
val spark = SparkSession
.appName("Spark Hive Example")
Spark SQL, Dataframes, SparkR
From DBs using JDBC
● Spark SQL also includes a data source that can read data from DBs using JDBC.
● Results are returned as a DataFrame
● Easily be processed in Spark SQL or joined with other data sources
Spark SQL, Dataframes, SparkR
hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar
From DBs using JDBC
Spark SQL, Dataframes, SparkR
hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar
/usr/spark2.0.1/bin/spark-shell --driver-class-path
mysql-connector-java-5.1.36-bin.jar --jars
val jdbcDF =
.option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex")
.option("dbtable", "widgets")
.option("user", "sqoopuser")
.option("password", "NHkkP876rp")
From DBs using JDBC
Spark SQL, Dataframes, SparkR
val jdbcDF ="jdbc").option("url",
"widgets").option("user", "sqoopuser").option("password",
var df = spark.sql("select * from a_student")
spark.sql("select * from jdbc_widgets j, hive_students h where h.marks =").show()
Joining Across
Spark SQL, Dataframes, SparkR
Data Frames
(Spark SQL)
Parquet map(), reduce() ...
Spark SQL, Dataframes, SparkR
● Spark SQL as a distributed query engine
● using its JDBC/ODBC
● or command-line interface.
● Users can run SQL queries on Spark
● without the need to write any code.
Distributed SQL Engine
Spark SQL, Dataframes, SparkR
Distributed SQL Engine - Setting up
Step 1: Running the Thrift JDBC/ODBC server
The thrift JDBC/ODBC here corresponds to HiveServer. You can start it
from the local installation:
It starts in the background and writes data to log file. To see the logs use,
tail -f command
Spark SQL, Dataframes, SparkR
Step 2: Connecting
Connect to thrift service using beeline:
On the beeline shell:
!connect jdbc:hive2://localhost:10000
You can further query using the same commands as hive.
Distributed SQL Engine - Setting up
Spark SQL, Dataframes, SparkR
Distributed SQL Engine
Thank you!
Dataframes & Spark SQL

More Related Content

What's hot

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
Yiguang Hu
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
Yogesh Kulkarni
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
Samir Bessalah
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story

What's hot (20)

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story

Similar to Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
Syed Danyal Khaliq
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Olgun Aydın
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Ankara Big Data Meetup
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Jiahua Zhu
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
Alex Thompson
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
Sri Ambati
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
Eran Rom
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재

Similar to Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재실시간 Streaming using Spark and Kafka 강의교재
실시간 Streaming using Spark and Kafka 강의교재

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
Decision Trees
Decision TreesDecision Trees
Decision Trees
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
Decision Trees
Decision TreesDecision Trees
Decision Trees
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab

Recently uploaded

RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf

Recently uploaded (20)

RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Spark SQL, Dataframes, SparkR hadoop fs -cat /data/spark/books.xml <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description> An in-depth look at creating applications … … </book> <book id="bk101"> … … </book> … ... </catalog> Loading XML
  • 2. Spark SQL, Dataframes, SparkR We will use: Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1 Loading XML
  • 3. Spark SQL, Dataframes, SparkR We will use: Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1 Load the Data: val df ="xml").option("rowTag", "book").load("/data/spark/books.xml") OR val df ="com.databricks.spark.xml") .option("rowTag", "book").load("/data/spark/books.xml") Loading XML
  • 4. Spark SQL, Dataframes, SparkR Loading XML scala> +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ | _id| author| description| genre|price|publish_date| title| +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ |bk101|Gambardella, Matthew| An in...| Computer|44.95| 2000-10-01|XML Developer's G...| |bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain| |bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant| |bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy| |bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail| |bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds| |bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash| |bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies| |bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost| |bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...| |bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...| |bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...| +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ Display Data:
  • 5. Spark SQL, Dataframes, SparkR What is RPC - Remote Process Call [{ Name: John, Phone: 1234 }, { Name: John, Phone: 1234 },] … getPhoneBook("myuserid")
  • 6. Spark SQL, Dataframes, SparkR Avro is: 1. A Remote Procedure call 2. Data Serialization Framework 3. Uses JSON for defining data types and protocols 4. Serializes data in a compact binary format 5. Similar to Thrift and Protocol Buffers 6. Doesn't require running a code-generation program Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Apache Spark SQL can access Avro as a data source.[1] AVRO
  • 7. Spark SQL, Dataframes, SparkR We will use: Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 Loading AVRO
  • 8. Spark SQL, Dataframes, SparkR We will use: Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 Load the Data: val df ="com.databricks.spark.avro") .load("/data/spark/episodes.avro") Display Data: +--------------------+----------------+------+ | title| air_date|doctor| +--------------------+----------------+------+ | The Eleventh Hour| 3 April 2010| 11| | The Doctor's Wife| 14 May 2011| 11| Loading AVRO
  • 9. Spark SQL, Dataframes, SparkR Data Sources ● Columnar storage format ● Any project in the Hadoop ecosystem ● Regardless of ○ Data processing framework ○ Data model ○ Programming language.
  • 10. Spark SQL, Dataframes, SparkR var df ="/data/spark/users.parquet") Data Sources Method1 - Automatically (parquet unless otherwise configured)
  • 11. Spark SQL, Dataframes, SparkR var df ="/data/spark/users.parquet") df ="name", "favorite_color")"namesAndFavColors_21jan2018.parquet") Data Sources Method1 - Automatically (parquet unless otherwise configured)
  • 12. Spark SQL, Dataframes, SparkR Data Sources Method2 - Manually Specifying Options df ="json").load("/data/spark/people.json") df ="name", "age") df.write.format("parquet").save("namesAndAges.parquet")
  • 13. Spark SQL, Dataframes, SparkR Data Sources Method3 - Directly running sql on file val sqlDF = spark.sql("SELECT * FROM parquet.`/data/spark/users.parquet`") val sqlDF = spark.sql("SELECT * FROM json.`/data/spark/people.json`")
  • 14. Spark SQL, Dataframes, SparkR ● Spark SQL also supports reading and writing data stored in Apache Hive. ● Since Hive has a large number of dependencies, it is not included in the default Spark assembly. Hive Tables
  • 15. Spark SQL, Dataframes, SparkR Hive Tables ● Place your hive-site.xml, core-site.xml and hdfs-site.xml file in conf/ ● Not required in case of CloudxLab, it already done.
  • 16. Spark SQL, Dataframes, SparkR Hive Tables - Example /usr/spark2.0.1/bin/spark-shell scala> import spark.implicits._ import spark.implicits._ scala> var df = spark.sql("select * from a_student") scala> +---------+-----+-----+------+ | name|grade|marks|stream| +---------+-----+-----+------+ | Student1| A| 1| CSE| | Student2| B| 2| IT| | Student3| A| 3| ECE| | Student4| B| 4| EEE| | Student5| A| 5| MECH| | Student6| B| 6| CHEM|
  • 17. Spark SQL, Dataframes, SparkR Hive Tables - Example import val spark = SparkSession .builder() .appName("Spark Hive Example") .enableHiveSupport() .getOrCreate()
  • 18. Spark SQL, Dataframes, SparkR From DBs using JDBC ● Spark SQL also includes a data source that can read data from DBs using JDBC. ● Results are returned as a DataFrame ● Easily be processed in Spark SQL or joined with other data sources
  • 19. Spark SQL, Dataframes, SparkR hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar From DBs using JDBC
  • 20. Spark SQL, Dataframes, SparkR hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar /usr/spark2.0.1/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar val jdbcDF = .format("jdbc") .option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex") .option("dbtable", "widgets") .option("user", "sqoopuser") .option("password", "NHkkP876rp") .load() From DBs using JDBC
  • 21. Spark SQL, Dataframes, SparkR val jdbcDF ="jdbc").option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex").option("dbtable", "widgets").option("user", "sqoopuser").option("password", "NHkkP876rp").load() var df = spark.sql("select * from a_student") jdbcDF.createOrReplaceTempView("jdbc_widgets"); df.createOrReplaceTempView("hive_students"); spark.sql("select * from jdbc_widgets j, hive_students h where h.marks =").show() Joining Across
  • 22. Spark SQL, Dataframes, SparkR Data Frames Dataframes (Spark SQL) JSON HIVE RDD TEXT Parquet map(), reduce() ... SQL RDMS (JDBC)
  • 23. Spark SQL, Dataframes, SparkR ● Spark SQL as a distributed query engine ● using its JDBC/ODBC ● or command-line interface. ● Users can run SQL queries on Spark ● without the need to write any code. Distributed SQL Engine
  • 24. Spark SQL, Dataframes, SparkR Distributed SQL Engine - Setting up Step 1: Running the Thrift JDBC/ODBC server The thrift JDBC/ODBC here corresponds to HiveServer. You can start it from the local installation: ./sbin/ It starts in the background and writes data to log file. To see the logs use, tail -f command
  • 25. Spark SQL, Dataframes, SparkR Step 2: Connecting Connect to thrift service using beeline: ./bin/beeline On the beeline shell: !connect jdbc:hive2://localhost:10000 You can further query using the same commands as hive. Distributed SQL Engine - Setting up
  • 26. Spark SQL, Dataframes, SparkR Demo Distributed SQL Engine