SlideShare a Scribd company logo
Dataframes & Spark SQL
Spark SQL & Dataframes
Spark module
for
Structured Data Processing
Spark SQL
Spark SQL & Dataframes
Integrated
○ Provides DataFrames
○ Mix SQL queries & Spark programs
Spark SQL
Spark SQL & Dataframes
Uniform Data Access
○ Source:
■ HDFS,
■ Hive
■ Relational Databases
○ Avro, Parquet, ORC, JSON
○ You can even join data across these sources.
○ Hive Compatibility
○ Standard Connectivity
Spark SQL

Recommended for you

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...

The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction

Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.

hadoopsparksqlspark
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases

This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.

mongodbcouchdbnosql
Spark SQL & Dataframes
DataFrames
1 sandeep
2 ted
3 thomas
4 priya
5 kush
RDD
Unstructured
Need code for processing
Spark SQL & Dataframes
DataFrames
1 sandeep
2 ted
3 thomas
4 priya
5 kush
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
Unstructured Structured
Need code for processing Can use SQL or R like syntax:
df.sql("select Id where name = 'priya'")
head(where(df, df$ID > 21))
Spark SQL & Dataframes
● Collection with named columns
● Distributed
● <> Same as database table
● <> A data frame in R/Python
Data Frames
col1 col2 col3 Partition1
Partition2
Spark SQL & Dataframes
Data Frames
col1 col2 col3 Partition1
Partition2
Structured data:
CSV, JSON
Hive RDBMS RDDs
Can be constructed from

Recommended for you

Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science

View video of this presentation here: https://www.youtube.com/watch?v=vxeLcoELaP4 Introducing DataFrames in Spark for Large-scale Data Science

data sciencehadooppandas
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL

Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.

sparksqlsparkspark2.0
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020

Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.

azure cloud analytics synapse data warehousing
Spark SQL & Dataframes
Data Frames
DataFrame API is available in
col1 col2 col3 Partition1
Partition2
Spark SQL & Dataframes
● Available in Spark 2.0x onwards.
● Using usual interfaces
○ Spark-shell
○ Spark Application
○ Pyspark
○ Java
○ etc.
Getting Started
Spark SQL & Dataframes
$ export HADOOP_CONF_DIR=/etc/hadoop/conf/
$ export YARN_CONF_DIR=/etc/hadoop/conf/
Getting Started
Spark SQL & Dataframes
$ export HADOOP_CONF_DIR=/etc/hadoop/conf/
$ export YARN_CONF_DIR=/etc/hadoop/conf/
$ ls /usr/
bin games include jdk64 lib64 local share spark1.6
spark2.0.2 tmp etc hdp java lib libexec sbin spark1.2.1
spark2.0.1 src
Getting Started

Recommended for you

CQRS and Event Sourcing for Java Developers
CQRS and Event Sourcing for Java DevelopersCQRS and Event Sourcing for Java Developers
CQRS and Event Sourcing for Java Developers

This document discusses CQRS and event sourcing patterns for Java developers. It begins with an overview of classical monolithic architectures versus modern microservice architectures. It then contrasts CRUD with CQRS, explaining that CQRS separates reads from writes by using commands for writes and queries for reads. Events evolve from commands and represent things that occurred in the past. The document provides an example implementation using the Lagom framework that demonstrates separating the write side from the read side and persisting events. It emphasizes that with this approach, all state changes are stored as events and the current state can be recreated by replaying events. The document encourages the use of Lagom due to benefits like asynchronous programming, developer productivity, and production readiness.

cqrsreactivejava
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming

This document proposes using support vector machines (SVMs) to model high-frequency limit order book dynamics and predict metrics like mid-price movement and price spread crossing. It describes representing each limit order book entry as a vector of attributes, then using multi-class SVMs to build models for each metric. Experiments on real data show the selected features are effective for short-term price forecasts. The document provides background on SVMs, describing how they find an optimal separating hyperplane to classify data points into labels.

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

awsemrhadoop
Spark SQL & Dataframes
$ export HADOOP_CONF_DIR=/etc/hadoop/conf/
$ export YARN_CONF_DIR=/etc/hadoop/conf/
$ ls /usr/
bin games include jdk64 lib64 local share spark1.6
spark2.0.2 tmp etc hdp java lib libexec sbin spark1.2.1
spark2.0.1 src
$ /usr/spark2.0.2/bin/spark-shell
Getting Started
Spark SQL & Dataframes
Getting Started
Spark context Web UI available at http://172.31.60.179:4040
Spark context available as 'sc' (master = local[*], app id = local-1498489557917).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark SQL & Dataframes
Getting Started
Spark context Web UI available at http://172.31.60.179:4040
Spark context available as 'sc' (master = local[*], app id = local-1498489557917).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark SQL & Dataframes
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
Starting Point: SparkSession

Recommended for you

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

hadoopapache hadoopspark
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook

Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users. Speakers: Ankit Agarwal, Sameer Agarwal

Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries

This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.

rddapachespark
Spark SQL & Dataframes
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
//For implicit conversions, e.g. RDDs to DataFrames
import spark.implicits._
Starting Point: SparkSession
Spark SQL & Dataframes
Creating DataFrames from JSON
In web console or ssh:
$ hadoop fs -cat /data/spark/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
Spark SQL & Dataframes
var df = spark.read.json("/data/spark/people.json")
// Displays the content of the DataFrame to stdout
df.show()
Creating DataFrames from JSON
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Spark SQL & Dataframes
Creating DataFrames from JSON
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Original JSON:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
var df = spark.read.json("/data/spark/people.json")
# Displays the content of the DataFrame to stdout
df.show()

Recommended for you

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases

This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.

dynamonosqlpnuts
MapReduce
MapReduceMapReduce
MapReduce

This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.

mapreduce
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview

This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.

analyticsbig dataapache spark
Spark SQL & Dataframes
# Print the schema in a tree format
df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
DataFrame Operations
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
Spark SQL & Dataframes
# Select only the "name" column
df.select("name").show()
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
DataFrame Operations
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
Spark SQL & Dataframes
# Increment the age by 1
df.select($"name",$"age" + 1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
DataFrame Operations
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
Spark SQL & Dataframes
# Select people older than 21
df.filter($"age"> 21).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
DataFrame Operations
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

Recommended for you

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering

The document introduces data engineering and provides an overview of the topic. It discusses (1) what data engineering is, how it has evolved with big data, and the required skills, (2) the roles of data engineers, data scientists, and data analysts in working with big data, and (3) the structure and schedule of an upcoming meetup on data engineering that will use an agile approach over monthly sprints.

data engineeringbig data
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

hivetezapache
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

apache sparkspark summit
Spark SQL & Dataframes
# Count people by age
df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
| 19| 1|
|null| 1|
| 30| 1|
+----+-----+
#SQL Equivalent
Select age, count(*) from df group by age
DataFrame Operations
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
Spark SQL & Dataframes
Running SQL Queries Programmatically
Spark SQL & Dataframes
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
Running SQL Queries Programmatically
Spark SQL & Dataframes
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
Running SQL Queries Programmatically

Recommended for you

Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series

This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.

datasetsapache sparkscala
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks

This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.

apache spark 2.xstructured streamingspark sql
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last

Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.

dataframesdatasetsspark sql
Spark SQL & Dataframes
Running SQL Queries Programmatically
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Spark SQL & Dataframes
Datasets
● Similar to RDDs
● instead Java serialization or Kryo
● use a specialized Encoder
● use Encoder to serialize
Encoders
● Are dynamically generated code
● Perform operations with deserializing
Datasets
Spark SQL & Dataframes
// Encoders for most common types are automatically
provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)
Creating Datasets
Spark SQL & Dataframes
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
Creating Datasets

Recommended for you

Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL

This chapter discusses Spark SQL, which allows querying Spark data with SQL. It covers initializing Spark SQL, loading data from sources like Hive, Parquet, JSON and RDDs, caching data, writing UDFs, and performance tuning. The JDBC server allows sharing cached tables and queries between programs. SchemaRDDs returned by queries or loaded from data represent the data structure that SQL queries operate on.

learning sparkbig dataapache spark
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...

This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.

nosqlcassandrabig data
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python

This document provides an introduction and overview of Apache Spark with Python (PySpark). It discusses key Spark concepts like RDDs, DataFrames, Spark SQL, Spark Streaming, GraphX, and MLlib. It includes code examples demonstrating how to work with data using PySpark for each of these concepts.

pythonpysparkspark
Spark SQL & Dataframes
val path = "/data/spark/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Creating Datasets
Spark SQL & Dataframes
Interoperating with RDDs
How to convert an RDD into dataframe?
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Spark SQL & Dataframes
Interoperating with RDDs
Two ways to convert RDDs to DF:
a. Inferring the Schema Using Reflection
b.
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Spark SQL & Dataframes
Interoperating with RDDs
Two ways to convert RDDs to DF:
a. Inferring the Schema Using Reflection
b. Programmatically Specifying the Schema
RDD
ID Name
1 sandeep
2 ted
3 thomas
4 priya
5 kush
Data Frame
1 sandeep
2 ted
3 thomas
4 priya
5 kush

Recommended for you

Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement

Introduction and background Spark RDD API Introduction to Scala Spark DataFrames API + SparkSQL Spark Execution Model Spark Shell & Application Deployment Spark Extensions (Spark Streaming, MLLib, ML) Spark & DataStax Enterprise Integration Demos

datastaxhadoopcassandra
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames

The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames

The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.

Spark SQL & Dataframes
Inferring the Schema Using Reflection
● Spark SQL can convert an RDDs with case classes to a DataFrame
● The names of case class arguments are read using reflection and become columns
● Case classes can be nested or contain complex types
● Let us try to convert people.txt into dataframe
people.txt:
Michael, 29
Andy, 30
Justin, 19
Spark SQL & Dataframes
Inferring the Schema Using Reflection
https://github.com/cloudxlab/bigdata/blob/master/spark/examples/dataframes/rdd_to_df.scala
Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
Inferring the Schema Using Reflection
Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
Inferring the Schema Using Reflection

Recommended for you

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know

The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.

An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup

- Apache Spark is an open-source cluster computing framework that provides fast, in-memory processing for large-scale data analytics. It can run on Hadoop clusters and standalone. - Spark allows processing of data using transformations and actions on resilient distributed datasets (RDDs). RDDs can be persisted in memory for faster processing. - Spark comes with modules for SQL queries, machine learning, streaming, and graphs. Spark SQL allows SQL queries on structured data. MLib provides scalable machine learning. Spark Streaming processes live data streams.

apache sparkatlanta meetup
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark

- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed. - The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules. - The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.

Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
scala> val textRDD = sc.textFile("/data/spark/people.txt")
textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt
MapPartitionsRDD[3] at textFile at <console>:30
Inferring the Schema Using Reflection
Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
scala> val textRDD = sc.textFile("/data/spark/people.txt")
textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt
MapPartitionsRDD[3] at textFile at <console>:30
scala> val arrayRDD = textRDD.map(_.split(","))
arrayRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at
map at <console>:32
Inferring the Schema Using Reflection
Spark SQL & Dataframes
scala> import spark.implicits._
import spark.implicits._
scala> case class Person(name: String, age: Long)
defined class Person
scala> val textRDD = sc.textFile("/data/spark/people.txt")
textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt
MapPartitionsRDD[3] at textFile at <console>:30
scala> val arrayRDD = textRDD.map(_.split(","))
arrayRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at
map at <console>:32
scala> val personRDD = arrayRDD.map(attributes => Person(attributes(0),
attributes(1).trim.toInt))
personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5] at map
at <console>:36
Inferring the Schema Using Reflection
Spark SQL & Dataframes
scala> val peopleDF = personRDD.toDF()
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
Inferring the Schema Using Reflection

Recommended for you

Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands-On Labs

apache spark 2.xdatabricksdataframes
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

This document provides an overview of Spark and using Spark on HDInsight. It discusses Spark concepts like RDDs, transformations, and actions. It also covers Spark extensions like Spark SQL, Spark Streaming, and MLlib. Finally, it highlights benefits of using Spark on HDInsight like integration with Azure services, scalability, and support.

azurebig dataanalytics
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.

iot and bigdatase2016
Spark SQL & Dataframes
scala> val peopleDF = personRDD.toDF()
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
scala> peopleDF.show()
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
Inferring the Schema Using Reflection
Spark SQL & Dataframes
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND
19")
Inferring the Schema Using Reflection
Spark SQL & Dataframes
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND
19")
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
Inferring the Schema Using Reflection
Spark SQL & Dataframes
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND
19")
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
Inferring the Schema Using Reflection

Recommended for you

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.

big data day labddlausc
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark

This document provides a summary of Apache Spark, including: - Spark is a framework for large-scale data processing across clusters that is faster than Hadoop by relying more on RAM and minimizing disk IO. - Spark transformations operate on resilient distributed datasets (RDDs) to manipulate data, while actions return results to the driver program. - Spark can receive data from various sources like files, databases, sockets through its datasource APIs and process both batch and streaming data. - Spark streaming divides streaming data into micro-batches called DStreams and integrates with messaging systems like Kafka. Structured streaming is a newer API that works on DataFrames/Datasets.

javaapache sparkbig data
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)

SparkR provides an R interface to Apache Spark that allows R developers to leverage Spark's distributed processing capabilities through DataFrames. DataFrames impose a schema on data from RDDs, making the data easier to access and manipulate compared to raw RDDs. DataFrames in SparkR allow R-like syntax and interactions with data while leveraging Spark's optimizations by passing operations to the JVM. The speaker demonstrated SparkR DataFrame features and discussed the project's roadmap, including expanding SparkR to support Spark machine learning capabilities.

spark summit 2015apache spark
Spark SQL & Dataframes
Inferring the Schema Using Reflection
// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))
Spark SQL & Dataframes
● When case classes can't be defined during time of coding
a. E.g. The fields expected in case classes are passed as arguments
● We need to programmatically create the dataframe:
Programmatically Specifying the Schema
Spark SQL & Dataframes
Programmatically Specifying the Schema
● When case classes can't be defined during time of coding
a. E.g. The fields expected in case classes are passed as arguments
● We need to programmatically create the dataframe:
1. Create RDD of Row objects
2. Create schema represented by StructType
3. Apply schema with createDataFrame
Spark SQL & Dataframes
Programmatically Specifying the Schema
people.txt:
Michael, 29
Andy, 30
Justin, 19
val schemaString = "name age"

Recommended for you

Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration

The document describes the steps to configure Oracle sharding in an Oracle 12c environment. It includes installing Oracle software on shardcat, shard1, and shard2 nodes, creating an SCAT database, installing the GSM software, configuring the shard catalog, registering the shard nodes, creating a shard group and adding shards, deploying the shards to create databases on shard1 and shard2, verifying the shard configuration, creating a global service, and creating a sample schema and shard table to verify distribution across shards.

oracle 12c shardingoracle shardingcreating oracle shards
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning

Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal. Topics covered: 1. Overview of Machine Learning 2. Basics of Deep Learning 3. What is computer vision and its use-cases? 4. Various algorithms used in Computer Vision (mostly CNN) 5. Live hands-on demo of either Auto Cameraman or Face recognition system 6. What next?

computer visionmachine learningbig data
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview

This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.

cloudxlabdeep learningmachine learning
Spark SQL & Dataframes
import org.apache.spark.sql.types._
import org.apache.spark.sql._
Programmatically Specifying the Schema
Spark SQL & Dataframes
Programmatically Specifying the Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// The schema is encoded in a string
// User provided variable
val schemaString = "name age"
val filename = "/data/spark/people.txt"
val fieldsArray = schemaString.split(" ")
val fields = fieldsArray.map(
f => StructField(f, StringType, nullable = true)
)
val schema = StructType(fields)
Spark SQL & Dataframes
Programmatically Specifying the Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// The schema is encoded in a string
// User provided variable
val schemaString = "name age"
val filename = "/data/spark/people.txt"
val fieldsArray = schemaString.split(" ")
val fields = fieldsArray.map(
f => StructField(f, StringType, nullable = true)
)
val schema = StructType(fields)
val peopleRDD = spark.sparkContext.textFile(filename)
val rowRDD = peopleRDD.map(_.split(",")).map(
attributes => Row.fromSeq(attributes)
)
val peopleDF = spark.createDataFrame(rowRDD, schema)
Spark SQL & Dataframes
Programmatically Specifying the Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// The schema is encoded in a string
// User provided variable
val schemaString = "name age"
val filename = "/data/spark/people.txt"
val fieldsArray = schemaString.split(" ")
val fields = fieldsArray.map(
f => StructField(f, StringType, nullable = true)
)
val schema = StructType(fields)
val peopleRDD = spark.sparkContext.textFile(filename)
val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row.fromSeq(attributes))
val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF.show()
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+

Recommended for you

Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks

This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.

recurrent neural networkrecurrent neural network tutorialrecurrent neural network tensorflow
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing

Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.

cloudxlabdeep learningmachine learning
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes

- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets. - It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted. - It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.

cloudxlabdeep learningmachine learning
Thank you!
Spark SQL & Dataframes

More Related Content

What's hot

Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
Derek Stainer
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
CQRS and Event Sourcing for Java Developers
CQRS and Event Sourcing for Java DevelopersCQRS and Event Sourcing for Java Developers
CQRS and Event Sourcing for Java Developers
Markus Eisele
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Marin Dimitrov
 
MapReduce
MapReduceMapReduce
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 

What's hot (20)

Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
CQRS and Event Sourcing for Java Developers
CQRS and Event Sourcing for Java DevelopersCQRS and Event Sourcing for Java Developers
CQRS and Event Sourcing for Java Developers
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
MapReduce
MapReduceMapReduce
MapReduce
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 

Similar to Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
suresh gandhi
 

Similar to Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Recently uploaded

Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 

Recently uploaded (20)

Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Spark SQL & Dataframes Spark module for Structured Data Processing Spark SQL
  • 3. Spark SQL & Dataframes Integrated ○ Provides DataFrames ○ Mix SQL queries & Spark programs Spark SQL
  • 4. Spark SQL & Dataframes Uniform Data Access ○ Source: ■ HDFS, ■ Hive ■ Relational Databases ○ Avro, Parquet, ORC, JSON ○ You can even join data across these sources. ○ Hive Compatibility ○ Standard Connectivity Spark SQL
  • 5. Spark SQL & Dataframes DataFrames 1 sandeep 2 ted 3 thomas 4 priya 5 kush RDD Unstructured Need code for processing
  • 6. Spark SQL & Dataframes DataFrames 1 sandeep 2 ted 3 thomas 4 priya 5 kush RDD ID Name 1 sandeep 2 ted 3 thomas 4 priya 5 kush Data Frame Unstructured Structured Need code for processing Can use SQL or R like syntax: df.sql("select Id where name = 'priya'") head(where(df, df$ID > 21))
  • 7. Spark SQL & Dataframes ● Collection with named columns ● Distributed ● <> Same as database table ● <> A data frame in R/Python Data Frames col1 col2 col3 Partition1 Partition2
  • 8. Spark SQL & Dataframes Data Frames col1 col2 col3 Partition1 Partition2 Structured data: CSV, JSON Hive RDBMS RDDs Can be constructed from
  • 9. Spark SQL & Dataframes Data Frames DataFrame API is available in col1 col2 col3 Partition1 Partition2
  • 10. Spark SQL & Dataframes ● Available in Spark 2.0x onwards. ● Using usual interfaces ○ Spark-shell ○ Spark Application ○ Pyspark ○ Java ○ etc. Getting Started
  • 11. Spark SQL & Dataframes $ export HADOOP_CONF_DIR=/etc/hadoop/conf/ $ export YARN_CONF_DIR=/etc/hadoop/conf/ Getting Started
  • 12. Spark SQL & Dataframes $ export HADOOP_CONF_DIR=/etc/hadoop/conf/ $ export YARN_CONF_DIR=/etc/hadoop/conf/ $ ls /usr/ bin games include jdk64 lib64 local share spark1.6 spark2.0.2 tmp etc hdp java lib libexec sbin spark1.2.1 spark2.0.1 src Getting Started
  • 13. Spark SQL & Dataframes $ export HADOOP_CONF_DIR=/etc/hadoop/conf/ $ export YARN_CONF_DIR=/etc/hadoop/conf/ $ ls /usr/ bin games include jdk64 lib64 local share spark1.6 spark2.0.2 tmp etc hdp java lib libexec sbin spark1.2.1 spark2.0.1 src $ /usr/spark2.0.2/bin/spark-shell Getting Started
  • 14. Spark SQL & Dataframes Getting Started Spark context Web UI available at http://172.31.60.179:4040 Spark context available as 'sc' (master = local[*], app id = local-1498489557917). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.0.2 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91) Type in expressions to have them evaluated. Type :help for more information. scala>
  • 15. Spark SQL & Dataframes Getting Started Spark context Web UI available at http://172.31.60.179:4040 Spark context available as 'sc' (master = local[*], app id = local-1498489557917). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.0.2 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91) Type in expressions to have them evaluated. Type :help for more information. scala>
  • 16. Spark SQL & Dataframes import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() Starting Point: SparkSession
  • 17. Spark SQL & Dataframes import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() //For implicit conversions, e.g. RDDs to DataFrames import spark.implicits._ Starting Point: SparkSession
  • 18. Spark SQL & Dataframes Creating DataFrames from JSON In web console or ssh: $ hadoop fs -cat /data/spark/people.json {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
  • 19. Spark SQL & Dataframes var df = spark.read.json("/data/spark/people.json") // Displays the content of the DataFrame to stdout df.show() Creating DataFrames from JSON scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
  • 20. Spark SQL & Dataframes Creating DataFrames from JSON scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ Original JSON: {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19} var df = spark.read.json("/data/spark/people.json") # Displays the content of the DataFrame to stdout df.show()
  • 21. Spark SQL & Dataframes # Print the schema in a tree format df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true) DataFrame Operations {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
  • 22. Spark SQL & Dataframes # Select only the "name" column df.select("name").show() +-------+ | name| +-------+ |Michael| | Andy| | Justin| +-------+ DataFrame Operations {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
  • 23. Spark SQL & Dataframes # Increment the age by 1 df.select($"name",$"age" + 1).show() +-------+---------+ | name|(age + 1)| +-------+---------+ |Michael| null| | Andy| 31| | Justin| 20| +-------+---------+ DataFrame Operations {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
  • 24. Spark SQL & Dataframes # Select people older than 21 df.filter($"age"> 21).show() +---+----+ |age|name| +---+----+ | 30|Andy| +---+----+ DataFrame Operations {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
  • 25. Spark SQL & Dataframes # Count people by age df.groupBy("age").count().show() +----+-----+ | age|count| +----+-----+ | 19| 1| |null| 1| | 30| 1| +----+-----+ #SQL Equivalent Select age, count(*) from df group by age DataFrame Operations {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
  • 26. Spark SQL & Dataframes Running SQL Queries Programmatically
  • 27. Spark SQL & Dataframes // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") Running SQL Queries Programmatically
  • 28. Spark SQL & Dataframes // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") Running SQL Queries Programmatically
  • 29. Spark SQL & Dataframes Running SQL Queries Programmatically // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
  • 30. Spark SQL & Dataframes Datasets ● Similar to RDDs ● instead Java serialization or Kryo ● use a specialized Encoder ● use Encoder to serialize Encoders ● Are dynamically generated code ● Perform operations with deserializing Datasets
  • 31. Spark SQL & Dataframes // Encoders for most common types are automatically provided by importing spark.implicits._ val primitiveDS = Seq(1, 2, 3).toDS() primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4) Creating Datasets
  • 32. Spark SQL & Dataframes case class Person(name: String, age: Long) // Encoders are created for case classes val caseClassDS = Seq(Person("Andy", 32)).toDS() caseClassDS.show() // +----+---+ // |name|age| // +----+---+ // |Andy| 32| // +----+---+ Creating Datasets
  • 33. Spark SQL & Dataframes val path = "/data/spark/people.json" val peopleDS = spark.read.json(path).as[Person] peopleDS.show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ Creating Datasets
  • 34. Spark SQL & Dataframes Interoperating with RDDs How to convert an RDD into dataframe? RDD ID Name 1 sandeep 2 ted 3 thomas 4 priya 5 kush Data Frame 1 sandeep 2 ted 3 thomas 4 priya 5 kush
  • 35. Spark SQL & Dataframes Interoperating with RDDs Two ways to convert RDDs to DF: a. Inferring the Schema Using Reflection b. RDD ID Name 1 sandeep 2 ted 3 thomas 4 priya 5 kush Data Frame 1 sandeep 2 ted 3 thomas 4 priya 5 kush
  • 36. Spark SQL & Dataframes Interoperating with RDDs Two ways to convert RDDs to DF: a. Inferring the Schema Using Reflection b. Programmatically Specifying the Schema RDD ID Name 1 sandeep 2 ted 3 thomas 4 priya 5 kush Data Frame 1 sandeep 2 ted 3 thomas 4 priya 5 kush
  • 37. Spark SQL & Dataframes Inferring the Schema Using Reflection ● Spark SQL can convert an RDDs with case classes to a DataFrame ● The names of case class arguments are read using reflection and become columns ● Case classes can be nested or contain complex types ● Let us try to convert people.txt into dataframe people.txt: Michael, 29 Andy, 30 Justin, 19
  • 38. Spark SQL & Dataframes Inferring the Schema Using Reflection https://github.com/cloudxlab/bigdata/blob/master/spark/examples/dataframes/rdd_to_df.scala
  • 39. Spark SQL & Dataframes scala> import spark.implicits._ import spark.implicits._ Inferring the Schema Using Reflection
  • 40. Spark SQL & Dataframes scala> import spark.implicits._ import spark.implicits._ scala> case class Person(name: String, age: Long) defined class Person Inferring the Schema Using Reflection
  • 41. Spark SQL & Dataframes scala> import spark.implicits._ import spark.implicits._ scala> case class Person(name: String, age: Long) defined class Person scala> val textRDD = sc.textFile("/data/spark/people.txt") textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt MapPartitionsRDD[3] at textFile at <console>:30 Inferring the Schema Using Reflection
  • 42. Spark SQL & Dataframes scala> import spark.implicits._ import spark.implicits._ scala> case class Person(name: String, age: Long) defined class Person scala> val textRDD = sc.textFile("/data/spark/people.txt") textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt MapPartitionsRDD[3] at textFile at <console>:30 scala> val arrayRDD = textRDD.map(_.split(",")) arrayRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at map at <console>:32 Inferring the Schema Using Reflection
  • 43. Spark SQL & Dataframes scala> import spark.implicits._ import spark.implicits._ scala> case class Person(name: String, age: Long) defined class Person scala> val textRDD = sc.textFile("/data/spark/people.txt") textRDD: org.apache.spark.rdd.RDD[String] = /data/spark/people.txt MapPartitionsRDD[3] at textFile at <console>:30 scala> val arrayRDD = textRDD.map(_.split(",")) arrayRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[4] at map at <console>:32 scala> val personRDD = arrayRDD.map(attributes => Person(attributes(0), attributes(1).trim.toInt)) personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5] at map at <console>:36 Inferring the Schema Using Reflection
  • 44. Spark SQL & Dataframes scala> val peopleDF = personRDD.toDF() peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint] Inferring the Schema Using Reflection
  • 45. Spark SQL & Dataframes scala> val peopleDF = personRDD.toDF() peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint] scala> peopleDF.show() +-------+---+ | name|age| +-------+---+ |Michael| 29| | Andy| 30| | Justin| 19| +-------+---+ Inferring the Schema Using Reflection
  • 46. Spark SQL & Dataframes // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") Inferring the Schema Using Reflection
  • 47. Spark SQL & Dataframes // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") // The columns of a row in the result can be accessed by field index teenagersDF.map(teenager => "Name: " + teenager(0)).show() // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+ Inferring the Schema Using Reflection
  • 48. Spark SQL & Dataframes // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people") // SQL statements can be run by using the sql methods provided by Spark val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") // The columns of a row in the result can be accessed by field index teenagersDF.map(teenager => "Name: " + teenager(0)).show() // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+ // or by field name teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show() // +------------+ // | value| // +------------+ // |Name: Justin| // +------------+ Inferring the Schema Using Reflection
  • 49. Spark SQL & Dataframes Inferring the Schema Using Reflection // No pre-defined encoders for Dataset[Map[K,V]], define explicitly implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] // Primitive types and case classes can be also defined as // implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder() // row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T] teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect() // Array(Map("name" -> "Justin", "age" -> 19))
  • 50. Spark SQL & Dataframes ● When case classes can't be defined during time of coding a. E.g. The fields expected in case classes are passed as arguments ● We need to programmatically create the dataframe: Programmatically Specifying the Schema
  • 51. Spark SQL & Dataframes Programmatically Specifying the Schema ● When case classes can't be defined during time of coding a. E.g. The fields expected in case classes are passed as arguments ● We need to programmatically create the dataframe: 1. Create RDD of Row objects 2. Create schema represented by StructType 3. Apply schema with createDataFrame
  • 52. Spark SQL & Dataframes Programmatically Specifying the Schema people.txt: Michael, 29 Andy, 30 Justin, 19 val schemaString = "name age"
  • 53. Spark SQL & Dataframes import org.apache.spark.sql.types._ import org.apache.spark.sql._ Programmatically Specifying the Schema
  • 54. Spark SQL & Dataframes Programmatically Specifying the Schema import org.apache.spark.sql.types._ import org.apache.spark.sql._ // The schema is encoded in a string // User provided variable val schemaString = "name age" val filename = "/data/spark/people.txt" val fieldsArray = schemaString.split(" ") val fields = fieldsArray.map( f => StructField(f, StringType, nullable = true) ) val schema = StructType(fields)
  • 55. Spark SQL & Dataframes Programmatically Specifying the Schema import org.apache.spark.sql.types._ import org.apache.spark.sql._ // The schema is encoded in a string // User provided variable val schemaString = "name age" val filename = "/data/spark/people.txt" val fieldsArray = schemaString.split(" ") val fields = fieldsArray.map( f => StructField(f, StringType, nullable = true) ) val schema = StructType(fields) val peopleRDD = spark.sparkContext.textFile(filename) val rowRDD = peopleRDD.map(_.split(",")).map( attributes => Row.fromSeq(attributes) ) val peopleDF = spark.createDataFrame(rowRDD, schema)
  • 56. Spark SQL & Dataframes Programmatically Specifying the Schema import org.apache.spark.sql.types._ import org.apache.spark.sql._ // The schema is encoded in a string // User provided variable val schemaString = "name age" val filename = "/data/spark/people.txt" val fieldsArray = schemaString.split(" ") val fields = fieldsArray.map( f => StructField(f, StringType, nullable = true) ) val schema = StructType(fields) val peopleRDD = spark.sparkContext.textFile(filename) val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row.fromSeq(attributes)) val peopleDF = spark.createDataFrame(rowRDD, schema) peopleDF.show() +-------+---+ | name|age| +-------+---+ |Michael| 29| | Andy| 30| | Justin| 19| +-------+---+
  • 57. Thank you! Spark SQL & Dataframes