SlideShare a Scribd company logo
Apache Hive
Cloud Computing, Class Presentation Tarbiat Modarres University
MAHMOODREZA ESMAILI ZAND
Supervisor: Dr. Sadegh Dorri Nogoorani
Winter – Spring 2018
What Is Hive
• Apache Hive is a data warehouse (initially developed by Facebook)
software project built on top of Apache Hadoop for providing data
summarization, query and analysis.
• Hive gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop. It provides a
SQL interface to query data stored in Hadoop distributed file
system (HDFS) or Amazon S3 (an AWS implementation) through an
HDFS-like abstraction layer called EMRFS (Elastic MapReduce File
System).
Apache Hive: Fast Facts
Apache hive introduction
Hive Is Not…
• A Relational Database
• Designed for on-Line Transaction Processing
• Suited for real-time queries and row-level updates
Limitations
• Certain standard SQL functions, such as NOT IN, NOT LIKE, and NOT
EQUAL, do not exist or require certain workarounds.
• Hive is not made for low-latency, real-time, or near-real-time
querying.
SQL queries translate to MapReduce, which means slower performance for
certain queries compared to traditional RDBMS.
Typical Use Case
• Supports uses shuch as: Ad-hoc queries, Summarization, Data
Analysis
Apache hive introduction
Hadoop Hive Components
• Hive Clients – Apache Hive supports all application written in
languages like C++, Java, Python etc. using JDBC, Thrift and ODBC
drivers. Thus, one can easily write Hive client application written in a
language of their choice.
• Hive Services – Hive provides various services like web Interface, CLI
etc. to perform queries.
• Processing framework and Resource Management – Hive internally
uses Hadoop MapReduce framework to execute the queries.
• Distributed Storage – As seen above that Hive is built on the top of
Hadoop, so it uses the underlying HDFS for the distributed storage.
Hive Services
• a) CLI(Command Line Interface) – This is the default shell that Hive
provides, in which you can execute your Hive queries and command
directly.
Hive Services
• b) Web Interface – Hive also provides web based GUI for executing
Hive queries and commands.
Pig vs Hive
Pig and Hive work well together and many businesses use both
• 1) Hive Hadoop Component is used mainly by data analysts
whereas Pig HadoopComponent is generally used by Researchers and
Programmers.
• 2) Hive HadoopComponent is used for completely structured Data
whereas Pig HadoopComponent is used for semi structured data.
How to process data with Apache Hive?
How to process data with Apache Hive?
• User Interface (UI) calls the execute interface to the Driver.
• The driver creates a session handle for the query. Then it sends the
query to the compiler to generate an execution plan.
• The compiler needs the metadata. So it sends a request
for getMetaData. Thus receives the sendMetaData request from
Metastore.
• Now compiler uses this metadata to type check the expressions in the
query. The compiler generates the plan which is DAG of stages with
each stage being either a map/reduce job, a metadata operation or
an operation on HDFS. The plan contains map operator trees and a
reduce operator tree for map/reduce stages.
How to process data with Apache Hive?
• Now execution engine submits these stages to appropriate
components. After in each task the deserializer associated with the
table or intermediate outputs is used to read the rows from HDFS
files. Then pass them through the associated operator tree. Once it
generates the output, write it to a temporary HDFS file through the
serializer. Now temporary file provides the subsequent map/reduce
stages of the plan. Then move the final temporary file to the table’s
location for DML operations.
• Now for queries, execution engine directly read the contents of the
temporary file from HDFS as part of the fetch call from the Driver.
Apache hive introduction
HiveQL
HiveQL Features
• HiveQL is similar to other SQLs
-Use familiar relational database concepts (tables, rows, Schema…)
• Support multi-table inserts via your code
-accesses Big Data via table
• Converts SQL queries into MapReduce jobs
-user doesn’t need to know MapReduce
Getting Started With Hive
Hive Table
Hive Table
• A Hive Table:
-Data: file or group of files in HDFS
-Schema: in the form of metadata stored in a relational database
Schema and Data are separate
-Schema can be defined for existing data
-data can added or removed independently
-Hive can be pointed at existing data
You have to define a schema if you have existing data in HDFS that you want to
use in Hive
Defining a Table
Managing Table
Loading Data
• Use LOAD DATA to import data into Hive Table
• Not modified by Hive – Loaded as is
• Use the word OVERWRITE to write over a file of the same name
• The schema is checked when the data is queried
• If a node does not match, it will be read as a null
Loading Data
Insert
• Use INSERT statement to populate data into a table from another
Hive table.
• Since query results are usually large it is best to use an INSERT clause
to tell Hive where to store your query
Insert Overwrite
• Overwrite is used to replace the data in the table, Otherwise the data
is appended to the table
• Append Happens by adding files to the directory holding the table data
Can Write to directory in HDFS
Can Write to Local directory
Performing Queries (HiveQL)
• SELECT
• Supports the following:
• WHERE clause
• UNION ALL and DISTINCT
• GROUP BY and HAVING
• LIMIT clause
• Can use REGEX column Specification
Apache hive introduction
Subqueries
• Hive support subqueries only in the FROM clause
• The columns in the subquery SELECT list are available in outer query
JOIN – Inner Joins
JOIN – Outer Joins
• Allows for the finding of rows with non-matches in the tables being
joined
• Three Types:
• LEFT OUTER JOIN
• Returns a row for every row in the first table entered
• RIGHT OUTER JOIN
• Returns a row for every row in the second table entered
• FULL OUTER JOIN
• Returns a row for every from both tables
Sorting
• ORDER BY
• Sort but sets the number of reducers to 1
• SORT BY
• Multiple reducers with a sorted file from each
Summary
Build a data warehouse with Hive
• If you want to build a data library
using Hadoop, but have no Java
or MapReduce knowledge, Hive
can be a great alternative (if you
know SQL).
• It also allows you to build
star schemas on top of
HDFS.
Example: Build a data warehouse for baseball
information
• Begin by downloading the CSV file that contains statistics about
baseball and baseball players. From within Linux, create a directory,
then run:
• The example contains four main tables, each with a unique column
Load data into HDFS or Hive
• Different theories and practices are used to load data into Hadoop.
Sometimes, you ingest raw files directly into HDFS. You might create
certain directories and subdirectories to organize the files, but it's a
simple process of copying or moving files from one location to
another.
Build the data library with Hive
• To keep it simple, let's use the Hive shell. The high-level steps are:
1. Create the baseball database
2. Create the tables
3. Load the tables
4. Verify that the tables are correct
Build the data library with Hive
Build the data library with Hive
• To load data into the Hive table, open the Hive shell again, then run
the following code:
Build a normalized database with Hive
• The baseball database is more or less normalized: You have the four
main tables and several secondary tables. Again, Hive is a Schema on
Read, so you have to do most of the work in the data analysis and ETL
stages because there is no indexing or referential integrity such as in
traditional RDBMSes.
Thanks For Your Time

More Related Content

Apache hive introduction

  • 1. Apache Hive Cloud Computing, Class Presentation Tarbiat Modarres University MAHMOODREZA ESMAILI ZAND Supervisor: Dr. Sadegh Dorri Nogoorani Winter – Spring 2018
  • 2. What Is Hive • Apache Hive is a data warehouse (initially developed by Facebook) software project built on top of Apache Hadoop for providing data summarization, query and analysis. • Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. It provides a SQL interface to query data stored in Hadoop distributed file system (HDFS) or Amazon S3 (an AWS implementation) through an HDFS-like abstraction layer called EMRFS (Elastic MapReduce File System).
  • 5. Hive Is Not… • A Relational Database • Designed for on-Line Transaction Processing • Suited for real-time queries and row-level updates
  • 6. Limitations • Certain standard SQL functions, such as NOT IN, NOT LIKE, and NOT EQUAL, do not exist or require certain workarounds. • Hive is not made for low-latency, real-time, or near-real-time querying. SQL queries translate to MapReduce, which means slower performance for certain queries compared to traditional RDBMS.
  • 7. Typical Use Case • Supports uses shuch as: Ad-hoc queries, Summarization, Data Analysis
  • 9. Hadoop Hive Components • Hive Clients – Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice. • Hive Services – Hive provides various services like web Interface, CLI etc. to perform queries. • Processing framework and Resource Management – Hive internally uses Hadoop MapReduce framework to execute the queries. • Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage.
  • 10. Hive Services • a) CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.
  • 11. Hive Services • b) Web Interface – Hive also provides web based GUI for executing Hive queries and commands.
  • 12. Pig vs Hive Pig and Hive work well together and many businesses use both • 1) Hive Hadoop Component is used mainly by data analysts whereas Pig HadoopComponent is generally used by Researchers and Programmers. • 2) Hive HadoopComponent is used for completely structured Data whereas Pig HadoopComponent is used for semi structured data.
  • 13. How to process data with Apache Hive?
  • 14. How to process data with Apache Hive? • User Interface (UI) calls the execute interface to the Driver. • The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan. • The compiler needs the metadata. So it sends a request for getMetaData. Thus receives the sendMetaData request from Metastore. • Now compiler uses this metadata to type check the expressions in the query. The compiler generates the plan which is DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. The plan contains map operator trees and a reduce operator tree for map/reduce stages.
  • 15. How to process data with Apache Hive? • Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations. • Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.
  • 18. HiveQL Features • HiveQL is similar to other SQLs -Use familiar relational database concepts (tables, rows, Schema…) • Support multi-table inserts via your code -accesses Big Data via table • Converts SQL queries into MapReduce jobs -user doesn’t need to know MapReduce
  • 21. Hive Table • A Hive Table: -Data: file or group of files in HDFS -Schema: in the form of metadata stored in a relational database Schema and Data are separate -Schema can be defined for existing data -data can added or removed independently -Hive can be pointed at existing data You have to define a schema if you have existing data in HDFS that you want to use in Hive
  • 24. Loading Data • Use LOAD DATA to import data into Hive Table • Not modified by Hive – Loaded as is • Use the word OVERWRITE to write over a file of the same name • The schema is checked when the data is queried • If a node does not match, it will be read as a null
  • 26. Insert • Use INSERT statement to populate data into a table from another Hive table. • Since query results are usually large it is best to use an INSERT clause to tell Hive where to store your query
  • 27. Insert Overwrite • Overwrite is used to replace the data in the table, Otherwise the data is appended to the table • Append Happens by adding files to the directory holding the table data Can Write to directory in HDFS Can Write to Local directory
  • 28. Performing Queries (HiveQL) • SELECT • Supports the following: • WHERE clause • UNION ALL and DISTINCT • GROUP BY and HAVING • LIMIT clause • Can use REGEX column Specification
  • 30. Subqueries • Hive support subqueries only in the FROM clause • The columns in the subquery SELECT list are available in outer query
  • 31. JOIN – Inner Joins
  • 32. JOIN – Outer Joins • Allows for the finding of rows with non-matches in the tables being joined • Three Types: • LEFT OUTER JOIN • Returns a row for every row in the first table entered • RIGHT OUTER JOIN • Returns a row for every row in the second table entered • FULL OUTER JOIN • Returns a row for every from both tables
  • 33. Sorting • ORDER BY • Sort but sets the number of reducers to 1 • SORT BY • Multiple reducers with a sorted file from each
  • 35. Build a data warehouse with Hive • If you want to build a data library using Hadoop, but have no Java or MapReduce knowledge, Hive can be a great alternative (if you know SQL). • It also allows you to build star schemas on top of HDFS.
  • 36. Example: Build a data warehouse for baseball information • Begin by downloading the CSV file that contains statistics about baseball and baseball players. From within Linux, create a directory, then run: • The example contains four main tables, each with a unique column
  • 37. Load data into HDFS or Hive • Different theories and practices are used to load data into Hadoop. Sometimes, you ingest raw files directly into HDFS. You might create certain directories and subdirectories to organize the files, but it's a simple process of copying or moving files from one location to another.
  • 38. Build the data library with Hive • To keep it simple, let's use the Hive shell. The high-level steps are: 1. Create the baseball database 2. Create the tables 3. Load the tables 4. Verify that the tables are correct
  • 39. Build the data library with Hive
  • 40. Build the data library with Hive • To load data into the Hive table, open the Hive shell again, then run the following code:
  • 41. Build a normalized database with Hive • The baseball database is more or less normalized: You have the four main tables and several secondary tables. Again, Hive is a Schema on Read, so you have to do most of the work in the data analysis and ETL stages because there is no indexing or referential integrity such as in traditional RDBMSes.

Editor's Notes

  1. We could load from local, path and …
  2.  InfoSphere® BigInsights™ is a software platform for discovering, analyzing, and visualizing data from disparate sources.