Apache hive introduction
- 1. Apache Hive
Cloud Computing, Class Presentation Tarbiat Modarres University
MAHMOODREZA ESMAILI ZAND
Supervisor: Dr. Sadegh Dorri Nogoorani
Winter – Spring 2018
- 2. What Is Hive
• Apache Hive is a data warehouse (initially developed by Facebook)
software project built on top of Apache Hadoop for providing data
summarization, query and analysis.
• Hive gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop. It provides a
SQL interface to query data stored in Hadoop distributed file
system (HDFS) or Amazon S3 (an AWS implementation) through an
HDFS-like abstraction layer called EMRFS (Elastic MapReduce File
System).
- 5. Hive Is Not…
• A Relational Database
• Designed for on-Line Transaction Processing
• Suited for real-time queries and row-level updates
- 6. Limitations
• Certain standard SQL functions, such as NOT IN, NOT LIKE, and NOT
EQUAL, do not exist or require certain workarounds.
• Hive is not made for low-latency, real-time, or near-real-time
querying.
SQL queries translate to MapReduce, which means slower performance for
certain queries compared to traditional RDBMS.
- 7. Typical Use Case
• Supports uses shuch as: Ad-hoc queries, Summarization, Data
Analysis
- 9. Hadoop Hive Components
• Hive Clients – Apache Hive supports all application written in
languages like C++, Java, Python etc. using JDBC, Thrift and ODBC
drivers. Thus, one can easily write Hive client application written in a
language of their choice.
• Hive Services – Hive provides various services like web Interface, CLI
etc. to perform queries.
• Processing framework and Resource Management – Hive internally
uses Hadoop MapReduce framework to execute the queries.
• Distributed Storage – As seen above that Hive is built on the top of
Hadoop, so it uses the underlying HDFS for the distributed storage.
- 10. Hive Services
• a) CLI(Command Line Interface) – This is the default shell that Hive
provides, in which you can execute your Hive queries and command
directly.
- 11. Hive Services
• b) Web Interface – Hive also provides web based GUI for executing
Hive queries and commands.
- 12. Pig vs Hive
Pig and Hive work well together and many businesses use both
• 1) Hive Hadoop Component is used mainly by data analysts
whereas Pig HadoopComponent is generally used by Researchers and
Programmers.
• 2) Hive HadoopComponent is used for completely structured Data
whereas Pig HadoopComponent is used for semi structured data.
- 14. How to process data with Apache Hive?
• User Interface (UI) calls the execute interface to the Driver.
• The driver creates a session handle for the query. Then it sends the
query to the compiler to generate an execution plan.
• The compiler needs the metadata. So it sends a request
for getMetaData. Thus receives the sendMetaData request from
Metastore.
• Now compiler uses this metadata to type check the expressions in the
query. The compiler generates the plan which is DAG of stages with
each stage being either a map/reduce job, a metadata operation or
an operation on HDFS. The plan contains map operator trees and a
reduce operator tree for map/reduce stages.
- 15. How to process data with Apache Hive?
• Now execution engine submits these stages to appropriate
components. After in each task the deserializer associated with the
table or intermediate outputs is used to read the rows from HDFS
files. Then pass them through the associated operator tree. Once it
generates the output, write it to a temporary HDFS file through the
serializer. Now temporary file provides the subsequent map/reduce
stages of the plan. Then move the final temporary file to the table’s
location for DML operations.
• Now for queries, execution engine directly read the contents of the
temporary file from HDFS as part of the fetch call from the Driver.
- 18. HiveQL Features
• HiveQL is similar to other SQLs
-Use familiar relational database concepts (tables, rows, Schema…)
• Support multi-table inserts via your code
-accesses Big Data via table
• Converts SQL queries into MapReduce jobs
-user doesn’t need to know MapReduce
- 21. Hive Table
• A Hive Table:
-Data: file or group of files in HDFS
-Schema: in the form of metadata stored in a relational database
Schema and Data are separate
-Schema can be defined for existing data
-data can added or removed independently
-Hive can be pointed at existing data
You have to define a schema if you have existing data in HDFS that you want to
use in Hive
- 24. Loading Data
• Use LOAD DATA to import data into Hive Table
• Not modified by Hive – Loaded as is
• Use the word OVERWRITE to write over a file of the same name
• The schema is checked when the data is queried
• If a node does not match, it will be read as a null
- 26. Insert
• Use INSERT statement to populate data into a table from another
Hive table.
• Since query results are usually large it is best to use an INSERT clause
to tell Hive where to store your query
- 27. Insert Overwrite
• Overwrite is used to replace the data in the table, Otherwise the data
is appended to the table
• Append Happens by adding files to the directory holding the table data
Can Write to directory in HDFS
Can Write to Local directory
- 28. Performing Queries (HiveQL)
• SELECT
• Supports the following:
• WHERE clause
• UNION ALL and DISTINCT
• GROUP BY and HAVING
• LIMIT clause
• Can use REGEX column Specification
- 30. Subqueries
• Hive support subqueries only in the FROM clause
• The columns in the subquery SELECT list are available in outer query
- 32. JOIN – Outer Joins
• Allows for the finding of rows with non-matches in the tables being
joined
• Three Types:
• LEFT OUTER JOIN
• Returns a row for every row in the first table entered
• RIGHT OUTER JOIN
• Returns a row for every row in the second table entered
• FULL OUTER JOIN
• Returns a row for every from both tables
- 33. Sorting
• ORDER BY
• Sort but sets the number of reducers to 1
• SORT BY
• Multiple reducers with a sorted file from each
- 35. Build a data warehouse with Hive
• If you want to build a data library
using Hadoop, but have no Java
or MapReduce knowledge, Hive
can be a great alternative (if you
know SQL).
• It also allows you to build
star schemas on top of
HDFS.
- 36. Example: Build a data warehouse for baseball
information
• Begin by downloading the CSV file that contains statistics about
baseball and baseball players. From within Linux, create a directory,
then run:
• The example contains four main tables, each with a unique column
- 37. Load data into HDFS or Hive
• Different theories and practices are used to load data into Hadoop.
Sometimes, you ingest raw files directly into HDFS. You might create
certain directories and subdirectories to organize the files, but it's a
simple process of copying or moving files from one location to
another.
- 38. Build the data library with Hive
• To keep it simple, let's use the Hive shell. The high-level steps are:
1. Create the baseball database
2. Create the tables
3. Load the tables
4. Verify that the tables are correct
- 40. Build the data library with Hive
• To load data into the Hive table, open the Hive shell again, then run
the following code:
- 41. Build a normalized database with Hive
• The baseball database is more or less normalized: You have the four
main tables and several secondary tables. Again, Hive is a Schema on
Read, so you have to do most of the work in the data analysis and ETL
stages because there is no indexing or referential integrity such as in
traditional RDBMSes.
Editor's Notes
- We could load from local, path and …
- InfoSphere® BigInsights™ is a software platform for discovering, analyzing, and visualizing data from disparate sources.