SlideShare a Scribd company logo
Hadoop Ecosystem
Mohamed khouder
mohamedkhouder0100@Hotmail.com
mohamedkhouder0100@Gmail.com
1
Hadoop Ecosystem
2
Overview
The Hadoop Ecosystem
Hadoop core components
 HDFS
 Map Reduce
Other Hadoop ecosystem components
 Hive
 Pig
 Impala
 Sqoop
3
HDFS
Hadoop Distributed File System (HDFS) is designed to reliably
store very large files across machines in a large cluster.
It is inspired by the GoogleFileSystem.
Distribute large data file into blocks.
Blocks are managed by different nodes in the cluster.
Each block is replicated on multiple nodes.
Name node stored metadata information about files and blocks.
4

Recommended for you

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component

This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.

hadoop architecturehdfshadoop
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.

hadoop
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem

The document discusses the Hadoop ecosystem, which includes core Apache Hadoop components like HDFS, MapReduce, YARN, as well as related projects like Pig, Hive, HBase, Mahout, Sqoop, ZooKeeper, Chukwa, and HCatalog. It provides overviews and diagrams explaining the architecture and purpose of each component, positioning them as core functionality that speeds up Hadoop processing and makes Hadoop more usable and accessible.

Hadoop Distributed File System (HDFS)
5
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
HDFS Consists of a Name Node and Data Node.
HDFS Architecture
Name node Remembers where the data is stored in the cluster.
Data node Stores the actual data in the cluster.
Name node Master Node which clients must initiate read/write.
6
Has meta data information about a file.
File name ,permissions, directory.
Which nodes contain which blocks.
Name node
Disk backup of meta-data very important if you lose the name
node, you lose HDFS.
7
HDFS Architecture
8

Recommended for you

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...

The document provides information about Hadoop training. It discusses the need for Hadoop in today's data-heavy world. It then describes what Hadoop is, its ecosystem including HDFS for storage and MapReduce for processing. It also discusses YARN and provides a bank use case. It further explains the architecture and working of HDFS and MapReduce in processing large datasets in parallel across clusters.

hadoop traininghadoop training for beginnershadoop tutorial for beginners
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt

The most well known technology used for Big Data is Hadoop. It is actually a large scale batch data processing system

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology

Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro

HDFS ComparingVersions
HDFS 1.0 HDFS 2.0
disaster failure Name node single point failure Name node high availability
Resource manager
Resource Manager with map
reduce
Resource Manager withYarn
Scalability and
Performance
Scalability and performance
Suffer with larger clusters
Scalability and performance do
will with larger clusters
9
HDFS was built under the premise that hardware will fail.
FaultTolerance
Ensure that when hardware fails / users can still have their
Data available.
Achieved through storing multiple Copies throughout
Cluster.
10
Fault Tolerance
11
MapReduce
12

Recommended for you

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011

In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.

mrv2mapreducehadoop
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data

This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes

predicate pushdownorcpartitioning
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology

This document provides an overview of Big Data and Hadoop. It defines Big Data as large volumes of structured, semi-structured, and unstructured data that is too large to process using traditional databases and software. It provides examples of the large amounts of data generated daily by organizations. Hadoop is presented as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop including HDFS for distributed storage and fault tolerance, and MapReduce for distributed processing, are described at a high level. Common use cases for Hadoop by large companies are also mentioned.

hadoop
Hadoop Architecture
• Execution engine (Map Reduce)
13
Programming model for expressing distributed computations at a
massive scale.
What’s MapReduce?
A patented software framework introduced by Google.
Processes 20 petabytes of data per day.
Popularized by open-source Hadoop project.
Used atYahoo!, Facebook,Amazon, …
14
MapReduce: High Level
MapReduce job
submitted by
client computer
Job Tracker
Task Tracker
Task instance
Task Tracker
Task
instance
Task Tracker
Task
instance
15
 Code usually written in Java - though it can be written in other languages with the
Hadoop StreamingAPI.
MapReduce core functionality (I)
 Two fundamental components:
• Map step:
 Master node takes large problem and slices it into smaller sub problems;
distributes these to worker nodes.
 Worker node may do this again if necessary.
 Worker processes smaller problem and hands back to master.
• Reduce step:
 Master node takes the answers to the sub problems and combines them in
a predefined way to get the output/answer to original problem.
16

Recommended for you

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units. Below topics are explained in this Hadoop presentation: 1. What is Hadoop 2. Why Hadoop 3. Big Data generation 4. Hadoop HDFS 5. Hadoop MapReduce 6. Hadoop YARN 7. Use of Hadoop 8. Demo on HDFS, MapReduce and YARN What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

hadoop tutorial for beginnersapache hadoop tutorial for beginnershadoop tutorial
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL

Interested in learning Hadoop, but you’re overwhelmed by the number of components in the Hadoop ecosystem? You’d like to get some hands on experience with Hadoop but you don’t know Linux or Java? This session will focus on giving a high level explanation of Hive and HiveQL and how you can use them to get started with Hadoop without knowing Linux or Java.

hadoophiveqlbig data
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data

This presentation provides an overview of Hadoop, including: - A brief history of data and the rise of big data from various sources. - An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers. - Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture. - An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes. - Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.

big datamap reducehadoop
Hadoop data flow
17
Input reader reads a block and divides into splits.
Input reader
Each split would be sent to a map function.
a line is an input of a map function.
The key could be some internal number (filename - blockid – lineid ).
The value is the content of the textual line.
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Block 1
Block 2
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Input reader
18
Mapper: map function
Mapper takes the output generated by input reader.
output a list of intermediate <key, value> pairs.
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
mapper
m1
m2
m3
m4
19
Reducer: reduce function
Reducer takes the output generated by
the Mapper.
aggregates the value for each key, and
outputs the final result.
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Orange, 1
Orange, 1
Grapes, 1
Mongo, 1
Mongo, 1
Plum, 1
Plum, 1
Plum, 1
Apple, 4
Orange, 2
Grapes, 1
Mongo, 2
Plum, 3
reducer
shuffle/sort
r1
r2
r3
r4
r5
There is shuffle/sort before reducing.
20

Recommended for you

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop

The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.

apache hadoopdistributed computinghadoop
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

This presentation discusses the follow topics What is Hadoop? Need for Hadoop History of Hadoop Hadoop Overview Advantages and Disadvantages of Hadoop Hadoop Distributed File System Comparing: RDBMS vs. Hadoop Advantages and Disadvantages of HDFS Hadoop frameworks Modules of Hadoop frameworks Features of 'Hadoop‘ Hadoop Analytics Tools

hadoopgoogle analyticsbig data analytics
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS

HDFS (Hadoop Distributed File System) is a distributed file system that stores large data sets across clusters of machines. It partitions and stores data in blocks across nodes, with multiple replicas of each block for fault tolerance. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode and DataNodes work together to ensure high availability and reliability even when hardware failures occur. HDFS supports large data sets through horizontal scaling and tools like HDFS Federation that allow scaling the namespace across multiple NameNodes.

hdfshadoopdistributed file system
Execute MapReduce on a
cluster of machines with HDFS
21
Execution
22
MapReduce: Execution Details
Input reader
Divide input into splits, assign each split to a Map task.
Map task
Apply the Map function to each record in the split.
Each Map function returns a list of (key, value) pairs.
Shuffle/Partition and Sort
Shuffle distributes sorting & aggregation to many reducers.
All records for key k are directed to the same reduce processor.
Sort groups the same keys together, and prepares for aggregation.
Reduce task
Apply the Reduce function to each key.
The result of the Reduce function is a list of (key, value) pairs.
23
MapReduce Phases
Deciding on what will be the key and what will be the value  developer’s responsibility
24

Recommended for you

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System

Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.

engineeringhadooptechnology
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)

Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.

hivehadoop
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies

Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad. http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training

hadoop training institutes in hyderabadhadoop institutes in hyderabadhadoop classes
REDUCE(k,list(vMAP(k,v)
MapReduce – Group AVG Example
NewYork, US, 10
LosAngeles, US,40
London, GB, 20
Berlin, DE, 60
Glasgow, GB, 10
Munich, DE, 30
…
DE,45
GB,15
US,25
(US,10)
(US,40)
(GB,20
(GB,10
(DE,60
(DE,30
(US,10
(US,40
(GB,20
(GB,10
(DE,60
(DE,30
Input Data Intermediate
(K,V)-Pairs
Result
25
Map-Reduce Execution Engine
(Example: Color Count)
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions 26
Properties of MapReduce Engine
JobTracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
27
Properties of MapReduce Engine (Cont’d)
 TaskTracker is the slave node (runs on each datanode)
 Receives the task from JobTracker
 Runs the task until completion (either map or reduce task)
 Always in communication with the JobTracker reporting progress
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example,
1 map-reduce job consists of 4
map tasks and 3 reduce tasks
28

Recommended for you

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies

kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderaba www.kellytechno.com

hadoop training in hyderabadhadoop institutes in hyderabadhadoop training centers in hyderabad
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.

Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore

Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.

hadoop training in bangalorehadoop training institutes in bangalore
Example -Word count
Hello
Cloud
TA cool
Hello
TA
cool
Input
Mapper
Mapper
Mapper
Hello [11]
TA [11]
Cloud [1]
cool [11] Reducer
Reducer
Hello 2
TA 2
Cloud 1
cool 2
Hello 1
TA 1
Cloud 1
Hello1
cool 1
cool 1
TA 1
Hello1
Hello1
TA 1
TA 1
Cloud1
cool 1
cool 1
Sort/Copy
Merge
Output
29
Example 1:Word Count
Map
Tasks Reduce
Tasks
 Job: Count the occurrences of each word in a data set
30
Example 2: Color Count
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it
has 3 parts on probably 3
different machines 31
Example 3: Color Filter
Job: Select only the blue and the green colors
Input blocks
on HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
32

Recommended for you

Meethadoop
MeethadoopMeethadoop
Meethadoop

Hadoop is an open source framework for running large-scale data processing jobs across clusters of computers. It has two main components: HDFS for reliable storage and Hadoop MapReduce for distributed processing. HDFS stores large files across nodes through replication and uses a master-slave architecture. MapReduce allows users to write map and reduce functions to process large datasets in parallel and generate results. Hadoop has seen widespread adoption for processing massive datasets due to its scalability, reliability and ease of use.

Hadoop
HadoopHadoop
Hadoop

This document provides an overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop was developed based on Google's MapReduce algorithm and how it uses HDFS for scalable storage and MapReduce as an execution engine. Key components of Hadoop architecture include HDFS for fault-tolerant storage across data nodes and the MapReduce programming model for parallel processing of data blocks. The document also gives examples of how MapReduce works and industries that use Hadoop for big data applications.

Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview

This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses that Hadoop was created to address the challenges of "Big Data" characterized by high volume, variety and velocity of data. The key components of Hadoop are HDFS for storage and MapReduce as an execution engine for distributed computation. HDFS uses a master-slave architecture with a NameNode master and DataNode slaves, and provides fault tolerance through data replication. MapReduce allows processing of large datasets in parallel through mapping and reducing functions.

bigdatahadoop
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
33
Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
34
Apache Hive
35
Hive:A data warehouse on Hadoop
36

Recommended for you

Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown

The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment

Sean McKeown, Technical Solutions Architect discusses big data architecture and deployment at Cisco Connect Toronto.

cisco systemscisco connect toronto 2015technology innovation
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.

big data ingestionbig data analyticsdata visualization
Why Hive?
Problem: Data, data and more data
200GB per day in March 2008 back to 1TB compressed per
day today
The Hadoop Experiment
Problem: Map/Reduce (MR) is great but every one is not a
Map/Reduce expert.
I know SQL and I am a python and php expert.
So what do we do: HIVE
37
• A system for querying and managing structured data built on top of
Map/Reduce and Hadoop.
• MapReduce (MR) is very low level and requires customers to write custom
programs.
• HIVE supports queries expressed in SQL-like language called HiveQL
which are compiled into MR jobs that are executed on Hadoop.
What is HIVE?
• Data model
• Hive structures data into well-understood database concepts such as: tables, rows,
columns.
• It supports primitive types: integers, floats, doubles, and strings.
38
Hive Components
 Shell Interface: Like the MySQL shell
 Driver:
 Session handles, fetch, execution
 Complier:
 Prarse,plan,optimize.
 Execution Engine:
 DAG stage,Run map or reduce.
39
Hive
architecture
40

Recommended for you

Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction

The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce

This document provides an overview of MapReduce and Apache Hadoop. It discusses the history and components of Hadoop, including HDFS and MapReduce. It then walks through an example MapReduce job, the WordCount algorithm, to illustrate how MapReduce works. The WordCount example counts the frequency of words in documents by having mappers emit <word, 1> pairs and reducers sum the counts for each word.

Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad

Big data refers to large volumes of unstructured or semi-structured data that is difficult to process using traditional databases and analysis tools. The amount of data generated daily is growing exponentially due to factors like increased internet usage and data collection by organizations. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for reliable storage and MapReduce as a programming model to process data in parallel across nodes.

ha
HDFS
Map Reduce
Web UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
MetaStore
Thrift API
Hive QL
Parser
Planner
Optimizer
Execution
SerDe
CSV
Thrift
Regex
UDF/UDAF
substr
sum
average
FileFormats
TextFile
SequenceFile
RCFile
User-defined
Map-reduce Scripts
Architecture
41
Hive Metastore
 Stores Hive metadata.
 Default metastore database uses Apache Derby.
 Various configurations:
 Embedded(in-process metastore, in-process database)
 Mainly for unit tests.
 only one process can connect to the metastore at a time.
 Local (in-process metastore, out-of-process database)
 Each Hive client connects to the metastore directly
 Remote (out-of-process metastore, out-of-process database)
 Each Hive client connects to a metastore server, which connects to the
metadata database itself.
 Metastore server and Clint communicate usingThrift Protocol.
42
HiveWarehouse
 Hive tables are stored in the Hive “warehouse”.
Default HDFS location: /user/hive/warehouse.
 Tables are stored as sub-directories in the warehouse directory.
 Partitions are subdirectories of tables.
 External tables are supported in Hive.
 The actual data is stored in flat files.
43
Hive Schemas
 Hive is schema-on-read
oSchema is only enforced when the data is read (at query time)
oAllows greater flexibility: same data can be read using multiple
schemas
 Contrast with an RDBMS, which is schema-on-write
oSchema is enforced when the data is loaded.
oSpeeds up queries at the expense of load times.
44

Recommended for you

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group

This document provides an introduction to Hadoop and MapReduce. It discusses big data characteristics and challenges. It provides a brief history of Hadoop and compares it to RDBMS. Key aspects of Hadoop covered include the Hadoop Distributed File System (HDFS) for scalable storage and MapReduce for scalable processing. MapReduce uses a map function to process key-value pairs and generate intermediate pairs, and a reduce function to merge values by key and produce final results. The document demonstrates MapReduce through an example word count program and includes demos of implementing it on Hortonworks and Azure HDInsight.

hadoop map reduce hdinsight azure word count
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms

Hadoop is a framework for running applications on large clusters built of commodity hardware.The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

hadoopbig datadata
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics

1) Stratosphere is a distributed data processing system that extends the MapReduce model by supporting more operators and advanced data flow graphs composed of operators. 2) It has components like a query parser, compiler, and optimizer that translate queries into execution plans composed of operators like Map, Reduce, Join, Cross, CoGroup, and Union. 3) Stratosphere supports arbitrary data flows while MapReduce only supports MapReduce, and Stratosphere has better performance through in-memory processing and pipelining compared to MapReduce which always writes to disk.

Data Hierarchy
 Hive is organised hierarchically into:
 Databases: namespaces that separate tables and other objects.
 Tables: homogeneous units of data with the same schema.
Analogous to tables in an RDBMS.
 Partitions: determine how the data is stored
Allow efficient access to subsets of the data.
 Buckets/clusters
For subsampling within a partition.
Join optimization.
45
HiveQL
 HiveQL / HQL provides the basic SQL-like operations:
 Select columns using SELECT.
 Filter rows usingWHERE.
 JOIN between tables.
 Evaluate aggregates using GROUP BY.
 Store query results into another table.
 Download results to a local directory (i.e., export from HDFS).
 Manage tables and queries with CREATE, DROP, and ALTER.
46
Primitive DataTypes
Type Comments
TINYINT, SMALLINT, INT,
BIGINT
1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY 47
Complex DataTypes
Type Comments
STRUCT
A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP
Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY
Indexed list
IfA is an array of elements ['a','b','c']:
A[0] returns 'a'
48

Recommended for you

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance

This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.

Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx

Hadoop is an open-source framework that uses clusters of commodity hardware to store and process big data using the MapReduce programming model. It consists of four main components: MapReduce for distributed processing, HDFS for storage, YARN for resource management and scheduling, and common utilities. HDFS stores large files as blocks across nodes for fault tolerance. MapReduce jobs are split into map and reduce phases to process data in parallel. YARN schedules resources and manages job execution. The common utilities provide libraries and scripts used by all Hadoop components. Major companies use Hadoop to analyze large amounts of data.

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop

Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.

sparkforecasthaddop
CreateTable
 CreateTable is a statement used to create a table in Hive.
 The syntax and example are as follows:
49
CreateTable Example
 Let us assume you need to create a table named employee using CREATE
TABLE statement.
 The following table lists the fields and their data types in employee table:
50
Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string
CreateTable Example
 The following query creates a table named employee.
51
 If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.
 On successful creation of table, you get to see the following response:
Load Data
 we can insert data using the Insert statement. But in Hive, we can insert
data using the LOAD DATA statement.
 The syntax and example are as follows:
52
 LOCAL is identifier to specify the local path. It is optional.
 OVERWRITE is optional to overwrite the data in the table.
 PARTITION is optional.

Recommended for you

Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability

This document provides an overview of Apache Hadoop, a distributed processing framework for large datasets. It describes how Hadoop uses the Hadoop Distributed File System (HDFS) to provide a unified view of large amounts of data across clusters of computers. It also explains how the MapReduce programming model allows distributed computations to be run efficiently across large datasets in parallel. Key aspects of Hadoop's architecture like scalability, fault tolerance and the MapReduce programming model are discussed at a high level.

hadoop scalabilityhadoop high availabilityhadoop wandisco
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습

Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.

awsdatabasedocumentdb
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Load Data Example
 We will insert the following data into the table.
It is a text file named sample.txt in /home/user directory.
53
 The following query loads the given text into the table.
 On successful download, you get to see the following response:
HiveQL - Select-Where
 Given below is the syntax of the SELECT query:
54
Select-Where Example
 Assume we have the employee table as given below, with fields named Id, Name,
Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000.
55
 The following query retrieves the employee details using the above scenario:
Select-Where Example
 On successful execution of the query, you get to see the following response:
56

Recommended for you

Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf

red hat

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript

学历认证补办制【微信:A575476】【(SFU毕业证)西蒙弗雷泽大学毕业证成绩单offer】【微信:A575476】(留信学历认证永久存档查询)采用学校原版纸张,特殊工艺完全按照原版一比一制作(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递交认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(SFU毕业证)西蒙弗雷泽大学毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

瑞尔森大学毕业证多伦多都会大学毕业证劳伦森大学毕业证
HiveQL - Select-Where
 Given below is the syntax of the SELECT query:
57
HiveQL Limitations
HQL only supports equi-joins, outer joins, left semi-joins.
Because it is only a shell for mapreduce, complex queries can be
hard to optimise.
Missing large parts of full SQL specification:
 Correlated sub-queries.
 Sub-queries outside FROM clauses.
 Updatable or materialized views.
 Stored procedures.
58
External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
59
BrowsingTables And Partitions
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the specification ( uses
regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful only for
debugging )
DESCRIBE page_view
PARTITION (ds='2008-10-31');
List information about a partition
60

Recommended for you

Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...

Amul goes international: Desi dairy giant to launch fresh ...

[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction

Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.

awsdatabaseaurora
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Loading Data
Use LOAD DATA to load data from a file or directory
 Will read from HDFS unless LOCAL keyword is specified
 Will append data unless OVERWRITE specified
 PARTITION required if destination table is partitioned
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
61
Inserting Data
Use INSERT to load data from a Hive query
 Will append data unless OVERWRITE specified
 PARTITION required if destination table is partitioned
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION (dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url
WHERE pvs.country = 'US';
62
Apache Pig
63
What is Apache pig?
pig: is a high-level platform for creating MapReduce programs.
pig: is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
Pig is made up of two components:
PigLatin.
 Runtime Environment.
64

Recommended for you

Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript

学历认证补办制【微信:A575476】【(ACU毕业证)澳大利亚天主教大学毕业证成绩单offer】【微信:A575476】(留信学历认证永久存档查询)采用学校原版纸张,特殊工艺完全按照原版一比一制作(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递交认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(ACU毕业证)澳大利亚天主教大学毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(ACU毕业证)澳大利亚天主教大学毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(ACU毕业证)澳大利亚天主教大学毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(ACU毕业证)澳大利亚天主教大学毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

埃尔福特应用技术大学毕业证埃尔福特大学毕业证埃尔朗根-纽伦堡大学毕业证
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks

From Clues to Connections: How Social Media Investigators Expose Hidden Networks

social media investigatorssocialconnections
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx

Supervised Machine Learning

Why Apache pig?
Programmers who are not so good at Java normally used to struggle
working with Hadoop, especially while performing any MapReduce tasks.
 Apache Pig is a boon for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
Pig Latin is SQL-like language and it is easy to learnApache Pig when you
are familiar with SQL.
65
Features of Pig
Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
66
Apache PigVs MapReduce
67
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation inApache Pig is
pretty simple.
It is quite difficult in MapReduce to perform a
Join operation between datasets.
Any programmer with a basic knowledge of SQL
can work conveniently withApache Pig.
Exposure to Java is must to work with
MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more the
number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted internally
into a MapReduce job.
MapReduce jobs have a long compilation process.
Apache PigVs Hive
68
Apache Pig Hive
Apache Pig uses a language called Pig Latin.
It was originally created atYahoo.
Hive uses a language called HiveQL.
It was originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in
pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured, unstructured,
and semi-structured data.
Hive is mostly for structured data.

Recommended for you

Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...

Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model Safe

Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript

特殊工艺完全按照原版制作【微信:A575476】【(Victoria毕业证)维多利亚大学毕业证成绩单offer】【微信:A575476】(留信学历认证永久存档查询)采用学校原版纸张(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递���认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(Victoria毕业证)维多利亚大学毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(Victoria毕业证)维多利亚大学毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(Victoria毕业证)维多利亚大学毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(Victoria毕业证)维多利亚大学毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

杜克大学毕业证芝加哥大学毕业证达特茅斯学院毕业证
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe

Apache Pig - Architecture
69
Apache Pig converts scripts into a series of MapReduce jobs.
Apache Pig makes the programmer’s job easy.
 Parser :
checks the syntax, does type checking, and other
miscellaneous checks.
The output of the parser will be a DAG.
a DAG (directed acyclic graph) represents the Pig Latin
statements and logical operators.
 Optimizer :
The logical plan (DAG) is passed to the logical optimizer,
which carries out the logical optimizations such as
projection and pushdown.
Apache Pig - Architecture
70
 Compiler :
The compiler compiles the optimized logical plan into a
series of MapReduce jobs.
 Execution engine :
Finally the MapReduce jobs are submitted to Hadoop in
a sorted order.
 Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.
Pig Latin Data Model
71
The data model of Pig Latin is fully nested and it allows complex non-
atomic data types such as map and tuple.
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
Pig Latin statements
72
 Basic constructs :
These statements work with relations,They include expressions and schemas.
Every statement ends with a semicolon (;).
Pig Latin statements take a relation as input and produce another relation as output.
 Pig Latin example:
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Recommended for you

How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach

Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance. A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen. Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second. In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.

questdbtime-series
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx

LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time. It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.

Pig Latin Data types
73
DataType Description & Example
int Represents a signed 32-bit integer. Example : 8
long Represents a signed 64-bit integer. Example : 5L
float Represents a signed 32-bit floating point. Example : 5.5F
double Represents a 64-bit floating point. Example : 10.5
chararray Represents a character array (string) in Unicode UTF-8 format. Example :‘tutorials point’
Bytearray Represents a Byte array (blob).
Boolean Represents a Boolean value. Example : true/ false.
Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00
Biginteger Represents a Java BigInteger. Example : 60708090709
Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883
Pig Latin ComplexTypes
74
DataType Description & Example
Tuple A tuple is an ordered set of fields. Example : (raja, 30)
Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)}
Map A Map is a set of key-value pairs. Example : [‘name’#’Raju’,‘age’#30]
Apache Pig Filter Operator
 The FILTER operator is used to select the required tuples from a relation based on a condition.
75
 syntax of the FILTER operator.
 Example:
 Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
Filter Operator Example
 And we have loaded this file into Pig with the relation name student_details as shown below.
76
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
 now use the Filter operator to get the details of the students who belong to the city Chennai.
 Verify the relation filter_data using the DUMP operator as shown below.
 It will produce the following output, displaying the contents of the relation filter_data as follows.

Recommended for you

Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization

Democratizing Data – Why Data Mesh ?

Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe

Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript

原版制作【微信:A575476】【(NC毕业证)尼亚加拉学院毕业证成绩单offer】【微信:A575476】(留信学历认证永久存档查询)采用学校原版纸张(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递交认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

旧金山艺术大学毕业证金门大学毕业证圣地亚哥州立大学毕业证
Apache Pig Distinct Operator
 The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
77
 syntax of the DISTINCT operator.
 Example:
 Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
Distinct Operator Example
78
 remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator
 Verify the relation distinct_data using the DUMP operator as shown below.
 It will produce the following output, displaying the contents of the relation distinct_data as follows.
Apache Pig Group Operator
 The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
79
 syntax of the group operator.
 Example:
 Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
Group Operator Example
80
 let us group the records/tuples in the relation by age as shown below.
 Verify the relation group_data using the DUMP operator as shown below.
 It will produce the following output, displaying the contents of the relation group_data as follows.

Recommended for you

Apache Pig Join Operator
 The JOIN operator is used to combine records from two or more relations.
81
 Joins can be of the following types:
 Self-join: is used to join a table with itself.
 Inner Join:An inner join returns rows when there is a match in both tables.
 left outer Join: returns all rows from the left table, even if there are no matches in the right relation.
 right outer join: returns all rows from the right table, even if there are no matches in the left table.
 full outer join: operation returns rows when there is a match in one of the relations.
Impala
82
What is Impala?
Cloudera Impala is a query engine that runs onApache Hadoop.
Similar to HiveQL.
Does not use Map reduce.
Optimized for low latency queries.
Open source apache project.
Developed by Cloudera.
Much faster than Hive or pig.
83
Comparing Pig, Hive and Impala
Description of Feature Pig Hive Impala
SQL based query language No yes yes
Schema optional required required
Process data with external scripts yes yes no
Extensible file format support yes yes no
Query speed slow slow fast
Accessible via ODBC/JDBC no yes yes
84

Recommended for you

Apache Sqoop
85
What is Sqoop?
 Command-line interface for transforming data between relational database and
Hadoop
 Support incremental imports
 Imports use to populate tables in Hadoop
 Exports use to put data from Hadoop into relational database such as SQL server
Hadoop RDBMSsqoop
86
Sqoop Import
 sqoop import
--connect jdbc:postgresql://hdp-master/sqoop_db
--username sqoop_user
--password postgres
--table cities
87
Sqoop Export
sqoop export
--connect jdbc:postgresql://hdp-master/sqoop_db
--username sqoop_user
--password postgres
--table cities
--export-dir cities
88

Recommended for you

Scoop – Example
 An example scoop command to
– load data from mySql into Hive
bin/sqoop-import
--connect jdbc:mysql://<mysql host>:<msql port>/db3 
-username <username> 
-password <password> 
--table <tableName> 
--hive-table <Hive tableName> 
--create-hive-table 
--hive-import 
--hive-home <hive path>
89
How Sqoop works
The dataset being transferred is broken into small blocks.
Map only job is launched.
Individual mapper is responsible for transferring a block of the
dataset.
90
How Sqoop works
91
92

Recommended for you

More Related Content

What's hot

Hive
HiveHive
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop
HadoopHadoop
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
Vigen Sahakyan
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 

What's hot (20)

Hive
HiveHive
Hive
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 

Similar to Hadoop ecosystem

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop
HadoopHadoop
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
Cisco Canada
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
Arundhati Kanungo
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
SakthiVinoth78
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 

Similar to Hadoop ecosystem (20)

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Big Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce ParadigmsBig Data and Hadoop with MapReduce Paradigms
Big Data and Hadoop with MapReduce Paradigms
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 

Recently uploaded

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
shoeb2926
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
nikita dubey$A17
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
nehadubay1
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 

Recently uploaded (20)

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeVasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Vasant Kunj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 

Hadoop ecosystem

  • 3. Overview The Hadoop Ecosystem Hadoop core components  HDFS  Map Reduce Other Hadoop ecosystem components  Hive  Pig  Impala  Sqoop 3
  • 4. HDFS Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is inspired by the GoogleFileSystem. Distribute large data file into blocks. Blocks are managed by different nodes in the cluster. Each block is replicated on multiple nodes. Name node stored metadata information about files and blocks. 4
  • 5. Hadoop Distributed File System (HDFS) 5 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 6. HDFS Consists of a Name Node and Data Node. HDFS Architecture Name node Remembers where the data is stored in the cluster. Data node Stores the actual data in the cluster. Name node Master Node which clients must initiate read/write. 6
  • 7. Has meta data information about a file. File name ,permissions, directory. Which nodes contain which blocks. Name node Disk backup of meta-data very important if you lose the name node, you lose HDFS. 7
  • 9. HDFS ComparingVersions HDFS 1.0 HDFS 2.0 disaster failure Name node single point failure Name node high availability Resource manager Resource Manager with map reduce Resource Manager withYarn Scalability and Performance Scalability and performance Suffer with larger clusters Scalability and performance do will with larger clusters 9
  • 10. HDFS was built under the premise that hardware will fail. FaultTolerance Ensure that when hardware fails / users can still have their Data available. Achieved through storing multiple Copies throughout Cluster. 10
  • 13. Hadoop Architecture • Execution engine (Map Reduce) 13
  • 14. Programming model for expressing distributed computations at a massive scale. What’s MapReduce? A patented software framework introduced by Google. Processes 20 petabytes of data per day. Popularized by open-source Hadoop project. Used atYahoo!, Facebook,Amazon, … 14
  • 15. MapReduce: High Level MapReduce job submitted by client computer Job Tracker Task Tracker Task instance Task Tracker Task instance Task Tracker Task instance 15
  • 16.  Code usually written in Java - though it can be written in other languages with the Hadoop StreamingAPI. MapReduce core functionality (I)  Two fundamental components: • Map step:  Master node takes large problem and slices it into smaller sub problems; distributes these to worker nodes.  Worker node may do this again if necessary.  Worker processes smaller problem and hands back to master. • Reduce step:  Master node takes the answers to the sub problems and combines them in a predefined way to get the output/answer to original problem. 16
  • 18. Input reader reads a block and divides into splits. Input reader Each split would be sent to a map function. a line is an input of a map function. The key could be some internal number (filename - blockid – lineid ). The value is the content of the textual line. Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Block 1 Block 2 Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Input reader 18
  • 19. Mapper: map function Mapper takes the output generated by input reader. output a list of intermediate <key, value> pairs. Apple Orange Mongo Orange Grapes Plum Apple Plum Mongo Apple Apple Plum Apple, 1 Orange, 1 Mongo, 1 Orange, 1 Grapes, 1 Plum, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Apple, 1 Plum, 1 mapper m1 m2 m3 m4 19
  • 20. Reducer: reduce function Reducer takes the output generated by the Mapper. aggregates the value for each key, and outputs the final result. Apple, 1 Orange, 1 Mongo, 1 Orange, 1 Grapes, 1 Plum, 1 Apple, 1 Plum, 1 Mongo, 1 Apple, 1 Apple, 1 Plum, 1 Apple, 1 Apple, 1 Apple, 1 Apple, 1 Orange, 1 Orange, 1 Grapes, 1 Mongo, 1 Mongo, 1 Plum, 1 Plum, 1 Plum, 1 Apple, 4 Orange, 2 Grapes, 1 Mongo, 2 Plum, 3 reducer shuffle/sort r1 r2 r3 r4 r5 There is shuffle/sort before reducing. 20
  • 21. Execute MapReduce on a cluster of machines with HDFS 21
  • 23. MapReduce: Execution Details Input reader Divide input into splits, assign each split to a Map task. Map task Apply the Map function to each record in the split. Each Map function returns a list of (key, value) pairs. Shuffle/Partition and Sort Shuffle distributes sorting & aggregation to many reducers. All records for key k are directed to the same reduce processor. Sort groups the same keys together, and prepares for aggregation. Reduce task Apply the Reduce function to each key. The result of the Reduce function is a list of (key, value) pairs. 23
  • 24. MapReduce Phases Deciding on what will be the key and what will be the value  developer’s responsibility 24
  • 25. REDUCE(k,list(vMAP(k,v) MapReduce – Group AVG Example NewYork, US, 10 LosAngeles, US,40 London, GB, 20 Berlin, DE, 60 Glasgow, GB, 10 Munich, DE, 30 … DE,45 GB,15 US,25 (US,10) (US,40) (GB,20 (GB,10 (DE,60 (DE,30 (US,10 (US,40 (GB,20 (GB,10 (DE,60 (DE,30 Input Data Intermediate (K,V)-Pairs Result 25
  • 26. Map-Reduce Execution Engine (Example: Color Count) Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Users only provide the “Map” and “Reduce” functions 26
  • 27. Properties of MapReduce Engine JobTracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality) • This file has 5 Blocks  run 5 map tasks • Where to run the task reading block “1” • Try to run it on Node 1 or Node 3 Node 1 Node 2 Node 3 27
  • 28. Properties of MapReduce Engine (Cont’d)  TaskTracker is the slave node (runs on each datanode)  Receives the task from JobTracker  Runs the task until completion (either map or reduce task)  Always in communication with the JobTracker reporting progress Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks 28
  • 29. Example -Word count Hello Cloud TA cool Hello TA cool Input Mapper Mapper Mapper Hello [11] TA [11] Cloud [1] cool [11] Reducer Reducer Hello 2 TA 2 Cloud 1 cool 2 Hello 1 TA 1 Cloud 1 Hello1 cool 1 cool 1 TA 1 Hello1 Hello1 TA 1 TA 1 Cloud1 cool 1 cool 1 Sort/Copy Merge Output 29
  • 30. Example 1:Word Count Map Tasks Reduce Tasks  Job: Count the occurrences of each word in a data set 30
  • 31. Example 2: Color Count Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Job: Count the number of each color in a data set Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines 31
  • 32. Example 3: Color Filter Job: Select only the blue and the green colors Input blocks on HDFS Map Map Map Map Produces (k, v) ( , 1) Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue or green colors • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines 32
  • 33. Word Count Execution the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output 33
  • 34. Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 2 fox, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 34
  • 36. Hive:A data warehouse on Hadoop 36
  • 37. Why Hive? Problem: Data, data and more data 200GB per day in March 2008 back to 1TB compressed per day today The Hadoop Experiment Problem: Map/Reduce (MR) is great but every one is not a Map/Reduce expert. I know SQL and I am a python and php expert. So what do we do: HIVE 37
  • 38. • A system for querying and managing structured data built on top of Map/Reduce and Hadoop. • MapReduce (MR) is very low level and requires customers to write custom programs. • HIVE supports queries expressed in SQL-like language called HiveQL which are compiled into MR jobs that are executed on Hadoop. What is HIVE? • Data model • Hive structures data into well-understood database concepts such as: tables, rows, columns. • It supports primitive types: integers, floats, doubles, and strings. 38
  • 39. Hive Components  Shell Interface: Like the MySQL shell  Driver:  Session handles, fetch, execution  Complier:  Prarse,plan,optimize.  Execution Engine:  DAG stage,Run map or reduce. 39
  • 41. HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL MetaStore Thrift API Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts Architecture 41
  • 42. Hive Metastore  Stores Hive metadata.  Default metastore database uses Apache Derby.  Various configurations:  Embedded(in-process metastore, in-process database)  Mainly for unit tests.  only one process can connect to the metastore at a time.  Local (in-process metastore, out-of-process database)  Each Hive client connects to the metastore directly  Remote (out-of-process metastore, out-of-process database)  Each Hive client connects to a metastore server, which connects to the metadata database itself.  Metastore server and Clint communicate usingThrift Protocol. 42
  • 43. HiveWarehouse  Hive tables are stored in the Hive “warehouse”. Default HDFS location: /user/hive/warehouse.  Tables are stored as sub-directories in the warehouse directory.  Partitions are subdirectories of tables.  External tables are supported in Hive.  The actual data is stored in flat files. 43
  • 44. Hive Schemas  Hive is schema-on-read oSchema is only enforced when the data is read (at query time) oAllows greater flexibility: same data can be read using multiple schemas  Contrast with an RDBMS, which is schema-on-write oSchema is enforced when the data is loaded. oSpeeds up queries at the expense of load times. 44
  • 45. Data Hierarchy  Hive is organised hierarchically into:  Databases: namespaces that separate tables and other objects.  Tables: homogeneous units of data with the same schema. Analogous to tables in an RDBMS.  Partitions: determine how the data is stored Allow efficient access to subsets of the data.  Buckets/clusters For subsampling within a partition. Join optimization. 45
  • 46. HiveQL  HiveQL / HQL provides the basic SQL-like operations:  Select columns using SELECT.  Filter rows usingWHERE.  JOIN between tables.  Evaluate aggregates using GROUP BY.  Store query results into another table.  Download results to a local directory (i.e., export from HDFS).  Manage tables and queries with CREATE, DROP, and ALTER. 46
  • 47. Primitive DataTypes Type Comments TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers BOOLEAN TRUE/FALSE FLOAT, DOUBLE Single and double precision real numbers STRING Character string TIMESTAMP Unix-epoch offset or datetime string DECIMAL Arbitrary-precision decimal BINARY 47
  • 48. Complex DataTypes Type Comments STRUCT A collection of elements If S is of type STRUCT {a INT, b INT}: S.a returns element a MAP Key-value tuple If M is a map from 'group' to GID: M['group'] returns value of GID ARRAY Indexed list IfA is an array of elements ['a','b','c']: A[0] returns 'a' 48
  • 49. CreateTable  CreateTable is a statement used to create a table in Hive.  The syntax and example are as follows: 49
  • 50. CreateTable Example  Let us assume you need to create a table named employee using CREATE TABLE statement.  The following table lists the fields and their data types in employee table: 50 Sr.No Field Name Data Type 1 Eid int 2 Name String 3 Salary Float 4 Designation string
  • 51. CreateTable Example  The following query creates a table named employee. 51  If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.  On successful creation of table, you get to see the following response:
  • 52. Load Data  we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement.  The syntax and example are as follows: 52  LOCAL is identifier to specify the local path. It is optional.  OVERWRITE is optional to overwrite the data in the table.  PARTITION is optional.
  • 53. Load Data Example  We will insert the following data into the table. It is a text file named sample.txt in /home/user directory. 53  The following query loads the given text into the table.  On successful download, you get to see the following response:
  • 54. HiveQL - Select-Where  Given below is the syntax of the SELECT query: 54
  • 55. Select-Where Example  Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. 55  The following query retrieves the employee details using the above scenario:
  • 56. Select-Where Example  On successful execution of the query, you get to see the following response: 56
  • 57. HiveQL - Select-Where  Given below is the syntax of the SELECT query: 57
  • 58. HiveQL Limitations HQL only supports equi-joins, outer joins, left semi-joins. Because it is only a shell for mapreduce, complex queries can be hard to optimise. Missing large parts of full SQL specification:  Correlated sub-queries.  Sub-queries outside FROM clauses.  Updatable or materialized views.  Stored procedures. 58
  • 59. External Table CREATE EXTERNAL TABLE page_view_stg (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION '/user/staging/page_view'; 59
  • 60. BrowsingTables And Partitions Command Comments SHOW TABLES; Show all the tables in the database SHOW TABLES 'page.*'; Show tables matching the specification ( uses regex syntax ) SHOW PARTITIONS page_view; Show the partitions of the page_view table DESCRIBE page_view; List columns of the table DESCRIBE EXTENDED page_view; More information on columns (useful only for debugging ) DESCRIBE page_view PARTITION (ds='2008-10-31'); List information about a partition 60
  • 61. Loading Data Use LOAD DATA to load data from a file or directory  Will read from HDFS unless LOCAL keyword is specified  Will append data unless OVERWRITE specified  PARTITION required if destination table is partitioned LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt' OVERWRITE INTO TABLE page_view PARTITION (date='2008-06-08', country='US') 61
  • 62. Inserting Data Use INSERT to load data from a Hive query  Will append data unless OVERWRITE specified  PARTITION required if destination table is partitioned FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US'; 62
  • 64. What is Apache pig? pig: is a high-level platform for creating MapReduce programs. pig: is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is made up of two components: PigLatin.  Runtime Environment. 64
  • 65. Why Apache pig? Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks.  Apache Pig is a boon for all such programmers. Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java. Pig Latin is SQL-like language and it is easy to learnApache Pig when you are familiar with SQL. 65
  • 66. Features of Pig Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc. Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS. UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. 66
  • 67. Apache PigVs MapReduce 67 Apache Pig MapReduce Apache Pig is a data flow language. MapReduce is a data processing paradigm. It is a high level language. MapReduce is low level and rigid. Performing a Join operation inApache Pig is pretty simple. It is quite difficult in MapReduce to perform a Join operation between datasets. Any programmer with a basic knowledge of SQL can work conveniently withApache Pig. Exposure to Java is must to work with MapReduce. Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. MapReduce will require almost 20 times more the number of lines to perform the same task. There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a long compilation process.
  • 68. Apache PigVs Hive 68 Apache Pig Hive Apache Pig uses a language called Pig Latin. It was originally created atYahoo. Hive uses a language called HiveQL. It was originally created at Facebook. Pig Latin is a data flow language. HiveQL is a query processing language. Pig Latin is a procedural language and it fits in pipeline paradigm. HiveQL is a declarative language. Apache Pig can handle structured, unstructured, and semi-structured data. Hive is mostly for structured data.
  • 69. Apache Pig - Architecture 69 Apache Pig converts scripts into a series of MapReduce jobs. Apache Pig makes the programmer’s job easy.  Parser : checks the syntax, does type checking, and other miscellaneous checks. The output of the parser will be a DAG. a DAG (directed acyclic graph) represents the Pig Latin statements and logical operators.  Optimizer : The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
  • 70. Apache Pig - Architecture 70  Compiler : The compiler compiles the optimized logical plan into a series of MapReduce jobs.  Execution engine : Finally the MapReduce jobs are submitted to Hadoop in a sorted order.  Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 71. Pig Latin Data Model 71 The data model of Pig Latin is fully nested and it allows complex non- atomic data types such as map and tuple. • A bag is a collection of tuples. • A tuple is an ordered set of fields. • A field is a piece of data.
  • 72. Pig Latin statements 72  Basic constructs : These statements work with relations,They include expressions and schemas. Every statement ends with a semicolon (;). Pig Latin statements take a relation as input and produce another relation as output.  Pig Latin example: grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
  • 73. Pig Latin Data types 73 DataType Description & Example int Represents a signed 32-bit integer. Example : 8 long Represents a signed 64-bit integer. Example : 5L float Represents a signed 32-bit floating point. Example : 5.5F double Represents a 64-bit floating point. Example : 10.5 chararray Represents a character array (string) in Unicode UTF-8 format. Example :‘tutorials point’ Bytearray Represents a Byte array (blob). Boolean Represents a Boolean value. Example : true/ false. Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00 Biginteger Represents a Java BigInteger. Example : 60708090709 Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883
  • 74. Pig Latin ComplexTypes 74 DataType Description & Example Tuple A tuple is an ordered set of fields. Example : (raja, 30) Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} Map A Map is a set of key-value pairs. Example : [‘name’#’Raju’,‘age’#30]
  • 75. Apache Pig Filter Operator  The FILTER operator is used to select the required tuples from a relation based on a condition. 75  syntax of the FILTER operator.  Example:  Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
  • 76. Filter Operator Example  And we have loaded this file into Pig with the relation name student_details as shown below. 76 grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);  now use the Filter operator to get the details of the students who belong to the city Chennai.  Verify the relation filter_data using the DUMP operator as shown below.  It will produce the following output, displaying the contents of the relation filter_data as follows.
  • 77. Apache Pig Distinct Operator  The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation. 77  syntax of the DISTINCT operator.  Example:  Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
  • 78. Distinct Operator Example 78  remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator  Verify the relation distinct_data using the DUMP operator as shown below.  It will produce the following output, displaying the contents of the relation distinct_data as follows.
  • 79. Apache Pig Group Operator  The GROUP operator is used to group the data in one or more relations. It collects the data having the same key. 79  syntax of the group operator.  Example:  Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
  • 80. Group Operator Example 80  let us group the records/tuples in the relation by age as shown below.  Verify the relation group_data using the DUMP operator as shown below.  It will produce the following output, displaying the contents of the relation group_data as follows.
  • 81. Apache Pig Join Operator  The JOIN operator is used to combine records from two or more relations. 81  Joins can be of the following types:  Self-join: is used to join a table with itself.  Inner Join:An inner join returns rows when there is a match in both tables.  left outer Join: returns all rows from the left table, even if there are no matches in the right relation.  right outer join: returns all rows from the right table, even if there are no matches in the left table.  full outer join: operation returns rows when there is a match in one of the relations.
  • 83. What is Impala? Cloudera Impala is a query engine that runs onApache Hadoop. Similar to HiveQL. Does not use Map reduce. Optimized for low latency queries. Open source apache project. Developed by Cloudera. Much faster than Hive or pig. 83
  • 84. Comparing Pig, Hive and Impala Description of Feature Pig Hive Impala SQL based query language No yes yes Schema optional required required Process data with external scripts yes yes no Extensible file format support yes yes no Query speed slow slow fast Accessible via ODBC/JDBC no yes yes 84
  • 86. What is Sqoop?  Command-line interface for transforming data between relational database and Hadoop  Support incremental imports  Imports use to populate tables in Hadoop  Exports use to put data from Hadoop into relational database such as SQL server Hadoop RDBMSsqoop 86
  • 87. Sqoop Import  sqoop import --connect jdbc:postgresql://hdp-master/sqoop_db --username sqoop_user --password postgres --table cities 87
  • 88. Sqoop Export sqoop export --connect jdbc:postgresql://hdp-master/sqoop_db --username sqoop_user --password postgres --table cities --export-dir cities 88
  • 89. Scoop – Example  An example scoop command to – load data from mySql into Hive bin/sqoop-import --connect jdbc:mysql://<mysql host>:<msql port>/db3 -username <username> -password <password> --table <tableName> --hive-table <Hive tableName> --create-hive-table --hive-import --hive-home <hive path> 89
  • 90. How Sqoop works The dataset being transferred is broken into small blocks. Map only job is launched. Individual mapper is responsible for transferring a block of the dataset. 90
  • 92. 92