SlideShare a Scribd company logo
Dr. Sandeep G. Deshmukh
DataTorrent
1
Introduction to
Contents
 Motivation
 Scale of Cloud Computing
 Hadoop
 Hadoop Distributed File System (HDFS)
 MapReduce
 Sample Code Walkthrough
 Hadoop EcoSystem
2
Motivation - Traditional Distributed systems
 Processor Bound
 Using multiple machines
 Developer is burdened with
managing too many things
 Synchronization
 Failures
 Data moves from shared disk to
compute node
 Cost of maintaining clusters
 Scalability as and when required
not present
3
What is the scale we are talking about?
100s of CPUs?
Couple of CPUs?
10s of CPUs?
4

Recommended for you

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

Hadoop is a distributed processing framework for large datasets. It utilizes HDFS for storage and MapReduce as its programming model. The Hadoop ecosystem has expanded to include many other tools. YARN was developed to address limitations in the original Hadoop architecture. It provides a common platform for various data processing engines like MapReduce, Spark, and Storm. YARN improves scalability, utilization, and supports multiple workloads by decoupling cluster resource management from application logic. It allows different applications to leverage shared Hadoop cluster resources.

big datahadoop yarn
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

This presentation discusses the follow topics What is Hadoop? Need for Hadoop History of Hadoop Hadoop Overview Advantages and Disadvantages of Hadoop Hadoop Distributed File System Comparing: RDBMS vs. Hadoop Advantages and Disadvantages of HDFS Hadoop frameworks Modules of Hadoop frameworks Features of 'Hadoop‘ Hadoop Analytics Tools

hadoopgoogle analyticsbig data analytics
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

 
by EMC
apache hadoophadoophadoop 101
5
Hadoop @ Yahoo!
6
7
What we need
 Handling failure
 One computer = fails once in 1000 days
 1000 computers = 1 per day
 Petabytes of data to be processed in parallel
 1 HDD= 100 MB/sec
 1000 HDD= 100 GB/sec
 Easy scalability
 Relative increase/decrease of performance depending on
increase/decrease of nodes
8

Recommended for you

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce

Here is how you can solve this problem using MapReduce and Unix commands: Map step: grep -o 'Blue\|Green' input.txt | wc -l > output This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches). Reduce step: cat output This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green. So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map

hadoopbig dataapache apex
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System

Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.

engineeringhadooptechnology
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem

The document discusses the Hadoop ecosystem, which includes core Apache Hadoop components like HDFS, MapReduce, YARN, as well as related projects like Pig, Hive, HBase, Mahout, Sqoop, ZooKeeper, Chukwa, and HCatalog. It provides overviews and diagrams explaining the architecture and purpose of each component, positioning them as core functionality that speeds up Hadoop processing and makes Hadoop more usable and accessible.

What we’ve got : Hadoop!
 Created by Doug Cutting
 Started as a module in nutch and then matured as an
apache project
 Named it after his son's stuffed
elephant
9
What we’ve got : Hadoop!
 Fault-tolerant file system
 Hadoop Distributed File System (HDFS)
 Modeled on Google File system
 Takes computation to data
 Data Locality
 Scalability:
 Program remains same for 10, 100, 1000,… nodes
 Corresponding performance improvement
 Parallel computation using MapReduce
 Other components – Pig, Hbase, HIVE, ZooKeeper
10
HDFS
Hadoop distributed File System
11
How HDFS works
NameNode -
Master
DataNodes
- Slave
Secondary
NameNode
12

Recommended for you

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop

Hadoop, flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

hadoophbasehive
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners

The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/

hadoopapache hadoophadoop introduction
Map Reduce
Map ReduceMap Reduce
Map Reduce

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

hadoopmapreduce
Storing file on HDFS
Motivation
 Reliability,
 Availability,
 Network Bandwidth
 The input file (say 1 TB) is split into smaller chunks/blocks of 64 MB (or multiples
of 64MB)
 The chunks are stored on multiple nodes as independent files on slave nodes
 To ensure that data is not lost, replicas are stored in the following way:
 One on local node
 One on remote rack (incase local rack fails)
 One on local rack (incase local node fails)
 Other randomly placed
 Default replication factor is 3
13
B1
B1
B1
B2
B2
B2
B3 Bn
Hub 1 Hub 2
Datanodes
File
Master
Node
8 gigabit
1 gigabit
Blocks
14
NameNode -
Master
The master node: NameNode
Functions:
 Manages File System- mapping files to blocks and blocks
to data nodes
 Maintaining status of data nodes
 Heartbeat
 Datanode sends heartbeat at regular intervals
 If heartbeat is not received, datanode is declared dead
 Blockreport
 DataNode sends list of blocks on it
 Used to check health of HDFS
15
NameNode Functions
 Replication
 On Datanode failure
 On Disk failure
 On Block corruption
 Data integrity
 Checksum for each block
 Stored in hidden file
 Rebalancing- balancer tool
 Addition of new nodes
 Decommissioning
 Deletion of some files
NameNode -
Master
16

Recommended for you

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units. Below topics are explained in this Hadoop presentation: 1. What is Hadoop 2. Why Hadoop 3. Big Data generation 4. Hadoop HDFS 5. Hadoop MapReduce 6. Hadoop YARN 7. Use of Hadoop 8. Demo on HDFS, MapReduce and YARN What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

hadoop tutorial for beginnersapache hadoop tutorial for beginnershadoop tutorial
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology

This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data

This presentation provides an overview of Hadoop, including: - A brief history of data and the rise of big data from various sources. - An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers. - Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture. - An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes. - Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.

big datamap reducehadoop
HDFS Robustness
 Safemode
 At startup: No replication possible
 Receives Heartbeats and Blockreports from Datanodes
 Only a percentage of blocks are checked for defined replication
factor
17
All is well   Exit Safemode
 Replicate blocks wherever necessary
HDFS Summary
 Fault tolerant
 Scalable
 Reliable
 File are distributed in large blocks for
 Efficient reads
 Parallel access
18
Questions?
19
MapReduce
20

Recommended for you

Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases

The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.

nosql database
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.

introduction to hdfshdfsnamenode
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...

This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail Below topics are explained in this Hive presetntation: 1. History of Hive 2. What is Hive? 3. Architecture of Hive 4. Data flow in Hive 5. Hive data modeling 6. Hive data types 7. Different modes of Hive 8. Difference between Hive and RDBMS 9. Features of Hive 10. Demo on HiveQL What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

hive tutorialhive tutorial for beginnershive architecture
What is MapReduce?
 It is a powerful paradigm for parallel computation
 Hadoop uses MapReduce to execute jobs on files in
HDFS
 Hadoop will intelligently distribute computation over
cluster
 Take computation to data
21
Origin: Functional Programming
map f [a, b, c] = [f(a), f(b), f(c)]
map sq [1, 2, 3] = [sq(1), sq(2), sq(3)]
= [1,4,9]
 Returns a list constructed by applying a function (the first
argument) to all items in a list passed as the second
argument
22
Origin: Functional Programming
reduce f [a, b, c] = f(a, b, c)
reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL))))
= 14
 Returns a list constructed by applying a function (the first
argument) on the list passed as the second argument
 Can be identity (do nothing)
23
Sum of squares example
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
16941
30
Input
Intermediate
output
Output
MAPPER
REDUCER
M1 M2 M3 M4
R1
24

Recommended for you

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...

This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation. Below are the topics covered in this Hadoop Architecture presentation: 1. What is Hadoop? 2. Components of Hadoop 3. What is HDFS? 4. HDFS Architecture 5. Hadoop MapReduce 6. Hadoop MapReduce Example 7. Hadoop YARN 8. Demo on MapReduce What are the course objectives? This course will enable you to: 1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Who should take up this Big Data and Hadoop Certification Training Course? Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals: 1. Software Developers and Architects 2. Analytics Professionals 3. Senior IT professionals 4. Testing and Mainframe professionals 5. Data Management Professionals 6. Business Intelligence Professionals 7. Project Managers 8. Aspiring Data Scientists Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

hadoop architecturehadoop architecture tutorialhdfs architecture
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™

The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.

hadoopnosqlbig data
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?

Hadoop Introduction you connect with us: http://www.linkedin.com/profile/view?id=232566291&trk=nav_responsive_tab_profile

hadoopapache hadoophadoop introduction
Sum of squares of even and odd
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(even, 16)(odd, 9)(even, 4)(odd, 1)
(even, 20) (odd, 10)
Input
Intermediate
output
Output
MAPPER
REDUCER
M1 M2 M3 M4
R1 R2
25
Programming model- key, value pairs
Format of input- output
(key, value)
Map: (k1, v1) → list (k2, v2)
Reduce: (k2, list v2) → list (k3, v3)
26
Sum of squares of even and odd and prime
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(even, 16)(odd, 9)
(prime, 9)
(even, 4)
(prime, 4)
(odd, 1)
(even, 20) (odd, 10)
(prime, 13)
Input
Intermediate
output
Output
R2R1
R3
27
Many keys, many values
Format of input- output
(key, value)
Map: (k1, v1) → list (k2, v2)
Reduce: (k2, list v2) → list (k3, v3)
28

Recommended for you

Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter

In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. - Tungsten Phase 2: Speed up Apache Spark by 10X

tungstensparkapache spark 2.0
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark

Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.

apache sparkbig datadatabricks
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark

Spark is a fast and general engine for large-scale data processing. It provides APIs in Java, Scala, and Python and an interactive shell. Spark applications operate on resilient distributed datasets (RDDs) that can be cached in memory for faster performance. RDDs are immutable and fault-tolerant via lineage graphs. Transformations create new RDDs from existing ones while actions return values to the driver program. Spark's execution model involves a driver program that coordinates tasks on executor machines. RDD caching and lineage graphs allow Spark to efficiently run jobs across clusters.

apache hadoopjavaspark
Fibonacci sequence
 f(n) = f(n-1) + f(n-2)
i.e. f(5) = f(4) + f(3)
 0, 1, 1, 2, 3, 5, 8, 13,…
f(5)
f(4) f(3)
f(2)f(3)
f(1)f(2)
f(2) f(1)
f(0)f(1)
•MapReduce will not work
on this kind of calculation
•No inter-process
communication
•No data sharing
29
Input:
1TB text file containing color
names- Blue, Green, Yellow,
Purple, Pink, Red, Maroon, Grey,
Desired
output:
Occurrence of colors Blue
and Green
30
N1 f.001
Blue
Purple
Blue
Red
Green
Blue
Maroon
Green
Yellow
N1 f.001
Blue
Blue
Green
Blue
Green
grep Blue|Green
Nn
f.00n
Green
Blue
Blue
Blue
Green
Blue= 3000
Green= 5500
Blue=500
Green=200
Blue=420
Green=200
sort |unique -c
awk ‘{arr[$1]+=$2;}
END{print arr[Blue], arr[Green]}’
COMBINER
MAPPER
REDUCER
awk ‘{arr[$1]++;}
END{print arr[Blue], arr[Green]}’Nn f.00n
Blue
Purple
Blue
Red
Green
Blue
Maroon
Green
Yellow
31
Input
data
Map
Map
Map
Reduce
Reduce
Output
INPUT MAP SHUFFLE REDUCE OUTPUT
Works on a record
Works on output of Map
32
MapReduce Overview

Recommended for you

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

This Hadoop MapReduce tutorial will unravel MapReduce Programming, MapReduce Commands, MapReduce Fundamentals, Driver Class, Mapper Class, Reducer Class, Job Tracker & Task Tracker. At the end, you'll have a strong knowledge regarding Hadoop MapReduce Basics. PPT Agenda: ✓ Introduction to BIG Data & Hadoop ✓ What is MapReduce? ✓ MapReduce Data Flows ✓ MapReduce Programming ---------- What is MapReduce? MapReduce is a programming framework for distributed processing of large data-sets via commodity computing clusters. It is based on the principal of parallel data processing, wherein data is broken into smaller blocks rather than processed as a single block. This ensures a faster, secure & scalable solution. Mapreduce commands are based in Java. ---------- What are MapReduce Components? It has the following components: 1. Combiner: The combiner collates all the data from the sample set based on your desired filters. For example, you can collate data based on day, week, month and year. After this, the data is prepared and sent for parallel processing. 2. Job Tracker: This allocates the data across multiple servers. 3. Task Tracker: This executes the program across various servers. 4. Reducer: It will isolate the desired output from across the multiple servers. ---------- Applications of MapReduce 1. Data Mining 2. Document Indexing 3. Business Intelligence 4. Predictive Modelling 5. Hypothesis Testing ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

big datahadoopmapreduce fundamentals
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.

sparkapache spark
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.

big dataapache sparkmeetup
Input
data
Combine
Combine
Combine
Map
Map
Map
Reduce
Reduce
Output
INPUT MAP REDUCE OUTPUT
Works on output of Map Works on output of Combiner
33
MapReduce Overview
34
MapReduce Overview
 Mapper, reducer and combiner act on <key, value>
pairs
 Mapper gets one record at a time as an input
 Combiner (if present) works on output of map
 Reducer works on output of map (or combiner, if
present)
 Combiner can be thought of local-reducer
 Reduces output of maps that are executed on same
node
35
MapReduce Summary
What Hadoop is not..
 It is not a POSIX file system
 It is not a SAN file system
 It is not for interactive file accessing
 It is not meant for a large number of small files-
it is for a small number of large files
 MapReduce cannot be used for any and all
applications
36

Recommended for you

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.

clouderaapache hadoopmapreduce
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture

This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark

apache sparkdistributed systemtungsten
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies

kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderaba www.kellytechno.com

hadoop training in hyderabadhadoop institutes in hyderabadhadoop training centers in hyderabad
Hadoop: Take Home
 Takes computation to data
 Suitable for large data centric operations
 Scalable on demand
 Fault tolerant and highly transparent
37
Questions?
38
 Coming up next …
 First hadoop program
 Second hadoop program
Your first program in hadoop
Open up any tutorial on hadoop and first program
you see will be of wordcount 
Task:
 Given a text file, generate a list of words with the
number of times each of them appear in the file
Input:
 Plain text file
Expected Output:
 <word, frequency> pairs for all words in the file
hadoop is a framework written in java
hadoop supports parallel processing
and is a simple framework
<hadoop, 2>
<is, 2>
<a , 2>
<java , 1>
<framework , 2>
<written , 1>
<in , 1>
<and,1>
<supports , 1>
<parallel , 1>
<processing. , 1>
<simple,1>
39
Your second program in hadoop
Task:
 Given a text file containing numbers, one
per line, count sum of squares of odd, even
and prime
Input:
 File containing integers, one per line
Expected Output:
 <type, sum of squares> for odd, even, prime
1
2
5
3
5
6
3
7
9
4
<odd, 302>
<even, 278>
<prime, 323 >
40

Recommended for you

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies

Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad. http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training

hadoop training institutes in hyderabadhadoop institutes in hyderabadhadoop classes
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore

Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.

hadoop training in bangalorehadoop training institutes in bangalore
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx

This document discusses MapReduce and its ability to process large datasets in a distributed manner. MapReduce addresses challenges of distributed computation by allowing programmers to specify map and reduce functions. It then parallelizes the execution of these functions across large clusters and handles failures transparently. The map function processes input key-value pairs to generate intermediate pairs, which are then grouped by key and passed to reduce functions to generate the final output.

Your second program in hadoop
File on HDFS
41
3
9
6
2
3
7
8
Map: square
3 <odd,9>
7 <odd,49>
2
6 <even,36>
9 <odd,81>
3 <odd,9>
8 <even,64>
<prime,4>
<prime,9>
<prime,9>
<even,4>
Reducer: sum
prime:<9,4,9>
odd:<9,81,9,49>
even:<,36,4,64>
<odd,148>
<even,104>
<prime,22>
Input
Value
Output
(Key,Value)
Input (Key, List of Values)
Output
(Key,Value)
Your second program in hadoop
42
Map
(Invoked on a record)
Reduce
(Invoked on a key)
void map (int x){
int sq = x * x;
if(x is odd)
print(“odd”,sq);
if(x is even)
print(“even”,sq);
if(x is prime)
print(“prime”,sq);
}
void reduce(List l ){
for(y in List l){
sum += y;
}
print(Key, sum);
}
Library functions
boolean odd(int x){ …}
boolean even(int x){ …}
boolean prime(int x){ …}
Your second program in hadoop
43
Map
(Invoked on a record)
Map
(Invoked on a record)
void map (int x){
int sq = x * x;
if(x is odd)
print(“odd”,sq);
if(x is even)
print(“even”,sq);
if(x is prime)
print(“prime”,sq);
}
Your second program in hadoop: Reduce
44
Reduce
(Invoked on a key)
Reduce
(Invoked on a
key)
void reduce(List l ){
for(y in List l){
sum += y;
}
print(Key, sum);
}

Recommended for you

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.

big datapigdata mining
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

heterogeneous clustermapreducedata placement
Your second program in hadoop
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, “OddEvenPrime");
job.setJarByClass(OddEvenPrime.class);
job.setMapperClass(OddEvenPrimeMapper.class);
job.setCombinerClass(OddEvenPrimeReducer.class);
job.setReducerClass(OddEvenPrimeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
45
Questions?
46
 Coming up next …
 More examples
 Hadoop Ecosystem
Example: Counting Fans
47
Example: Counting Fans
48
Problem: Give Crowd
statistics
Count fans
supporting India and
Pakistan

Recommended for you

Data Science
Data ScienceData Science
Data Science

The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses: - Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark. - Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs. - Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access. - Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.

mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx

This document provides an introduction to MapReduce programming model. It describes how MapReduce inspired by Lisp functions works by dividing tasks into mapping and reducing parts that are distributed and processed in parallel. It then gives examples of using MapReduce for word counting and calculating total sales. It also provides details on MapReduce daemons in Hadoop and includes demo code for summing array elements in Java and doing word counting on a text file using the Hadoop framework in Python.

mapreducepython
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics. This slide covers the Advance Map reduce concepts of Hadoop and Big Data. For training queries you can contact us: Email: admin@designpathshala.com Call us at: +91 98 188 23045 Visit us at: http://designpathshala.com Join us at: http://www.designpathshala.com/contact-us Course details: http://www.designpathshala.com/course/view/65536 Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792 Business Analytics Course details: http://www.designpathshala.com/course/view/196608

bigdatapigdesignpathshala
49
45882 67917
 Traditional Way
 Central Processing
 Every fan comes to the
centre and presses
India/Pak button
 Issues
 Slow/Bottlenecks
 Only one processor
 Processing time
determined by the
speed at which
people(data) can
move
Example: Counting Fans
50
Hadoop Way
 Appoint processors per
block (MAPPER)
 Send them to each block
and ask them to send a
signal for each person
 Central processor will
aggregate the results
(REDUCER)
 Can make the processor
smart by asking him/her
to aggregate locally and
only send aggregated
value (COMBINER)
Example: Counting Fans
Reducer
Combiner
HomeWork : Exit Polls 2014
51
Hadoop EcoSystem: Basic
52

Recommended for you

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile

The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce

Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.

chapterstudentvnit-acm
Hadoop Distributions
53
Who all are using Hadoop
54
http://wiki.apache.org/hadoop/PoweredBy
References
For understanding Hadoop
 Official Hadoop website- http://hadoop.apache.org/
 Hadoop presentation wiki-
http://wiki.apache.org/hadoop/HadoopPresentations?a
ction=AttachFile
 http://developer.yahoo.com/hadoop/
 http://wiki.apache.org/hadoop/
 http://www.cloudera.com/hadoop-training/
 http://developer.yahoo.com/hadoop/tutorial/module2.
html#basics
55
References
Setup and Installation
 Installing on Ubuntu
 http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Single-Node_Cluster)
 http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Multi-Node_Cluster)
 Installing on Debian
 http://archive.cloudera.com/docs/_apt.html
56

Recommended for you

Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...

This document provides an overview of MapReduce and Hadoop. It describes the Map and Reduce functions, explaining that Map applies a function to each element of a list and Reduce reduces a list to a single value. It gives examples of Map and Reduce using employee salary data. It then discusses Hadoop and its core components HDFS for distributed storage and MapReduce for distributed processing. Key aspects covered include the NameNode, DataNodes, input/output formats, and the job launch process. It also addresses some common questions around small files, large files, and accessing SQL data from Hadoop.

Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question

Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.

ccdhcloudera certificationhadoop
Hadoop
HadoopHadoop
Hadoop

This document provides an overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop was developed based on Google's MapReduce algorithm and how it uses HDFS for scalable storage and MapReduce as an execution engine. Key components of Hadoop architecture include HDFS for fault-tolerant storage across data nodes and the MapReduce programming model for parallel processing of data blocks. The document also gives examples of how MapReduce works and industries that use Hadoop for big data applications.

 Hadoop: The Definitive Guide: Tom White
 http://developer.yahoo.com/hadoop/tutorial/
 http://www.cloudera.com/content/cloudera-
content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-
Tutorial.html
57
Further Reading
Questions?
58
59
Acknowledgements
Surabhi Pendse
Sayali Kulkarni
Parth Shah

More Related Content

What's hot

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 

What's hot (20)

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 

Viewers also liked

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 

Viewers also liked (9)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to Introduction to Hadoop

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
Habiba Abderrahim
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop
HadoopHadoop
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
Francesc Lordan Gomis
 

Similar to Introduction to Hadoop (20)

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Data Science
Data ScienceData Science
Data Science
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 

More from Apache Apex

Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
Apache Apex
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
Apache Apex
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 

More from Apache Apex (20)

Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 

Recently uploaded

Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Sparity1
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
sachin chaurasia
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
Ortus Solutions, Corp
 
Intro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AIIntro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AI
Ortus Solutions, Corp
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
908dutch
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
SimonedeGijt
 
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Livetecs LLC
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
sheqnetworkmarketing
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
Mitchell Marsh
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
karim wahed
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
karim wahed
 
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf
kalichargn70th171
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
active-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptxactive-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptx
sudsdeep
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
MaisnamLuwangPibarel
 
dachnug51 - Whats new in domino 14 .pdf
dachnug51 - Whats new in domino 14  .pdfdachnug51 - Whats new in domino 14  .pdf
dachnug51 - Whats new in domino 14 .pdf
DNUG e.V.
 

Recently uploaded (20)

Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
 
Intro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AIIntro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AI
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
 
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
 
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
active-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptxactive-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptx
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
 
dachnug51 - Whats new in domino 14 .pdf
dachnug51 - Whats new in domino 14  .pdfdachnug51 - Whats new in domino 14  .pdf
dachnug51 - Whats new in domino 14 .pdf
 

Introduction to Hadoop

  • 1. Dr. Sandeep G. Deshmukh DataTorrent 1 Introduction to
  • 2. Contents  Motivation  Scale of Cloud Computing  Hadoop  Hadoop Distributed File System (HDFS)  MapReduce  Sample Code Walkthrough  Hadoop EcoSystem 2
  • 3. Motivation - Traditional Distributed systems  Processor Bound  Using multiple machines  Developer is burdened with managing too many things  Synchronization  Failures  Data moves from shared disk to compute node  Cost of maintaining clusters  Scalability as and when required not present 3
  • 4. What is the scale we are talking about? 100s of CPUs? Couple of CPUs? 10s of CPUs? 4
  • 5. 5
  • 7. 7
  • 8. What we need  Handling failure  One computer = fails once in 1000 days  1000 computers = 1 per day  Petabytes of data to be processed in parallel  1 HDD= 100 MB/sec  1000 HDD= 100 GB/sec  Easy scalability  Relative increase/decrease of performance depending on increase/decrease of nodes 8
  • 9. What we’ve got : Hadoop!  Created by Doug Cutting  Started as a module in nutch and then matured as an apache project  Named it after his son's stuffed elephant 9
  • 10. What we’ve got : Hadoop!  Fault-tolerant file system  Hadoop Distributed File System (HDFS)  Modeled on Google File system  Takes computation to data  Data Locality  Scalability:  Program remains same for 10, 100, 1000,… nodes  Corresponding performance improvement  Parallel computation using MapReduce  Other components – Pig, Hbase, HIVE, ZooKeeper 10
  • 12. How HDFS works NameNode - Master DataNodes - Slave Secondary NameNode 12
  • 13. Storing file on HDFS Motivation  Reliability,  Availability,  Network Bandwidth  The input file (say 1 TB) is split into smaller chunks/blocks of 64 MB (or multiples of 64MB)  The chunks are stored on multiple nodes as independent files on slave nodes  To ensure that data is not lost, replicas are stored in the following way:  One on local node  One on remote rack (incase local rack fails)  One on local rack (incase local node fails)  Other randomly placed  Default replication factor is 3 13
  • 14. B1 B1 B1 B2 B2 B2 B3 Bn Hub 1 Hub 2 Datanodes File Master Node 8 gigabit 1 gigabit Blocks 14
  • 15. NameNode - Master The master node: NameNode Functions:  Manages File System- mapping files to blocks and blocks to data nodes  Maintaining status of data nodes  Heartbeat  Datanode sends heartbeat at regular intervals  If heartbeat is not received, datanode is declared dead  Blockreport  DataNode sends list of blocks on it  Used to check health of HDFS 15
  • 16. NameNode Functions  Replication  On Datanode failure  On Disk failure  On Block corruption  Data integrity  Checksum for each block  Stored in hidden file  Rebalancing- balancer tool  Addition of new nodes  Decommissioning  Deletion of some files NameNode - Master 16
  • 17. HDFS Robustness  Safemode  At startup: No replication possible  Receives Heartbeats and Blockreports from Datanodes  Only a percentage of blocks are checked for defined replication factor 17 All is well   Exit Safemode  Replicate blocks wherever necessary
  • 18. HDFS Summary  Fault tolerant  Scalable  Reliable  File are distributed in large blocks for  Efficient reads  Parallel access 18
  • 21. What is MapReduce?  It is a powerful paradigm for parallel computation  Hadoop uses MapReduce to execute jobs on files in HDFS  Hadoop will intelligently distribute computation over cluster  Take computation to data 21
  • 22. Origin: Functional Programming map f [a, b, c] = [f(a), f(b), f(c)] map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]  Returns a list constructed by applying a function (the first argument) to all items in a list passed as the second argument 22
  • 23. Origin: Functional Programming reduce f [a, b, c] = f(a, b, c) reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14  Returns a list constructed by applying a function (the first argument) on the list passed as the second argument  Can be identity (do nothing) 23
  • 24. Sum of squares example [1,2,3,4] Sq (1) Sq (2) Sq (3) Sq (4) 16941 30 Input Intermediate output Output MAPPER REDUCER M1 M2 M3 M4 R1 24
  • 25. Sum of squares of even and odd [1,2,3,4] Sq (1) Sq (2) Sq (3) Sq (4) (even, 16)(odd, 9)(even, 4)(odd, 1) (even, 20) (odd, 10) Input Intermediate output Output MAPPER REDUCER M1 M2 M3 M4 R1 R2 25
  • 26. Programming model- key, value pairs Format of input- output (key, value) Map: (k1, v1) → list (k2, v2) Reduce: (k2, list v2) → list (k3, v3) 26
  • 27. Sum of squares of even and odd and prime [1,2,3,4] Sq (1) Sq (2) Sq (3) Sq (4) (even, 16)(odd, 9) (prime, 9) (even, 4) (prime, 4) (odd, 1) (even, 20) (odd, 10) (prime, 13) Input Intermediate output Output R2R1 R3 27
  • 28. Many keys, many values Format of input- output (key, value) Map: (k1, v1) → list (k2, v2) Reduce: (k2, list v2) → list (k3, v3) 28
  • 29. Fibonacci sequence  f(n) = f(n-1) + f(n-2) i.e. f(5) = f(4) + f(3)  0, 1, 1, 2, 3, 5, 8, 13,… f(5) f(4) f(3) f(2)f(3) f(1)f(2) f(2) f(1) f(0)f(1) •MapReduce will not work on this kind of calculation •No inter-process communication •No data sharing 29
  • 30. Input: 1TB text file containing color names- Blue, Green, Yellow, Purple, Pink, Red, Maroon, Grey, Desired output: Occurrence of colors Blue and Green 30
  • 31. N1 f.001 Blue Purple Blue Red Green Blue Maroon Green Yellow N1 f.001 Blue Blue Green Blue Green grep Blue|Green Nn f.00n Green Blue Blue Blue Green Blue= 3000 Green= 5500 Blue=500 Green=200 Blue=420 Green=200 sort |unique -c awk ‘{arr[$1]+=$2;} END{print arr[Blue], arr[Green]}’ COMBINER MAPPER REDUCER awk ‘{arr[$1]++;} END{print arr[Blue], arr[Green]}’Nn f.00n Blue Purple Blue Red Green Blue Maroon Green Yellow 31
  • 32. Input data Map Map Map Reduce Reduce Output INPUT MAP SHUFFLE REDUCE OUTPUT Works on a record Works on output of Map 32 MapReduce Overview
  • 33. Input data Combine Combine Combine Map Map Map Reduce Reduce Output INPUT MAP REDUCE OUTPUT Works on output of Map Works on output of Combiner 33 MapReduce Overview
  • 35.  Mapper, reducer and combiner act on <key, value> pairs  Mapper gets one record at a time as an input  Combiner (if present) works on output of map  Reducer works on output of map (or combiner, if present)  Combiner can be thought of local-reducer  Reduces output of maps that are executed on same node 35 MapReduce Summary
  • 36. What Hadoop is not..  It is not a POSIX file system  It is not a SAN file system  It is not for interactive file accessing  It is not meant for a large number of small files- it is for a small number of large files  MapReduce cannot be used for any and all applications 36
  • 37. Hadoop: Take Home  Takes computation to data  Suitable for large data centric operations  Scalable on demand  Fault tolerant and highly transparent 37
  • 38. Questions? 38  Coming up next …  First hadoop program  Second hadoop program
  • 39. Your first program in hadoop Open up any tutorial on hadoop and first program you see will be of wordcount  Task:  Given a text file, generate a list of words with the number of times each of them appear in the file Input:  Plain text file Expected Output:  <word, frequency> pairs for all words in the file hadoop is a framework written in java hadoop supports parallel processing and is a simple framework <hadoop, 2> <is, 2> <a , 2> <java , 1> <framework , 2> <written , 1> <in , 1> <and,1> <supports , 1> <parallel , 1> <processing. , 1> <simple,1> 39
  • 40. Your second program in hadoop Task:  Given a text file containing numbers, one per line, count sum of squares of odd, even and prime Input:  File containing integers, one per line Expected Output:  <type, sum of squares> for odd, even, prime 1 2 5 3 5 6 3 7 9 4 <odd, 302> <even, 278> <prime, 323 > 40
  • 41. Your second program in hadoop File on HDFS 41 3 9 6 2 3 7 8 Map: square 3 <odd,9> 7 <odd,49> 2 6 <even,36> 9 <odd,81> 3 <odd,9> 8 <even,64> <prime,4> <prime,9> <prime,9> <even,4> Reducer: sum prime:<9,4,9> odd:<9,81,9,49> even:<,36,4,64> <odd,148> <even,104> <prime,22> Input Value Output (Key,Value) Input (Key, List of Values) Output (Key,Value)
  • 42. Your second program in hadoop 42 Map (Invoked on a record) Reduce (Invoked on a key) void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq); } void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum); } Library functions boolean odd(int x){ …} boolean even(int x){ …} boolean prime(int x){ …}
  • 43. Your second program in hadoop 43 Map (Invoked on a record) Map (Invoked on a record) void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq); }
  • 44. Your second program in hadoop: Reduce 44 Reduce (Invoked on a key) Reduce (Invoked on a key) void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum); }
  • 45. Your second program in hadoop public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, “OddEvenPrime"); job.setJarByClass(OddEvenPrime.class); job.setMapperClass(OddEvenPrimeMapper.class); job.setCombinerClass(OddEvenPrimeReducer.class); job.setReducerClass(OddEvenPrimeReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } 45
  • 46. Questions? 46  Coming up next …  More examples  Hadoop Ecosystem
  • 48. Example: Counting Fans 48 Problem: Give Crowd statistics Count fans supporting India and Pakistan
  • 49. 49 45882 67917  Traditional Way  Central Processing  Every fan comes to the centre and presses India/Pak button  Issues  Slow/Bottlenecks  Only one processor  Processing time determined by the speed at which people(data) can move Example: Counting Fans
  • 50. 50 Hadoop Way  Appoint processors per block (MAPPER)  Send them to each block and ask them to send a signal for each person  Central processor will aggregate the results (REDUCER)  Can make the processor smart by asking him/her to aggregate locally and only send aggregated value (COMBINER) Example: Counting Fans Reducer Combiner
  • 51. HomeWork : Exit Polls 2014 51
  • 54. Who all are using Hadoop 54 http://wiki.apache.org/hadoop/PoweredBy
  • 55. References For understanding Hadoop  Official Hadoop website- http://hadoop.apache.org/  Hadoop presentation wiki- http://wiki.apache.org/hadoop/HadoopPresentations?a ction=AttachFile  http://developer.yahoo.com/hadoop/  http://wiki.apache.org/hadoop/  http://www.cloudera.com/hadoop-training/  http://developer.yahoo.com/hadoop/tutorial/module2. html#basics 55
  • 56. References Setup and Installation  Installing on Ubuntu  http://www.michael- noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_ (Single-Node_Cluster)  http://www.michael- noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_ (Multi-Node_Cluster)  Installing on Debian  http://archive.cloudera.com/docs/_apt.html 56
  • 57.  Hadoop: The Definitive Guide: Tom White  http://developer.yahoo.com/hadoop/tutorial/  http://www.cloudera.com/content/cloudera- content/cloudera-docs/HadoopTutorial/CDH4/Hadoop- Tutorial.html 57 Further Reading