SlideShare a Scribd company logo
MapReduce Paradigm

  Dilip Reddy Kancharla
        Spring 2012
Outline
• Introduction
• Motivating example
• Hadoop
  – Hadoop MapReduce
  – HDFS
• Pros & Cons of MapReduce
• Hadoop Applicability to different workflows
• Conclusions and Future work
Critical                                 User
MapReduce                              Program
Execution       Fork                                        Fork
                                           Fork
Overview [DG08]
                                       Master

                             Assign               Assign
                             Map                  Reduce
               Key/Value
                 Pairs      Worker
                                                   Remote                     Output
                                       Local                 Worker
 Split 1                                           read                        file 1
                                       Write                          Write
 Split 2
                            Worker
 Split 3
 Split 4                      .
                              .
                                                                              Output
 Split 5                                                     Worker            file 2
                              .
   .
   .                       Worker
   .                                                                      Output
                                            Intermediate    Reduce
 Input Files               Map Phase        Operations      Phase         Files
MapReduce Paradigm
• Splits input files into blocks (typically of 64MB
  each)
• Operates on key/value pairs
• Mappers filter & transform input data
• Reducers aggregate mappers output
• Efficient way to process the cluster:
  – Move code to data
  – Run code on all machines

Recommended for you

Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases

The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.

nosql database
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture

This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.

hadooplambda architecturefastdata
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™

The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.

hadoopnosqlbig data
• Map
                     Hash Function
     (K1,v1)                               List(k2,v2)


• Reduce
                      Aggregate Function     List(k3,v3)
     (k2,list(v2))
Advanced MapReduce
• Hadoop Streaming
  – Lets you stream Mapper and reducer written in
    other languages such as python, ruby, etc.,
• Chaining MapReduce jobs
• Joining data
• Bloom filters
Hadoop
• Open Source Implementation of MapReduce by
  Apache Software Foundation.
• Created by Doug Cutting.
• Derived from Google's MapReduce and Google
  File System (GFS) papers.
• Apache Hadoop is a software framework that
  supports data-intensive distributed applications
  under a free license
• It enables applications to work with thousands of
  computational independent computers and
  petabytes of data.
Hadoop Architecture
• Hadoop MapReduce
  – Single master node, many worker nodes
  – Client submits a job to master node
  – Master splits each job into tasks (MapReduce),
    and assigns tasks to worker nodes
• Hadoop Distributed File System (HDFS)
  – Single name node, many data nodes
  – Files stored as large, fixed-size (e.g. 64MB) blocks
  – HDFS typically holds map input and reduce output

Recommended for you

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL

This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.

nosqlsqlcolumn
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners

The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/

hadoopapache hadoophadoop introduction
Virtualization Vs. Containers
Virtualization Vs. ContainersVirtualization Vs. Containers
Virtualization Vs. Containers

This document discusses virtualization, containers, and hyperconvergence. It provides an overview of virtualization and its benefits including hardware abstraction and multi-tenancy. However, virtualization also has challenges like significant overhead and repetitive configuration tasks. Containers provide similar benefits with less overhead by abstracting at the operating system level. The document then discusses how hyperconvergence combines compute, storage, and networking to simplify deployment and operations. It notes that many hyperconverged solutions still face virtualization challenges. The presentation argues that combining containers and hyperconvergence can provide both the benefits of containers' efficiency and hyperconvergence's scale. Stratoscale is presented as a solution that provides containers as a service with multi-tenancy, SLA-driven performance

Hadoop Architecture
     Secondary
     Namenode



     Namenode                  JobTracker




    Data                                     Data
                     Data
    node                                     node
                     node
TaskTracker                           TaskTracker
                 TaskTracker
  Map                                       Map
   Map             Map                       Map
    Map             Map                       Map
                     Map
   Map
    Map                                     Map
                                             Map
    Reduce          Map
                     Map                     Reduce
                     Reduce
Job Scheduling in Hadoop
• One map task for each block of the input file
  – Applies user-defined map function to each record in
    the block
  – Record = <key, value>
• User-defined number of reduce tasks
  – Each reduce task is assigned a set of record groups
  – For each group, apply user-defined reduce function to
    the record values in that group
• Reduce tasks read from every map task
  – Each read returns the record groups for that reduce
    task
Dataflow in Hadoop
• Map tasks write their output to local disk
  – Output available after map task has completed
• Reduce tasks write their output to HDFS
  – Once job is finished, next job’s map tasks can be
    scheduled, and will read input from HDFS
• Therefore, fault tolerance is simple: simply re-
  run tasks on failure
  – No consumers see partial operator output
Dataflow in Hadoop[CAHER10]

   Submit job




      map       schedule   reduce



      map                  reduce

Recommended for you

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data

This presentation provides an overview of Hadoop, including: - A brief history of data and the rise of big data from various sources. - An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers. - Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture. - An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes. - Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.

big datamap reducehadoop
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce

Here is how you can solve this problem using MapReduce and Unix commands: Map step: grep -o 'Blue\|Green' input.txt | wc -l > output This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches). Reduce step: cat output This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green. So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map

hadoopbig dataapache apex
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt

Hadoop Pipes 27

Dataflow in Hadoop[CAHER10]



Read
Input File
                       map         reduce
             Block 1

  HDFS
             Block 2
                       map         reduce
Dataflow in Hadoop[CAHER10]




     map   Local
            FS
                              reduce

                   HTTP GET
           Local
     map    FS                reduce
Dataflow in Hadoop[CAHER10]



                            Write
                            Final
                   reduce
                            Answer
                               HDFS

                   reduce
HDFS
• Data is distributed and replicated over
  multiple machines.
• Files are not stored in contiguously on servers
  broken up into blocks.
• Designed for large files (large means GB or TB)
• Block Oriented
• Linux Style commands (eg. ls, cp, mkdir, mv)

Recommended for you

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop

Hadoop, flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

hadoophbasehive
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt

The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.

zookeeperhivehadoop
An Introduction to Azure IaaS
An Introduction to Azure IaaSAn Introduction to Azure IaaS
An Introduction to Azure IaaS

Slides from AIS and Microsoft's half-day session on the recently-announced Windows Azure Infrastructure as a Service (IaaS) offering. After a brief overview of the Azure Platform as a Service (PaaS) model, we will focus on key IaaS concepts. Additionally, we will walk you through a number of scenarios enabled by Azure IaaS and several demonstrations. Agenda: Overview of Windows Azure Platform Azure IaaS Why IaaS? IaaS Core Concepts Supported Applications Azure Virtual Machines Disk Mobility VM export / Import Availability Azure Virtual Network

azure iaasmicrosoft windowswindows azure
Different Workflows[MTAGS11]
Hadoop Applicability by Workflow[MTAGS11]




  Score Meaning:
  • Score Zero implies Easily adaptable to the workflow
  • Score 0.5 implies Moderately adaptable to the
    workflow
  • Score 1 indicates one of the potential workflow areas
    where Hadoop needs improvement
Relative Merits and Demerits of
           Hadoop Over DBMS
Pros                                   Cons
• Fault tolerance                     • No high level language like
• Self Healing rebalances files          SQL in DBMS
  across cluster                      • No schema and no index
• Highly Scalable                     • Low efficiency
• Highly Flexible as it does not      • Very young (since 2004)
  have any dependency on                 compared to over 40years
  data model and schema                  of DBMS
                 Hadoop                      Relational
           Scale out (add more            Scaling is difficult
                machines)
              Key/Value pairs                   Tables
        Say how to process the data    Say what you want (SQL)
              Offline/ batch              Online/ realtime
Conclusions and Future Work
• MapReduce is easy to program
• Hadoop=HDFS+MapReduce
• Distributed, Parallel processing
• Designed for fault tolerance and high scalability
• MapReduce is unlikely to substitute DBMS in
  data warehousing instead we expect them to
  complement each other and help in data analysis
  of scientific data patterns
• Finally, Efficiency and especially I/O costs needs
  to be addressed for successful implications

Recommended for you

VMware Presentation
VMware PresentationVMware Presentation
VMware Presentation

Virtualization allows multiple operating systems and applications to run on the same physical server at the same time. This increases hardware utilization and flexibility while reducing IT costs. VMware virtualization solutions can reduce energy costs by 80% through server consolidation and powering down unused servers without affecting applications or users. Virtualization makes hardware resources independent of operating systems and applications, treating them as single unified units that can be more easily deployed, maintained, and supported.

vmwarevirtualizationi.t
Unit 4
Unit 4Unit 4
Unit 4

The document discusses several security challenges related to cloud computing. It covers topics like data breaches, misconfiguration issues, lack of cloud security strategy, insufficient identity and access management, account hijacking, insider threats, and insecure application programming interfaces. The document emphasizes that securing customer data and applications is critical for cloud service providers to maintain trust and meet compliance requirements.

cloud computingcomputer science engineering.ravikumar balaraman
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing

The document summarizes two papers about MapReduce frameworks for cloud computing. The first paper describes Hadoop, which uses MapReduce and HDFS to process large amounts of distributed data across clusters. HDFS stores data across cluster nodes in a fault-tolerant manner, while MapReduce splits jobs into parallel map and reduce tasks. The second paper discusses P2P-MapReduce, which allows for a dynamic cloud environment where nodes can join and leave. It uses a peer-to-peer model where nodes can be masters or slaves, and maintains backup masters to prevent job loss if the primary master fails.

References
[LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn
Chung, and Bongki Moon, “Parallel data processing with MapReduce:
a survey,” SIGMOD, January 2012, pp. 11-20.
 [MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, and
Lavanya Ramakrishnan, “ Riding the Elephant: Managing Ensembles
with Hadoop,” Proceedings of the 2011 ACM international workshop
on Many task computing on grids and supercomputers, ACM, New
York, NY, USA, pp. 49-58.
[DG08]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified
data processing on large clusters,” January 2008, pp. 107-113. ACM.
[CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.
Hellerstein, Khaled Elmeleegy, and Russell Sears, “MapReduce online,”
Proceedings of the 7th USENIX conference on Networked systems
design and implementation (NSDI'10), USENIX Association, Berkeley,
CA, USA, 2010, pp. 21-37.
Thank You!



Questions?

More Related Content

What's hot

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Virtualization Vs. Containers
Virtualization Vs. ContainersVirtualization Vs. Containers
Virtualization Vs. Containers
actualtechmedia
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
An Introduction to Azure IaaS
An Introduction to Azure IaaSAn Introduction to Azure IaaS
An Introduction to Azure IaaS
Applied Information Sciences
 
VMware Presentation
VMware PresentationVMware Presentation
VMware Presentation
Emirates Computers
 
Unit 4
Unit 4Unit 4
Unit 4
Ravi Kumar
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
EUDAT
 

What's hot (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Virtualization Vs. Containers
Virtualization Vs. ContainersVirtualization Vs. Containers
Virtualization Vs. Containers
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
An Introduction to Azure IaaS
An Introduction to Azure IaaSAn Introduction to Azure IaaS
An Introduction to Azure IaaS
 
VMware Presentation
VMware PresentationVMware Presentation
VMware Presentation
 
Unit 4
Unit 4Unit 4
Unit 4
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 

Similar to MapReduce Paradigm

Hadoop
HadoopHadoop
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
Amjith Singh
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
SakthiVinoth78
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Hadoop
HadoopHadoop
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
Sunitha Satyadas
 
Hadoop
HadoopHadoop
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
MapReduce
MapReduceMapReduce
MapReduce
Surinder Kaur
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
WasyihunSema2
 

Similar to MapReduce Paradigm (20)

Hadoop
HadoopHadoop
Hadoop
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Hadoop
HadoopHadoop
Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
MapReduce
MapReduceMapReduce
MapReduce
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 

Recently uploaded

Views in Odoo - Advanced Views - Pivot View in Odoo 17
Views in Odoo - Advanced Views - Pivot View in Odoo 17Views in Odoo - Advanced Views - Pivot View in Odoo 17
Views in Odoo - Advanced Views - Pivot View in Odoo 17
Celine George
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
Nguyen Thanh Tu Collection
 
Delegation Inheritance in Odoo 17 and Its Use Cases
Delegation Inheritance in Odoo 17 and Its Use CasesDelegation Inheritance in Odoo 17 and Its Use Cases
Delegation Inheritance in Odoo 17 and Its Use Cases
Celine George
 
Bedok NEWater Photostory - COM322 Assessment (Story 2)
Bedok NEWater Photostory - COM322 Assessment (Story 2)Bedok NEWater Photostory - COM322 Assessment (Story 2)
Bedok NEWater Photostory - COM322 Assessment (Story 2)
Liyana Rozaini
 
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
siemaillard
 
National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)
SaadaGrijaldo1
 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
marianell3076
 
SYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISING
SYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISINGSYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISING
SYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISING
Dr Vijay Vishwakarma
 
NLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacherNLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacher
AngelicaLubrica
 
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptxBRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
kambal1234567890
 
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
MysoreMuleSoftMeetup
 
Is Email Marketing Really Effective In 2024?
Is Email Marketing Really Effective In 2024?Is Email Marketing Really Effective In 2024?
Is Email Marketing Really Effective In 2024?
Rakesh Jalan
 
Principles of Roods Approach!!!!!!!.pptx
Principles of Roods Approach!!!!!!!.pptxPrinciples of Roods Approach!!!!!!!.pptx
Principles of Roods Approach!!!!!!!.pptx
ibtesaam huma
 
L1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 interventionL1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 intervention
RHODAJANEAURESTILA
 
The basics of sentences session 9pptx.pptx
The basics of sentences session 9pptx.pptxThe basics of sentences session 9pptx.pptx
The basics of sentences session 9pptx.pptx
heathfieldcps1
 
Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17
Celine George
 
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptxUnlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
bipin95
 
AI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdfAI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdf
SrimanigandanMadurai
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
heathfieldcps1
 
How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17
Celine George
 

Recently uploaded (20)

Views in Odoo - Advanced Views - Pivot View in Odoo 17
Views in Odoo - Advanced Views - Pivot View in Odoo 17Views in Odoo - Advanced Views - Pivot View in Odoo 17
Views in Odoo - Advanced Views - Pivot View in Odoo 17
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
 
Delegation Inheritance in Odoo 17 and Its Use Cases
Delegation Inheritance in Odoo 17 and Its Use CasesDelegation Inheritance in Odoo 17 and Its Use Cases
Delegation Inheritance in Odoo 17 and Its Use Cases
 
Bedok NEWater Photostory - COM322 Assessment (Story 2)
Bedok NEWater Photostory - COM322 Assessment (Story 2)Bedok NEWater Photostory - COM322 Assessment (Story 2)
Bedok NEWater Photostory - COM322 Assessment (Story 2)
 
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
 
National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)
 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 
SYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISING
SYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISINGSYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISING
SYBCOM SEM III UNIT 1 INTRODUCTION TO ADVERTISING
 
NLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacherNLC English 7 Consolidation Lesson plan for teacher
NLC English 7 Consolidation Lesson plan for teacher
 
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptxBRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
 
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
 
Is Email Marketing Really Effective In 2024?
Is Email Marketing Really Effective In 2024?Is Email Marketing Really Effective In 2024?
Is Email Marketing Really Effective In 2024?
 
Principles of Roods Approach!!!!!!!.pptx
Principles of Roods Approach!!!!!!!.pptxPrinciples of Roods Approach!!!!!!!.pptx
Principles of Roods Approach!!!!!!!.pptx
 
L1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 interventionL1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 intervention
 
The basics of sentences session 9pptx.pptx
The basics of sentences session 9pptx.pptxThe basics of sentences session 9pptx.pptx
The basics of sentences session 9pptx.pptx
 
Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17
 
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptxUnlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
 
AI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdfAI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdf
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17
 

MapReduce Paradigm

  • 1. MapReduce Paradigm Dilip Reddy Kancharla Spring 2012
  • 2. Outline • Introduction • Motivating example • Hadoop – Hadoop MapReduce – HDFS • Pros & Cons of MapReduce • Hadoop Applicability to different workflows • Conclusions and Future work
  • 3. Critical User MapReduce Program Execution Fork Fork Fork Overview [DG08] Master Assign Assign Map Reduce Key/Value Pairs Worker Remote Output Local Worker Split 1 read file 1 Write Write Split 2 Worker Split 3 Split 4 . . Output Split 5 Worker file 2 . . . Worker . Output Intermediate Reduce Input Files Map Phase Operations Phase Files
  • 4. MapReduce Paradigm • Splits input files into blocks (typically of 64MB each) • Operates on key/value pairs • Mappers filter & transform input data • Reducers aggregate mappers output • Efficient way to process the cluster: – Move code to data – Run code on all machines
  • 5. • Map Hash Function (K1,v1) List(k2,v2) • Reduce Aggregate Function List(k3,v3) (k2,list(v2))
  • 6. Advanced MapReduce • Hadoop Streaming – Lets you stream Mapper and reducer written in other languages such as python, ruby, etc., • Chaining MapReduce jobs • Joining data • Bloom filters
  • 7. Hadoop • Open Source Implementation of MapReduce by Apache Software Foundation. • Created by Doug Cutting. • Derived from Google's MapReduce and Google File System (GFS) papers. • Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license • It enables applications to work with thousands of computational independent computers and petabytes of data.
  • 8. Hadoop Architecture • Hadoop MapReduce – Single master node, many worker nodes – Client submits a job to master node – Master splits each job into tasks (MapReduce), and assigns tasks to worker nodes • Hadoop Distributed File System (HDFS) – Single name node, many data nodes – Files stored as large, fixed-size (e.g. 64MB) blocks – HDFS typically holds map input and reduce output
  • 9. Hadoop Architecture Secondary Namenode Namenode JobTracker Data Data Data node node node TaskTracker TaskTracker TaskTracker Map Map Map Map Map Map Map Map Map Map Map Map Map Reduce Map Map Reduce Reduce
  • 10. Job Scheduling in Hadoop • One map task for each block of the input file – Applies user-defined map function to each record in the block – Record = <key, value> • User-defined number of reduce tasks – Each reduce task is assigned a set of record groups – For each group, apply user-defined reduce function to the record values in that group • Reduce tasks read from every map task – Each read returns the record groups for that reduce task
  • 11. Dataflow in Hadoop • Map tasks write their output to local disk – Output available after map task has completed • Reduce tasks write their output to HDFS – Once job is finished, next job’s map tasks can be scheduled, and will read input from HDFS • Therefore, fault tolerance is simple: simply re- run tasks on failure – No consumers see partial operator output
  • 12. Dataflow in Hadoop[CAHER10] Submit job map schedule reduce map reduce
  • 13. Dataflow in Hadoop[CAHER10] Read Input File map reduce Block 1 HDFS Block 2 map reduce
  • 14. Dataflow in Hadoop[CAHER10] map Local FS reduce HTTP GET Local map FS reduce
  • 15. Dataflow in Hadoop[CAHER10] Write Final reduce Answer HDFS reduce
  • 16. HDFS • Data is distributed and replicated over multiple machines. • Files are not stored in contiguously on servers broken up into blocks. • Designed for large files (large means GB or TB) • Block Oriented • Linux Style commands (eg. ls, cp, mkdir, mv)
  • 18. Hadoop Applicability by Workflow[MTAGS11] Score Meaning: • Score Zero implies Easily adaptable to the workflow • Score 0.5 implies Moderately adaptable to the workflow • Score 1 indicates one of the potential workflow areas where Hadoop needs improvement
  • 19. Relative Merits and Demerits of Hadoop Over DBMS Pros Cons • Fault tolerance • No high level language like • Self Healing rebalances files SQL in DBMS across cluster • No schema and no index • Highly Scalable • Low efficiency • Highly Flexible as it does not • Very young (since 2004) have any dependency on compared to over 40years data model and schema of DBMS Hadoop Relational Scale out (add more Scaling is difficult machines) Key/Value pairs Tables Say how to process the data Say what you want (SQL) Offline/ batch Online/ realtime
  • 20. Conclusions and Future Work • MapReduce is easy to program • Hadoop=HDFS+MapReduce • Distributed, Parallel processing • Designed for fault tolerance and high scalability • MapReduce is unlikely to substitute DBMS in data warehousing instead we expect them to complement each other and help in data analysis of scientific data patterns • Finally, Efficiency and especially I/O costs needs to be addressed for successful implications
  • 21. References [LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon, “Parallel data processing with MapReduce: a survey,” SIGMOD, January 2012, pp. 11-20. [MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, and Lavanya Ramakrishnan, “ Riding the Elephant: Managing Ensembles with Hadoop,” Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, ACM, New York, NY, USA, pp. 49-58. [DG08]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters,” January 2008, pp. 107-113. ACM. [CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears, “MapReduce online,” Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI'10), USENIX Association, Berkeley, CA, USA, 2010, pp. 21-37.

Editor's Notes

  1. If Distributed Computing is so hard, Do we need it?
  2. Run code on machines unlike conventional systems where we move data to code, do processing and then store them back.
  3. - Out of the scope of papers
  4. The master (Job-Tracker) is ress. Each worker runs a Task- Tracker process that manages the execution of the tasks currently assigned to that node. Each TaskTracker has a fixed number of slots for executing tasks. Each map task is assigned a portion of the input file called a split. By default, a split contains a single HDFS block, so the total number of file blocks determines the number of map tasks.
  5. Reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of executionMapReduce jobs can run continuously, accepting new data as it arrives and analyzing it immediately. This allows MapReduce to be used for applications such as event monitoring and stream processing.Data Node: Store actual file blocks on disk. Does not store entire files!Report block info to Namenode.Receive instructions from namenode.Secondary Namenode: Snapshot of namenode.Not a flipover server of namenode.Help minimize downtime/data loss ifNameNode failsJobTracker: Partition tasks across the cluster. Track MapReduce tasks. Re start failed tasks on different nodes.TaskTracker does the task processing and logs each and every event.
  6. The input to a job is an input specification that is in key-value pairs. Each job consists of two stages: first, a user defin map function is applied to each input record to produce a list of intermediate key-value pairs. Second, a user-defined reduce function is called once for each distinct key in the map output and passed the list of intermediate values associated with that key. Reduce - The shuffle phase (Each reduce task is assigned a partition of the keyrange produced by the map step, so the reduce task must fetch the content of this partition from every map task’s output). The sort phase groups records with the same key. Apply the user-defined reduce function
  7. The buffer content is written to the local file system as an index file and a data file . Index file for indexing and The data file contains only the records, which are sorted by the key within each partition segment. A reduce task fetches data from each map task by issuing HTTP requests to a configurablenumber of TaskTrackers at once (5 by default). The Job- Tracker relays the location of every TaskTracker that hosts map output to every TaskTracker that is executing a reduce task. a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk.
  8. The map phase reads the task’s split/HDFS blocks from HDFS, parses it into records (key/value pairs), and applies the map function to each record.After the map function has been applied to each input record, the commit phase registers the final output with the TaskTracker, which then informs theJobTracker that the task has finished executing.
  9. a reduce task fetches data from ach map task by issuing HTTP requests to a configurable number of TaskTrackers at once (5 by default). The Job-Tracker relays the location of every TaskTracker that hosts map output to every TaskTracker that is executing a reduce task. Note that a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk
  10. In this design, the output of both map and reduce tasks is written to disk before it can be consumed. This is particularly expensive for reduce tasks, because their output is written to HDFS. Output materialization simplifies fault tolerance, because it reduces the amount of state that must be restored to consistency after a node failure. If any task (either map or reduce) fails, the JobTracker simply schedules a new task to perform the same work as the failed task.
  11. While it was possible to implement all patterns in the framework but the level of difficulty varied.This evaluation helps in identifying if an applications workflow will be suitable to run in MapReduce Framework or not.
  12. Fault tolerant when node fails due to high data replication. Scalable just by adding nodes we can process as much data as we want.Low efficiency:- with fault tolerance and scalability as its primary goals, MapReduce operations are not always optimized for I/O efficiency. Also Map and Reduce are blocking operations
  13. -Easy since it hides implementation details of parallelization, fault tolerance, local optimization and load balanace. Horizontal scale out helps in processing as much as data we want by simply adding as many nodes as you want.