This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
This document provides an overview of Apache Sqoop, a tool for transferring bulk data between Apache Hadoop and structured data stores like relational databases. It describes how Sqoop can import data from external sources into HDFS or related systems, and export data from Hadoop to external systems. The document also demonstrates how to use basic Sqoop commands to list databases and tables, import and export data between MySQL and HDFS, and perform updates during export.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
ACID properties
Atomicity, Consistency, Isolation, Durability
Transactions should possess several properties, often called the ACID properties; they should be enforced by the concurrency control and recovery methods of the DBMS.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
This document provides an introduction to Hadoop and HDFS. It defines big data and Hadoop, describing how Hadoop uses a scale-out approach to distribute data and processing across clusters of commodity servers. It explains that HDFS is the distributed file system of Hadoop, which splits files into blocks and replicates them across multiple nodes for reliability. HDFS is optimized for large streaming reads and writes of large files. The document also gives an overview of the Hadoop ecosystem and common Hadoop distributions.
The document provides an introduction to MapReduce, describing its motivation as a framework for simplifying large-scale data processing across distributed systems. It outlines MapReduce's programming model and main features, including automatic parallelization, fault tolerance, and locality. The document also provides a detailed example of counting letter frequencies in a large file to illustrate how MapReduce works.
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
Predictive analytics: Mining gold and creating valuable product
My presentation about building predictive analytics and machine learning solutions. Presented using a number of real world projects that I've worked on over the past couple of years
MapReduce provides a programming model for processing large datasets in a distributed, parallel manner. It involves two main steps - the map step where the input data is converted into intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated based on keys to produce the final results. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce.
This document provides an introduction to using MapReduce with MongoDB. It explains what MapReduce is, how it works, and provides examples of mapping and reducing sample data to calculate applications by state, applications by status and state, and average wages by visa class and status. It also discusses some limitations and considerations when using MapReduce with MongoDB.
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland meet-up held on 20th October 20116, with presentations on updates from Oracle Open World 2016. Covering Tech/Database, Big Data, Analyitcs, and Oracle Cloud
The document provides an overview of the Hadoop Distributed File System (HDFS). It describes HDFS's master-slave architecture with a single NameNode master and multiple DataNode slaves. The NameNode manages filesystem metadata and data placement, while DataNodes store data blocks. The document outlines HDFS components like the SecondaryNameNode, DataNodes, and how files are written and read. It also discusses high availability solutions, operational tools, and the future of HDFS.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
MapReduce is a framework for processing large datasets in a distributed manner. It involves two functions: map and reduce. The map function processes individual elements to generate intermediate key-value pairs, and the reduce function merges all intermediate values with the same key. Hadoop is an open-source implementation of MapReduce that uses HDFS for storage. A typical MapReduce job in Hadoop involves defining map and reduce functions, configuring the job, and submitting it to the JobTracker which schedules tasks across nodes and monitors execution.
This document provides an introduction to MapReduce and Disco, an open source implementation of MapReduce in Erlang and Python. It explains the motivation for MapReduce frameworks like Google's in addressing the need to process massive amounts of data across large clusters reliably. The core concepts of MapReduce are described, including how the input is split and mapped in parallel, intermediate key-value pairs are grouped and reduced, and the final output is produced. An example word counting algorithm demonstrates how a problem can be solved using MapReduce.
This slide gives a simple and purposeful knowledge about popular Hadoop platforms.
From simple definition to importance of Hadoop in modern era the presentation also introduces Hadoop service providers along with its core components.
Do go through it once and comment below with your feedback. I am sure that this slide will help many in presenting basics of Hadoop for their projects or business purpose.
The crisp information has been generated after going through detailed information available on internet as well as research papers
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
This document provides an overview of Apache Sqoop, a tool for transferring bulk data between Apache Hadoop and structured data stores like relational databases. It describes how Sqoop can import data from external sources into HDFS or related systems, and export data from Hadoop to external systems. The document also demonstrates how to use basic Sqoop commands to list databases and tables, import and export data between MySQL and HDFS, and perform updates during export.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
ACID properties
Atomicity, Consistency, Isolation, Durability
Transactions should possess several properties, often called the ACID properties; they should be enforced by the concurrency control and recovery methods of the DBMS.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
This document provides an introduction to Hadoop and HDFS. It defines big data and Hadoop, describing how Hadoop uses a scale-out approach to distribute data and processing across clusters of commodity servers. It explains that HDFS is the distributed file system of Hadoop, which splits files into blocks and replicates them across multiple nodes for reliability. HDFS is optimized for large streaming reads and writes of large files. The document also gives an overview of the Hadoop ecosystem and common Hadoop distributions.
The document provides an introduction to MapReduce, describing its motivation as a framework for simplifying large-scale data processing across distributed systems. It outlines MapReduce's programming model and main features, including automatic parallelization, fault tolerance, and locality. The document also provides a detailed example of counting letter frequencies in a large file to illustrate how MapReduce works.
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
Predictive analytics: Mining gold and creating valuable productBrendan Tierney
My presentation about building predictive analytics and machine learning solutions. Presented using a number of real world projects that I've worked on over the past couple of years
MapReduce provides a programming model for processing large datasets in a distributed, parallel manner. It involves two main steps - the map step where the input data is converted into intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated based on keys to produce the final results. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce.
An Introduction to Map/Reduce with MongoDBRainforest QA
This document provides an introduction to using MapReduce with MongoDB. It explains what MapReduce is, how it works, and provides examples of mapping and reducing sample data to calculate applications by state, applications by status and state, and average wages by visa class and status. It also discusses some limitations and considerations when using MapReduce with MongoDB.
OUG Ireland Meet-up - Updates from Oracle Open World 2016Brendan Tierney
OUG Ireland meet-up held on 20th October 20116, with presentations on updates from Oracle Open World 2016. Covering Tech/Database, Big Data, Analyitcs, and Oracle Cloud
The document provides an overview of the Hadoop Distributed File System (HDFS). It describes HDFS's master-slave architecture with a single NameNode master and multiple DataNode slaves. The NameNode manages filesystem metadata and data placement, while DataNodes store data blocks. The document outlines HDFS components like the SecondaryNameNode, DataNodes, and how files are written and read. It also discusses high availability solutions, operational tools, and the future of HDFS.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
MapReduce is a framework for processing large datasets in a distributed manner. It involves two functions: map and reduce. The map function processes individual elements to generate intermediate key-value pairs, and the reduce function merges all intermediate values with the same key. Hadoop is an open-source implementation of MapReduce that uses HDFS for storage. A typical MapReduce job in Hadoop involves defining map and reduce functions, configuring the job, and submitting it to the JobTracker which schedules tasks across nodes and monitors execution.
This document provides an introduction to MapReduce and Disco, an open source implementation of MapReduce in Erlang and Python. It explains the motivation for MapReduce frameworks like Google's in addressing the need to process massive amounts of data across large clusters reliably. The core concepts of MapReduce are described, including how the input is split and mapped in parallel, intermediate key-value pairs are grouped and reduced, and the final output is produced. An example word counting algorithm demonstrates how a problem can be solved using MapReduce.
This slide gives a simple and purposeful knowledge about popular Hadoop platforms.
From simple definition to importance of Hadoop in modern era the presentation also introduces Hadoop service providers along with its core components.
Do go through it once and comment below with your feedback. I am sure that this slide will help many in presenting basics of Hadoop for their projects or business purpose.
The crisp information has been generated after going through detailed information available on internet as well as research papers
The document appears to be a presentation about Oracle's R technologies and how they address challenges with the R programming language. It discusses Oracle R Distribution, Oracle R Enterprise, Oracle R Advanced Analytics for Hadoop, and ROracle. It also covers how Oracle has added capabilities for embedded R execution in the Oracle Database using SQL, including functions like rqEval and rqScriptCreate that allow running R scripts and accessing database contents directly from R.
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
we are interested in performing Big Data analytics, we need to
learn Hadoop to perform operations with Hadoop MapReduce. In this Presentation, we
will discuss what MapReduce is, why it is necessary, how MapReduce programs can
be developed through Apache Hadoop, and more.
This document discusses MapReduce and how it can be used to parallelize a word counting task over large datasets. It explains that MapReduce programs have two phases - mapping and reducing. The mapping phase takes input data and feeds each element to mappers, while the reducing phase aggregates the outputs from mappers. It also describes how Hadoop implements MapReduce by splitting files into splits, assigning splits to mappers across nodes, and using reducers to aggregate the outputs.
Hadoop/MapReduce is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, a programming model where input data is processed by "map" functions in parallel, and results are combined by "reduce" functions, to process and generate outputs from large amounts of data and nodes. The core components are the Hadoop Distributed File System for data storage, and the MapReduce programming model and framework. MapReduce jobs involve mapping data to intermediate key-value pairs, shuffling and sorting the data, and reducing to output results.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
This document provides an overview of cloud computing with MapReduce and Hadoop. It discusses what cloud computing and MapReduce are, how they work, and examples of applications that use MapReduce. Specifically, MapReduce is introduced as a programming model for large-scale data processing across thousands of machines in a fault-tolerant way. Example applications like search, sorting, inverted indexing, finding popular words, and numerical integration are described. The document also outlines how to get started with Hadoop and write MapReduce jobs in Java.
Hadoop eco system with mapreduce hive and pigKhanKhaja1
This document provides an overview of MapReduce architecture and components. It discusses how MapReduce processes data using map and reduce tasks on key-value pairs. The JobTracker manages jobs by scheduling tasks on TaskTrackers. Data is partitioned and sorted during the shuffle and sort phase before being processed by reducers. Components like Hive, Pig, partitions, combiners, and HBase are described in the context of how they integrate with and optimize MapReduce processing.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Hadoop and Mapreduce for .NET User GroupCsaba Toth
This document provides an introduction to Hadoop and MapReduce. It discusses big data characteristics and challenges. It provides a brief history of Hadoop and compares it to RDBMS. Key aspects of Hadoop covered include the Hadoop Distributed File System (HDFS) for scalable storage and MapReduce for scalable processing. MapReduce uses a map function to process key-value pairs and generate intermediate pairs, and a reduce function to merge values by key and produce final results. The document demonstrates MapReduce through an example word count program and includes demos of implementing it on Hortonworks and Azure HDInsight.
Learning Objectives - In this module, you will understand Hadoop MapReduce framework and how MapReduce works on data stored in HDFS. Also, you will learn what are the different types of Input and Output formats in MapReduce framework and their usage.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
Hadoop is an open-source framework that uses clusters of commodity hardware to store and process big data using the MapReduce programming model. It consists of four main components: MapReduce for distributed processing, HDFS for storage, YARN for resource management and scheduling, and common utilities. HDFS stores large files as blocks across nodes for fault tolerance. MapReduce jobs are split into map and reduce phases to process data in parallel. YARN schedules resources and manages job execution. The common utilities provide libraries and scripts used by all Hadoop components. Major companies use Hadoop to analyze large amounts of data.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
This document provides an introduction to MapReduce and Hadoop, including an overview of computing PageRank using MapReduce. It discusses how MapReduce addresses challenges of parallel programming by hiding details of distributed systems. It also demonstrates computing PageRank on Hadoop through parallel matrix multiplication and implementing custom file formats.
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are:
- Mappers process input records in parallel, emitting (key, value) pairs.
- A shuffle/sort phase groups values by key to same reducer.
- Reducers process grouped values to produce final output, aggregating as needed.
- This allows massive datasets to be processed across a cluster in a fault-tolerant way.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
1. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Big Data Analysis using Hadoop!
!
Map-Reduce – An Introduction!
!
Lecture 2!
!
!
Brendan Tierney
3. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
MapReduce
• A batch based, distributed computing framework modelled on Google’s paper on
MapReduce [http://research.google.com/archive/mapreduce.html]
• MapReduce decomposes work into small parallelised map and reduce tasks which
are scheduled for remote execution on slave nodes
• Terminology
• A job is a full programme
• A task is the execution of a single map or reduce task over a slice of
data called a split
• A Mapper is a map task
• A Reducer is a reduce task
• MapReduce works by manipulating key/value pairs in the general format
map(key1,value1)➝ list(key2,value2)
reduce(key2,list(value2)) ➝ (key3, value3)
5. [from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The input is
divided into
fixed-size pieces
called input
splits
A map task is
created for each
split
6. [from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The role of
the
programmer
is to define
the Map and
Reduce
functions
7. [from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
8. [from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
The Reduce phase
merges the data,
as defined by the
programmer to
produce the
outputs.
9. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Map
• The Map function
• The Mapper takes as input a key/value pair which represents a logical
record from the input data source (e.g. a line in a file)
• It produces zero or more outputs key/value pairs for each input pair
• e.g. a filtering function may only produce output if a certain
condition is met
• e.g. a counting function may produce multiple key/value pairs, one
per element being counted
map(in_key, in_value) ➝ list(temp_key, temp_value)
10. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reduce
• The Reducer(s)
• A single Reducer handles all the map output for a unique map output
key
• A Reducer outputs zero to many key/value pairs
• The output is written to HDFS files, to external DBs, or to any data sink...
reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)
11. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
MapReduce
• JobTracker - (Master)
• Controls MapReduce jobs
• Assigns Map & Reduce tasks to the other nodes on the cluster
• Monitors the tasks as they are running
• Relaunches failed tasks on other nodes in the cluster
• TaskTracker - (Slave)
• A single TaskTracker per slave node
• Manage the execution of the individual tasks on the node
• Can instantiate many JVMs to handle tasks in parallel
• Communicates back to the JobTracker (via a heartbeat)
13. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
[from Hadoop the Definitive Guide,Tom White]
A MapReduce Job
14. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
[from Hadoop the Definitive Guide,Tom White]
Monitoring progress
15. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
YARN (Yet Another Resource Negotiator) Framework
16. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Data Locality !
“This is a local node for local Data”
• Whenever possible Hadoop will attempt to ensure that a Mapper on a node is
working on a block of data stored locally on that node vis HDFS
• If this is not possible, the Mapper will have to transfer the data across the network as
it accesses the data
• Once all the Map tasks are finished, the map output data is transferred across the
network to the Reducers
• Although Reducers may run on the same node (physical machine) as the Mappers
there is no concept of data locality for Reducers
17. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Bottlenecks?
• Reducers cannot start until all Mappers are finished and the output has been
transferred to the Reducers and sorted
• To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the
Reducers as the Mappers finish
• The percentage of Mappers which should finish before the Reducers
start retrieving data is configurable
• To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative
execution
• If a Mapper appears to be running significantly slower than the others, a
new instance of the Mapper will be started on another machine,
operating on the same data (remember replication)
• The results of the first Mapper to finish will be used
• The Mapper which is still running will be terminated by Hadoop
19. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
The MapReduce Job!
!
Let us build up an example
20. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
The Scenario
• Build a Word Counter
• Using the Shakespeare Poems
• Count the number of times a word appears
in the data set
• Use Map-Reduce to do this work
• Step-by-Step of creating the MR process
21. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driving Class
Mapper
Reducer
22. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Setting up the MapReduce Job
• A Job object forms the specification for the job
• Job needs to know:
• the jar file that the code is in which will be distributed around the cluster; setJarByClass()
• the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath()
• the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath()
• the Mapper and Reducer classes; setMapperClass() setReducerClass()
• the output key and value classes; setOutputKeyClass() setOutputValueClass()
• the Mapper output key and value classes if they are different from the Reducer;
setMapOutputKeyClass() setMapOutputValueClass()
• the Mapper output key and value classes
• the name of the job, default is the name of the jar file; setJobName()
• The default input considers the file as lines of text
• The default key input is LongWriteable (the byte offset into the file)
• The default value input is Text (the contents of the line read from the file)
23. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
24. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You will typically import these classes into every
MapReduce job you write. We will omit the import
statements in future slides for brevity.
25. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
26. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The main method accepts two command-line arguments: the
input and output directories.
The first step is to ensure that we have been given two
command line arguments. If not, print a help message and exit.
27. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create a new job, specify the class which will be called to run
the job, and give it a Job Name.
28. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Give the Job information about the classes for the Mapper and
the reducer
29. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the format of the intermediate output key and value
produced by the Mapper
30. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the types for the Reducer output key and value
31. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the input directory (where the data will be read from)
and the output directory where the data will be written.
32. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
File formats - Inputs
• The default InputFormat (TextInputFormat) will be used unless you specify otherwise
• To use an InputFormat other than the default, use e.g.
conf.setInputFormat(KeyValueTextInputFormat.class)
• By default, FileInputFormat.setInputPaths() will read all files from a specified directory
and send them to Mappers
• Exceptions: items whose names begin with a period (.) or underscore (_)
• Globs can be specified to restrict input
• For example, /2010/*/01/*
33. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
File formats - Outputs
• FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will
write their final output
• The driver can also specify the format of the output data
• Default is a plain text file
• Could be explicitly written as
conf.setOutputFormat(TextOutputFormat.class);
34. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Submit the Job and wait for completion
35. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
36. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Mapper
• The Mapper takes as input a key/value pair which represents a logical record from the
input data source (e.g. a line in a file)
• The Mapper may use or ignore the input key
• E.g. a standard pattern is to read a file one line at a time
• Key = byte offset into the file where the line starts
• Value = contents of the line in the file
• Typically the key can be considered irrelevant
• It produces zero or more outputs key/value pairs for each input pair
• e.g. a filtering function may only produce output if a certain condition is
met
• e.g. a counting function may produce multiple key/value pairs, one per
element being counted
37. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Mapper Class
• extends the Mapper <K1, V1, K2, V2> class
• key and value classes implement the WriteableComparable and
Writeable interfaces
• most Mappers override the map method which is called once for every
key/value pair in the input
• void map (K1 key,
V1 value,
Context context) throws IOException,
InterruptedException
• the default map method is the Identity mapper - maps the inputs directly
to the outputs
• in general the map input types K1, V1 are different from the map output
types K2, V2
38. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Mapper Class
• Hadoop provides a number of Mapper implementations:
InverseMapper - swaps the keys and values
TokenCounterMapper - tokenises the input and outputs each token with a
count of 1
RegexMapper - extracts text matching a regular expression
Example:
job.setMapperClass(TokenCounterMapper.class);
39. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Mapper Code
...
public class WordMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
public void map(LongWritable key, Text value, Context
context)
throws IOException, InterruptedException {
String s = value.toString();
for (String word : s.split("W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Inputs
Outputs
Writes the outputs
Processes the input text
40. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
What the mapper does
• Input to the Mapper:
• Output from the Mapper:
(“this one I think is called a yink”)
(“he likes to wink, he likes to drink”)
(“he likes to drink and drink and drink”)
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(called,1)
(a, 1)
(yink,1)
(he, 1)
(likes,1)
(to,1)
(wink,1)
(he,1)
(likes,1)
(to,1)
(drink,1)
(he,1)
(likes,1)
(to,1)
(drink 1)
(and,1)
(drink,1)
(and,1)
(drink,1)
41. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Shuffle and sort
• Shuffle
• Integrates the data (key/value pairs) from outputs of each mapper
• For now, integrates into 1 file
• Sort
• The set of intermediate keys on a single node is automatically
sorted by Hadoop before they are presented to the Reducer
• Sorted within key
• Determines what subset of data goes to which Reducer
43. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reducer Class
• extends the Reducer <K2, V2, K3, V3> class
• key and value classes implement the WriteableComparable and Writeable interfaces
• void reduce (K2 key,
Iterable<V2> values,
Context context) throws IOException, InterruptedException
• called once for each input key
• generates a list of output key/values pairs by iterating over the values associated with the
input key
• the reduce input types K2, V2 must be the same types as the map output types
• the reduce output types K3, V3 can be different from the reduce input types
• the default reduce method is the Identity reducer - outputs each input/value pair directly
• getConfiguration() - access the Configuration for a Job
• void setup (Context context) - called once at the beginning of the reduce task
• void cleanup(Context context) - called at the end of the task to wrap up any
loose ends, closes files, db connections etc.
• Default number of reducers = 1
44. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reducer Class
• Hadoop provides some Reducer implementations
IntSumReducer - sums the values (integers) for a given key
LongSumReducer - sums the values (longs) for a given key
Example:
job.setReducerClass(IntSumReducer.class);
http://hadooptutorial.info/predefined-mapper-and-reducer-classes/
http://www.programcreek.com/java-api-examples/index.php?
api=org.apache.hadoop.mapreduce.lib.map.InverseMapper
45. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
46. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reducer Code
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Inputs
Outputs
47. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Processes the input text
48. www.oralytics.com
t : @brendantierney
e : brendan.tierney@oralytics.com
Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Writes the outputs