SlideShare a Scribd company logo
Understanding Big Data Technology
Foundations
Module 3
Syllabus
• The MapReduce Framework
• Techniques to reduce Mapreduce Jobs
• Uses of Mapreduce
• Role of Hbase in Big data Processing
Understanding Big Data Technology
Foundations
• The advent of Local Area Networks (LANs) and other
networking technologies shifted the focus of IT industry
toward solving bigger and bigger problems by combining
computing and storing capacities of systems on the network
• This chapter focuses on explaining the basics and exploring
the relevance and role of various functions that are used in
the MapReduce framework.
Big Data
• The MapReduce Framework At the start of the 21st century,
the team of engineers working with Google concluded that
because of the increasing number of Internet users, the
resources and solutions available would be inadequate to
fulfill the future requirements. As a preparation to the
upcoming issue, Google engineers established that the
concept of task distribution across economical resources,
and their interconnectivity as a cluster over the network,
can be presented as a solution. The concept of task
distribution, though, could not be a complete answer to the
issue, which requires the tasks to be distributed in parallel.

Recommended for you

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™

The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.

hadoopnosqlbig data
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners

The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/

hadoopapache hadoophadoop introduction
Node.js and The Internet of Things
Node.js and The Internet of ThingsNode.js and The Internet of Things
Node.js and The Internet of Things

A look at where the market of the Internet of Things is and how technologies like Node.js (JavaScript) and the Intel Edison are making it easier to create connected solutions. Learn more at https://losant.com. The major topics include: * What is the Internet of Things * Where is IoT Today * 4 Parts of IoT (Collect, Communicate, Analyze, Act) * Why JavaScript is Good for IoT * How Node.js is Making a Dent in the Internet of Things * What npm Modules are used for Hardware (Johnny-Five, Cylon.js, MRAA) * What is the Intel Edison * How to Best Work with the Edison * Tips for Edison (MRAA, Grove Kit, UPM) * Where the World of JavaScript and IoT is Going

internet of thingsnode.jsjavascript
A parallel distribution of tasks
• Helps in automatic expansion and contraction of processes
• Enables continuation of processes without being affected by network
failures or individual system failures
• Empowers developers with rights to access the services that other
developers have created in context of multiple usage scenarios
• A generic implementation to the entire concept was, therefore, provided
with the development of the MapReduce programming model
Exploring the Features of MapReduce
• MapReduce keeps all the processing operations separate for parallel execution. Problems that are
extremely large in size are divided into subtasks, which are chunks of data separated in manageable
blocks.
• The principal features of MapReduce include the following:
Synchronization
Co-location of Code/Data (Data Locality)
Handling of Errors/Faults
Scale-Out Architecture
Working of
MapReduce
1. Take a large dataset or set of records.
2. Perform iteration over the data.
3. Extract some interesting patterns to prepare an
output list by using the map function.
4. Arrange the output list properly to enable
optimization for further processing.
5. Compute a set of results by using the reduce
function.
6. Provide the final output.
The MapReduce programming model also works on an
algorithm to execute the map and reduce operations.
This algorithm can be depicted as follows
Working of the MapReduce approach

Recommended for you

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...

This Edureka "What is Hadoop" Tutorial (check our hadoop blog series here: https://goo.gl/lQKjL8) will help you understand all the basics of Hadoop. Learn about the differences in traditional and hadoop way of storing and processing data in detail. Below are the topics covered in this tutorial: 1) Traditional Way of Processing - SEARS 2) Big Data Growth Drivers 3) Problem Associated with Big Data 4) Hadoop: Solution to Big Data Problem 5) What is Hadoop? 6) HDFS 7) MapReduce 8) Hadoop Ecosystem 9) Demo: Hadoop Case Study - Orbitz Subscribe to our channel to get updates. Check our complete Hadoop playlist here: https://goo.gl/4OyoTW

hadoop tutorialwhat is hadoop and big datawhat is big data and hadoop
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop

Hadoop, flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

hadoophbasehive
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.

hadoop
Working of the MapReduce
approach
• Is a combination of a master and three slaves
• The master monitors the entire job assigned to the MapReduce algorithm
and is given the name of JobTracker
• Slaves, on the other hand, are responsible for keeping track of individual
tasks and are called TaskTrackers
• First, the given job is divided into a number of tasks by the master, i.e., the
JobTracker, which then distributes these tasks into slaves
• It is the responsibility of the JobTracker to further keep an eye on the
processing activities and the re-execution of the failed tasks
• Slaves coordinate with the master by executing the tasks they are given by
the master.
• The JobTracker receives jobs from client applications to process large
information. These jobs are assigned in the forms of individual tasks (after
a job is divided into smaller parts) to various TaskTrackers
• The task distribution operation is completed by the JobTracker. The data
after being processed by TaskTrackers is transmitted to the reduce function
so that the final, integrated output which is an aggregate of the data
processed by the map function, can be provided.
Operations performed in the MapReduce
model
• The input is provided from large data files in the form of
key-value pair (KVP), which is the standard input format
in a Hadoop MapReduce programming model
• The input data is divided into small pieces, and master
and slave nodes are created. The master node usually
executes on the machine where the data is present, and
slaves are made to work remotely on the data.
• The map operation is performed simultaneously on all the
data pieces, which are read by the map function. The
map function extracts the relevant data and generates the
KVP for it
The input/output operations of the
map function are shown in Figure
•The output list is generated from the map operation,
and the master instructs the reduce function about
further actions that it needs to take
•The list of KVPs obtained from the map function is
passed on to the reduce function. The reduce function
sorts the data on the basis of the KVP list
•The process of collecting the map output list from
the map function and then sorting it as per the keys is
known as shuffling. Every unique key is then taken by
the reduce function. These keys are called, as required,
for producing the final output to be sent to the file
The input/output operations of the reduce function are
shown in Figure
The output is finally generated by the reduce function, and the control is handed
over to the user program by the master

Recommended for you

Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices

NoSQL: It’s about making intelligent choices. Making Sense of NoSQL. Graph Database Document Store Key value Store BigTable

UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf

The document discusses the course objectives and topics for CCS334 - Big Data Analytics. The course aims to teach students about big data, NoSQL databases, Hadoop, and related tools for big data management and analytics. It covers understanding big data and its characteristics, unstructured data, industry examples of big data applications, web analytics, and key tools used for big data including Hadoop, Spark, and NoSQL databases.

big data analytics
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem

The document discusses the Hadoop ecosystem, which includes core Apache Hadoop components like HDFS, MapReduce, YARN, as well as related projects like Pig, Hive, HBase, Mahout, Sqoop, ZooKeeper, Chukwa, and HCatalog. It provides overviews and diagrams explaining the architecture and purpose of each component, positioning them as core functionality that speeds up Hadoop processing and makes Hadoop more usable and accessible.

The entire process of data analysis conducted in the
MapReduce programming model:
• Let’s now try to understand the
working of the MapReduce
programming model with the help of
a few examples
Example 1
• Consider that there is a data analysis project in which 20 terabytes of data needs to be
analyzed on 20 different MapReduce server nodes
• At first, the data distribution process simply copies data to all the nodes before starting
the MapReduce process.
• You need to keep in mind that the determination of the format of the file rests with the
user and no standard file format is specified in MapReduce as in relational databases.
• Next, the scheduler comes into the picture as it receives two programs from the
programmer. These two programs are the map and reduce programs. The data is made
available from the disk to the map function, which runs the logic on the data. In our
example, all the 20 nodes independently perform the operation.
•The map function passes the results to the reduce function for summarizing and
providing the final output in an aggregate form
Example 1
• The ancient Rome census can help to understand the mapping process of the map and
reduce functions. In the Rome census, volunteers were sent to cover various places
that are situated near the kingdom of Rome. Volunteers had to count the number of
people living in the area assigned to them and send the report of the population to the
organization. The census chief added the count of people recorded from all the areas
to reach an aggregate whole. The map function performs the processing operation in
parallel to counting the number of people living in an area, and the reduce function
combines the entire result.
Example 2
• A data analytic professional parses out every term available in the chat text by creating a map step. He
creates a map function to find out every word of the chat. The count is incremented by one after the word
is parsed from the paragraph.
• The map function provides the output in the form of a list that involves a number of KVPs, for example,
″<my, 1>,″ ″<product, 1>,″ ″<broke, 1>.″
• Once the operations of all map functions are complete, the information is provided to the scheduler by
the map function itself. After completing the map operation,
• After completing the map operation, the reduce function starts performing the reduce operation. Keeping
the current target of finding the count of the number of times a word appears in the text, shuffling is
performed next
• This process involves distribution of the map output through hashing in order to map the same keywords
to the respective node of the reduce function. Assuming a simple situation of processing an English text,
for example, we require 26 nodes that can handle words starting with individual letters of the alphabet
• In this case, words starting with A will be handled by one node, words that start with B will be handled
by another node, and so on. Thus, the number of words can easily be counted by the reduce step.

Recommended for you

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce

Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.

chapterstudentvnit-acm
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase

This document contains information about Apache HBase including links to documentation pages, JIRA issues, and discussions on using HBase. It provides configuration examples for viewing HFile contents, explains how Bloom filters are used in HBase, includes an overview of the HBase data model and comparisons with RDBMS. It also shows an example Git diff of modifying the HBase heap size configuration and provides links to guides on using HBase and documentation on region splitting and merging.

avromapreducehadoop
Internet of Things
Internet of ThingsInternet of Things
Internet of Things

In this presentation, Parul introduces IoT and gives examples of interesting applications in that space. Parul is interested in data management and insights that come out of IoT clas devices.

internet of thingsmphasisinterniot
The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first value of each angular
bracket (<>) is the analyzed word, and the second value is the count of the word, i.e., the number of times the word appears
within the entire text. The result set will include 26 files.
The detailed MapReduce process used in this
example:
•The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first
value of each angular bracket (<>) is the analyzed word, and the second value is the count of the word,
i.e., the number of times the word appears within the entire text
• The result set will include 26 files. Each of these files is produced from an individual node and contains
the count of words in a sorted order. You need to keep in mind that the combining operation will also
require a process to handle all the 26 files obtained as a result of the MapReduce operations. After we
obtain the count of words, we can feed the results for any kind of analysis.
Exploring Map and Reduce Functions
•The MapReduce programming model facilitates faster data analysis for which the data is taken in the
form of KVPs.
•Both MapReduce functions and Hadoop can be created in many languages; however, programmers
generally prefer to create them in Java. The Pipes library allows C++ source code to be utilized for map
and reduce code
•The generic Application Programming Interface (API) called streaming allows programs created in
most languages to be utilized as map and reduce functions in Hadoop
•Consider an example of a program that counts the number of Indian cities having a population of above
one lakh. You must note that the following is not a programming code instead a plain English
representation of the solution to the problem.
•One way to achieve the following task is to determine the input data and generate a list in the following
manner:
mylist = ("all counties in the India that participated in the most recent general
election")
Exploring Map and Reduce Functions
• Use the map function to create a function, howManyPeople, which selects the cities
having a population of more than one lakh:
map howManyPeople (mylist) = [howManyPeople "city 1";howManyPeople"city 2";
howManyPeople "city 3"; howManyPeople "city 4";...]
•Now, generate a new output list of all the cities having a population of more than one lakh:
(no, city 1; yes, city 2; no, city 3; yes, city 4;?, city nnn)
•The preceding function gets executed without making any modifications to the original list.
Moreover, you can notice that each element of the output list gets mapped to a
corresponding element of the input list, having a “yes” or “no” attached.
example, city is the key and temperature is the
value.
• Out of all the data we have collected, we want to find the maximum temperature for each
city across all of the data files (note that each file might have the same city represented
multiple times). Using the MapReduce framework, we can break this down into five map
tasks, where each mapper works on one of the five files, and the mapper task goes through
the data and returns the maximum temperature for each city. For example, the results
produced from one mapper task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22)
(Rome, 33)

Recommended for you

Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System

This presentation discusses the following topics: Hadoop Distributed File System (HDFS) How does HDFS work? HDFS Architecture Features of HDFS Benefits of using HDFS Examples: Target Marketing HDFS data replication

hadoophdfsbig data
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop

big data analytics using hadoop and basics of hadoop architecture mapreduce hive pig jaql and a mini project based on hadoop

mapreduce algorithms architecturehadoopbig data
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)

This document provides an introduction and overview of Apache Hive. It discusses how Hive originated at Facebook to manage large amounts of data stored in Oracle databases. It then defines what Hive is, how it works by compiling SQL queries into MapReduce jobs, and its architecture. Key components of Hive like its data model, metastore, and commands for creating tables and loading data are summarized.

hivehdfsbig data
example, city is the key and temperature
is the value.
Let’s assume the other four mapper tasks produced the following intermediate results
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New
York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto,
31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the
input results and output a single value for each city, producing the final result set as
follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
Techniques to Optimize MapReduce Jobs
•MapReduce optimization techniques are in the following categories:
Hardware or network topology
 Synchronization
 File system
•You need to keep the following points in mind while designing a file that supports
MapReduce implementation:
Keep it Warm
The Bigger the Better
The Long View
Right Degree of Security
The fields benefitted by the use of MapReduce are:
1. Web Page Visits—Suppose a researcher wants to know the number of times the website of a particular
newspaper was accessed. The map task would be to read the logs of the Web page requests and make a
complete list. The map outputs may look similar to the following:
The reduce function would find the results for the newspaper URL and add them.
The output of the preceding code is:
<newspaperURL, 3>
The fields benefitted by the use of
MapReduce are:
2. Web Page Visitor Paths- Consider a situation in which an advocacy group wishes to
know how visitors get to know about its website. To determine this, they designed a
link known as “source,” and the Web page to which the link transfers the information is
known as “target.” The map function scans the Web links for returning the results of the
type <target, source>. The reduce function scans this list for determining the results
where the “target” is the Web page. The reduce function output, which is the final
output, will be of the form <advocacy group page, list (source)>.

Recommended for you

Data ingestion
Data ingestionData ingestion
Data ingestion

This document discusses data ingestion into Hadoop. It describes how data can be ingested in real-time or in batches. Common tools for ingesting data into Hadoop include Apache Flume, Apache NiFi, and Apache Sqoop. Flume is designed for streaming data ingestion and uses a source-channel-sink architecture to reliably move data into Hadoop. NiFi focuses on real-time data collection and processing capabilities. Sqoop can import and export structured data between Hadoop and relational databases.

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

hadoophdfs
E031201032036
E031201032036E031201032036
E031201032036

International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.

international journal of computational engineeringwill be handled as carefully and efficiently as poas well as related requests
The fields benefitted by the use of
MapReduce are:
3. Word Frequency—A researcher wishes to read articles about flood but, he
does not want those articles in which the flood is discussed as a minor topic.
Therefore, he decided that an article basically dealing with earthquakes and
floods should have the word “tectonic plate” in it more than 10 times. The map
function will count the number of times the specified word occurred in each
document and provide the result as <document, frequency>. The reduce
function will count and select only the results that have a frequency of more
than 10 words.
The fields benefitted by the use of MapReduce are:
4. Word Count—Suppose a researcher wishes to determine the number of times celebrities talk about the present
bestseller. The data to be analyzed comprises written blogs, posts, and tweets of the celebrities. The map function
will make a list of all the words. This list will be in the KVP format, in which the key is each word, and the value is
1 for every appearance of that word. The output from the map function would be obtained somewhat as follows:
The preceding output will be converted in the following form by the reduce function:
HBase
• The MapReduce programming model can utilize other components of the Hadoop ecosystem to
perform its operations better. One such components is Hbase
• Role of HBase in Big Data Processing- HBase is an open source, non-relational, distributed,
column-oriented database developed as a part of Apache Software Foundation’s Hadoop project.
• MapReduce enhances Big Data processing, HBase takes care of its storage and access
requirements. Characteristics of HBase -- HBase helps programmers to store large quantities
of data in such a way that it can be accessed easily and quickly, as and when required
• It stores data in a compressed format and thus, occupies less memory space. HBase has low
latency time and is, therefore, beneficial for lookups and scanning of large amounts of data.
HBase saves data in cells in the descending order (with the help of timestamp); therefore, a read
will always first determine the most recent values. Columns in HBase relate to a column family.
• The column family name is utilized as a prefix for determining members of its family; for
instance, Cars: Wagon R and Cars: i10 are the members of the Cars column family. A key is
associated with rows in HBase tables
HBase
• The structure of the key is very flexible. It can be a calculated value, a string, or
any other data structure. The key is used for controlling the retrieval of data to the
cells in the row. All these characteristics help build the schema of the HBase data
structure before the storage of any data. Moreover, tables can be modified and new
column families can be added once the database is up and running.
• The columns can be added very easily and are added row-by-row, providing great
flexibility, performance, and scalability. In case you have large volume and
variety of data, you can use a columnar database. HBase is suitable in conditions
where the data changes gradually and rapidly. In addition, HBase can save the data
that has a slow-changing rate and ensure its availability for Hadoop tasks.
• HBase is a framework written in Java for supporting applications that are used to
process Big Data. HBase is a non-relational Hadoop database that provides fault
tolerance for huge amounts of data.

Recommended for you

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation

MapReduce is a programming model and implementation for processing large datasets in a distributed environment. It allows users to write map and reduce functions to process key-value pairs. The MapReduce library handles parallelization across clusters, automatic parallelization, fault-tolerance through task replication, and load balancing. It was designed at Google to simplify distributed computations on massive amounts of data and aggregates the results across clusters.

Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig

This document provides an overview of MapReduce architecture and components. It discusses how MapReduce processes data using map and reduce tasks on key-value pairs. The JobTracker manages jobs by scheduling tasks on TaskTrackers. Data is partitioned and sorted during the shuffle and sort phase before being processed by reducers. Components like Hive, Pig, partitions, combiners, and HBase are described in the context of how they integrate with and optimize MapReduce processing.

Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script

MapReduce is a programming model for processing large datasets in a distributed system. It allows parallel processing of data across clusters of computers. A MapReduce program defines a map function that processes key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The MapReduce framework handles parallelization of tasks, scheduling, input/output handling, and fault tolerance.

Hbase-Installation
• Before starting the installation of HBase, you need to install the Java Software
Development Kit (SDK). The installation of HBase requires the following
operations to be performed in a stepwise manner: In the Windows terminal, install
the dependency $sudo apt-get installntp libopts25. Figure 5.7 shows the
installation of dependency for HBase: Figure
Hbase-Installation
• The HBase file can be customized as per the user needs by exporting JAVA_HOME
and HBase Opts (hbase-env.sh). To customize an HBase file, type the following
code:
Hbase-Installation
• Zookeeper, the file management engine of the Hadoop ecosystem, manages the files
thatHBase plans to use currently and in the future. Therefore, to manage zookeeper in
HBase and ensure that it is enabled, use the following command:
export HBASE_MANAGES_ZK=true
• Figure 5.9 shows zookeeper enabled in HBase:
Hbase-Installation
• Site-specific customizations are done in hbase-site.xml (HBASE_HOME/conf). Figure
5.10 shows customized hbase-site.xml (HBASE_HOME/conf):

Recommended for you

MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model

MapReduce is a programming model introduced by Google for processing and generating large data sets on clusters of computers.

big datamapreducemapreduce programming
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf

The document provides an overview of developing a big data strategy. It discusses defining a big data strategy by identifying opportunities and economic value of data, defining a big data architecture, selecting technologies, understanding data science, developing analytics, and institutionalizing big data. A good strategy explores these subject domains and aligns them to organizational objectives to accomplish a data-driven vision and direct the organization.

big datadatadatabase
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture

This presentation discusses the following topics: Introduction Components of Hadoop MapReduce Map Task Reduce Task Anatomy of a Map Reduce

hadoopbig datadata analytics
Hbase-Installation
• To enable connection with remote HBase server, edit /etc/hosts. Figure
5.11 shows the edited /etc/hosts:
Hbase-Installation
• To enable connection with remote HBase server, edit /etc/hosts. Figure
5.11 shows the edited /etc/hosts:
Hbase-Installation
• Start HBase by using the following command: $bin/start-hbase.sh Figure
5.12 shows the initiation process of HBase daemons:
Hbase-Installation
• Check all HBase daemons by using the following command: $jps Figure
5.13 shows the implementation of the $jps command:

Recommended for you

writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs

we are interested in performing Big Data analytics, we need to learn Hadoop to perform operations with Hadoop MapReduce. In this Presentation, we will discuss what MapReduce is, why it is necessary, how MapReduce programs can be developed through Apache Hadoop, and more.

hadoopmapreducebusiness problems
Hadoop
HadoopHadoop
Hadoop

Hadoop/MapReduce is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, a programming model where input data is processed by "map" functions in parallel, and results are combined by "reduce" functions, to process and generate outputs from large amounts of data and nodes. The core components are the Hadoop Distributed File System for data storage, and the MapReduce programming model and framework. MapReduce jobs involve mapping data to intermediate key-value pairs, shuffling and sorting the data, and reducing to output results.

hadoopsoftware developmentbig data
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction

This document discusses MapReduce and how it can be used to parallelize a word counting task over large datasets. It explains that MapReduce programs have two phases - mapping and reducing. The mapping phase takes input data and feeds each element to mappers, while the reducing phase aggregates the outputs from mappers. It also describes how Hadoop implements MapReduce by splitting files into splits, assigning splits to mappers across nodes, and using reducers to aggregate the outputs.

hadoop
Hbase-Installation
• Paste the following link to access the Web interface, which has the list of
tables created, along with their definition: http://localhost:60010
Figure 5.14 shows the Web interface for HBase:
Hbase-Installation
• Check the region server for HBase by pasting the following link in your Web browser:
http://localhost:60030
• DT Editorial Services. Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive,
YARN, Pig, R and Data Visualization (p. 138). Wiley India. Kindle Edition.
Hbase-Installation
• Start the HBase shell by using the following command: $bin/hbase shell
Figure 5.16 shows the $bin/hbase shell running in a terminal:
Hbase-Installation

Recommended for you

Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com

Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch. Read more here http://softwarequery.com/hadoop/

hadoop mapreduce frameworkhadoop tutorialhadoop inteview questions
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint

A popular programming model for running data intensive applications on the cloud is map reduce. In the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on improving s y s t em utilization. We have proposed an algorithm which facilitates the user to specify a jobs deadline and evaluates whether the job can be finished before the deadline. Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or virtual nodes can be added dynamically to complete the job within deadline[8].

Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)

The document describes MapReduce, a programming model developed at Google for processing large datasets in a distributed computing environment. It discusses how MapReduce works, with mappers processing input data in parallel to generate intermediate key-value pairs, and reducers then merging all intermediate values associated with the same key. Three examples of MapReduce problems and their solutions are provided to illustrate how MapReduce can be used to calculate averages, group data by gender to find totals and averages, and categorize words by length.

cloud computing
Hbase-Installation
Hbase-Installation
Hbase-Installation
Hbase-Installation

Recommended for you

Map reduce
Map reduceMap reduce
Map reduce

This document discusses MapReduce, a programming model for processing large datasets in a distributed computing environment. It describes the key concepts of MapReduce including mapping input data to intermediate key-value pairs, shuffling, and reducing to output results. The document also covers MapReduce implementation details such as execution flow with a master and workers, fault tolerance, backup tasks, partitioning and combiner functions, skipping bad records, and counters.

mapreduce
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx

CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx

search engine optimizationengineering
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce

The Google MapReduce presented in 2004 is the inspiration for Hadoop. Let's take a deep dive into MapReduce to better understand Hadoop.

googlemapreducedistributed computation
Hbase-Installation
Thank You

More Related Content

What's hot

The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Node.js and The Internet of Things
Node.js and The Internet of ThingsNode.js and The Internet of Things
Node.js and The Internet of Things
Losant
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop
HadoopHadoop
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
Maynooth University
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
vvpadhu
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Avkash Chauhan
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
Mphasis
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Data ingestion
Data ingestionData ingestion
Data ingestion
nitheeshe2
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 

What's hot (20)

The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Node.js and The Internet of Things
Node.js and The Internet of ThingsNode.js and The Internet of Things
Node.js and The Internet of Things
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 

Similar to Big Data.pptx

E031201032036
E031201032036E031201032036
E031201032036
ijceronline
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Ahmad El Tawil
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
AdarshaDhakal
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
WasyihunSema2
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 
Hadoop
HadoopHadoop
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
ijccsa
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
Map reduce
Map reduceMap reduce
Map reduce
대호 김
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 

Similar to Big Data.pptx (20)

E031201032036
E031201032036E031201032036
E031201032036
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
Map reduce
Map reduceMap reduce
Map reduce
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 

Recently uploaded

2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】
2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】
2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】
karimimorine448
 
ISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdf
ISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdfISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdf
ISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdf
MohammadAdnan6667
 
The Chartered Project Engineer.PREVIEW.pdf
The Chartered Project Engineer.PREVIEW.pdfThe Chartered Project Engineer.PREVIEW.pdf
The Chartered Project Engineer.PREVIEW.pdf
GAFM ACADEMY
 
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeConnaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
tinakumariji156
 
Career Paths as Civil Graduate in the Era of Digital Transformation
Career Paths as Civil Graduate in the Era of Digital TransformationCareer Paths as Civil Graduate in the Era of Digital Transformation
Career Paths as Civil Graduate in the Era of Digital Transformation
Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Alisha Pathan $A17
 
RK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model Safe
anchal singh$A17
 
How to Start a Project? PREVIEW
How to Start a Project?                 PREVIEW How to Start a Project?                 PREVIEW
How to Start a Project? PREVIEW
GAFM ACADEMY
 
Curriculum Vitae of Heston Matthew Jackson II
Curriculum Vitae of Heston Matthew Jackson IICurriculum Vitae of Heston Matthew Jackson II
Curriculum Vitae of Heston Matthew Jackson II
Heston Matthew Jackson, II
 
SSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptx
SSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptxSSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptx
SSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptx
Vijay Kumar
 
reStartEvents Nationwide TS/SCI & Above Cleared Virtual Career Fair
reStartEvents Nationwide TS/SCI & Above Cleared Virtual Career FairreStartEvents Nationwide TS/SCI & Above Cleared Virtual Career Fair
reStartEvents Nationwide TS/SCI & Above Cleared Virtual Career Fair
Ken Fuller
 
Greater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Greater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeGreater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Greater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
binna singh$A17
 
Guide for a Winning Interview - July 8, 2024
Guide for a Winning Interview -  July 8, 2024Guide for a Winning Interview -  July 8, 2024
Guide for a Winning Interview - July 8, 2024
Bruce Bennett
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
Bruce Bennett
 
Microsoft AZ-305 Designing Microsoft Azure Infrastructure Solutions
Microsoft AZ-305 Designing Microsoft Azure Infrastructure SolutionsMicrosoft AZ-305 Designing Microsoft Azure Infrastructure Solutions
Microsoft AZ-305 Designing Microsoft Azure Infrastructure Solutions
Stepan Kalika
 
0724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-010724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-01
Thomas GIRARD BDes
 
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeConnaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
aashuverma204
 
0724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-010724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-01
Thomas GIRARD BDes
 
Basic Of Civil Engineering Site knowledge
Basic Of Civil Engineering Site knowledgeBasic Of Civil Engineering Site knowledge
Basic Of Civil Engineering Site knowledge
SuvamoyPanja
 
欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】
欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】
欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】
houghdei87700
 

Recently uploaded (20)

2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】
2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】
2024欧洲杯赔率-2024欧洲杯赔率下注网址-2024欧洲杯赔率下注网站 |【​网址​🎉ac10.net🎉​】
 
ISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdf
ISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdfISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdf
ISB_GRADE_XI_AND_XII_ORIENTATION_PPT AY 2024-2025.pdf
 
The Chartered Project Engineer.PREVIEW.pdf
The Chartered Project Engineer.PREVIEW.pdfThe Chartered Project Engineer.PREVIEW.pdf
The Chartered Project Engineer.PREVIEW.pdf
 
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeConnaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
Career Paths as Civil Graduate in the Era of Digital Transformation
Career Paths as Civil Graduate in the Era of Digital TransformationCareer Paths as Civil Graduate in the Era of Digital Transformation
Career Paths as Civil Graduate in the Era of Digital Transformation
 
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
RK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9711199012 ❤꧂Fabulous sonam Mehra Top Model Safe
 
How to Start a Project? PREVIEW
How to Start a Project?                 PREVIEW How to Start a Project?                 PREVIEW
How to Start a Project? PREVIEW
 
Curriculum Vitae of Heston Matthew Jackson II
Curriculum Vitae of Heston Matthew Jackson IICurriculum Vitae of Heston Matthew Jackson II
Curriculum Vitae of Heston Matthew Jackson II
 
SSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptx
SSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptxSSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptx
SSC CGL 2024 Notification Released: Apply Now For 17,727 Vacancies.pptx
 
reStartEvents Nationwide TS/SCI & Above Cleared Virtual Career Fair
reStartEvents Nationwide TS/SCI & Above Cleared Virtual Career FairreStartEvents Nationwide TS/SCI & Above Cleared Virtual Career Fair
reStartEvents Nationwide TS/SCI & Above Cleared Virtual Career Fair
 
Greater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Greater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeGreater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Greater Noida @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Guide for a Winning Interview - July 8, 2024
Guide for a Winning Interview -  July 8, 2024Guide for a Winning Interview -  July 8, 2024
Guide for a Winning Interview - July 8, 2024
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
Microsoft AZ-305 Designing Microsoft Azure Infrastructure Solutions
Microsoft AZ-305 Designing Microsoft Azure Infrastructure SolutionsMicrosoft AZ-305 Designing Microsoft Azure Infrastructure Solutions
Microsoft AZ-305 Designing Microsoft Azure Infrastructure Solutions
 
0724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-010724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-01
 
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeConnaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Connaught Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
0724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-010724.curriculumvitaeandresume_scholarandauthor-01
0724.curriculumvitaeandresume_scholarandauthor-01
 
Basic Of Civil Engineering Site knowledge
Basic Of Civil Engineering Site knowledgeBasic Of Civil Engineering Site knowledge
Basic Of Civil Engineering Site knowledge
 
欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】
欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】
欧洲杯竞彩平台-欧洲杯竞彩平台投注竞彩-欧洲杯竞彩平台竞猜投注 |【​网址​🎉ac44.net🎉​】
 

Big Data.pptx

  • 1. Understanding Big Data Technology Foundations Module 3
  • 2. Syllabus • The MapReduce Framework • Techniques to reduce Mapreduce Jobs • Uses of Mapreduce • Role of Hbase in Big data Processing
  • 3. Understanding Big Data Technology Foundations • The advent of Local Area Networks (LANs) and other networking technologies shifted the focus of IT industry toward solving bigger and bigger problems by combining computing and storing capacities of systems on the network • This chapter focuses on explaining the basics and exploring the relevance and role of various functions that are used in the MapReduce framework.
  • 4. Big Data • The MapReduce Framework At the start of the 21st century, the team of engineers working with Google concluded that because of the increasing number of Internet users, the resources and solutions available would be inadequate to fulfill the future requirements. As a preparation to the upcoming issue, Google engineers established that the concept of task distribution across economical resources, and their interconnectivity as a cluster over the network, can be presented as a solution. The concept of task distribution, though, could not be a complete answer to the issue, which requires the tasks to be distributed in parallel.
  • 5. A parallel distribution of tasks • Helps in automatic expansion and contraction of processes • Enables continuation of processes without being affected by network failures or individual system failures • Empowers developers with rights to access the services that other developers have created in context of multiple usage scenarios • A generic implementation to the entire concept was, therefore, provided with the development of the MapReduce programming model
  • 6. Exploring the Features of MapReduce • MapReduce keeps all the processing operations separate for parallel execution. Problems that are extremely large in size are divided into subtasks, which are chunks of data separated in manageable blocks. • The principal features of MapReduce include the following: Synchronization Co-location of Code/Data (Data Locality) Handling of Errors/Faults Scale-Out Architecture
  • 7. Working of MapReduce 1. Take a large dataset or set of records. 2. Perform iteration over the data. 3. Extract some interesting patterns to prepare an output list by using the map function. 4. Arrange the output list properly to enable optimization for further processing. 5. Compute a set of results by using the reduce function. 6. Provide the final output. The MapReduce programming model also works on an algorithm to execute the map and reduce operations. This algorithm can be depicted as follows
  • 8. Working of the MapReduce approach
  • 9. Working of the MapReduce approach • Is a combination of a master and three slaves • The master monitors the entire job assigned to the MapReduce algorithm and is given the name of JobTracker • Slaves, on the other hand, are responsible for keeping track of individual tasks and are called TaskTrackers • First, the given job is divided into a number of tasks by the master, i.e., the JobTracker, which then distributes these tasks into slaves • It is the responsibility of the JobTracker to further keep an eye on the processing activities and the re-execution of the failed tasks • Slaves coordinate with the master by executing the tasks they are given by the master. • The JobTracker receives jobs from client applications to process large information. These jobs are assigned in the forms of individual tasks (after a job is divided into smaller parts) to various TaskTrackers • The task distribution operation is completed by the JobTracker. The data after being processed by TaskTrackers is transmitted to the reduce function so that the final, integrated output which is an aggregate of the data processed by the map function, can be provided.
  • 10. Operations performed in the MapReduce model • The input is provided from large data files in the form of key-value pair (KVP), which is the standard input format in a Hadoop MapReduce programming model • The input data is divided into small pieces, and master and slave nodes are created. The master node usually executes on the machine where the data is present, and slaves are made to work remotely on the data. • The map operation is performed simultaneously on all the data pieces, which are read by the map function. The map function extracts the relevant data and generates the KVP for it
  • 11. The input/output operations of the map function are shown in Figure •The output list is generated from the map operation, and the master instructs the reduce function about further actions that it needs to take •The list of KVPs obtained from the map function is passed on to the reduce function. The reduce function sorts the data on the basis of the KVP list •The process of collecting the map output list from the map function and then sorting it as per the keys is known as shuffling. Every unique key is then taken by the reduce function. These keys are called, as required, for producing the final output to be sent to the file
  • 12. The input/output operations of the reduce function are shown in Figure The output is finally generated by the reduce function, and the control is handed over to the user program by the master
  • 13. The entire process of data analysis conducted in the MapReduce programming model: • Let’s now try to understand the working of the MapReduce programming model with the help of a few examples
  • 14. Example 1 • Consider that there is a data analysis project in which 20 terabytes of data needs to be analyzed on 20 different MapReduce server nodes • At first, the data distribution process simply copies data to all the nodes before starting the MapReduce process. • You need to keep in mind that the determination of the format of the file rests with the user and no standard file format is specified in MapReduce as in relational databases. • Next, the scheduler comes into the picture as it receives two programs from the programmer. These two programs are the map and reduce programs. The data is made available from the disk to the map function, which runs the logic on the data. In our example, all the 20 nodes independently perform the operation. •The map function passes the results to the reduce function for summarizing and providing the final output in an aggregate form
  • 15. Example 1 • The ancient Rome census can help to understand the mapping process of the map and reduce functions. In the Rome census, volunteers were sent to cover various places that are situated near the kingdom of Rome. Volunteers had to count the number of people living in the area assigned to them and send the report of the population to the organization. The census chief added the count of people recorded from all the areas to reach an aggregate whole. The map function performs the processing operation in parallel to counting the number of people living in an area, and the reduce function combines the entire result.
  • 16. Example 2 • A data analytic professional parses out every term available in the chat text by creating a map step. He creates a map function to find out every word of the chat. The count is incremented by one after the word is parsed from the paragraph. • The map function provides the output in the form of a list that involves a number of KVPs, for example, ″<my, 1>,″ ″<product, 1>,″ ″<broke, 1>.″ • Once the operations of all map functions are complete, the information is provided to the scheduler by the map function itself. After completing the map operation, • After completing the map operation, the reduce function starts performing the reduce operation. Keeping the current target of finding the count of the number of times a word appears in the text, shuffling is performed next • This process involves distribution of the map output through hashing in order to map the same keywords to the respective node of the reduce function. Assuming a simple situation of processing an English text, for example, we require 26 nodes that can handle words starting with individual letters of the alphabet • In this case, words starting with A will be handled by one node, words that start with B will be handled by another node, and so on. Thus, the number of words can easily be counted by the reduce step.
  • 17. The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first value of each angular bracket (<>) is the analyzed word, and the second value is the count of the word, i.e., the number of times the word appears within the entire text. The result set will include 26 files. The detailed MapReduce process used in this example: •The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first value of each angular bracket (<>) is the analyzed word, and the second value is the count of the word, i.e., the number of times the word appears within the entire text • The result set will include 26 files. Each of these files is produced from an individual node and contains the count of words in a sorted order. You need to keep in mind that the combining operation will also require a process to handle all the 26 files obtained as a result of the MapReduce operations. After we obtain the count of words, we can feed the results for any kind of analysis.
  • 18. Exploring Map and Reduce Functions •The MapReduce programming model facilitates faster data analysis for which the data is taken in the form of KVPs. •Both MapReduce functions and Hadoop can be created in many languages; however, programmers generally prefer to create them in Java. The Pipes library allows C++ source code to be utilized for map and reduce code •The generic Application Programming Interface (API) called streaming allows programs created in most languages to be utilized as map and reduce functions in Hadoop •Consider an example of a program that counts the number of Indian cities having a population of above one lakh. You must note that the following is not a programming code instead a plain English representation of the solution to the problem. •One way to achieve the following task is to determine the input data and generate a list in the following manner: mylist = ("all counties in the India that participated in the most recent general election")
  • 19. Exploring Map and Reduce Functions • Use the map function to create a function, howManyPeople, which selects the cities having a population of more than one lakh: map howManyPeople (mylist) = [howManyPeople "city 1";howManyPeople"city 2"; howManyPeople "city 3"; howManyPeople "city 4";...] •Now, generate a new output list of all the cities having a population of more than one lakh: (no, city 1; yes, city 2; no, city 3; yes, city 4;?, city nnn) •The preceding function gets executed without making any modifications to the original list. Moreover, you can notice that each element of the output list gets mapped to a corresponding element of the input list, having a “yes” or “no” attached.
  • 20. example, city is the key and temperature is the value. • Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files, and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this: (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
  • 21. example, city is the key and temperature is the value. Let’s assume the other four mapper tasks produced the following intermediate results (Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30) All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing the final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
  • 22. Techniques to Optimize MapReduce Jobs •MapReduce optimization techniques are in the following categories: Hardware or network topology  Synchronization  File system •You need to keep the following points in mind while designing a file that supports MapReduce implementation: Keep it Warm The Bigger the Better The Long View Right Degree of Security
  • 23. The fields benefitted by the use of MapReduce are: 1. Web Page Visits—Suppose a researcher wants to know the number of times the website of a particular newspaper was accessed. The map task would be to read the logs of the Web page requests and make a complete list. The map outputs may look similar to the following: The reduce function would find the results for the newspaper URL and add them. The output of the preceding code is: <newspaperURL, 3>
  • 24. The fields benefitted by the use of MapReduce are: 2. Web Page Visitor Paths- Consider a situation in which an advocacy group wishes to know how visitors get to know about its website. To determine this, they designed a link known as “source,” and the Web page to which the link transfers the information is known as “target.” The map function scans the Web links for returning the results of the type <target, source>. The reduce function scans this list for determining the results where the “target” is the Web page. The reduce function output, which is the final output, will be of the form <advocacy group page, list (source)>.
  • 25. The fields benefitted by the use of MapReduce are: 3. Word Frequency—A researcher wishes to read articles about flood but, he does not want those articles in which the flood is discussed as a minor topic. Therefore, he decided that an article basically dealing with earthquakes and floods should have the word “tectonic plate” in it more than 10 times. The map function will count the number of times the specified word occurred in each document and provide the result as <document, frequency>. The reduce function will count and select only the results that have a frequency of more than 10 words.
  • 26. The fields benefitted by the use of MapReduce are: 4. Word Count—Suppose a researcher wishes to determine the number of times celebrities talk about the present bestseller. The data to be analyzed comprises written blogs, posts, and tweets of the celebrities. The map function will make a list of all the words. This list will be in the KVP format, in which the key is each word, and the value is 1 for every appearance of that word. The output from the map function would be obtained somewhat as follows: The preceding output will be converted in the following form by the reduce function:
  • 27. HBase • The MapReduce programming model can utilize other components of the Hadoop ecosystem to perform its operations better. One such components is Hbase • Role of HBase in Big Data Processing- HBase is an open source, non-relational, distributed, column-oriented database developed as a part of Apache Software Foundation’s Hadoop project. • MapReduce enhances Big Data processing, HBase takes care of its storage and access requirements. Characteristics of HBase -- HBase helps programmers to store large quantities of data in such a way that it can be accessed easily and quickly, as and when required • It stores data in a compressed format and thus, occupies less memory space. HBase has low latency time and is, therefore, beneficial for lookups and scanning of large amounts of data. HBase saves data in cells in the descending order (with the help of timestamp); therefore, a read will always first determine the most recent values. Columns in HBase relate to a column family. • The column family name is utilized as a prefix for determining members of its family; for instance, Cars: Wagon R and Cars: i10 are the members of the Cars column family. A key is associated with rows in HBase tables
  • 28. HBase • The structure of the key is very flexible. It can be a calculated value, a string, or any other data structure. The key is used for controlling the retrieval of data to the cells in the row. All these characteristics help build the schema of the HBase data structure before the storage of any data. Moreover, tables can be modified and new column families can be added once the database is up and running. • The columns can be added very easily and are added row-by-row, providing great flexibility, performance, and scalability. In case you have large volume and variety of data, you can use a columnar database. HBase is suitable in conditions where the data changes gradually and rapidly. In addition, HBase can save the data that has a slow-changing rate and ensure its availability for Hadoop tasks. • HBase is a framework written in Java for supporting applications that are used to process Big Data. HBase is a non-relational Hadoop database that provides fault tolerance for huge amounts of data.
  • 29. Hbase-Installation • Before starting the installation of HBase, you need to install the Java Software Development Kit (SDK). The installation of HBase requires the following operations to be performed in a stepwise manner: In the Windows terminal, install the dependency $sudo apt-get installntp libopts25. Figure 5.7 shows the installation of dependency for HBase: Figure
  • 30. Hbase-Installation • The HBase file can be customized as per the user needs by exporting JAVA_HOME and HBase Opts (hbase-env.sh). To customize an HBase file, type the following code:
  • 31. Hbase-Installation • Zookeeper, the file management engine of the Hadoop ecosystem, manages the files thatHBase plans to use currently and in the future. Therefore, to manage zookeeper in HBase and ensure that it is enabled, use the following command: export HBASE_MANAGES_ZK=true • Figure 5.9 shows zookeeper enabled in HBase:
  • 32. Hbase-Installation • Site-specific customizations are done in hbase-site.xml (HBASE_HOME/conf). Figure 5.10 shows customized hbase-site.xml (HBASE_HOME/conf):
  • 33. Hbase-Installation • To enable connection with remote HBase server, edit /etc/hosts. Figure 5.11 shows the edited /etc/hosts:
  • 34. Hbase-Installation • To enable connection with remote HBase server, edit /etc/hosts. Figure 5.11 shows the edited /etc/hosts:
  • 35. Hbase-Installation • Start HBase by using the following command: $bin/start-hbase.sh Figure 5.12 shows the initiation process of HBase daemons:
  • 36. Hbase-Installation • Check all HBase daemons by using the following command: $jps Figure 5.13 shows the implementation of the $jps command:
  • 37. Hbase-Installation • Paste the following link to access the Web interface, which has the list of tables created, along with their definition: http://localhost:60010 Figure 5.14 shows the Web interface for HBase:
  • 38. Hbase-Installation • Check the region server for HBase by pasting the following link in your Web browser: http://localhost:60030 • DT Editorial Services. Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization (p. 138). Wiley India. Kindle Edition.
  • 39. Hbase-Installation • Start the HBase shell by using the following command: $bin/hbase shell Figure 5.16 shows the $bin/hbase shell running in a terminal: