SlideShare a Scribd company logo
Understanding Big Data Technology
Foundations
Module 3
Syllabus
• The MapReduce Framework
• Techniques to reduce Mapreduce Jobs
• Uses of Mapreduce
• Role of Hbase in Big data Processing
Understanding Big Data Technology
Foundations
• The advent of Local Area Networks (LANs) and other
networking technologies shifted the focus of IT industry
toward solving bigger and bigger problems by combining
computing and storing capacities of systems on the network
• This chapter focuses on explaining the basics and exploring
the relevance and role of various functions that are used in
the MapReduce framework.
Big Data
• The MapReduce Framework At the start of the 21st century,
the team of engineers working with Google concluded that
because of the increasing number of Internet users, the
resources and solutions available would be inadequate to
fulfill the future requirements. As a preparation to the
upcoming issue, Google engineers established that the
concept of task distribution across economical resources,
and their interconnectivity as a cluster over the network,
can be presented as a solution. The concept of task
distribution, though, could not be a complete answer to the
issue, which requires the tasks to be distributed in parallel.

Recommended for you

Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint

A popular programming model for running data intensive applications on the cloud is map reduce. In the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on improving s y s t em utilization. We have proposed an algorithm which facilitates the user to specify a jobs deadline and evaluates whether the job can be finished before the deadline. Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or virtual nodes can be added dynamically to complete the job within deadline[8].

Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)

The document describes MapReduce, a programming model developed at Google for processing large datasets in a distributed computing environment. It discusses how MapReduce works, with mappers processing input data in parallel to generate intermediate key-value pairs, and reducers then merging all intermediate values associated with the same key. Three examples of MapReduce problems and their solutions are provided to illustrate how MapReduce can be used to calculate averages, group data by gender to find totals and averages, and categorize words by length.

cloud computing
Map reduce
Map reduceMap reduce
Map reduce

This document discusses MapReduce, a programming model for processing large datasets in a distributed computing environment. It describes the key concepts of MapReduce including mapping input data to intermediate key-value pairs, shuffling, and reducing to output results. The document also covers MapReduce implementation details such as execution flow with a master and workers, fault tolerance, backup tasks, partitioning and combiner functions, skipping bad records, and counters.

mapreduce
A parallel distribution of tasks
• Helps in automatic expansion and contraction of processes
• Enables continuation of processes without being affected by network
failures or individual system failures
• Empowers developers with rights to access the services that other
developers have created in context of multiple usage scenarios
• A generic implementation to the entire concept was, therefore, provided
with the development of the MapReduce programming model
Exploring the Features of MapReduce
• MapReduce keeps all the processing operations separate for parallel execution. Problems that are
extremely large in size are divided into subtasks, which are chunks of data separated in manageable
blocks.
• The principal features of MapReduce include the following:
Synchronization
Co-location of Code/Data (Data Locality)
Handling of Errors/Faults
Scale-Out Architecture
Working of
MapReduce
1. Take a large dataset or set of records.
2. Perform iteration over the data.
3. Extract some interesting patterns to prepare an
output list by using the map function.
4. Arrange the output list properly to enable
optimization for further processing.
5. Compute a set of results by using the reduce
function.
6. Provide the final output.
The MapReduce programming model also works on an
algorithm to execute the map and reduce operations.
This algorithm can be depicted as follows
Working of the MapReduce approach

Recommended for you

CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx

CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx

search engine optimizationengineering
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce

The Google MapReduce presented in 2004 is the inspiration for Hadoop. Let's take a deep dive into MapReduce to better understand Hadoop.

googlemapreducedistributed computation
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters

The document describes MapReduce, a programming model and software framework for processing large datasets in a distributed computing environment. It discusses how MapReduce allows users to specify map and reduce functions to parallelize tasks across large clusters of machines. It also covers how MapReduce handles parallelization, fault tolerance, and load balancing transparently through an easy-to-use programming interface.

cloud computinghadoopdatabase
Working of the MapReduce
approach
• Is a combination of a master and three slaves
• The master monitors the entire job assigned to the MapReduce algorithm
and is given the name of JobTracker
• Slaves, on the other hand, are responsible for keeping track of individual
tasks and are called TaskTrackers
• First, the given job is divided into a number of tasks by the master, i.e., the
JobTracker, which then distributes these tasks into slaves
• It is the responsibility of the JobTracker to further keep an eye on the
processing activities and the re-execution of the failed tasks
• Slaves coordinate with the master by executing the tasks they are given by
the master.
• The JobTracker receives jobs from client applications to process large
information. These jobs are assigned in the forms of individual tasks (after
a job is divided into smaller parts) to various TaskTrackers
• The task distribution operation is completed by the JobTracker. The data
after being processed by TaskTrackers is transmitted to the reduce function
so that the final, integrated output which is an aggregate of the data
processed by the map function, can be provided.
Operations performed in the MapReduce
model
• The input is provided from large data files in the form of
key-value pair (KVP), which is the standard input format
in a Hadoop MapReduce programming model
• The input data is divided into small pieces, and master
and slave nodes are created. The master node usually
executes on the machine where the data is present, and
slaves are made to work remotely on the data.
• The map operation is performed simultaneously on all the
data pieces, which are read by the map function. The
map function extracts the relevant data and generates the
KVP for it
The input/output operations of the
map function are shown in Figure
•The output list is generated from the map operation,
and the master instructs the reduce function about
further actions that it needs to take
•The list of KVPs obtained from the map function is
passed on to the reduce function. The reduce function
sorts the data on the basis of the KVP list
•The process of collecting the map output list from
the map function and then sorting it as per the keys is
known as shuffling. Every unique key is then taken by
the reduce function. These keys are called, as required,
for producing the final output to be sent to the file
The input/output operations of the reduce function are
shown in Figure
The output is finally generated by the reduce function, and the control is handed
over to the user program by the master

Recommended for you

Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf

The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are: - Mappers process input records in parallel, emitting (key, value) pairs. - A shuffle/sort phase groups values by key to same reducer. - Reducers process grouped values to produce final output, aggregating as needed. - This allows massive datasets to be processed across a cluster in a fault-tolerant way.

Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective

How matrix multiplication in database. Larger set matrix multiplication in map reduce based hadoop framework.

MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics

Covers: Distributed processing issues, MR programming model Sample MR job How MR can be implemented Pros and cons of MR, tips for better performance

distributed computingmapreducebig bang
The entire process of data analysis conducted in the
MapReduce programming model:
• Let’s now try to understand the
working of the MapReduce
programming model with the help of
a few examples
Example 1
• Consider that there is a data analysis project in which 20 terabytes of data needs to be
analyzed on 20 different MapReduce server nodes
• At first, the data distribution process simply copies data to all the nodes before starting
the MapReduce process.
• You need to keep in mind that the determination of the format of the file rests with the
user and no standard file format is specified in MapReduce as in relational databases.
• Next, the scheduler comes into the picture as it receives two programs from the
programmer. These two programs are the map and reduce programs. The data is made
available from the disk to the map function, which runs the logic on the data. In our
example, all the 20 nodes independently perform the operation.
•The map function passes the results to the reduce function for summarizing and
providing the final output in an aggregate form
Example 1
• The ancient Rome census can help to understand the mapping process of the map and
reduce functions. In the Rome census, volunteers were sent to cover various places
that are situated near the kingdom of Rome. Volunteers had to count the number of
people living in the area assigned to them and send the report of the population to the
organization. The census chief added the count of people recorded from all the areas
to reach an aggregate whole. The map function performs the processing operation in
parallel to counting the number of people living in an area, and the reduce function
combines the entire result.
Example 2
• A data analytic professional parses out every term available in the chat text by creating a map step. He
creates a map function to find out every word of the chat. The count is incremented by one after the word
is parsed from the paragraph.
• The map function provides the output in the form of a list that involves a number of KVPs, for example,
″<my, 1>,″ ″<product, 1>,″ ″<broke, 1>.″
• Once the operations of all map functions are complete, the information is provided to the scheduler by
the map function itself. After completing the map operation,
• After completing the map operation, the reduce function starts performing the reduce operation. Keeping
the current target of finding the count of the number of times a word appears in the text, shuffling is
performed next
• This process involves distribution of the map output through hashing in order to map the same keywords
to the respective node of the reduce function. Assuming a simple situation of processing an English text,
for example, we require 26 nodes that can handle words starting with individual letters of the alphabet
• In this case, words starting with A will be handled by one node, words that start with B will be handled
by another node, and so on. Thus, the number of words can easily be counted by the reduce step.

Recommended for you

Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm

The document summarizes the MapReduce programming model and associated implementation developed by Google for processing and generating large datasets in a distributed computing environment. It describes how users specify computations using map and reduce functions, and the underlying system automatically parallelizes execution across large clusters, handles failures, and coordinates inter-machine communication. The authors note over 10,000 distinct programs have been implemented using MapReduce internally at Google to process over 20 petabytes of data daily across its clusters.

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce

This document provides an overview of MapReduce and Apache Hadoop. It discusses the history and components of Hadoop, including HDFS and MapReduce. It then walks through an example MapReduce job, the WordCount algorithm, to illustrate how MapReduce works. The WordCount example counts the frequency of words in documents by having mappers emit <word, 1> pairs and reducers sum the counts for each word.

closd computing 4th updated MODULE-4.pptx
closd computing 4th updated MODULE-4.pptxclosd computing 4th updated MODULE-4.pptx
closd computing 4th updated MODULE-4.pptx

it is a basic ppt regarding mod 4 cloud application development in cloud

engineering
The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first value of each angular
bracket (<>) is the analyzed word, and the second value is the count of the word, i.e., the number of times the word appears
within the entire text. The result set will include 26 files.
The detailed MapReduce process used in this
example:
•The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first
value of each angular bracket (<>) is the analyzed word, and the second value is the count of the word,
i.e., the number of times the word appears within the entire text
• The result set will include 26 files. Each of these files is produced from an individual node and contains
the count of words in a sorted order. You need to keep in mind that the combining operation will also
require a process to handle all the 26 files obtained as a result of the MapReduce operations. After we
obtain the count of words, we can feed the results for any kind of analysis.
Exploring Map and Reduce Functions
•The MapReduce programming model facilitates faster data analysis for which the data is taken in the
form of KVPs.
•Both MapReduce functions and Hadoop can be created in many languages; however, programmers
generally prefer to create them in Java. The Pipes library allows C++ source code to be utilized for map
and reduce code
•The generic Application Programming Interface (API) called streaming allows programs created in
most languages to be utilized as map and reduce functions in Hadoop
•Consider an example of a program that counts the number of Indian cities having a population of above
one lakh. You must note that the following is not a programming code instead a plain English
representation of the solution to the problem.
•One way to achieve the following task is to determine the input data and generate a list in the following
manner:
mylist = ("all counties in the India that participated in the most recent general
election")
Exploring Map and Reduce Functions
• Use the map function to create a function, howManyPeople, which selects the cities
having a population of more than one lakh:
map howManyPeople (mylist) = [howManyPeople "city 1";howManyPeople"city 2";
howManyPeople "city 3"; howManyPeople "city 4";...]
•Now, generate a new output list of all the cities having a population of more than one lakh:
(no, city 1; yes, city 2; no, city 3; yes, city 4;?, city nnn)
•The preceding function gets executed without making any modifications to the original list.
Moreover, you can notice that each element of the output list gets mapped to a
corresponding element of the input list, having a “yes” or “no” attached.
example, city is the key and temperature is the
value.
• Out of all the data we have collected, we want to find the maximum temperature for each
city across all of the data files (note that each file might have the same city represented
multiple times). Using the MapReduce framework, we can break this down into five map
tasks, where each mapper works on one of the five files, and the mapper task goes through
the data and returns the maximum temperature for each city. For example, the results
produced from one mapper task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22)
(Rome, 33)

Recommended for you

Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad

Big data refers to large volumes of unstructured or semi-structured data that is difficult to process using traditional databases and analysis tools. The amount of data generated daily is growing exponentially due to factors like increased internet usage and data collection by organizations. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for reliable storage and MapReduce as a programming model to process data in parallel across nodes.

ha
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam

The document contains questions that could appear on a model exam for MapReduce concepts. It includes questions about the limitations of classic MapReduce, comparing classic MapReduce to YARN, Zookeeper, differences between HBase and HDFS, differences between RDBMS and Cassandra, Pig Latin, the use of Grunt, advantages of Hive, comparing relational databases to HBase, when to use HBase, anatomy of a classic MapReduce job run, YARN architecture, job scheduling, handling failures in classic MapReduce and YARN, HBase and Cassandra architectures and data models, HBase and Cassandra clients, Hive data types and file formats, Pig Latin scripts and Grunt shell, and HiveQL data definition.

Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations

MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. A MapReduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. Due to 1) that map tasks can only run in map slots and reduce tasks can only run in reduce slots, and 2) the general execution constraints that map tasks are executed before reduce tasks, different job execution orders and map/reduce slot configurations for a MapReduce workload have significantly different performance and system utilization. This survey proposes two classes of algorithms to minimize the make span and the total completion time for an offline MapReduce workload. Our first class of algorithms focuses on the job ordering optimization for a MapReduce workload under a given map/reduce slot configuration. In contrast, our second class of algorithms considers the scenario that we can perform optimization for map/reduce slot configuration for a MapReduce workload. We perform simulations as well as experiments on Amazon EC2 and show that our proposed algorithms produce results that are up to 15 - 80 percent better than currently unoptimized Hadoop, leading to significant reductions in running time in practice.

ms. mamatha m rdr. k thippeswamy
example, city is the key and temperature
is the value.
Let’s assume the other four mapper tasks produced the following intermediate results
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New
York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto,
31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the
input results and output a single value for each city, producing the final result set as
follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
Techniques to Optimize MapReduce Jobs
•MapReduce optimization techniques are in the following categories:
Hardware or network topology
 Synchronization
 File system
•You need to keep the following points in mind while designing a file that supports
MapReduce implementation:
Keep it Warm
The Bigger the Better
The Long View
Right Degree of Security
The fields benefitted by the use of MapReduce are:
1. Web Page Visits—Suppose a researcher wants to know the number of times the website of a particular
newspaper was accessed. The map task would be to read the logs of the Web page requests and make a
complete list. The map outputs may look similar to the following:
The reduce function would find the results for the newspaper URL and add them.
The output of the preceding code is:
<newspaperURL, 3>
The fields benefitted by the use of
MapReduce are:
2. Web Page Visitor Paths- Consider a situation in which an advocacy group wishes to
know how visitors get to know about its website. To determine this, they designed a
link known as “source,” and the Web page to which the link transfers the information is
known as “target.” The map function scans the Web links for returning the results of the
type <target, source>. The reduce function scans this list for determining the results
where the “target” is the Web page. The reduce function output, which is the final
output, will be of the form <advocacy group page, list (source)>.

Recommended for you

Hadoop
HadoopHadoop
Hadoop

The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.

hadoopseminarcloud
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5

The document discusses MapReduce and the Hadoop framework. It provides an overview of how MapReduce works, examples of problems it can solve, and how Hadoop implements MapReduce at scale across large clusters in a fault-tolerant manner using the HDFS distributed file system and YARN resource management.

big dataanalyticshadoop
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf

CSS chapter 1 notes

The fields benefitted by the use of
MapReduce are:
3. Word Frequency—A researcher wishes to read articles about flood but, he
does not want those articles in which the flood is discussed as a minor topic.
Therefore, he decided that an article basically dealing with earthquakes and
floods should have the word “tectonic plate” in it more than 10 times. The map
function will count the number of times the specified word occurred in each
document and provide the result as <document, frequency>. The reduce
function will count and select only the results that have a frequency of more
than 10 words.
The fields benefitted by the use of MapReduce are:
4. Word Count—Suppose a researcher wishes to determine the number of times celebrities talk about the present
bestseller. The data to be analyzed comprises written blogs, posts, and tweets of the celebrities. The map function
will make a list of all the words. This list will be in the KVP format, in which the key is each word, and the value is
1 for every appearance of that word. The output from the map function would be obtained somewhat as follows:
The preceding output will be converted in the following form by the reduce function:
HBase
• The MapReduce programming model can utilize other components of the Hadoop ecosystem to
perform its operations better. One such components is Hbase
• Role of HBase in Big Data Processing- HBase is an open source, non-relational, distributed,
column-oriented database developed as a part of Apache Software Foundation’s Hadoop project.
• MapReduce enhances Big Data processing, HBase takes care of its storage and access
requirements. Characteristics of HBase -- HBase helps programmers to store large quantities
of data in such a way that it can be accessed easily and quickly, as and when required
• It stores data in a compressed format and thus, occupies less memory space. HBase has low
latency time and is, therefore, beneficial for lookups and scanning of large amounts of data.
HBase saves data in cells in the descending order (with the help of timestamp); therefore, a read
will always first determine the most recent values. Columns in HBase relate to a column family.
• The column family name is utilized as a prefix for determining members of its family; for
instance, Cars: Wagon R and Cars: i10 are the members of the Cars column family. A key is
associated with rows in HBase tables
HBase
• The structure of the key is very flexible. It can be a calculated value, a string, or
any other data structure. The key is used for controlling the retrieval of data to the
cells in the row. All these characteristics help build the schema of the HBase data
structure before the storage of any data. Moreover, tables can be modified and new
column families can be added once the database is up and running.
• The columns can be added very easily and are added row-by-row, providing great
flexibility, performance, and scalability. In case you have large volume and
variety of data, you can use a columnar database. HBase is suitable in conditions
where the data changes gradually and rapidly. In addition, HBase can save the data
that has a slow-changing rate and ensure its availability for Hadoop tasks.
• HBase is a framework written in Java for supporting applications that are used to
process Big Data. HBase is a non-relational Hadoop database that provides fault
tolerance for huge amounts of data.

Recommended for you

Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...

Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air Mobility

OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf

OCS Training Institute is pleased to co-operate with a Global provider of Rig Inspection/Audits, Commission-ing, Compliance & Acceptance as well as & Engineering for Offshore Drilling Rigs, to deliver Drilling Rig Inspec-tion Workshops (RIW) which teaches the inspection & maintenance procedures required to ensure equipment integrity. Candidates learn to implement the relevant standards & understand industry requirements so that they can verify the condition of a rig’s equipment & improve safety, thus reducing the number of accidents and protecting the asset.

traininginspectiontrainingcourse
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf

Pyspark

Hbase-Installation
• Before starting the installation of HBase, you need to install the Java Software
Development Kit (SDK). The installation of HBase requires the following
operations to be performed in a stepwise manner: In the Windows terminal, install
the dependency $sudo apt-get installntp libopts25. Figure 5.7 shows the
installation of dependency for HBase: Figure
Hbase-Installation
• The HBase file can be customized as per the user needs by exporting JAVA_HOME
and HBase Opts (hbase-env.sh). To customize an HBase file, type the following
code:
Hbase-Installation
• Zookeeper, the file management engine of the Hadoop ecosystem, manages the files
thatHBase plans to use currently and in the future. Therefore, to manage zookeeper in
HBase and ensure that it is enabled, use the following command:
export HBASE_MANAGES_ZK=true
• Figure 5.9 shows zookeeper enabled in HBase:
Hbase-Installation
• Site-specific customizations are done in hbase-site.xml (HBASE_HOME/conf). Figure
5.10 shows customized hbase-site.xml (HBASE_HOME/conf):

Recommended for you

GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf

Energy market

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...

Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization. We reported that 1) the reduction ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning and this study can save reduce the cost of invoking GPT-4 by 24%per API query. Importantly, SlimCode simplifies the input code with its greedy strategy and can obtain at most 133 times faster than the state-of-the-art technique with a significant improvement. This paper calls for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.

code simplification
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS

In this slide, we'll explore how to leverage internal notes within Odoo 17 POS to enhance communication and streamline operations. Internal notes provide a platform for staff to exchange crucial information regarding orders, customers, or specific tasks, all while remaining invisible to the customer. This fosters improved collaboration and ensures everyone on the team is on the same page.

odoo 17 posodoo posnotes in odoo
Hbase-Installation
• To enable connection with remote HBase server, edit /etc/hosts. Figure
5.11 shows the edited /etc/hosts:
Hbase-Installation
• To enable connection with remote HBase server, edit /etc/hosts. Figure
5.11 shows the edited /etc/hosts:
Hbase-Installation
• Start HBase by using the following command: $bin/start-hbase.sh Figure
5.12 shows the initiation process of HBase daemons:
Hbase-Installation
• Check all HBase daemons by using the following command: $jps Figure
5.13 shows the implementation of the $jps command:

Recommended for you

MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme

Syllabus

msbte syllabusxcxzcxzcxzxccxzczc
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...

This study primarily aimed to determine the best practices of clothing businesses to use it as a foundation of strategic business advancements. Moreover, the frequency with which the business's best practices are tracked, which best practices are the most targeted of the apparel firms to be retained, and how does best practices can be used as strategic business advancement. The respondents of the study is the owners of clothing businesses in Talavera, Nueva Ecija. Data were collected and analyzed using a quantitative approach and utilizing a descriptive research design. Unveiling best practices of clothing businesses as a foundation for strategic business advancement through statistical analysis: frequency and percentage, and weighted means analyzing the data in terms of identifying the most to the least important performance indicators of the businesses among all of the variables. Based on the survey conducted on clothing businesses in Talavera, Nueva Ecija, several best practices emerge across different areas of business operations. These practices are categorized into three main sections, section one being the Business Profile and Legal Requirements, followed by the tracking of indicators in terms of Product, Place, Promotion, and Price, and Key Performance Indicators (KPIs) covering finance, marketing, production, technical, and distribution aspects. The research study delved into identifying the core best practices of clothing businesses, serving as a strategic guide for their advancement. Through meticulous analysis, several key findings emerged. Firstly, prioritizing product factors, such as maintaining optimal stock levels and maximizing customer satisfaction, was deemed essential for driving sales and fostering loyalty. Additionally, selecting the right store location was crucial for visibility and accessibility, directly impacting footfall and sales. Vigilance towards competitors and demographic shifts was highlighted as essential for maintaining relevance. Understanding the relationship between marketing spend and customer acquisition proved pivotal for optimizing budgets and achieving a higher ROI. Strategic analysis of profit margins across clothing items emerged as crucial for maximizing profitability and revenue. Creating a positive customer experience, investing in employee training, and implementing effective inventory management practices were also identified as critical success factors. In essence, these findings underscored the holistic approach needed for sustainable growth in the clothing business, emphasizing the importance of product management, marketing strategies, customer experience, and operational efficiency.

best practicesclothing businesskey performance indicators
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe

Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe

Hbase-Installation
• Paste the following link to access the Web interface, which has the list of
tables created, along with their definition: http://localhost:60010
Figure 5.14 shows the Web interface for HBase:
Hbase-Installation
• Check the region server for HBase by pasting the following link in your Web browser:
http://localhost:60030
• DT Editorial Services. Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive,
YARN, Pig, R and Data Visualization (p. 138). Wiley India. Kindle Edition.
Hbase-Installation
• Start the HBase shell by using the following command: $bin/hbase shell
Figure 5.16 shows the $bin/hbase shell running in a terminal:
Hbase-Installation

Recommended for you

Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking

An Internet Protocol address (IP address) is a logical numeric address that is assigned to every single computer, printer, switch, router, tablets, smartphones or any other device that is part of a TCP/IP-based network. Types of IP address- Dynamic means "constantly changing “ .dynamic IP addresses aren't more powerful, but they can change. Static means staying the same. Static. Stand. Stable. Yes, static IP addresses don't change. Most IP addresses assigned today by Internet Service Providers are dynamic IP addresses. It's more cost effective for the ISP and you.

networkinginternetcommunication
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.

Trends in CAD CAM

L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx

..

Hbase-Installation
Hbase-Installation
Hbase-Installation
Hbase-Installation

Recommended for you

Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx

Chlorine and Nitric acid

IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes

SP-23: Hand Bank on Concrete Mixes required at the time designing

sp-23: hand bank of concrete
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training 2024 July 09

Hbase-Installation
Thank You

More Related Content

Similar to module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf

Hadoop
HadoopHadoop
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
ijccsa
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
Map reduce
Map reduceMap reduce
Map reduce
대호 김
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
lmphuong06
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
closd computing 4th updated MODULE-4.pptx
closd computing 4th updated MODULE-4.pptxclosd computing 4th updated MODULE-4.pptx
closd computing 4th updated MODULE-4.pptx
MaruthiPrasad96
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
Indhujeni
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
dbpublications
 
Hadoop
HadoopHadoop
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
RojaT4
 

Similar to module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
Map reduce
Map reduceMap reduce
Map reduce
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
closd computing 4th updated MODULE-4.pptx
closd computing 4th updated MODULE-4.pptxclosd computing 4th updated MODULE-4.pptx
closd computing 4th updated MODULE-4.pptx
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 

Recently uploaded

22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
sharvaridhokte
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
VICTOR MAESTRE RAMIREZ
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
Muanisa Waras
 
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
pavanaroshni1977
 
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
ProexportColombia1
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
Celine George
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
Anwar Patel
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
IJAEMSJORNAL
 
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
Md.Shohel Rana ( M.Sc in CSE Khulna University of Engineering & Technology (KUET))
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
Tool and Die Tech
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
Mani Krishna Sarkar
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
Servizi a rete
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
IJAEMSJORNAL
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
sipij
 

Recently uploaded (20)

22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
 
LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
 
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
 
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
 

module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf

  • 1. Understanding Big Data Technology Foundations Module 3
  • 2. Syllabus • The MapReduce Framework • Techniques to reduce Mapreduce Jobs • Uses of Mapreduce • Role of Hbase in Big data Processing
  • 3. Understanding Big Data Technology Foundations • The advent of Local Area Networks (LANs) and other networking technologies shifted the focus of IT industry toward solving bigger and bigger problems by combining computing and storing capacities of systems on the network • This chapter focuses on explaining the basics and exploring the relevance and role of various functions that are used in the MapReduce framework.
  • 4. Big Data • The MapReduce Framework At the start of the 21st century, the team of engineers working with Google concluded that because of the increasing number of Internet users, the resources and solutions available would be inadequate to fulfill the future requirements. As a preparation to the upcoming issue, Google engineers established that the concept of task distribution across economical resources, and their interconnectivity as a cluster over the network, can be presented as a solution. The concept of task distribution, though, could not be a complete answer to the issue, which requires the tasks to be distributed in parallel.
  • 5. A parallel distribution of tasks • Helps in automatic expansion and contraction of processes • Enables continuation of processes without being affected by network failures or individual system failures • Empowers developers with rights to access the services that other developers have created in context of multiple usage scenarios • A generic implementation to the entire concept was, therefore, provided with the development of the MapReduce programming model
  • 6. Exploring the Features of MapReduce • MapReduce keeps all the processing operations separate for parallel execution. Problems that are extremely large in size are divided into subtasks, which are chunks of data separated in manageable blocks. • The principal features of MapReduce include the following: Synchronization Co-location of Code/Data (Data Locality) Handling of Errors/Faults Scale-Out Architecture
  • 7. Working of MapReduce 1. Take a large dataset or set of records. 2. Perform iteration over the data. 3. Extract some interesting patterns to prepare an output list by using the map function. 4. Arrange the output list properly to enable optimization for further processing. 5. Compute a set of results by using the reduce function. 6. Provide the final output. The MapReduce programming model also works on an algorithm to execute the map and reduce operations. This algorithm can be depicted as follows
  • 8. Working of the MapReduce approach
  • 9. Working of the MapReduce approach • Is a combination of a master and three slaves • The master monitors the entire job assigned to the MapReduce algorithm and is given the name of JobTracker • Slaves, on the other hand, are responsible for keeping track of individual tasks and are called TaskTrackers • First, the given job is divided into a number of tasks by the master, i.e., the JobTracker, which then distributes these tasks into slaves • It is the responsibility of the JobTracker to further keep an eye on the processing activities and the re-execution of the failed tasks • Slaves coordinate with the master by executing the tasks they are given by the master. • The JobTracker receives jobs from client applications to process large information. These jobs are assigned in the forms of individual tasks (after a job is divided into smaller parts) to various TaskTrackers • The task distribution operation is completed by the JobTracker. The data after being processed by TaskTrackers is transmitted to the reduce function so that the final, integrated output which is an aggregate of the data processed by the map function, can be provided.
  • 10. Operations performed in the MapReduce model • The input is provided from large data files in the form of key-value pair (KVP), which is the standard input format in a Hadoop MapReduce programming model • The input data is divided into small pieces, and master and slave nodes are created. The master node usually executes on the machine where the data is present, and slaves are made to work remotely on the data. • The map operation is performed simultaneously on all the data pieces, which are read by the map function. The map function extracts the relevant data and generates the KVP for it
  • 11. The input/output operations of the map function are shown in Figure •The output list is generated from the map operation, and the master instructs the reduce function about further actions that it needs to take •The list of KVPs obtained from the map function is passed on to the reduce function. The reduce function sorts the data on the basis of the KVP list •The process of collecting the map output list from the map function and then sorting it as per the keys is known as shuffling. Every unique key is then taken by the reduce function. These keys are called, as required, for producing the final output to be sent to the file
  • 12. The input/output operations of the reduce function are shown in Figure The output is finally generated by the reduce function, and the control is handed over to the user program by the master
  • 13. The entire process of data analysis conducted in the MapReduce programming model: • Let’s now try to understand the working of the MapReduce programming model with the help of a few examples
  • 14. Example 1 • Consider that there is a data analysis project in which 20 terabytes of data needs to be analyzed on 20 different MapReduce server nodes • At first, the data distribution process simply copies data to all the nodes before starting the MapReduce process. • You need to keep in mind that the determination of the format of the file rests with the user and no standard file format is specified in MapReduce as in relational databases. • Next, the scheduler comes into the picture as it receives two programs from the programmer. These two programs are the map and reduce programs. The data is made available from the disk to the map function, which runs the logic on the data. In our example, all the 20 nodes independently perform the operation. •The map function passes the results to the reduce function for summarizing and providing the final output in an aggregate form
  • 15. Example 1 • The ancient Rome census can help to understand the mapping process of the map and reduce functions. In the Rome census, volunteers were sent to cover various places that are situated near the kingdom of Rome. Volunteers had to count the number of people living in the area assigned to them and send the report of the population to the organization. The census chief added the count of people recorded from all the areas to reach an aggregate whole. The map function performs the processing operation in parallel to counting the number of people living in an area, and the reduce function combines the entire result.
  • 16. Example 2 • A data analytic professional parses out every term available in the chat text by creating a map step. He creates a map function to find out every word of the chat. The count is incremented by one after the word is parsed from the paragraph. • The map function provides the output in the form of a list that involves a number of KVPs, for example, ″<my, 1>,″ ″<product, 1>,″ ″<broke, 1>.″ • Once the operations of all map functions are complete, the information is provided to the scheduler by the map function itself. After completing the map operation, • After completing the map operation, the reduce function starts performing the reduce operation. Keeping the current target of finding the count of the number of times a word appears in the text, shuffling is performed next • This process involves distribution of the map output through hashing in order to map the same keywords to the respective node of the reduce function. Assuming a simple situation of processing an English text, for example, we require 26 nodes that can handle words starting with individual letters of the alphabet • In this case, words starting with A will be handled by one node, words that start with B will be handled by another node, and so on. Thus, the number of words can easily be counted by the reduce step.
  • 17. The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first value of each angular bracket (<>) is the analyzed word, and the second value is the count of the word, i.e., the number of times the word appears within the entire text. The result set will include 26 files. The detailed MapReduce process used in this example: •The final output of the process will include ″<my, 10>,″ ″<product, 25>,″ ″<broke, 20>,″ where the first value of each angular bracket (<>) is the analyzed word, and the second value is the count of the word, i.e., the number of times the word appears within the entire text • The result set will include 26 files. Each of these files is produced from an individual node and contains the count of words in a sorted order. You need to keep in mind that the combining operation will also require a process to handle all the 26 files obtained as a result of the MapReduce operations. After we obtain the count of words, we can feed the results for any kind of analysis.
  • 18. Exploring Map and Reduce Functions •The MapReduce programming model facilitates faster data analysis for which the data is taken in the form of KVPs. •Both MapReduce functions and Hadoop can be created in many languages; however, programmers generally prefer to create them in Java. The Pipes library allows C++ source code to be utilized for map and reduce code •The generic Application Programming Interface (API) called streaming allows programs created in most languages to be utilized as map and reduce functions in Hadoop •Consider an example of a program that counts the number of Indian cities having a population of above one lakh. You must note that the following is not a programming code instead a plain English representation of the solution to the problem. •One way to achieve the following task is to determine the input data and generate a list in the following manner: mylist = ("all counties in the India that participated in the most recent general election")
  • 19. Exploring Map and Reduce Functions • Use the map function to create a function, howManyPeople, which selects the cities having a population of more than one lakh: map howManyPeople (mylist) = [howManyPeople "city 1";howManyPeople"city 2"; howManyPeople "city 3"; howManyPeople "city 4";...] •Now, generate a new output list of all the cities having a population of more than one lakh: (no, city 1; yes, city 2; no, city 3; yes, city 4;?, city nnn) •The preceding function gets executed without making any modifications to the original list. Moreover, you can notice that each element of the output list gets mapped to a corresponding element of the input list, having a “yes” or “no” attached.
  • 20. example, city is the key and temperature is the value. • Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files, and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this: (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
  • 21. example, city is the key and temperature is the value. Let’s assume the other four mapper tasks produced the following intermediate results (Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30) All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing the final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
  • 22. Techniques to Optimize MapReduce Jobs •MapReduce optimization techniques are in the following categories: Hardware or network topology  Synchronization  File system •You need to keep the following points in mind while designing a file that supports MapReduce implementation: Keep it Warm The Bigger the Better The Long View Right Degree of Security
  • 23. The fields benefitted by the use of MapReduce are: 1. Web Page Visits—Suppose a researcher wants to know the number of times the website of a particular newspaper was accessed. The map task would be to read the logs of the Web page requests and make a complete list. The map outputs may look similar to the following: The reduce function would find the results for the newspaper URL and add them. The output of the preceding code is: <newspaperURL, 3>
  • 24. The fields benefitted by the use of MapReduce are: 2. Web Page Visitor Paths- Consider a situation in which an advocacy group wishes to know how visitors get to know about its website. To determine this, they designed a link known as “source,” and the Web page to which the link transfers the information is known as “target.” The map function scans the Web links for returning the results of the type <target, source>. The reduce function scans this list for determining the results where the “target” is the Web page. The reduce function output, which is the final output, will be of the form <advocacy group page, list (source)>.
  • 25. The fields benefitted by the use of MapReduce are: 3. Word Frequency—A researcher wishes to read articles about flood but, he does not want those articles in which the flood is discussed as a minor topic. Therefore, he decided that an article basically dealing with earthquakes and floods should have the word “tectonic plate” in it more than 10 times. The map function will count the number of times the specified word occurred in each document and provide the result as <document, frequency>. The reduce function will count and select only the results that have a frequency of more than 10 words.
  • 26. The fields benefitted by the use of MapReduce are: 4. Word Count—Suppose a researcher wishes to determine the number of times celebrities talk about the present bestseller. The data to be analyzed comprises written blogs, posts, and tweets of the celebrities. The map function will make a list of all the words. This list will be in the KVP format, in which the key is each word, and the value is 1 for every appearance of that word. The output from the map function would be obtained somewhat as follows: The preceding output will be converted in the following form by the reduce function:
  • 27. HBase • The MapReduce programming model can utilize other components of the Hadoop ecosystem to perform its operations better. One such components is Hbase • Role of HBase in Big Data Processing- HBase is an open source, non-relational, distributed, column-oriented database developed as a part of Apache Software Foundation’s Hadoop project. • MapReduce enhances Big Data processing, HBase takes care of its storage and access requirements. Characteristics of HBase -- HBase helps programmers to store large quantities of data in such a way that it can be accessed easily and quickly, as and when required • It stores data in a compressed format and thus, occupies less memory space. HBase has low latency time and is, therefore, beneficial for lookups and scanning of large amounts of data. HBase saves data in cells in the descending order (with the help of timestamp); therefore, a read will always first determine the most recent values. Columns in HBase relate to a column family. • The column family name is utilized as a prefix for determining members of its family; for instance, Cars: Wagon R and Cars: i10 are the members of the Cars column family. A key is associated with rows in HBase tables
  • 28. HBase • The structure of the key is very flexible. It can be a calculated value, a string, or any other data structure. The key is used for controlling the retrieval of data to the cells in the row. All these characteristics help build the schema of the HBase data structure before the storage of any data. Moreover, tables can be modified and new column families can be added once the database is up and running. • The columns can be added very easily and are added row-by-row, providing great flexibility, performance, and scalability. In case you have large volume and variety of data, you can use a columnar database. HBase is suitable in conditions where the data changes gradually and rapidly. In addition, HBase can save the data that has a slow-changing rate and ensure its availability for Hadoop tasks. • HBase is a framework written in Java for supporting applications that are used to process Big Data. HBase is a non-relational Hadoop database that provides fault tolerance for huge amounts of data.
  • 29. Hbase-Installation • Before starting the installation of HBase, you need to install the Java Software Development Kit (SDK). The installation of HBase requires the following operations to be performed in a stepwise manner: In the Windows terminal, install the dependency $sudo apt-get installntp libopts25. Figure 5.7 shows the installation of dependency for HBase: Figure
  • 30. Hbase-Installation • The HBase file can be customized as per the user needs by exporting JAVA_HOME and HBase Opts (hbase-env.sh). To customize an HBase file, type the following code:
  • 31. Hbase-Installation • Zookeeper, the file management engine of the Hadoop ecosystem, manages the files thatHBase plans to use currently and in the future. Therefore, to manage zookeeper in HBase and ensure that it is enabled, use the following command: export HBASE_MANAGES_ZK=true • Figure 5.9 shows zookeeper enabled in HBase:
  • 32. Hbase-Installation • Site-specific customizations are done in hbase-site.xml (HBASE_HOME/conf). Figure 5.10 shows customized hbase-site.xml (HBASE_HOME/conf):
  • 33. Hbase-Installation • To enable connection with remote HBase server, edit /etc/hosts. Figure 5.11 shows the edited /etc/hosts:
  • 34. Hbase-Installation • To enable connection with remote HBase server, edit /etc/hosts. Figure 5.11 shows the edited /etc/hosts:
  • 35. Hbase-Installation • Start HBase by using the following command: $bin/start-hbase.sh Figure 5.12 shows the initiation process of HBase daemons:
  • 36. Hbase-Installation • Check all HBase daemons by using the following command: $jps Figure 5.13 shows the implementation of the $jps command:
  • 37. Hbase-Installation • Paste the following link to access the Web interface, which has the list of tables created, along with their definition: http://localhost:60010 Figure 5.14 shows the Web interface for HBase:
  • 38. Hbase-Installation • Check the region server for HBase by pasting the following link in your Web browser: http://localhost:60030 • DT Editorial Services. Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization (p. 138). Wiley India. Kindle Edition.
  • 39. Hbase-Installation • Start the HBase shell by using the following command: $bin/hbase shell Figure 5.16 shows the $bin/hbase shell running in a terminal: