SlideShare a Scribd company logo
MapReduce
Algorithms
CSE 490H
Algorithms for MapReduce
 Sorting
 Searching
 TF-IDF
 BFS
 PageRank
 More advanced algorithms
MapReduce Jobs
 Tend to be very short, code-wise
IdentityReducer is very common
 “Utility” jobs can be composed
 Represent a data flow, more so than a
procedure
Sort: Inputs
 A set of files, one value per line.
 Mapper key is file name, line number
 Mapper value is the contents of the line

Recommended for you

Map Reduce
Map ReduceMap Reduce
Map Reduce

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

hadoopmapreduce
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

heterogeneous clustermapreducedata placement
Lecture set 5
Lecture set 5Lecture set 5
Lecture set 5

The network layer is responsible for transporting data segments from source to destination hosts. It encapsulates segments into datagrams and delivers them to the transport layer. Network layer protocols run on every host and router. Routers examine header fields to forward datagrams appropriately based on destination addresses. The network layer handles addressing, routing, and intermediate forwarding of datagrams between source and destination hosts.

Sort Algorithm
 Takes advantage of reducer properties:
(key, value) pairs are processed in order
by key; reducers are themselves ordered
 Mapper: Identity function for value
(k, v)  (v, _)
 Reducer: Identity function (k’, _) -> (k’, “”)
Sort: The Trick
 (key, value) pairs from mappers are sent to a
particular reducer based on hash(key)
 Must pick the hash function for your data such
that k1 < k2 => hash(k1) < hash(k2)
M1 M2 M3
R1 R2
Partition
and
Shuffle
Final Thoughts on Sort
 Used as a test of Hadoop’s raw speed
 Essentially “IO drag race”
 Highlights utility of GFS
Search: Inputs
 A set of files containing lines of text
 A search pattern to find
 Mapper key is file name, line number
 Mapper value is the contents of the line
 Search pattern sent as special parameter

Recommended for you

Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS

The document discusses MapReduce, a programming model for distributed computing. It describes how MapReduce works like a Unix pipeline to efficiently process large amounts of data in parallel across clusters of computers. Key aspects covered include mappers and reducers, locality optimizations, input/output formats, and tools like counters, compression, and partitioners that can improve performance. An example word count program is provided to illustrate how MapReduce jobs are defined and executed.

Combinatorial Optimization
Combinatorial OptimizationCombinatorial Optimization
Combinatorial Optimization

The document discusses various combinatorial optimization problems including the minimum spanning tree (MST), travelling salesman problem (TSP), and knapsack problem. It provides details on the MST and TSP, defining them, describing algorithms to solve them such as Kruskal's and Prim's for the MST and dynamic programming for the TSP, and discussing their applications and time complexities. The document also compares Prim and Kruskal algorithms and discusses how dynamic programming can provide an efficient solution for the TSP in some cases but not when the number of targets is too large.

M017327378
M017327378M017327378
M017327378

1. The document discusses using a multi-objective genetic algorithm (MOGA) for static, non-preemptive scheduling of tasks on homogeneous multiprocessor systems. The goal is to minimize job completion time. 2. A genetic algorithm is proposed that determines suitable task priorities to find sub-optimal scheduling solutions. Genetic algorithms mimic natural selection to evolve better solutions over multiple generations. 3. The document outlines the genetic algorithm process of selection, crossover and mutation to evolve scheduling solutions, and evaluates solutions based on metrics like makespan and speedup.

iosr journal of computer engineering (iosr-jce)
Search Algorithm
 Mapper:
Given (filename, some text) and “pattern”, if
“text” matches “pattern” output (filename, _)
 Reducer:
Identity function
Search: An Optimization
 Once a file is found to be interesting, we
only need to mark it that way once
 Use Combiner function to fold redundant
(filename, _) pairs into a single one
Reduces network I/O
TF-IDF
 Term Frequency – Inverse Document
Frequency
Relevant to text processing
Common web analysis algorithm
The Algorithm, Formally
•| D | : total number of documents in the corpus
• : number of documents where the term ti appears (that is ).

Recommended for you

Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithm

IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.

task schedulingdirected acyclic graphparallel computing
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx

This document discusses MapReduce and its ability to process large datasets in a distributed manner. MapReduce addresses challenges of distributed computation by allowing programmers to specify map and reduce functions. It then parallelizes the execution of these functions across large clusters and handles failures transparently. The map function processes input key-value pairs to generate intermediate pairs, which are then grouped by key and passed to reduce functions to generate the final output.

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Mastering Hadoop Map Reduce was a presentation I gave to Orlando Data Science on April 23, 2015. The presentation provides a clear overview of how Hadoop Map Reduce works, and then dives into more advanced topics of how to optimize runtime performance and implement custom data types. The examples are written in Python and Java, and the presentation walks through how to create an n-gram count map reduce program using custom data types. You can get the full source code for the examples on my Github! http://www.github.com/scottcrespo/ngrams

mapreducebig datangrams
Information We Need
 Number of times term X appears in a
given document
 Number of terms in each document
 Number of documents X appears in
 Total number of documents
Job 1: Word Frequency in Doc
 Mapper
Input: (docname, contents)
Output: ((word, docname), 1)
 Reducer
Sums counts for word in document
Outputs ((word, docname), n)
 Combiner is same as Reducer
Job 2: Word Counts For Docs
 Mapper
Input: ((word, docname), n)
Output: (docname, (word, n))
 Reducer
Sums frequency of individual n’s in same doc
Feeds original data through
Outputs ((word, docname), (n, N))
Job 3: Word Frequency In Corpus
 Mapper
Input: ((word, docname), (n, N))
Output: (word, (docname, n, N, 1))
 Reducer
Sums counts for word in corpus
Outputs ((word, docname), (n, N, m))

Recommended for you

Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application

This document discusses parallelizing several algorithms and applications including k-means clustering, frequent itemset mining, integer programming, computer chess, and support vector machines (SVM). For k-means and frequent itemset mining, the algorithms can be parallelized by partitioning the data across processors and performing partial computations locally before combining results with an allreduce operation. Computer chess can be parallelized by exploring different game tree branches simultaneously on different processors. SVM problems involve large dense matrices that are difficult to solve in parallel directly due to their size exceeding memory; alternative approaches include solving smaller subproblems independently.

Map reduce
Map reduceMap reduce
Map reduce

This document introduces MapReduce, a programming model for processing large datasets across distributed systems. It describes how users write map and reduce functions to specify computations. The MapReduce system automatically parallelizes jobs by splitting input data, running the map function on different parts in parallel, collecting output, and running the reduce function to combine results. It handles failures and distribution of work across machines. Many common large-scale data processing tasks can be expressed as MapReduce jobs. The system has been used to process petabytes of data on thousands of machines at Google.

Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading

This document introduces MapReduce, a programming model and associated implementation for processing large datasets across distributed systems. The key aspects are: 1. Users specify map and reduce functions that process key-value pairs. The map function produces intermediate key-value pairs and the reduce function merges values for the same key. 2. The system automatically parallelizes the computation by partitioning input data and scheduling tasks on a cluster. It handles failures, data distribution, and load balancing. 3. The implementation runs on large Google clusters and is highly scalable, processing terabytes of data on thousands of machines. Hundreds of programs use MapReduce daily at Google.

mao re
Job 4: Calculate TF-IDF
 Mapper
Input: ((word, docname), (n, N, m))
Assume D is known (or, easy MR to find it)
Output ((word, docname), TF*IDF)
 Reducer
Just the identity function
Working At Scale
 Buffering (doc, n, N) counts while
summing 1’s into m may not fit in memory
How many documents does the word “the”
occur in?
 Possible solutions
Ignore very-high-frequency words
Write out intermediate data to a file
Use another MR pass
Final Thoughts on TF-IDF
 Several small jobs add up to full algorithm
 Lots of code reuse possible
Stock classes exist for aggregation, identity
 Jobs 3 and 4 can really be done at once in
same reducer, saving a write/read cycle
 Very easy to handle medium-large scale,
but must take care to ensure flat memory
usage for largest scale
BFS: Motivating Concepts
 Performing computation on a graph data
structure requires processing at each node
 Each node contains node-specific data as
well as links (edges) to other nodes
 Computation must traverse the graph and
perform the computation step
 How do we traverse a graph in
MapReduce? How do we represent the
graph for this?

Recommended for you

MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf

The WordCount and Sort examples demonstrate basic MapReduce algorithms in Hadoop. WordCount counts the frequency of words in a text document by having mappers emit (word, 1) pairs and reducers sum the counts. Sort uses an identity mapper and reducer to simply sort the input files by key. Both examples read from and write to HDFS, and can be run on large datasets to benchmark a Hadoop cluster's sorting performance.

Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft

The document discusses a resource provisioning framework for MapReduce jobs running in the cloud. It proposes using a signature matching algorithm to identify optimal configurations by matching a job's resource consumption signature to a database. If no match is found, it uses an SLO-based algorithm to calculate the minimum number of map and reduce slots needed to finish a job within a deadline. It also describes algorithms for priority-based scheduling, skew mitigation, bottleneck detection and removal, and deadlock prevention to improve performance.

WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING

This document provides an overview of a Big Data training presentation. It discusses topics that will be covered including uses of Big Data, Hadoop, HDFS architecture, MapReduce, and tips for optimizing MapReduce codes. The presentation introduces key concepts such as what is Big Data, why use Big Data, what is Hadoop, the HDFS and MapReduce architectures, and demonstrates a word count example MapReduce algorithm. Contact details are provided at the end for any questions.

Breadth-First Search
• Breadth-First
Search is an
iterated algorithm
over graphs
• Frontier advances
from origin by one
level with each pass
1
2
2 2
3
3
3
3
4
4
Breadth-First Search & MapReduce
 Problem: This doesn't “fit” into MapReduce
 Solution: Iterated passes through
MapReduce – map some nodes, result
includes additional nodes which are fed into
successive MapReduce passes
Breadth-First Search & MapReduce
 Problem: Sending the entire graph to a map
task (or hundreds/thousands of map tasks)
involves an enormous amount of memory
 Solution: Carefully consider how we
represent graphs
Graph Representations
• The most straightforward representation of
graphs uses references from each node to
its neighbors

Recommended for you

MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations

MapReduce: Recap •Programmers must specify: map(k, v) → <k’, v’>* reduce(k’, v’) → <k’, v’>* –All values with the same key are reduced together •Optionally, also: partition(k’, number of partitions) → partition for k’ –Often a simple hash of the key, e.g., hash(k’) mod n –Divides up key space for parallel reduce operations combine(k’, v’) → <k’, v’>* –Mini-reducers that run in memory after the map phase –Used as an optimization to reduce network traffic •The execution framework handles everything else…

mapreducedatahadoop
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets

The document proposes a distributed approximate spectral clustering (DASC) algorithm to process large datasets in a scalable way. DASC uses locality sensitive hashing to group similar data points and then approximates the kernel matrix on each group to reduce computation. It implements DASC using MapReduce and evaluates it on real and synthetic datasets, showing it can achieve similar clustering accuracy to standard spectral clustering but with an order of magnitude better runtime by distributing the computation across clusters.

#bigdata #machinelearning #approximation #massived
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation

ENISA Threat Landscape 2023

Direct References
 Structure is inherent
to object
 Iteration requires
linked list “threaded
through” graph
 Requires common
view of shared
memory
(synchronization!)
 Not easily serializable
class GraphNode
{
Object data;
Vector<GraphNode>
out_edges;
GraphNode
iter_next;
}
Adjacency Matrices
 Another classic graph representation.
M[i][j]= '1' implies a link from node i to j.
 Naturally encapsulates iteration over nodes
0
1
0
1
4
0
0
1
0
3
1
1
0
1
2
1
0
1
0
1
4
3
2
1
Adjacency Matrices: Sparse
Representation
 Adjacency matrix for most large graphs
(e.g., the web) will be overwhelmingly full of
zeros.
 Each row of the graph is absurdly long
 Sparse matrices only include non-zero
elements
Sparse Matrix Representation
1: (3, 1), (18, 1), (200, 1)
2: (6, 1), (12, 1), (80, 1), (400, 1)
3: (1, 1), (14, 1)
…

Recommended for you

Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation

Mindfire Solutions specializes in DevOps services, facilitating digital transformation through streamlined software development and operational efficiency. Their expertise enhances collaboration, accelerates delivery cycles, and ensures scalability using cloud-native technologies. Mindfire Solutions empowers businesses to innovate rapidly and maintain competitive advantage in dynamic market landscapes.

devops servicesdevops consulting servicesexpertise devops
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf

A robust software testing strategy encompassing functional and non-functional testing is fundamental for development teams. These twin pillars are essential for ensuring the success of your applications. But why are they so critical? Functional testing rigorously examines the application's processes against predefined requirements, ensuring they align seamlessly. Conversely, non-functional testing evaluates performance and reliability under load, enhancing the end-user experience.

non functional testingfunctional testing
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...

Ansys Mechanical enables you to solve complex structural engineering problems and make better, faster design decisions. With the finite element analysis (FEA) solvers available in the suite, you can customize and automate solutions for your structural mechanics problems and parameterize them to analyze multiple design scenarios. Ansys Mechanical is a dynamic tool that has a complete range of analysis tools.

mechanical engineeringmodelling software3d modelling software
Sparse Matrix Representation
1: 3, 18, 200
2: 6, 12, 80, 400
3: 1, 14
…
Finding the Shortest Path
• A common graph
search application is
finding the shortest
path from a start node
to one or more target
nodes
• Commonly done on a
single machine with
Dijkstra's Algorithm
• Can we use BFS to
find the shortest path
via MapReduce?
This is called the single-source shortest path problem. (a.k.a. SSSP)
Finding the Shortest Path: Intuition
 We can define the solution to this problem
inductively:
DistanceTo(startNode) = 0
For all nodes n directly reachable from
startNode, DistanceTo(n) = 1
For all nodes n reachable from some other set
of nodes S,
DistanceTo(n) = 1 + min(DistanceTo(m), m  S)
From Intuition to Algorithm
 A map task receives a node n as a key, and
(D, points-to) as its value
D is the distance to the node from the start
points-to is a list of nodes reachable from n
 p  points-to, emit (p, D+1)
 Reduce task gathers possible distances to
a given p and selects the minimum one

Recommended for you

WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf

WhatsApp Tracker Software is an effective tool for remotely tracking the target’s WhatsApp activities. It allows users to monitor their loved one’s online behavior to ensure appropriate interactions for responsive device use. Download this PPTX file and share this information to others.

whatsapp trackerwhatsapp tracker for parentswhatsapp tracker for employers
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...

React and Next.js are complementary tools in web development. React, a JavaScript library, specializes in building user interfaces with its component-based architecture and efficient state management. Next.js extends React by providing server-side rendering, routing, and other utilities, making it ideal for building SEO-friendly, high-performance web applications.

react vs next jsnext jsreact
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies

A captivating AI chatbot PowerPoint presentation is made with a striking backdrop in order to attract a wider audience. Select this template featuring several AI chatbot visuals to boost audience engagement and spontaneity. With the aid of this multi-colored template, you may make a compelling presentation and get extra bonuses. To easily elucidate your ideas, choose a typeface with vibrant colors. You can include your data regarding utilizing the chatbot methodology to the remaining half of the template.

chatbot ppt
What This Gives Us
 This MapReduce task can advance the
known frontier by one hop
 To perform the whole BFS, a non-
MapReduce component then feeds the
output of this step back into the
MapReduce task for another iteration
Problem: Where'd the points-to list go?
Solution: Mapper emits (n, points-to) as well
Blow-up and Termination
 This algorithm starts from one node
 Subsequent iterations include many more
nodes of the graph as frontier advances
 Does this ever terminate?
Yes! Eventually, routes between nodes will stop
being discovered and no better distances will
be found. When distance is the same, we stop
Mapper should emit (n, D) to ensure that
“current distance” is carried into the reducer
Adding weights
 Weighted-edge shortest path is more useful
than cost==1 approach
 Simple change: points-to list in map task
includes a weight 'w' for each pointed-to
node
emit (p, D+wp) instead of (p, D+1) for each
node p
Works for positive-weighted graph
Comparison to Dijkstra
 Dijkstra's algorithm is more efficient
because at any step it only pursues edges
from the minimum-cost path inside the
frontier
 MapReduce version explores all paths in
parallel; not as efficient overall, but the
architecture is more scalable
 Equivalent to Dijkstra for weight=1 case

Recommended for you

Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf

Discover the rich history of US Independence Day 2024, tracing its origins and evolution as a national holiday, and its significance today.

us independence day 2024us independence dayindependence day 2024
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION

Bitcoin heist prediction using ML

ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django

Class based indexes feature in Django

djangoindexesopen-source
PageRank: Random Walks Over
The Web
 If a user starts at a random web page and
surfs by clicking links and randomly
entering new URLs, what is the probability
that s/he will arrive at a given page?
 The PageRank of a page captures this
notion
More “popular” or “worthwhile” pages get a
higher rank
PageRank: Visually
www.cnn.com
en.wikipedia.org
www.nytimes.com
PageRank: Formula
Given page A, and pages T1 through Tn
linking to A, PageRank is defined as:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn))
C(P) is the cardinality (out-degree) of page P
d is the damping (“random URL”) factor
PageRank: Intuition
 Calculation is iterative: PRi+1 is based on PRi
 Each page distributes its PRi to all pages it
links to. Linkees add up their awarded rank
fragments to find their PRi+1
 d is a tunable parameter (usually = 0.85)
encapsulating the “random jump factor”
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Recommended for you

React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System

Your project needs and long-term objectives will ultimately choose which of React Native and Flutter to use. For applications using JavaScript and current web technologies in particular, React Native is a mature and trustworthy choice. For projects that value performance and customizability across many platforms, Flutter, on the other hand, provides outstanding performance and a unified UI development experience.

mobile app developmentreact native vs fluttermobile app design
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf

AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf

awscloudpractitioner
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital

If you are having trouble deciding which time tracker tool is best for you, try "Task Tracker" app. It has numerous features, including the ability to check daily attendance sheet, and other that make team management easier.

time trackerdaily attendance sheettask tracker
PageRank: First Implementation
 Create two tables 'current' and 'next' holding
the PageRank for each page. Seed 'current'
with initial PR values
 Iterate over all pages in the graph,
distributing PR from 'current' into 'next' of
linkees
 current := next; next := fresh_table();
 Go back to iteration step or end if converged
Distribution of the Algorithm
 Key insights allowing parallelization:
The 'next' table depends on 'current', but not on
any other rows of 'next'
Individual rows of the adjacency matrix can be
processed in parallel
Sparse matrix rows are relatively small
Distribution of the Algorithm
 Consequences of insights:
We can map each row of 'current' to a list of
PageRank “fragments” to assign to linkees
These fragments can be reduced into a single
PageRank value for a page by summing
Graph representation can be even more
compact; since each element is simply 0 or 1,
only transmit column numbers where it's 1
Map step: break page rank into even fragments to distribute to link targets
Reduce step: add together fragments into next PageRank
Iterate for next step...

Recommended for you

Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits

Efficient hot work permit software for safe, streamlined work permit management and compliance. Enhance safety today. Contact us on +353 214536034. https://sheqnetwork.com/work-permit/

hot work permit softwarework permit softwaresafe work permit software
dachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdfdachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdf

dachnug51 | All you ever wanted to know about domino licensing | Uffe Sorensen

dachnugdachnug51dnug
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024

Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024

Phase 1: Parse HTML
 Map task takes (URL, page content) pairs
and maps them to (URL, (PRinit, list-of-urls))
PRinit is the “seed” PageRank for URL
list-of-urls contains all pages pointed to by URL
 Reduce task is just the identity function
Phase 2: PageRank Distribution
 Map task takes (URL, (cur_rank, url_list))
For each u in url_list, emit (u, cur_rank/|url_list|)
Emit (URL, url_list) to carry the points-to list
along through iterations
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Phase 2: PageRank Distribution
 Reduce task gets (URL, url_list) and many
(URL, val) values
Sum vals and fix up with d
Emit (URL, (new_rank, url_list))
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Finishing up...
 A subsequent component determines
whether convergence has been achieved
(Fixed number of iterations? Comparison of
key values?)
 If so, write out the PageRank lists - done!
 Otherwise, feed output of Phase 2 into
another Phase 2 iteration

Recommended for you

Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)

Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require. To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals). In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.

softwaremodel-driven engineeringdomain-specific languages
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud

Are you wondering how to migrate to the Cloud? At the ITB session, we addressed the challenge of managing multiple ColdFusion licenses and AWS EC2 instances. Discover how you can consolidate with just one EC2 instance capable of running over 50 apps using CommandBox ColdFusion. This solution supports both ColdFusion flavors and includes cb-websites, a GoLang binary for managing CommandBox websites.

coldfusioncfmlwebsite
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.

Shivam Pandit Php Web Dveloper

phpmysqlsql
PageRank Conclusions
 MapReduce runs the “heavy lifting” in
iterated computation
 Key element in parallelization is
independent PageRank computations in a
given step
 Parallelization requires thinking about
minimum data partitions to transmit (e.g.,
compact representations of graph rows)
Even the implementation shown today doesn't
actually scale to the whole Internet; but it works
for intermediate-sized graphs

More Related Content

Similar to MapReduceAlgorithms.ppt

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Leonidas Akritidis
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Lecture set 5
Lecture set 5Lecture set 5
Lecture set 5
Gopi Saiteja
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
Archana Gopinath
 
Combinatorial Optimization
Combinatorial OptimizationCombinatorial Optimization
Combinatorial Optimization
Institute of Technology, Nirma University
 
M017327378
M017327378M017327378
M017327378
IOSR Journals
 
Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithm
iosrjce
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
AnilVijayagiri
 
Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft
Deanna Kosaraju
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
Utkarsh Srivastava
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
Bita Kazemi
 

Similar to MapReduceAlgorithms.ppt (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Lecture set 5
Lecture set 5Lecture set 5
Lecture set 5
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
Combinatorial Optimization
Combinatorial OptimizationCombinatorial Optimization
Combinatorial Optimization
 
M017327378
M017327378M017327378
M017327378
 
Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithm
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 

Recently uploaded

ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation
Mindfire Solution
 
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf
kalichargn70th171
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
sachin chaurasia
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
onemonitarsoftware
 
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
Semiosis Software Private Limited
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
MaisnamLuwangPibarel
 
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Livetecs LLC
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
ssuser2b426d1
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
akshesh doshi
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
SSTech System
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
karim wahed
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
Task Tracker
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
sheqnetworkmarketing
 
dachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdfdachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdf
DNUG e.V.
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
Ortus Solutions, Corp
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation
 
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf
 
introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...introduction of Ansys software and basic and advance knowledge of modelling s...
introduction of Ansys software and basic and advance knowledge of modelling s...
 
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdfWhatsApp Tracker -  Tracking WhatsApp to Boost Online Safety.pdf
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdf
 
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
 
Development of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML TechnologiesDevelopment of Chatbot Using AI\ML Technologies
Development of Chatbot Using AI\ML Technologies
 
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
 
Safe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work PermitsSafe Work Permit Management Software for Hot Work Permits
Safe Work Permit Management Software for Hot Work Permits
 
dachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdfdachnug51 - All you ever wanted to know about domino licensing.pdf
dachnug51 - All you ever wanted to know about domino licensing.pdf
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
Migrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS CloudMigrate your Infrastructure to the AWS Cloud
Migrate your Infrastructure to the AWS Cloud
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
Abortion pills in Fujairah *((+971588192166*)☎️)¥) **Effective Abortion Pills...
 

MapReduceAlgorithms.ppt

  • 2. Algorithms for MapReduce  Sorting  Searching  TF-IDF  BFS  PageRank  More advanced algorithms
  • 3. MapReduce Jobs  Tend to be very short, code-wise IdentityReducer is very common  “Utility” jobs can be composed  Represent a data flow, more so than a procedure
  • 4. Sort: Inputs  A set of files, one value per line.  Mapper key is file name, line number  Mapper value is the contents of the line
  • 5. Sort Algorithm  Takes advantage of reducer properties: (key, value) pairs are processed in order by key; reducers are themselves ordered  Mapper: Identity function for value (k, v)  (v, _)  Reducer: Identity function (k’, _) -> (k’, “”)
  • 6. Sort: The Trick  (key, value) pairs from mappers are sent to a particular reducer based on hash(key)  Must pick the hash function for your data such that k1 < k2 => hash(k1) < hash(k2) M1 M2 M3 R1 R2 Partition and Shuffle
  • 7. Final Thoughts on Sort  Used as a test of Hadoop’s raw speed  Essentially “IO drag race”  Highlights utility of GFS
  • 8. Search: Inputs  A set of files containing lines of text  A search pattern to find  Mapper key is file name, line number  Mapper value is the contents of the line  Search pattern sent as special parameter
  • 9. Search Algorithm  Mapper: Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _)  Reducer: Identity function
  • 10. Search: An Optimization  Once a file is found to be interesting, we only need to mark it that way once  Use Combiner function to fold redundant (filename, _) pairs into a single one Reduces network I/O
  • 11. TF-IDF  Term Frequency – Inverse Document Frequency Relevant to text processing Common web analysis algorithm
  • 12. The Algorithm, Formally •| D | : total number of documents in the corpus • : number of documents where the term ti appears (that is ).
  • 13. Information We Need  Number of times term X appears in a given document  Number of terms in each document  Number of documents X appears in  Total number of documents
  • 14. Job 1: Word Frequency in Doc  Mapper Input: (docname, contents) Output: ((word, docname), 1)  Reducer Sums counts for word in document Outputs ((word, docname), n)  Combiner is same as Reducer
  • 15. Job 2: Word Counts For Docs  Mapper Input: ((word, docname), n) Output: (docname, (word, n))  Reducer Sums frequency of individual n’s in same doc Feeds original data through Outputs ((word, docname), (n, N))
  • 16. Job 3: Word Frequency In Corpus  Mapper Input: ((word, docname), (n, N)) Output: (word, (docname, n, N, 1))  Reducer Sums counts for word in corpus Outputs ((word, docname), (n, N, m))
  • 17. Job 4: Calculate TF-IDF  Mapper Input: ((word, docname), (n, N, m)) Assume D is known (or, easy MR to find it) Output ((word, docname), TF*IDF)  Reducer Just the identity function
  • 18. Working At Scale  Buffering (doc, n, N) counts while summing 1’s into m may not fit in memory How many documents does the word “the” occur in?  Possible solutions Ignore very-high-frequency words Write out intermediate data to a file Use another MR pass
  • 19. Final Thoughts on TF-IDF  Several small jobs add up to full algorithm  Lots of code reuse possible Stock classes exist for aggregation, identity  Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle  Very easy to handle medium-large scale, but must take care to ensure flat memory usage for largest scale
  • 20. BFS: Motivating Concepts  Performing computation on a graph data structure requires processing at each node  Each node contains node-specific data as well as links (edges) to other nodes  Computation must traverse the graph and perform the computation step  How do we traverse a graph in MapReduce? How do we represent the graph for this?
  • 21. Breadth-First Search • Breadth-First Search is an iterated algorithm over graphs • Frontier advances from origin by one level with each pass 1 2 2 2 3 3 3 3 4 4
  • 22. Breadth-First Search & MapReduce  Problem: This doesn't “fit” into MapReduce  Solution: Iterated passes through MapReduce – map some nodes, result includes additional nodes which are fed into successive MapReduce passes
  • 23. Breadth-First Search & MapReduce  Problem: Sending the entire graph to a map task (or hundreds/thousands of map tasks) involves an enormous amount of memory  Solution: Carefully consider how we represent graphs
  • 24. Graph Representations • The most straightforward representation of graphs uses references from each node to its neighbors
  • 25. Direct References  Structure is inherent to object  Iteration requires linked list “threaded through” graph  Requires common view of shared memory (synchronization!)  Not easily serializable class GraphNode { Object data; Vector<GraphNode> out_edges; GraphNode iter_next; }
  • 26. Adjacency Matrices  Another classic graph representation. M[i][j]= '1' implies a link from node i to j.  Naturally encapsulates iteration over nodes 0 1 0 1 4 0 0 1 0 3 1 1 0 1 2 1 0 1 0 1 4 3 2 1
  • 27. Adjacency Matrices: Sparse Representation  Adjacency matrix for most large graphs (e.g., the web) will be overwhelmingly full of zeros.  Each row of the graph is absurdly long  Sparse matrices only include non-zero elements
  • 28. Sparse Matrix Representation 1: (3, 1), (18, 1), (200, 1) 2: (6, 1), (12, 1), (80, 1), (400, 1) 3: (1, 1), (14, 1) …
  • 29. Sparse Matrix Representation 1: 3, 18, 200 2: 6, 12, 80, 400 3: 1, 14 …
  • 30. Finding the Shortest Path • A common graph search application is finding the shortest path from a start node to one or more target nodes • Commonly done on a single machine with Dijkstra's Algorithm • Can we use BFS to find the shortest path via MapReduce? This is called the single-source shortest path problem. (a.k.a. SSSP)
  • 31. Finding the Shortest Path: Intuition  We can define the solution to this problem inductively: DistanceTo(startNode) = 0 For all nodes n directly reachable from startNode, DistanceTo(n) = 1 For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m  S)
  • 32. From Intuition to Algorithm  A map task receives a node n as a key, and (D, points-to) as its value D is the distance to the node from the start points-to is a list of nodes reachable from n  p  points-to, emit (p, D+1)  Reduce task gathers possible distances to a given p and selects the minimum one
  • 33. What This Gives Us  This MapReduce task can advance the known frontier by one hop  To perform the whole BFS, a non- MapReduce component then feeds the output of this step back into the MapReduce task for another iteration Problem: Where'd the points-to list go? Solution: Mapper emits (n, points-to) as well
  • 34. Blow-up and Termination  This algorithm starts from one node  Subsequent iterations include many more nodes of the graph as frontier advances  Does this ever terminate? Yes! Eventually, routes between nodes will stop being discovered and no better distances will be found. When distance is the same, we stop Mapper should emit (n, D) to ensure that “current distance” is carried into the reducer
  • 35. Adding weights  Weighted-edge shortest path is more useful than cost==1 approach  Simple change: points-to list in map task includes a weight 'w' for each pointed-to node emit (p, D+wp) instead of (p, D+1) for each node p Works for positive-weighted graph
  • 36. Comparison to Dijkstra  Dijkstra's algorithm is more efficient because at any step it only pursues edges from the minimum-cost path inside the frontier  MapReduce version explores all paths in parallel; not as efficient overall, but the architecture is more scalable  Equivalent to Dijkstra for weight=1 case
  • 37. PageRank: Random Walks Over The Web  If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page?  The PageRank of a page captures this notion More “popular” or “worthwhile” pages get a higher rank
  • 39. PageRank: Formula Given page A, and pages T1 through Tn linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) C(P) is the cardinality (out-degree) of page P d is the damping (“random URL”) factor
  • 40. PageRank: Intuition  Calculation is iterative: PRi+1 is based on PRi  Each page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1  d is a tunable parameter (usually = 0.85) encapsulating the “random jump factor” PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
  • 41. PageRank: First Implementation  Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values  Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees  current := next; next := fresh_table();  Go back to iteration step or end if converged
  • 42. Distribution of the Algorithm  Key insights allowing parallelization: The 'next' table depends on 'current', but not on any other rows of 'next' Individual rows of the adjacency matrix can be processed in parallel Sparse matrix rows are relatively small
  • 43. Distribution of the Algorithm  Consequences of insights: We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees These fragments can be reduced into a single PageRank value for a page by summing Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1
  • 44. Map step: break page rank into even fragments to distribute to link targets Reduce step: add together fragments into next PageRank Iterate for next step...
  • 45. Phase 1: Parse HTML  Map task takes (URL, page content) pairs and maps them to (URL, (PRinit, list-of-urls)) PRinit is the “seed” PageRank for URL list-of-urls contains all pages pointed to by URL  Reduce task is just the identity function
  • 46. Phase 2: PageRank Distribution  Map task takes (URL, (cur_rank, url_list)) For each u in url_list, emit (u, cur_rank/|url_list|) Emit (URL, url_list) to carry the points-to list along through iterations PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
  • 47. Phase 2: PageRank Distribution  Reduce task gets (URL, url_list) and many (URL, val) values Sum vals and fix up with d Emit (URL, (new_rank, url_list)) PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
  • 48. Finishing up...  A subsequent component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?)  If so, write out the PageRank lists - done!  Otherwise, feed output of Phase 2 into another Phase 2 iteration
  • 49. PageRank Conclusions  MapReduce runs the “heavy lifting” in iterated computation  Key element in parallelization is independent PageRank computations in a given step  Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows) Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs