iot.pptx

NADAR SARSWATHI COLLEGE
OF ARTS AND SCIENCE
THENI
DEPARTMENT OF INFORMATION
TECHNOLOGY
INTERNET OF THINGS
USING HADOOP MAPREDUCE FOR BATCH DATA
ANALYSIS
BY:
S. SABTHAMI
II. M.Sc(IT)

What is MapReduce?
• Terms are borrowed from Functional Language (e.g., Lisp)
Sum of squares:
• (map square ‘(1 2 3 4))
– Output: (1 4 9 16)
[processes each record sequentially and independently]
• (reduce + ‘(1 4 9 16))
– (+ 16 (+ 9 (+ 4 1) ) )
– Output: 30
[processes set of all records in batches]

Map
• Process individual records to generate
intermediate key/value pairs.

Map
• Process individual records to generate
intermediate key/value pairs.
• Parallelly Process a large number of
individual records to generate intermediate
key/value pairs.

Reduce
• Reduce processes and merges all intermediate
values associated per key
• Each key assigned to one Reduce
• Parallelly Processes and merges all
intermediate values by partitioning keys

Some Applications of MapReduce
Distributed Grep:
– Input: large set of files
– Output: lines that match pattern
– Map – Emits a line if it matches the supplied pattern
– Reduce – Copies the intermediate data to output

(2)
Reverse Web-Link Graph
– Input: Web graph: tuples (a, b) where (page a  page b)
– Output: For each page, list of pages that link to it
– Map – process web log and for each input <source, target>, it outputs
<target, source>
– Reduce - emits <target, list(source)>

(3)
Count of URL access frequency
– Input: Log of accessed URLs, e.g., from proxy server
– Output: For each URL, % of total accesses for that URL
– Map – Process web log and outputs <URL, 1>
– Multiple Reducers - Emits <URL, URL_count>
(So far, like Wordcount. But still need %)
– Chain another MapReduce job after above one
– Map – Processes <URL, URL_count> and outputs <1,
(<URL, URL_count> )>
�� 1 Reducer – Sums up URL_count’s to calculate
overall_count.
Emits multiple <URL, URL_count/overall_count>

(4)
Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)
Sort
– Input: Series of (key, value) pairs
– Output: Sorted <value>s
– Map – <key, value>  <value, _> (identity)
– Reducer – <key, value>  <key, value> (identity)
– Partitioning function – partition keys across reducers based
on ranges (can’t use hashing!)
• Take data distribution into account to balance reducer
tasks

Programming MapReduce
Externally: For user
1. Write a Map program (short), write a Reduce program (short)
2. Specify number of Maps and Reduces (parallelism level)
3. Submit job; wait for result
4. Need to know very little about parallel/distributed programming!
Internally: For the Paradigm and Scheduler
1. Parallelize Map
2. Transfer data from Map to Reduce
3. Parallelize Reduce
4. Implement Storage for Map input, Map output, Reduce input, and
Reduce output
(Ensure that no Reduce starts before all Maps are finished. That is,
ensure the barrier between the Map phase and Reduce phase)

Inside MapReduce
1. Parallelize Map: easy! each map task is independent of
the other!
• All Map output records with same key assigned to
same Reduce
2. Transfer data from Map to Reduce:
• All Map output records with same key assigned to
same Reduce task
• use partitioning function, e.g., hash(key)%number of
reducers

iot.pptx

Related slideshows

Recommended for you

Recommended for you

Recommended for you

More Related Content

Similar to iot.pptx

Similar to iot.pptx (20)

More from SabthamiS1

More from SabthamiS1 (12)

Recently uploaded

Recently uploaded (20)

iot.pptx