SlideShare a Scribd company logo
Lecture 3 – Hadoop
Technical Introduction
CSE 490H
Announcements
 My office hours: M 2:30—3:30 in CSE
212
 Cluster is operational; instructions in
assignment 1 heavily rewritten
 Eclipse plugin is “deprecated”
 Students who already created accounts:
let me know if you have trouble
Breaking news!
 Hadoop tested on 4,000 node cluster
32K cores (8 / node)
16 PB raw storage (4 x 1 TB disk / node)
(about 5 PB usable storage)
 http://developer.yahoo.com/blogs/hadoop/2008/09/
scaling_hadoop_to_4000_nodes_a.html
You Say, “tomato…”
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper

Recommended for you

Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial

This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.

apache_hadoop mapreduce big_data cloud_computing
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Optimal Execution Of MapReduce Jobs In Cloud Anshul Aggarwal, Software Engineer, Cisco Systems Session Length: 1 Hour Tue March 10 21:30 PST Wed March 11 0:30 EST Wed March 11 4:30:00 UTC Wed March 11 10:00 IST Wed March 11 15:30 Sydney Voices 2015 www.globaltechwomen.com We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized. Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized. The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.

mapreduceexecutionglobal
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic

MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.

mapreducehadoopbig data
Some MapReduce Terminology
 Job – A “full program” - an execution of a
Mapper and Reducer across a data set
 Task – An execution of a Mapper or a
Reducer on a slice of data
a.k.a. Task-In-Progress (TIP)
 Task Attempt – A particular instance of an
attempt to execute a task on a machine
Terminology Example
 Running “Word Count” across 20 files is
one job
 20 files to be mapped imply 20 map tasks
+ some number of reduce tasks
 At least 20 map task attempts will be
performed… more if a machine crashes,
etc.
Task Attempts
 A particular task will be attempted at least once,
possibly more times if it crashes
 If the same input causes crashes over and over, that
input will eventually be abandoned
 Multiple attempts at one task may occur in
parallel with speculative execution turned on
 Task ID from TaskInProgress is not a unique
identifier; don’t use it that way
MapReduce: High Level

Recommended for you

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...

1. The document discusses installing Hadoop in single node cluster mode on Ubuntu, including installing Java, configuring SSH, extracting and configuring Hadoop files. Key configuration files like core-site.xml and hdfs-site.xml are edited. 2. Formatting the HDFS namenode clears all data. Hadoop is started using start-all.sh and the jps command checks if daemons are running. 3. The document then moves to discussing running a KMeans clustering MapReduce program on the installed Hadoop framework.

hadoopkmeansmapreduce
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.

javahadoophive
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained

The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.

apache hadoopmapreduce
Node-to-Node Communication
 Hadoop uses its own RPC protocol
 All communication begins in slave nodes
Prevents circular-wait deadlock
Slaves periodically poll for “status” message
 Classes must provide explicit serialization
Nodes, Trackers, Tasks
 Master node runs JobTracker instance,
which accepts Job requests from clients
 TaskTracker instances run on slave nodes
 TaskTracker forks separate Java process
for task instances
Job Distribution
 MapReduce programs are contained in a Java
“jar” file + an XML file containing serialized
program configuration options
 Running a MapReduce job places these files
into the HDFS and notifies TaskTrackers where
to retrieve the relevant program code
 … Where’s the data distribution?
Data Distribution
 Implicit in design of MapReduce!
All mappers are equivalent; so map whatever
data is local to a particular node in HDFS
 If lots of data does happen to pile up on
the same node, nearby nodes will map
instead
Data transfer is handled implicitly by HDFS

Recommended for you

Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips

This document provides performance optimization tips for Hadoop jobs, including recommendations around compression, speculative execution, number of maps/reducers, block size, sort size, JVM tuning, and more. It suggests how to configure properties like mapred.compress.map.output, mapred.map/reduce.tasks.speculative.execution, and dfs.block.size based on factors like cluster size, job characteristics, and data size. It also identifies antipatterns to avoid like processing thousands of small files or using many maps with very short runtimes.

hadoop
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce

Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu Kerala, India in December 2010

mapreducehadoopcloud computing ganesh
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction

The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.

Configuring With JobConf
 MR Programs have many configurable options
 JobConf objects hold (key, value) components
mapping String  ’a
 e.g., “mapred.map.tasks”  20
 JobConf is serialized and distributed before running
the job
 Objects implementing JobConfigurable can
retrieve elements from a JobConf
What Happens In MapReduce?
Depth First
Job Launch Process: Client
 Client program creates a JobConf
Identify classes implementing Mapper and
Reducer interfaces
 JobConf.setMapperClass(), setReducerClass()
Specify inputs, outputs
 FileInputFormat.addInputPath(),
 FileOutputFormat.setOutputPath()
Optionally, other options too:
 JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Job Launch Process: JobClient
 Pass JobConf to JobClient.runJob() or
submitJob()
runJob() blocks, submitJob() does not
 JobClient:
Determines proper division of input into
InputSplits
Sends job data to master JobTracker server

Recommended for you

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce

This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.

big datamap-reducehadoop
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce

This document provides an overview of Hadoop and MapReduce. It discusses how Hadoop uses HDFS for distributed storage and replication of data blocks across commodity servers. It also explains how MapReduce allows for massively parallel processing of large datasets by splitting jobs into mappers and reducers. Mappers process data blocks in parallel and generate intermediate key-value pairs, which are then sorted and grouped by the reducers to produce the final results.

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial: 1) Thinking in Map / Reduce 2) Understanding Unix Pipeline 3) Examples to understand MapReduce 4) Merging 5) Mappers & Reducers 6) Mapper Example 7) Input Split 8) mapper() & reducer() Code 9) Example - Count number of words in a file using MapReduce 10) Example - Compute Max Temperature using MapReduce 11) Hands-on - Count number of words in a file using MapReduce on CloudxLab

cloudxlabhadoopapache hadoop
Job Launch Process: JobTracker
 JobTracker:
Inserts jar and JobConf (serialized to XML) in
shared location
Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
 TaskTrackers running on slave nodes
periodically query JobTracker for work
 Retrieve job-specific jar and config
 Launch task in separate instance of Java
main() is provided by Hadoop
Job Launch Process: Task
 TaskTracker.Child.main():
Sets up the child TaskInProgress attempt
Reads XML configuration
Connects back to necessary MapReduce
components via RPC
Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
 TaskRunner, MapTaskRunner,
MapRunner work in a daisy-chain to
launch your Mapper
Task knows ahead of time which InputSplits it
should be mapping
Calls Mapper once for each record retrieved
from the InputSplit
 Running the Reducer is much the same

Recommended for you

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview

This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.

apache hadoophdfsbig data
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals

This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom. Course website: http://michiard.github.io/DISC-CLOUD-COURSE/ Sources available here: https://github.com/michiard/DISC-CLOUD-COURSE

hadoopschedulingarchitecture
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance

The document discusses fault tolerance in Apache Hadoop. It describes how Hadoop handles failures at different layers through replication and rapid recovery mechanisms. In HDFS, data nodes regularly heartbeat to the name node, and blocks are replicated across racks. The name node tracks block locations and initiates replication if a data node fails. HDFS also supports name node high availability. In MapReduce v1, task and task tracker failures cause re-execution of tasks. YARN improved fault tolerance by removing the job tracker single point of failure.

hadoopmapreduceyarn
Creating the Mapper
 You provide the instance of Mapper
Should extend MapReduceBase
 One instance of your Mapper is initialized
by the MapTaskRunner for a
TaskInProgress
Exists in separate process from all other
instances of Mapper – no data sharing!
Mapper
 void map(K1 key,
V1 value,
OutputCollector<K2, V2> output,
Reporter reporter)
 K types implement WritableComparable
 V types implement Writable
What is Writable?
 Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
 All values are instances of Writable
 All keys are instances of
WritableComparable
Getting Data To The Mapper

Recommended for you

Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS

This presentation will give you Information about : 1.Configuring HDFS 2.Interacting With HDFS 3.HDFS Permissions and Security 4.Additional HDFS Tasks HDFS Overview and Architecture 5.HDFS Installation 6.Hadoop File System Shell 7.File System Java API

additional hdfs taskshadoop corporate software development project inhdfs overview and architecture
HADOOP
HADOOPHADOOP
HADOOP

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers. It addresses problems like massive data storage needs and scalable processing of large datasets. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data reliably across commodity hardware and MapReduce provides a programming model for distributed computing of large datasets.

5pen pc technology
5pen pc technology5pen pc technology
5pen pc technology

This document summarizes a pen pc technology called P-ISM. P-ISM includes 5 functions: a CPU pen, camera, virtual keyboard, visual output via an LED projector, and cellular calling capability. It uses Bluetooth and WiFi for wireless connectivity and allows the user to access computing functions by writing on any flat surface and using a projected virtual keyboard. While portable and convenient, challenges include cost, battery life, and precise keyboard positioning.

Reading Data
 Data sets are specified by InputFormats
Defines input data (e.g., a directory)
Identifies partitions of the data that form an
InputSplit
Factory for RecordReader objects to extract
(k, v) records from the input source
FileInputFormat and Friends
 TextInputFormat – Treats each ‘n’-
terminated line of a file as a value
 KeyValueTextInputFormat – Maps ‘n’-
terminated text lines of “k SEP v”
 SequenceFileInputFormat – Binary file of
(k, v) pairs with some add’l metadata
 SequenceFileAsTextInputFormat – Same,
but maps (k.toString(), v.toString())
Filtering File Inputs
 FileInputFormat will read all files out of a
specified directory and send them to the
mapper
 Delegates filtering this file list to a method
subclasses may override
e.g., Create your own “xyzFileInputFormat” to
read *.xyz from directory list
Record Readers
 Each InputFormat provides its own
RecordReader implementation
Provides (unused?) capability multiplexing
 LineRecordReader – Reads a line from a
text file
 KeyValueRecordReader – Used by
KeyValueTextInputFormat

Recommended for you

Laser Communication
Laser CommunicationLaser Communication
Laser Communication

Laser communication uses lasers to transmit information through free space instead of fiber optic cables. It works similarly to fiber optics but transmits the beam through the atmosphere instead of cables. The transmitter converts signals into laser light and the receiver includes a telescope to capture the beam and detectors to convert it back into signals. Laser communication has advantages over radio frequency and fiber optics for applications where laying cable is not possible or practical such as for satellites, remote areas, and emergencies due to its high bandwidth, directivity, security, and smaller antenna size.

Brain fingerprinting
Brain fingerprintingBrain fingerprinting
Brain fingerprinting

This document summarizes a seminar presentation on brain fingerprinting technology. Brain fingerprinting uses EEG to measure electrical brain wave responses, specifically the P300 wave, to stimuli presented on a computer in order to determine if individuals have hidden information stored in their brains. It works by presenting probes, targets, and irrelevant stimuli and analyzing the brain's differential response. There are four phases: evidence collection, brain evidence collection, computer analysis, and determining guilt or innocence. Unlike polygraph tests, it does not rely on physiological responses but on cognitive brain responses. Case studies showed it correctly identified information stored in a murder suspect's brain and its potential use in identifying trained terrorists.

Laser Communications
Laser CommunicationsLaser Communications
Laser Communications

Laser communications offer a viable alternative to RF communications for inter satellite links and other applications where high-performance links are a necessity.

laser communications seminar reportlaser communications ppt
Input Split Size
 FileInputFormat will divide large files into
chunks
Exact size controlled by mapred.min.split.size
 RecordReaders receive file, offset, and
length of chunk
 Custom InputFormat implementations may
override split size – e.g., “NeverChunkFile”
Sending Data To Reducers
 Map function receives OutputCollector
object
OutputCollector.collect() takes (k, v) elements
 Any (WritableComparable, Writable) can
be used
 By default, mapper output type assumed
to be same as reducer output type
WritableComparator
 Compares WritableComparable data
Will call WritableComparable.compare()
Can provide fast path for serialized data
 JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
 Reporter object sent to Mapper allows
simple asynchronous feedback
incrCounter(Enum key, long amount)
setStatus(String msg)
 Allows self-identification of input
InputSplit getInputSplit()

Recommended for you

3d internet
3d internet3d internet
3d internet

This document provides an overview of 3D on the web (3D internet). It discusses what 3D internet is, the applications and importance of 3D content on the web, the history and current status. Key enablers for 3D on the web have been increased bandwidth and computer processing power. 3D can be used for e-commerce, training, games, entertainment, social interaction, and education. The document also discusses technologies, design, animation, interactivity, and content creation for 3D on the web. A simple example of a 3D forest walk site is provided to illustrate how easy it can be to create basic 3D web content.

Brain Fingerprinting PPT
Brain Fingerprinting PPTBrain Fingerprinting PPT
Brain Fingerprinting PPT

Brain fingerprinting is a technique developed by Lawrence Farwell that uses electroencephalography (EEG) to detect electrical brainwave responses called MERMERs that are elicited when a person recognizes familiar stimuli. It works by measuring the brain's response when a subject is exposed to words or images related to a crime. If the brainwave patterns match those that would be expected from someone familiar with the crime details, it suggests the person has knowledge of the crime. Brain fingerprinting has been used to help solve criminal cases and evaluate brain functioning, though further research with larger samples is still needed to fully validate its accuracy and capabilities.

brain fingerprinting pptbrain fingerprintingbrainfingerprinting
3d internet
3d internet3d internet
3d internet

The document discusses the concept of 3D Internet, which combines the power of the Internet with 3D graphics to provide interactive, real-time 3D content over the web. It outlines how improvements in bandwidth, processor speeds, and graphics accelerators have now made 3D Internet possible. Examples are given of potential applications in e-commerce, education, entertainment, and more. Challenges that must still be overcome include complexity, slow adoption rates, and underutilization by advertisers. The future of 3D Internet is predicted to include highly immersive experiences that integrate the virtual and real world.

Partition And Shuffle
Partitioner
 int getPartition(key, val, numPartitions)
Outputs the partition number for a given key
One partition == values sent to one Reduce
task
 HashPartitioner used by default
Uses key.hashCode() to return partition num
 JobConf sets Partitioner implementation
Reduction
 reduce( K2 key,
Iterator<V2> values,
OutputCollector<K3, V3> output,
Reporter reporter)
 Keys & values sent to one partition all go
to the same reduce task
 Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
Finally: Writing The Output

Recommended for you

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop

The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.

apache hadoopdistributed computinghadoop
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt

This document provides an overview of key concepts in Hadoop including: - Hadoop was tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage. - MapReduce is Hadoop's programming model and consists of mappers that process input splits in parallel, and reducers that combine the outputs of the mappers. - The JobTracker manages jobs, TaskTrackers run tasks on slave nodes, and Tasks are individual mappers or reducers. Data is distributed to nodes implicitly based on the HDFS file distribution. Configurations are set using a JobConf object.

Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore

Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.

best hadoop institute in bangalorehadoop institutes in bangalorehadoop institute in bangalore
OutputFormat
 Analogous to InputFormat
 TextOutputFormat – Writes “key valn”
strings to output file
 SequenceFileOutputFormat – Uses a
binary format to pack (k, v) pairs
 NullOutputFormat – Discards output
Questions?

More Related Content

What's hot

Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
Hadoop 3
Hadoop 3Hadoop 3
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Titus Damaiyanti
 
Hadoop
HadoopHadoop
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
Pietro Michiardi
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
Ravindra Bandara
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 

What's hot (19)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Hadoop
HadoopHadoop
Hadoop
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 

Viewers also liked

HADOOP
HADOOPHADOOP
5pen pc technology
5pen pc technology5pen pc technology
5pen pc technology
Akhil Kumar
 
Laser Communication
Laser CommunicationLaser Communication
Laser Communication
shashanksachdeva
 
Brain fingerprinting
Brain fingerprintingBrain fingerprinting
Brain fingerprinting
Priyodarshini Dhar
 
Laser Communications
Laser CommunicationsLaser Communications
Laser Communications
Seminar Links
 
3d internet
3d internet3d internet
3d internet
Vikas Sarwara
 
Brain Fingerprinting PPT
Brain Fingerprinting PPTBrain Fingerprinting PPT
Brain Fingerprinting PPT
Vishnu Mysterio
 
3d internet
3d internet3d internet
3d internet
Subhashree Malla
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Viewers also liked (9)

HADOOP
HADOOPHADOOP
HADOOP
 
5pen pc technology
5pen pc technology5pen pc technology
5pen pc technology
 
Laser Communication
Laser CommunicationLaser Communication
Laser Communication
 
Brain fingerprinting
Brain fingerprintingBrain fingerprinting
Brain fingerprinting
 
Laser Communications
Laser CommunicationsLaser Communications
Laser Communications
 
3d internet
3d internet3d internet
3d internet
 
Brain Fingerprinting PPT
Brain Fingerprinting PPTBrain Fingerprinting PPT
Brain Fingerprinting PPT
 
3d internet
3d internet3d internet
3d internet
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Hadoop 2

hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
AnushkaChauhan68
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
srikanthhadoop
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Hadoop
HadoopHadoop
Hadoop
Dinakar nk
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Unit 2
Unit 2Unit 2
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
Chirag Ahuja
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 

Similar to Hadoop 2 (20)

hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Unit 2
Unit 2Unit 2
Unit 2
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Scala+data
Scala+dataScala+data
Scala+data
 

More from EasyMedico.com

Soft computing from net
Soft computing from netSoft computing from net
Soft computing from net
EasyMedico.com
 
Nis1
Nis1Nis1
Neural network
Neural networkNeural network
Neural network
EasyMedico.com
 
Nn devs
Nn devsNn devs
Nis1
Nis1Nis1
L005.neural networks
L005.neural networksL005.neural networks
L005.neural networks
EasyMedico.com
 
Ch03
Ch03Ch03
Ch02
Ch02Ch02

More from EasyMedico.com (9)

Sds
SdsSds
Sds
 
Soft computing from net
Soft computing from netSoft computing from net
Soft computing from net
 
Nis1
Nis1Nis1
Nis1
 
Neural network
Neural networkNeural network
Neural network
 
Nn devs
Nn devsNn devs
Nn devs
 
Nis1
Nis1Nis1
Nis1
 
L005.neural networks
L005.neural networksL005.neural networks
L005.neural networks
 
Ch03
Ch03Ch03
Ch03
 
Ch02
Ch02Ch02
Ch02
 

Recently uploaded

How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
Celine George
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
IJAEMSJORNAL
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
Mani Krishna Sarkar
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
sipij
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
Servizi a rete
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
Tool and Die Tech
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
Blesson Easo Varghese
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
IJAEMSJORNAL
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
Mani Krishna Sarkar
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
Prakhyath Rai
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
VICTOR MAESTRE RAMIREZ
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Sinan KOZAK
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
sharvaridhokte
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
bookhotbebes1
 
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
PriyankaKarn3
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
Tool and Die Tech
 

Recently uploaded (20)

How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
 
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
1239_2.pdf IS CODE FOR GI PIPE FOR PROCUREMENT
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
 
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE DonatoCONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
CONVEGNO DA IRETI 18 giugno 2024 | PASQUALE Donato
 
Vernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsxVernier Caliper and How to use Vernier Caliper.ppsx
Vernier Caliper and How to use Vernier Caliper.ppsx
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
 
IS Code SP 23: Handbook on concrete mixes
IS Code SP 23: Handbook  on concrete mixesIS Code SP 23: Handbook  on concrete mixes
IS Code SP 23: Handbook on concrete mixes
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafePaharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Paharganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
Conservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic RegenerationConservation of Taksar through Economic Regeneration
Conservation of Taksar through Economic Regeneration
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
 

Hadoop 2

  • 1. Lecture 3 – Hadoop Technical Introduction CSE 490H
  • 2. Announcements  My office hours: M 2:30—3:30 in CSE 212  Cluster is operational; instructions in assignment 1 heavily rewritten  Eclipse plugin is “deprecated”  Students who already created accounts: let me know if you have trouble
  • 3. Breaking news!  Hadoop tested on 4,000 node cluster 32K cores (8 / node) 16 PB raw storage (4 x 1 TB disk / node) (about 5 PB usable storage)  http://developer.yahoo.com/blogs/hadoop/2008/09/ scaling_hadoop_to_4000_nodes_a.html
  • 4. You Say, “tomato…” Google calls it: Hadoop equivalent: MapReduce Hadoop GFS HDFS Bigtable HBase Chubby Zookeeper
  • 5. Some MapReduce Terminology  Job – A “full program” - an execution of a Mapper and Reducer across a data set  Task – An execution of a Mapper or a Reducer on a slice of data a.k.a. Task-In-Progress (TIP)  Task Attempt – A particular instance of an attempt to execute a task on a machine
  • 6. Terminology Example  Running “Word Count” across 20 files is one job  20 files to be mapped imply 20 map tasks + some number of reduce tasks  At least 20 map task attempts will be performed… more if a machine crashes, etc.
  • 7. Task Attempts  A particular task will be attempted at least once, possibly more times if it crashes  If the same input causes crashes over and over, that input will eventually be abandoned  Multiple attempts at one task may occur in parallel with speculative execution turned on  Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 9. Node-to-Node Communication  Hadoop uses its own RPC protocol  All communication begins in slave nodes Prevents circular-wait deadlock Slaves periodically poll for “status” message  Classes must provide explicit serialization
  • 10. Nodes, Trackers, Tasks  Master node runs JobTracker instance, which accepts Job requests from clients  TaskTracker instances run on slave nodes  TaskTracker forks separate Java process for task instances
  • 11. Job Distribution  MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options  Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code  … Where’s the data distribution?
  • 12. Data Distribution  Implicit in design of MapReduce! All mappers are equivalent; so map whatever data is local to a particular node in HDFS  If lots of data does happen to pile up on the same node, nearby nodes will map instead Data transfer is handled implicitly by HDFS
  • 13. Configuring With JobConf  MR Programs have many configurable options  JobConf objects hold (key, value) components mapping String  ’a  e.g., “mapred.map.tasks”  20  JobConf is serialized and distributed before running the job  Objects implementing JobConfigurable can retrieve elements from a JobConf
  • 14. What Happens In MapReduce? Depth First
  • 15. Job Launch Process: Client  Client program creates a JobConf Identify classes implementing Mapper and Reducer interfaces  JobConf.setMapperClass(), setReducerClass() Specify inputs, outputs  FileInputFormat.addInputPath(),  FileOutputFormat.setOutputPath() Optionally, other options too:  JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 16. Job Launch Process: JobClient  Pass JobConf to JobClient.runJob() or submitJob() runJob() blocks, submitJob() does not  JobClient: Determines proper division of input into InputSplits Sends job data to master JobTracker server
  • 17. Job Launch Process: JobTracker  JobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue
  • 18. Job Launch Process: TaskTracker  TaskTrackers running on slave nodes periodically query JobTracker for work  Retrieve job-specific jar and config  Launch task in separate instance of Java main() is provided by Hadoop
  • 19. Job Launch Process: Task  TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce components via RPC Uses TaskRunner to launch user process
  • 20. Job Launch Process: TaskRunner  TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper Task knows ahead of time which InputSplits it should be mapping Calls Mapper once for each record retrieved from the InputSplit  Running the Reducer is much the same
  • 21. Creating the Mapper  You provide the instance of Mapper Should extend MapReduceBase  One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances of Mapper – no data sharing!
  • 22. Mapper  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  K types implement WritableComparable  V types implement Writable
  • 23. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable
  • 24. Getting Data To The Mapper
  • 25. Reading Data  Data sets are specified by InputFormats Defines input data (e.g., a directory) Identifies partitions of the data that form an InputSplit Factory for RecordReader objects to extract (k, v) records from the input source
  • 26. FileInputFormat and Friends  TextInputFormat – Treats each ‘n’- terminated line of a file as a value  KeyValueTextInputFormat – Maps ‘n’- terminated text lines of “k SEP v”  SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata  SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
  • 27. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 28. Record Readers  Each InputFormat provides its own RecordReader implementation Provides (unused?) capability multiplexing  LineRecordReader – Reads a line from a text file  KeyValueRecordReader – Used by KeyValueTextInputFormat
  • 29. Input Split Size  FileInputFormat will divide large files into chunks Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk  Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”
  • 30. Sending Data To Reducers  Map function receives OutputCollector object OutputCollector.collect() takes (k, v) elements  Any (WritableComparable, Writable) can be used  By default, mapper output type assumed to be same as reducer output type
  • 31. WritableComparator  Compares WritableComparable data Will call WritableComparable.compare() Can provide fast path for serialized data  JobConf.setOutputValueGroupingComparator()
  • 32. Sending Data To The Client  Reporter object sent to Mapper allows simple asynchronous feedback incrCounter(Enum key, long amount) setStatus(String msg)  Allows self-identification of input InputSplit getInputSplit()
  • 34. Partitioner  int getPartition(key, val, numPartitions) Outputs the partition number for a given key One partition == values sent to one Reduce task  HashPartitioner used by default Uses key.hashCode() to return partition num  JobConf sets Partitioner implementation
  • 35. Reduction  reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)  Keys & values sent to one partition all go to the same reduce task  Calls are sorted by key – “earlier” keys are reduced and output before “later” keys
  • 37. OutputFormat  Analogous to InputFormat  TextOutputFormat – Writes “key valn” strings to output file  SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs  NullOutputFormat – Discards output