SlideShare a Scribd company logo
Introduction to MapReduce and Hadoop
Expected … what to be said!
● History.
● What is Hadoop.
● Hadoop vs SQl.
● MapReduce.
● Hadoop Building Blocks.
● Installing, Configuring and Running Hadoop.
● Anatomy of MapReduce program.
Hadoop Series Resources
Introduction to MapReduce and Hadoop

Recommended for you

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows

Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.

map reducehadoop on windowsparallel programming
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance

The document discusses fault tolerance in Apache Hadoop. It describes how Hadoop handles failures at different layers through replication and rapid recovery mechanisms. In HDFS, data nodes regularly heartbeat to the name node, and blocks are replicated across racks. The name node tracks block locations and initiates replication if a data node fails. HDFS also supports name node high availability. In MapReduce v1, task and task tracker failures cause re-execution of tasks. YARN improved fault tolerance by removing the job tracker single point of failure.

hadoopmapreduceyarn
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX

GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.

graphgraphxapache spark
How hadoop was born?
Doug Cutting
Challenges of Distributed Processing of
Large Data
● How to distribute the work?
● How to store and distribute the data itself?
● How to overcome failures?
● How to balance the load?
● How to deal with unstructured data?
● ...
Hadoop tackles these
challenges!
So, what’s Hadoop?
What is Hadoop?
Hadoop is an open source framework for writing and
running distributed applications that process large
amounts of data.
Key distinctions of Hadoop:
● Accessible
● Robust
● Scalable
● Simple

Recommended for you

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce

Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.

chapterstudentvnit-acm
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction

The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms

This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.

mapreducealgorithmsparallelization
Hadoop vs SQL
● Structured and Unstructured data.
● Datastore and Data Analysis.
● Scale-out and Scale-up.
● Offline batch processing and Online
transactions.
Hadoop Uses
MapReduce
What is MapReduce?...
● Parallel programming model for clusters of
commodity machines.
● MapReduce provides:
o Automatic parallelization & distribution.
o Fault tolerance.
o Locality of data.
What is MapReduce?
MapReduce … Map then Reduce

Recommended for you

Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics

This document provides an overview of topics to be covered in a Big Data training. It will discuss uses of Big Data, Hadoop, HDFS architecture, MapReduce algorithm, WordCount example, tips for MapReduce, and distributing Twitter data for testing. Key concepts that will be covered include what Big Data is, how HDFS is architected, the MapReduce phases of map, sort, shuffle, and reduce, and how WordCount works as a simple MapReduce example. The goal is to introduce foundational Big Data and Hadoop concepts.

big data
Unit 1
Unit 1Unit 1
Unit 1

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed to support distributed processing of large datasets. The document provides an overview of Hadoop architecture including HDFS, MapReduce and key components like NameNode, DataNode, JobTracker and TaskTracker. It also discusses Hadoop history, features, use cases and configuration.

Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop

The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.

mapreducehadoopr
Keys and Values
● Key/Value pairs.
● Keys divide Reduce Space.
Input Output
Map <k1, v1> list(<k2, v2>)
Reduce <k2, list(v2)> list(<k3, v3>)
WordCount in Action
Input:
foo.txt:
“This is the foo file”
bar.txt:
“And this is the bar one”
1
is
1
the
1
foo
1
file
1
and
1
this
1
is
1
the
1
Reduce#2:
Input:
Output:
is, [1, 1] is,
2
Reduce#1:
Input:
Output:
this, [1, 1]
this, 2
Reduce#3:
Input:
Output:
foo, [1]
foo, 1.
.
Final output:
this 2
is 2
the 2
foo 1
file 1
and 1
bar 1
one 1
WordCount with MapReduce
map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T {
emit ((String)token,
(Integer) 1);
}
}
reduce(String token, List<Integer> values) {
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}
Hadoop Building Blocks
How does hadoop work?...

Recommended for you

MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model

The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.

mongodbhadoopscribe
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2

This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.

Map Reduce
Map ReduceMap Reduce
Map Reduce

This document provides an overview of MapReduce in Hadoop. It defines MapReduce as a distributed data processing paradigm designed for batch processing large datasets in parallel. The anatomy of MapReduce is explained, including the roles of mappers, shufflers, reducers, and how a MapReduce job runs from submission to completion. Potential purposes are batch processing and long running applications, while weaknesses include iterative algorithms, ad-hoc queries, and algorithms that depend on previously computed values or shared global state.

batch processingbig datamapreduce
Hadoop Building Blocks
1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
HDFS: NameNode and DataNodes
JobTracker and TaskTracker
Typical Hadoop Cluster

Recommended for you

Spark and shark
Spark and sharkSpark and shark
Spark and shark

Spark is a fast and general engine for large-scale data processing. It provides an interface called resilient distributed datasets (RDDs) that allow data to be distributed in memory across clusters and manipulated using parallel operations. Shark is a system built on Spark that allows running SQL queries over large datasets using Spark's speed and generality. The document discusses Spark and Shark's performance advantages over Hadoop for iterative and interactive applications.

hadoophadoop summit
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance

This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.

20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

yahoohadoopspark
Running Hadoop
Three modes to run Hadoop:
1. Local (standalone) mode.
2. Pseudo-distributed mode “cluster of one” .
3. Fully distributed mode.
An Action
Running Hadoop on Local Machine
Actions ...
1. Installing Hadoop.
2. Configuring Hadoop (Pseudo-distributed mode).
3. Running WordCount example.
4. Web-based cluster UI.
HDFS
1. HDFS is a filesystem designed for large-scale
distributed data processing.
2. HDFS isn’t a native Unix filesystem.
Basic File Commands:
$ hadoop fs -cmd <args>
$ hadoop fs –ls
$ hadoop fs –mkdir /user/chuck
$ hadoop fs -copyFromLocal

Recommended for you

MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement

The document discusses MapReduce, a framework for processing large datasets in parallel. It provides an overview of MapReduce's basic principles, surveys research to improve the conventional MapReduce framework, and describes research projects ongoing at KAIST. The key points are that MapReduce provides automatic parallelization, fault tolerance, and distributed processing of large datasets across commodity computer clusters. It also introduces the map and reduce functions that define MapReduce jobs.

mapreducehadoopparallel computing
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1

Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.

apache sparkberkeleybdas
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce

Very detail description if MapReduce in this slides. I hope you will learn very much about MapReduce after reading these slides

mapreducesoftware engineeringcloud computing
Anatomy of a MapReduce program
MapReduce and beyond
Hadoop
1. Data Types
2. Mapper
3. Reducer
4. Partitioner
5. Combiner
6. Reading and Writing
a. InputFormat
b. OutputFormat
Anatomy of a MapReduce program
Hadoop Data Types
● Certain defined way of serializing key/value pairs.
● Values should implement Writable Interface.
● Keys should implement WritableComparable interface.
● Some predefined classes:
o BooleanWritable.
o ByteWritable.
o IntWritable
o ...

Recommended for you

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce

The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.

map-reducemapreduceintroduction
Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)

Presentation by Brandwatch Developer James Grant at the second Big Data Brighton meetup, hosted by Brandwatch: www.brandwatch.com

james grantmap reducebrandwatch
Hadoop
HadoopHadoop
Hadoop

Hadoop/MapReduce is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, a programming model where input data is processed by "map" functions in parallel, and results are combined by "reduce" functions, to process and generate outputs from large amounts of data and nodes. The core components are the Hadoop Distributed File System for data storage, and the MapReduce programming model and framework. MapReduce jobs involve mapping data to intermediate key-value pairs, shuffling and sorting the data, and reducing to output results.

hadoopsoftware developmentbig data
Mapper
Mapper
1. Mapper<K1,V1,K2,V2>
2. Override method:
void map(K1 key, V1 value, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context){
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
Predefined Mappers

Recommended for you

Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco

This document provides an introduction to MapReduce and Disco, an open source implementation of MapReduce in Erlang and Python. It explains the motivation for MapReduce frameworks like Google's in addressing the need to process massive amounts of data across large clusters reliably. The core concepts of MapReduce are described, including how the input is split and mapped in parallel, intermediate key-value pairs are grouped and reduced, and the final output is produced. An example word counting algorithm demonstrates how a problem can be solved using MapReduce.

mapreducepythondisco
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2

This document provides an overview of Hadoop and its core components. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce as its programming model and the Hadoop Distributed File System (HDFS) for storage. HDFS stores data redundantly across nodes for reliability. The core subprojects of Hadoop include MapReduce, HDFS, Hive, HBase, and others.

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners

This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology. Thank you.

litrecytechnologyengineering
Reducer
Reducer
1. Extends Reducer<K1,V1,K2,V2>
2. Overrides method:
void reduce(K2, Iterable<V2>, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Predefined Reducers

Recommended for you

Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology

This document provides an overview of Big Data and Hadoop. It defines Big Data as large volumes of structured, semi-structured, and unstructured data that is too large to process using traditional databases and software. It provides examples of the large amounts of data generated daily by organizations. Hadoop is presented as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop including HDFS for distributed storage and fault tolerance, and MapReduce for distributed processing, are described at a high level. Common use cases for Hadoop by large companies are also mentioned.

hadoop
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

This Hadoop MapReduce tutorial will unravel MapReduce Programming, MapReduce Commands, MapReduce Fundamentals, Driver Class, Mapper Class, Reducer Class, Job Tracker & Task Tracker. At the end, you'll have a strong knowledge regarding Hadoop MapReduce Basics. PPT Agenda: ✓ Introduction to BIG Data & Hadoop ✓ What is MapReduce? ✓ MapReduce Data Flows ✓ MapReduce Programming ---------- What is MapReduce? MapReduce is a programming framework for distributed processing of large data-sets via commodity computing clusters. It is based on the principal of parallel data processing, wherein data is broken into smaller blocks rather than processed as a single block. This ensures a faster, secure & scalable solution. Mapreduce commands are based in Java. ---------- What are MapReduce Components? It has the following components: 1. Combiner: The combiner collates all the data from the sample set based on your desired filters. For example, you can collate data based on day, week, month and year. After this, the data is prepared and sent for parallel processing. 2. Job Tracker: This allocates the data across multiple servers. 3. Task Tracker: This executes the program across various servers. 4. Reducer: It will isolate the desired output from across the multiple servers. ---------- Applications of MapReduce 1. Data Mining 2. Document Indexing 3. Business Intelligence 4. Predictive Modelling 5. Hypothesis Testing ---------- Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance. Email: sales@skillspeed.com Website: https://www.skillspeed.com

big datahadoopmapreduce fundamentals
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details

Twitter Data Analysis using various Hadoop tools and little description of Mapreduce concept and use case

flumemapreducehadoop
Partitioner
Partitioner
The partitioner decides
which key goes where
class WordSizePartitioner extends
Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text
word, IntWritable count, int
numOfPartions) {
return 0;
}
}
Combiner
Combiner
It’s a local Reduce Task at
Mapper.
WordCout Mapper Output:
1. Without Combiner:<the, 1>, <file,
1>, <the, 1>, …
2. With Combiner:<the, 2>, <file, 2>,
...

Recommended for you

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce

Here is how you can solve this problem using MapReduce and Unix commands: Map step: grep -o 'Blue\|Green' input.txt | wc -l > output This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches). Reduce step: cat output This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green. So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map

hadoopbig dataapache apex
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

mapreducegooglemapreducegoogle
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2

As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop. At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.

hadoop 2cloudera hadoophadoop yarn
Reading and Writing
Reading and Writing
1. Input data usually resides in large files.
2. MapReduce’s processing power is the splitting of the
input data into chunks(InputSplit).
3. Hadoop’s FileSystem provides the class
FSDataInputStream for file reading. It extends
DataInputStream with random read access.
InputFormat Classes
● TextInputFormat
o <offset, line>
● KeyValueTextInputFormat
o keytvaue => <key, value>
● NLineInputFormat
o <offset, nLines>
You can define your own InputFormat class ...
1. The output has no splits.
2. Each reducer generates output file named
part-nnnnn, where nnnnn is the partition ID
of the reducer.
Predefined OutputFormat classes:
> TextOutputFormat <k, v> => ktv
OutputFormat

Recommended for you

Presentation1
Presentation1Presentation1
Presentation1

This document proposes an app idea to solve conflicts that arise in Indian families when parents and children want to watch different things on TV. The app would display updates like cricket scores, news, stock prices, weather and social media feeds along the bottom of the TV screen as ticker updates, allowing the whole family to watch together while still getting the information they want. Key features outlined include customizing the number and type of information bars displayed, minimizing bars individually, and changing themes. Revenue would come from advertisements on the app interface. Initial sketches and wireframes are included to illustrate the proposed user interface and flows.

Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...

This document provides a guide for OpenStack developers to contribute code. It outlines the prerequisites like creating a Launchpad account and signing a CLA. It describes finding work by attending meetings, tracking bugs, or writing blueprints. Developers are instructed to write git commit messages linking their code patches to specific bugs or blueprints. The guide also covers submitting code for review using git-review and responding professionally to review comments.

2012osacopenstack
Module 3
Module 3Module 3
Module 3

The document provides information and instructions for students preparing for their orientation at the University of Kansas (KU). It discusses: 1) Completing the pre-orientation survey and selecting 10 interesting classes to share with an advisor in order to help plan their first year schedule. 2) Logging into the orientation portal to confirm their session date and research any other questions about the day. 3) Checking that they have the required documents like ID, transcripts, placement scores before attending orientation. 4) The orientation day will include introductions, meeting with student assistants and advisors, and enrolling in classes. Snacks and drinks will be provided.

Recap
END OF SESSION #1
Q

More Related Content

What's hot

Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
MapReduce
MapReduceMapReduce
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
Ravindra Bandara
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
Andrea Iacono
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
Abhishek Mukherjee
 
Unit 1
Unit 1Unit 1
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
DataWorks Summit
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
Stefanie Zhao
 

What's hot (20)

Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 

Viewers also liked

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Hassan A-j
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)
Brandwatch
 
Hadoop
HadoopHadoop
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
Jim Roepcke
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
Ankit Gupta
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Presentation1
Presentation1Presentation1
Presentation1
Himanshu Bansal
 
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
OpenCity Community
 
Module 3
Module 3Module 3
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
giopersico
 
Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"
Елена Шальнова
 
Indicaciones de un helipuerto
Indicaciones de un helipuertoIndicaciones de un helipuerto
Indicaciones de un helipuerto
Jorge Echeverria
 
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Cámara de Comercio de Bilbao
 

Viewers also liked (20)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Presentation1
Presentation1Presentation1
Presentation1
 
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
 
Module 3
Module 3Module 3
Module 3
 
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
 
Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"
 
Indicaciones de un helipuerto
Indicaciones de un helipuertoIndicaciones de un helipuerto
Indicaciones de un helipuerto
 
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
 

Similar to Introduction to MapReduce and Hadoop

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
Marc de Palol
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
Kumari Surabhi
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 

Similar to Introduction to MapReduce and Hadoop (20)

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Spark overview
Spark overviewSpark overview
Spark overview
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 

Recently uploaded

MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
Mitchell Marsh
 
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Livetecs LLC
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
karim wahed
 
NYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdfNYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdf
AUGNYC
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Sparity1
 
Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation
Mindfire Solution
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
SSTech System
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
taskroupseo
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
dachnug51 - Whats new in domino 14 .pdf
dachnug51 - Whats new in domino 14  .pdfdachnug51 - Whats new in domino 14  .pdf
dachnug51 - Whats new in domino 14 .pdf
DNUG e.V.
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
miso_uam
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
sachin chaurasia
 
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
Semiosis Software Private Limited
 
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
bhatinidhi2001
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
karim wahed
 
Overview of ERP - Mechlin Technologies.pptx
Overview of ERP - Mechlin Technologies.pptxOverview of ERP - Mechlin Technologies.pptx
Overview of ERP - Mechlin Technologies.pptx
Mitchell Marsh
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
908dutch
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
ssuser2b426d1
 

Recently uploaded (20)

MVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptxMVP Mobile Application - Codearrest.pptx
MVP Mobile Application - Codearrest.pptx
 
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdfIndependence Day Hasn’t Always Been a U.S. Holiday.pdf
Independence Day Hasn’t Always Been a U.S. Holiday.pdf
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
 
NYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdfNYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdf
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
 
Cultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational TransformationCultural Shifts: Embracing DevOps for Organizational Transformation
Cultural Shifts: Embracing DevOps for Organizational Transformation
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
dachnug51 - Whats new in domino 14 .pdf
dachnug51 - Whats new in domino 14  .pdfdachnug51 - Whats new in domino 14  .pdf
dachnug51 - Whats new in domino 14 .pdf
 
Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)Software development... for all? (keynote at ICSOFT'2024)
Software development... for all? (keynote at ICSOFT'2024)
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
 
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
 
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.CViewSurvey Digitech Pvt Ltd that  works on a proven C.A.A.G. model.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
 
Overview of ERP - Mechlin Technologies.pptx
Overview of ERP - Mechlin Technologies.pptxOverview of ERP - Mechlin Technologies.pptx
Overview of ERP - Mechlin Technologies.pptx
 
Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …Prada Group Reports Strong Growth in First Quarter …
Prada Group Reports Strong Growth in First Quarter …
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
 

Introduction to MapReduce and Hadoop

  • 2. Expected … what to be said! ● History. ● What is Hadoop. ● Hadoop vs SQl. ● MapReduce. ● Hadoop Building Blocks. ● Installing, Configuring and Running Hadoop. ● Anatomy of MapReduce program.
  • 5. How hadoop was born? Doug Cutting
  • 6. Challenges of Distributed Processing of Large Data ● How to distribute the work? ● How to store and distribute the data itself? ● How to overcome failures? ● How to balance the load? ● How to deal with unstructured data? ● ...
  • 8. What is Hadoop? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Key distinctions of Hadoop: ● Accessible ● Robust ● Scalable ● Simple
  • 9. Hadoop vs SQL ● Structured and Unstructured data. ● Datastore and Data Analysis. ● Scale-out and Scale-up. ● Offline batch processing and Online transactions.
  • 11. ● Parallel programming model for clusters of commodity machines. ● MapReduce provides: o Automatic parallelization & distribution. o Fault tolerance. o Locality of data. What is MapReduce?
  • 12. MapReduce … Map then Reduce
  • 13. Keys and Values ● Key/Value pairs. ● Keys divide Reduce Space. Input Output Map <k1, v1> list(<k2, v2>) Reduce <k2, list(v2)> list(<k3, v3>)
  • 14. WordCount in Action Input: foo.txt: “This is the foo file” bar.txt: “And this is the bar one” 1 is 1 the 1 foo 1 file 1 and 1 this 1 is 1 the 1 Reduce#2: Input: Output: is, [1, 1] is, 2 Reduce#1: Input: Output: this, [1, 1] this, 2 Reduce#3: Input: Output: foo, [1] foo, 1. . Final output: this 2 is 2 the 2 foo 1 file 1 and 1 bar 1 one 1
  • 15. WordCount with MapReduce map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 16. Hadoop Building Blocks How does hadoop work?...
  • 17. Hadoop Building Blocks 1. NameNode 2. DataNode 3. Secondary NameNode 4. JobTracker 5. TaskTracker
  • 18. HDFS: NameNode and DataNodes
  • 21. Running Hadoop Three modes to run Hadoop: 1. Local (standalone) mode. 2. Pseudo-distributed mode “cluster of one” . 3. Fully distributed mode.
  • 22. An Action Running Hadoop on Local Machine
  • 23. Actions ... 1. Installing Hadoop. 2. Configuring Hadoop (Pseudo-distributed mode). 3. Running WordCount example. 4. Web-based cluster UI.
  • 24. HDFS 1. HDFS is a filesystem designed for large-scale distributed data processing. 2. HDFS isn’t a native Unix filesystem. Basic File Commands: $ hadoop fs -cmd <args> $ hadoop fs –ls $ hadoop fs –mkdir /user/chuck $ hadoop fs -copyFromLocal
  • 25. Anatomy of a MapReduce program MapReduce and beyond
  • 26. Hadoop 1. Data Types 2. Mapper 3. Reducer 4. Partitioner 5. Combiner 6. Reading and Writing a. InputFormat b. OutputFormat
  • 27. Anatomy of a MapReduce program
  • 28. Hadoop Data Types ● Certain defined way of serializing key/value pairs. ● Values should implement Writable Interface. ● Keys should implement WritableComparable interface. ● Some predefined classes: o BooleanWritable. o ByteWritable. o IntWritable o ...
  • 30. Mapper 1. Mapper<K1,V1,K2,V2> 2. Override method: void map(K1 key, V1 value, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 31. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
  • 34. Reducer 1. Extends Reducer<K1,V1,K2,V2> 2. Overrides method: void reduce(K2, Iterable<V2>, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 35. WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 38. Partitioner The partitioner decides which key goes where class WordSizePartitioner extends Partitioner<Text, IntWritable> { @Override public int getPartition(Text word, IntWritable count, int numOfPartions) { return 0; } }
  • 40. Combiner It’s a local Reduce Task at Mapper. WordCout Mapper Output: 1. Without Combiner:<the, 1>, <file, 1>, <the, 1>, … 2. With Combiner:<the, 2>, <file, 2>, ...
  • 42. Reading and Writing 1. Input data usually resides in large files. 2. MapReduce’s processing power is the splitting of the input data into chunks(InputSplit). 3. Hadoop’s FileSystem provides the class FSDataInputStream for file reading. It extends DataInputStream with random read access.
  • 43. InputFormat Classes ● TextInputFormat o <offset, line> ● KeyValueTextInputFormat o keytvaue => <key, value> ● NLineInputFormat o <offset, nLines> You can define your own InputFormat class ...
  • 44. 1. The output has no splits. 2. Each reducer generates output file named part-nnnnn, where nnnnn is the partition ID of the reducer. Predefined OutputFormat classes: > TextOutputFormat <k, v> => ktv OutputFormat
  • 45. Recap
  • 47. Q

Editor's Notes

  1. https://sites.google.com/site/hadoopintroduction/home/what-is-hadoop
  2. Lucene is a full featured text indexer and searching library. Nutch was trying to build a complete web search engine with Lucene, it has web crawler and HTML parser and so on.. Problem: There are billions of web pages there!! What can the poor Nutch do? > Google announced GFS and MapReduce 2004, they said that they are using these techniques in their search engine … realy? :/ < Doug and his team used these techniques for nutch and then Hadoop was born. Doug Cutting
  3. Challenges in processing Large Data in a distributed way.
  4. Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2). Robust—Because it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple—Hadoop allows users to quickly write efficient parallel code. Hadoop in Action section 1.2
  5. REF: https://sites.google.com/site/hadoopintroduction/home/comparing-sql-databases-and-hadoop
  6. REF: https://developer.yahoo.com/hadoop/tutorial/module4.html
  7. Table from “Hadoop In Action” Images source: https://developer.yahoo.com/hadoop/tutorial/module4.html
  8. Pseudo-code for map and reduce functions for word counting Source: Hadoop In Action
  9. We now know a general overview about mapreduce, let’s see how hadoop works
  10. Hadoop In Action Figure 2.1
  11. Local (standalone) mode. No HDFS. No Hadoop Daemons. Debugging and testing the logic of MapReduce program. Pseudo-distributed mode. All daemons running on a single machine. Debugging your code, allowing you to examine memory usage, HDFS input/out­put issues, and other daemon interactions. Fully distributed mode.
  12. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  13. This slide is initially left blank.
  14. https://developer.yahoo.com/hadoop/tutorial/module4.html
  15. This slide is initially left blank.
  16. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.
  17. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.