Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
Aaron Myers introduces MapReduce and Hadoop. MapReduce is a distributed programming paradigm that allows processing of large datasets across clusters. It works by splitting data, distributing it across nodes, processing it in parallel using map and reduce functions, and collecting the results. Hadoop is an open source software framework for distributed storage and processing of big data using MapReduce. It includes HDFS for storage and Hadoop MapReduce for distributed computing. Developers write MapReduce jobs in Java by implementing map and reduce functions.
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
Mastering Hadoop Map Reduce was a presentation I gave to Orlando Data Science on April 23, 2015. The presentation provides a clear overview of how Hadoop Map Reduce works, and then dives into more advanced topics of how to optimize runtime performance and implement custom data types.
The examples are written in Python and Java, and the presentation walks through how to create an n-gram count map reduce program using custom data types.
You can get the full source code for the examples on my Github! http://www.github.com/scottcrespo/ngrams
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
Aaron Myers introduces MapReduce and Hadoop. MapReduce is a distributed programming paradigm that allows processing of large datasets across clusters. It works by splitting data, distributing it across nodes, processing it in parallel using map and reduce functions, and collecting the results. Hadoop is an open source software framework for distributed storage and processing of big data using MapReduce. It includes HDFS for storage and Hadoop MapReduce for distributed computing. Developers write MapReduce jobs in Java by implementing map and reduce functions.
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
Mastering Hadoop Map Reduce was a presentation I gave to Orlando Data Science on April 23, 2015. The presentation provides a clear overview of how Hadoop Map Reduce works, and then dives into more advanced topics of how to optimize runtime performance and implement custom data types.
The examples are written in Python and Java, and the presentation walks through how to create an n-gram count map reduce program using custom data types.
You can get the full source code for the examples on my Github! http://www.github.com/scottcrespo/ngrams
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
This document provides an overview of big data analytics with Scala, including common frameworks and techniques. It discusses Lambda architecture, MapReduce, word counting examples, Scalding for batch and streaming jobs, Apache Storm, Trident, SummingBird for unified batch and streaming, and Apache Spark for fast cluster computing with resilient distributed datasets. It also covers clustering with Mahout, streaming word counting, and analytics platforms that combine batch and stream processing.
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTrackers.
- The process of launching a MapReduce job is described, from the client submitting the job to the JobTracker distributing tasks to TaskTrackers and running the user-defined mapper and reducer classes.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
This document provides an introduction to distributed programming and Apache Hadoop. It discusses sequential, asynchronous, concurrent, and distributed programming. It then describes how Hadoop works as an open source framework for distributed applications, using a cluster of nodes to scale linearly and process large amounts of data reliably in a simple way. Key concepts covered include MapReduce programming, Hadoop installation and usage, and working with files in the Hadoop Distributed File System.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
PyCascading provides a Python API for the Cascading framework to process data flows on Hadoop. It allows defining data flows as Python functions and operations instead of Java code. The document discusses Hadoop concepts, shows how to define a WordCount workflow in PyCascading with fewer lines of code than Java, and walks through a full example of finding friends' most common interests. Key advantages are using Python instead of Java and leveraging any Python libraries, though performance-critical parts require Java.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.
The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover:
+ What is Spark Streaming and what is it used for?
+ How does Spark Streaming work?
+ Example code to read, process, and write the processed data
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
Hadoop Streaming allows any executable or script to be used as a MapReduce job. It works by launching the executable or script as a separate process and communicating with it via stdin and stdout. The executable or script receives key-value pairs in a predefined format and outputs new key-value pairs that are collected. Hadoop Streaming uses PipeMapper and PipeReducer to adapt the external processes to the MapReduce framework. It provides a simple way to run MapReduce jobs without writing Java code.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
This document provides an overview of big data analytics with Scala, including common frameworks and techniques. It discusses Lambda architecture, MapReduce, word counting examples, Scalding for batch and streaming jobs, Apache Storm, Trident, SummingBird for unified batch and streaming, and Apache Spark for fast cluster computing with resilient distributed datasets. It also covers clustering with Mahout, streaming word counting, and analytics platforms that combine batch and stream processing.
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTrackers.
- The process of launching a MapReduce job is described, from the client submitting the job to the JobTracker distributing tasks to TaskTrackers and running the user-defined mapper and reducer classes.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
This document provides an introduction to distributed programming and Apache Hadoop. It discusses sequential, asynchronous, concurrent, and distributed programming. It then describes how Hadoop works as an open source framework for distributed applications, using a cluster of nodes to scale linearly and process large amounts of data reliably in a simple way. Key concepts covered include MapReduce programming, Hadoop installation and usage, and working with files in the Hadoop Distributed File System.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis?
This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program.
Please mail me if you found any problem toward the slide. EMAIL: tr.ywchiu@gmail.com
談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ?
本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
PyCascading provides a Python API for the Cascading framework to process data flows on Hadoop. It allows defining data flows as Python functions and operations instead of Java code. The document discusses Hadoop concepts, shows how to define a WordCount workflow in PyCascading with fewer lines of code than Java, and walks through a full example of finding friends' most common interests. Key advantages are using Python instead of Java and leveraging any Python libraries, though performance-critical parts require Java.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.
The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover:
+ What is Spark Streaming and what is it used for?
+ How does Spark Streaming work?
+ Example code to read, process, and write the processed data
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
Abstract: The presentation describes
- What is the BigData problem
- How Hadoop helps to solve BigData problems
- The main principles of the Hadoop architecture as a distributed computational platform
- History and definition of the MapReduce computational model
- Practical examples of how to write MapReduce programs and run them on Hadoop clusters
The talk is targeted to a wide audience of engineers who do not have experience using Hadoop.
MapReduce is a software framework introduced by Google that enables automatic parallelization and distribution of large-scale computations. It hides the details of parallelization, data distribution, load balancing, and fault tolerance. MapReduce allows programmers to specify a map function that processes key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. It then automatically parallelizes the computation across large clusters of machines.
MapReduce: Recap
•Programmers must specify:
map(k, v) → <k’, v’>*
reduce(k’, v’) → <k’, v’>*
–All values with the same key are reduced together
•Optionally, also:
partition(k’, number of partitions) → partition for k’
–Often a simple hash of the key, e.g., hash(k’) mod n
–Divides up key space for parallel reduce operations
combine(k’, v’) → <k’, v’>*
–Mini-reducers that run in memory after the map phase
–Used as an optimization to reduce network traffic
•The execution framework handles everything else…
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
Codemotion Rome 2015 - I Big Data sono indubbiamente tra i temi più "caldi" del panorama tecnologico attuale. Ad oggi nel mondo sono stati prodotti circa 5 Exabytes di dati che costituiscono una potenziale fonte di "intelligenza" che è possibile sfruttare, grazie alle tecnologie più recenti, in svariati ambiti che spaziano dalla medicina alla sociologia passando per il marketing. Il talk si propone, tramite una gita virtuale nello spazio, di introdurre i concetti, le tecniche e gli strumenti che consentono di iniziare a sfruttare il potenziale dei Big Data nel lavoro quotidiano.
This document provides an overview of using R, Hadoop, and Rhadoop for scalable analytics. It begins with introductions to basic R concepts like data types, vectors, lists, and data frames. It then covers Hadoop basics like MapReduce. Next, it discusses libraries for data manipulation in R like reshape2 and plyr. Finally, it focuses on Rhadoop projects like RMR for implementing MapReduce in R and considerations for using RMR effectively.
The document provides an introduction to MapReduce, describing its motivation as a framework for simplifying large-scale data processing across distributed systems. It outlines MapReduce's programming model and main features, including automatic parallelization, fault tolerance, and locality. The document also provides a detailed example of counting letter frequencies in a large file to illustrate how MapReduce works.
This document summarizes the MapReduce programming model and its associated implementation for processing large datasets across distributed systems. It describes how MapReduce allows users to express computations over large datasets in a simple way, while hiding the complexity of parallelization, fault-tolerance, and data distribution. The core abstractions of Map and Reduce are explained, along with how an implementation like Google's leverages a distributed file system to parallelize tasks across clusters and provide fault-tolerance through replication.
Hadoop and Mapreduce for .NET User GroupCsaba Toth
This document provides an introduction to Hadoop and MapReduce. It discusses big data characteristics and challenges. It provides a brief history of Hadoop and compares it to RDBMS. Key aspects of Hadoop covered include the Hadoop Distributed File System (HDFS) for scalable storage and MapReduce for scalable processing. MapReduce uses a map function to process key-value pairs and generate intermediate pairs, and a reduce function to merge values by key and produce final results. The document demonstrates MapReduce through an example word count program and includes demos of implementing it on Hortonworks and Azure HDInsight.
This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
The most popular batch processing framework is Apache Hadoop's MapReduce. MapReduce is a Java based system for processing large datasets in parallel. It reads data from the HDFS and divides the dataset into smaller pieces.
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
This document provides an overview of MapReduce and HBase in big data processing. It discusses how MapReduce distributes tasks across nodes in a cluster and uses map and reduce functions to process large datasets in parallel. It also explains how HBase can be used for storage with MapReduce, providing fast access and retrieval of large amounts of flexible, column-oriented data.
The document discusses MapReduce, a framework for processing large datasets in a distributed manner. It begins by explaining how MapReduce addresses issues around scaling computation across large networks. It then provides details on the key features and working of MapReduce, including how it divides jobs into map and reduce phases that operate in parallel on data blocks. Examples are given to illustrate how MapReduce can be used to count word frequencies in text and tally population statistics from a census.
Sparse matrix computations in MapReduceDavid Gleich
The document announces an ICME MapReduce workshop from April 29 to May 1, 2013. The workshop goals are to learn the basics of MapReduce and Hadoop and be able to process large volumes of scientific data. The workshop overview includes presentations on sparse matrix computations in MapReduce, extending MapReduce for scientific computing, and evaluating MapReduce for science. The document also discusses how to program data computers using MapReduce and Hadoop and provides examples of matrix operations like matrix-vector multiplication in MapReduce.
This document discusses algorithms that can be implemented using MapReduce, including sorting, searching, TF-IDF, breadth-first search, PageRank, and more advanced algorithms. It provides details on how sorting, searching, TF-IDF, breadth-first search, and PageRank algorithms work in MapReduce, including explaining the map and reduce phases. It also discusses graph representations that can be used for algorithms like breadth-first search and PageRank and how the algorithms are distributed across parallel tasks in MapReduce.
Similar to MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab (20)
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
1) NoSQL databases are non-relational and schema-free, providing alternatives to SQL databases for big data and high availability applications.
2) Common NoSQL database models include key-value stores, column-oriented databases, document databases, and graph databases.
3) The CAP theorem states that a distributed data store can only provide two out of three guarantees around consistency, availability, and partition tolerance.
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
This document provides instructions for getting started with TensorFlow using a free CloudxLab. It outlines the following steps:
1. Open CloudxLab and enroll if not already enrolled. Otherwise go to "My Lab".
2. In "My Lab", open Jupyter and run commands to clone an ML repository containing TensorFlow examples.
3. Go to the deep learning folder in Jupyter and open the TensorFlow notebook to get started with examples.
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2wLh5aF
This CloudxLab Introduction to Linux helps you to understand Linux in detail. Below are the topics covered in this tutorial:
1) Linux Overview
2) Linux Components - The Programs, The Kernel, The Shell
3) Overview of Linux File System
4) Connect to Linux Console
5) Linux - Quick Start Commands
6) Overview of Linux File System
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceAggregage
The traditional method of manual call monitoring is no longer cutting it in today's fast-paced call center environment. Join this webinar where industry experts Angie Kronlage and April Wiita from Working Solutions will explore the power of automation to revolutionize outdated call review processes!
Dev Dives: Mining your data with AI-powered Continuous DiscoveryUiPathCommunity
Want to learn how AI and Continuous Discovery can uncover impactful automation opportunities? Watch this webinar to find out more about UiPath Discovery products!
Watch this session and:
👉 See the power of UiPath Discovery products, including Process Mining, Task Mining, Communications Mining, and Automation Hub
👉 Watch the demo of how to leverage system data, desktop data, or unstructured communications data to gain deeper understanding of existing processes
👉 Learn how you can benefit from each of the discovery products as an Automation Developer
🗣 Speakers:
Jyoti Raghav, Principal Technical Enablement Engineer @UiPath
Anja le Clercq, Principal Technical Enablement Engineer @UiPath
⏩ Register for our upcoming Dev Dives July session: Boosting Tester Productivity with Coded Automation and Autopilot™
👉 Link: https://bit.ly/Dev_Dives_July
This session was streamed live on June 27, 2024.
Check out all our upcoming Dev Dives 2024 sessions at:
🚩 https://bit.ly/Dev_Dives_2024
Test Management as Chapter 5 of ISTQB Foundation. Topics covered are Test Organization, Test Planning and Estimation, Test Monitoring and Control, Test Execution Schedule, Test Strategy, Risk Management, Defect Management
The presentation will delve into the ASIMOV project, a novel initiative that leverages Retrieval-Augmented Generation (RAG) to provide precise, domain-specific assistance to telecommunications engineers and technicians. The session will focus on the unique capabilities of Milvus, the chosen vector database for the project, and its advantages over other vector databases.
Attending this session will give you a deeper understanding of the potential of RAG and Milvus DB in telecommunications engineering. You will learn how to address common challenges in the field and enhance the efficiency of their operations. The session will equip you with the knowledge to make informed decisions about the choice of vector databases, and how best to use them for your use-cases
Blockchain and Cyber Defense Strategies in new genre timesanupriti
Explore robust defense strategies at the intersection of blockchain technology and cybersecurity. This presentation delves into proactive measures and innovative approaches to safeguarding blockchain networks against evolving cyber threats. Discover how secure blockchain implementations can enhance resilience, protect data integrity, and ensure trust in digital transactions. Gain insights into cutting-edge security protocols and best practices essential for mitigating risks in the blockchain ecosystem.
Day 5 - Making your Automation Journey Continuous and Beneficial.pdfUiPathCommunity
Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program: https://bit.ly/Africa_Automation_Student_Developers
In our final session, we shall show you how to make your RPA journey continuous, productive, and beneficial; you will also get to know the career-scape for automation professionals. Lastly, you will learn how your college/polytechnic/university can join the UiPath Academic Alliance program.
📕 Detailed agenda:
The Future of AI-Powered Automation
Career Roles and Opportunities in automation
About UiPath Academy and Learning Resources
Introducing the UiPath Student Champions program
Introducing the UiPath Academic Alliance program
Purpose and Objectives of UiPath Academic Alliance
How the UiPath Academic Alliance benefits you and your University
💻 Extra training through UiPath Academy and relevant program resources:
Introduction to AI-powered automation
UiPath Community Student Champions program
UiPath Academic Alliance program
💥 Complete your full training via 👨🎓👩🎓 UiPath Academy - 'Automation Developer Associate Training' curriculum, and pave your way towards becoming 'UiPath Certified Professional - Automation Developer Associate':
https://academy.uipath.com/learning-plans/automation-developer-associate-training
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My IdentityCynthia Thomas
Identities are a crucial part of running workloads on Kubernetes. How do you ensure Pods can securely access Cloud resources? In this lightning talk, you will learn how large Cloud providers work together to share Identity Provider responsibilities in order to federate identities in multi-cloud environments.
This slide deck is a deep dive the Salesforce latest release - Summer 24, by the famous Stephen Stanley. He has examined the release notes very carefully, and summarised them for the Wellington Salesforce user group, virtual meeting June 27 2024.
The "Zen" of Python Exemplars - OTel Community DayPaige Cruz
The Zen of Python states "There should be one-- and preferably only one --obvious way to do it." OpenTelemetry is the obvious choice for traces but bad news for Pythonistas when it comes to metrics because both Prometheus and OpenTelemetry offer compelling choices. Let's look at all of the ways you can tie metrics and traces together with exemplars whether you're working with OTel metrics, Prom metrics, Prom-turned-OTel metrics, or OTel-turned-Prom metrics!
Tool Support for Testing as Chapter 6 of ISTQB Foundation 2018. Topics covered are Tool Benefits, Test Tool Classification, Benefits of Test Automation and Risk of Test Automation
Building an Agentic RAG locally with Ollama and MilvusZilliz
With the rise of Open-Source LLMs like Llama, Mistral, Gemma, and more, it has become apparent that LLMs might also be useful even when run locally. In this talk, we will see how to deploy an Agentic Retrieval Augmented Generation (RAG) setup using Ollama, with Milvus as the vector database on your laptop. That way, you can also avoid being Rate Limited by OpenAI like I have been in the past.
The document discusses testing throughout the software development life cycle. It describes different software development models including sequential, incremental, and iterative models. It also covers different test levels from component and integration testing to system and acceptance testing. The document discusses different types of testing including functional and non-functional testing. It also covers topics like maintenance testing and triggers for additional testing when changes are made. Also covers concepts of Agile including DevOps, Shift Left Approach, TDD, BDD, ATDD, Retrospective and Process Improvement
Database Management Myths for DevelopersJohn Sterrett
Myths, Mistakes, and Lessons learned about Managing SQL Server databases. We also focus on automating and validating your critical database management tasks.
Chapter 3 of ISTQB Foundation 2018 syllabus with sample questions. Answers about what is static testing, what is review, types of review, informal review, walkthrough, technical review, inspection.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
4. MapReduce
BIG DATA PROBLEM - PROCESSING
Q: How fast can 1GHz processor sort 1TB data? This
data is made up of 10 billion 100 byte size strings.
A: Around 6-10 hours
What's wrong 6-10 hours?
1. Faster Sort
2. Bigger Data Sorting
3. More often
We need
5. MapReduce
BIG DATA PROBLEM - PROCESSING
Google, 8 Sept, 2011:
Sorting 10PB took 6.5 hrs on 8000 computers
6. MapReduce
1. Every SQL Query is impacted by Sorting:
○ Where clause - Index (Sorting)
○ Group By - Involves Sorting
○ Joins - immensly enhanced by Sorting
○ Distinct
○ Order BY
2. Most of the algorithms depend on sorting
Why Sorting is such as big deal
7. MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Converts Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?
8. MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?
11. MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of containing 100s of text books,[500 mb]
how would you find the frequencies of words?
12. MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the files
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
13. MapReduce
THINKING IN MAP / REDUCE
Problems?
Start
Initialize a dictionary or
hashtable (word, count)
Read next word from file
Is Any
word
left?
Find word in
dictionary
Does the word
exist in
dictionary?
Increase the count by 1
Add new word
with count as 0
End
Print the word and
counts
1. wordcount={}
2. for word in file.read().split():
3. if word not in wordcount:
4. wordcount[word] = 0
5. wordcount[word] += 1
6. for k,v in wordcount.items():
7. print k, v
Line 1
2
2 3 4
5
6&7
14. MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the file
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
Can not process the data beyond RAM size.
15. MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach2 (SQL):
• Break the books into one word per line
• Insert one word per row in database table
• Execute: select word, count(*) from table group by word.
20. MapReduce
THINKING IN MAP / REDUCE
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 3 (Unix):
• Replace space with a newline
• Order lines with a sort command
• Then find frequencies using uniq
• Scans from top to bottom
• prints the count when line value changes
cat myfile| sed -E 's/[t ]+/n/g'| sort -S 1g | uniq -c
22. MapReduce
THINKING IN MAP / REDUCE
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
The moment the data starts going beyond RAM the time taken
starts increasing. The following become bottlenecks:
• CPU
• Disk Speed
• Disk Space
23. MapReduce
THINKING IN MAP / REDUCE
Then?
Approach 4: Use a external sort.
• Split the files to a size that fits RAM
• Use the previous approaches (2&3) to find freq
• Merge (sort -m) and sum-up frequencies
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1
24. MapReduce
• Takes O(n) time to merge sorted data
• Or the time is proportional to the number of
elements to be merged
THINKING IN MAP / REDUCE
Merging
41. MapReduce
• For more than two lists
○ Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
To the output
42. MapReduce
• For more than two lists
○ Use min-heap
THINKING IN MAP / REDUCE
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
1,
43. MapReduce
• For more than two lists
○ Or merge two at a time
THINKING IN MAP / REDUCE
Merging
44. MapReduce
THINKING IN MAP / REDUCE
Problems with Approach 4?
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1
45. MapReduce
THINKING IN MAP / REDUCE
Problems with external Sort?
Time is consumed in transport of data.
+
For each requirement we would need to
special purpose network oriented program.
+
Would Require A lot of Engineering.
Solution?
Use Map/Reduce
46. MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Convert Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?
47. MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
THINKING IN MAP / REDUCE
What is Map/Reduce?
48. MapReduce
Function Mapper (Image):
Convert image
to 100x100 pixel
EXAMPLE OF ONLY MAPPER
Directory Of Profile Pictures in HDFS
Function Mapper (Image):
Convert image
to 100x100 pixel
Function Mapper (Image):
Convert image
to 100x100 pixel
HDFS - Output Directory Of 100x100px Profile Pictures
Machine 1 Machine 2 Machine 3
51. MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa 1
re 1
re 1
sa 1
ga 1
hdfs
sa re re
sa ga
52. MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa re re
sa ga
function reduce(word, freqArray):
return Array.sum(freqArray);
sa 1
re 1
re 1
sa 1
ga 1
ga [1]
re [1, 1]
sa [1, 1]
ga 1
re 2
sa 2
hdfs
54. MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, date) = line.split(",")
print(c, (t, date))
def reduce(key, values):
maxt = -19191919;
date = ''
for i in values:
T = i[0]
If T > maxt: maxt = T, date=i[1
return (maxt, date)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC (20, 2014-01-01)
NYC (20, 2015-01-01)
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR (23, '2014-01-01'
CHICAGO (21, '2015-01-01').
NYC 20,20,21,24
SEATLE 25
BLR (23, '2015-01-01')
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date
55. MapReduce
MAP / REDUCE
Analogous to Group By
function map():
(temp, city, time) = line.split(",")
print(city, temp)
function reduce(city, arr_temps):
return max(arr_temps);
select city,
max(temp)
from table
group by city.
56. MapReduce
MAP / REDUCE
Analogous to Group By
function map():
foreach(word in input) :
print(word, 1);
function reduce(word, freqArray):
return Array.sum(freqArray);
select word,
count(*)
from table
group by
word.
57. MapReduce
MAP REDUCE - Multiple Reducers
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Reduce Part 1
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
TO
HDFS
Apple
Banana
Apricot
Carrots
62. MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
63. MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
No. The output of mapper is first
shuffled/sorted and then given to reducers.
64. MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
65. MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
False. Mapper can generate as many key-value pair
as it wants for an input.
66. MapReduce
MAP / REDUCE
A mapper always have to generate at least a
key-value pair[Correct/Wrong]?
67. MapReduce
MAP / REDUCE
A mapper always generates at least a key-value
pair[Correct/Wrong]?
Wrong
69. MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
Yes. By default there is a single reducer job but it
can be split by specifying cmd option :
mapred.reduce.tasks.
70. MapReduce
MAP / REDUCE
In hadoop 1.0, What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
71. MapReduce
MAP / REDUCE
What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
B.
72. MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
73. MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
Yes.
74. MapReduce
MAP / REDUCE
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?
77. MapReduce
MAP / REDUCE
In HDFS.
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?
78. MapReduce
MAP / REDUCE
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or Local File System or Memory?
79. MapReduce
MAP / REDUCE
First in Memory and purged to
Local File System.
Output of mapper is saved in HDFS directly only if
there is no reduce phase.
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or File System or Memory?
80. MapReduce
MAP / REDUCE Assignment For Tomorrow
1. Frequencies of letters [a-z] - Do you need Map/Reduce?
2. Find anagrams in a huge text. An anagram is basically a
different arrangement of letters in a word. Anagram does not
need have a meaning.
Input:
“the cat act in tic tac toe”
Output:
cat, tac, act
the
toe
in
tic
81. MapReduce
MAP / REDUCE
3a. A file contains the DNA sequence of people. Find all the
people who have same DNAs.
Output:
User1, User4
User2
User3, User 5
User6
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 AGCT”
Assignment For Tomorrow
82. MapReduce
MAP / REDUCE Assignment For Tomorrow
3b. A file contains the DNA sequence of people. Find all the
people who have same or mirror image of DNAs.
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 ACCT”
Output:
User1, User2, User4
User3, User 5
User6
83. MapReduce
MAP / REDUCE Assignment For Tomorrow
4. In an unusual democracy, everyone is not equal. The vote count is a
function of worth of the voter. Though everyone is voting for each other.
As example, if A with a worth of 5 and B with a worth of 1 are voting
for C, the vote count of C would be 6.
You are given a list of people with their value of vote. You are also given
another list describing who voted for who all.
List1
Voter Votee
A C
B C
C F
Find out what is the vote count of everyone?
List2
PersonWorth
A 5
B 1
C 11
Result
PersonVoteCount
A 0
B 0
C 6
F 11
89. MapReduce
QUICK - CLUSTER HANDS ON
MapReduce Command
The Example is available here
Remove old output directory
hadoop fs -rm -r /user/student/wordcount/output
Execute the mapReduce Command:
hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount /data/mr/wordcount/input mrout