This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes Hadoop as having two main components - the Hadoop Distributed File System (HDFS) which stores data across infrastructure, and MapReduce which processes the data in a parallel, distributed manner. HDFS provides redundancy, scalability, and fault tolerance. Together these components provide a solution for businesses to efficiently analyze the large, unstructured "Big Data" they collect.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
This document provides an overview of data streaming fundamentals and tools. It discusses how data streaming processes unbounded, continuous data streams in real-time as opposed to static datasets. The key aspects covered include data streaming architecture, specifically the lambda architecture, and popular open source data streaming tools like Apache Spark, Apache Flink, Apache Samza, Apache Storm, Apache Kafka, Apache Flume, Apache NiFi, Apache Ignite and Apache Apex.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
The document summarizes the history and evolution of non-relational databases, known as NoSQL databases. It discusses early database systems like MUMPS and IMS, the development of the relational model in the 1970s, and more recent NoSQL databases developed by companies like Google, Amazon, Facebook to handle large, dynamic datasets across many servers. Pioneering systems like Google's Bigtable and Amazon's Dynamo used techniques like distributed indexing, versioning, and eventual consistency that influenced many open-source NoSQL databases today.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
MapReduce is a programming framework that allows for distributed and parallel processing of large datasets. It consists of a map step that processes key-value pairs in parallel, and a reduce step that aggregates the outputs of the map step. As an example, a word counting problem is presented where words are counted by mapping each word to a key-value pair of the word and 1, and then reducing by summing the counts of each unique word. MapReduce jobs are executed on a cluster in a reliable way using YARN to schedule tasks across nodes, restarting failed tasks when needed.
This document discusses privacy, security, and ethics in data science. It covers topics such as anonymizing data and computations, seeking security for personal data, and the unethical surprises that can occur in data science work. It also discusses how to respect privacy by securely storing data, adding layers of protection like encryption, and using techniques like distributed computing and differential privacy to better protect sensitive information. The document cautions that biases in data can propagate biases in models, and highlights the importance of addressing issues like social bias, redaction of sensitive info, and debiasing models to help ensure ethical practices in this field.
The document provides an overview of MapReduce, including:
1) MapReduce is a programming model and implementation that allows for large-scale data processing across clusters of computers. It handles parallelization, distribution, and reliability.
2) The programming model involves mapping input data to intermediate key-value pairs and then reducing by key to output results.
3) Example uses of MapReduce include word counting and distributed searching of text.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
The document provides information about Pig Latin, the data processing language used with Apache Pig. It discusses Pig Latin basics like the data model, relations, tuples, and fields. It also covers Pig Latin statements, loading and storing data, data types, relational operations like group, join, cross, union, and diagnostic operators like dump and describe.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail
Below topics are explained in this Hive presetntation:
1. History of Hive
2. What is Hive?
3. Architecture of Hive
4. Data flow in Hive
5. Hive data modeling
6. Hive data types
7. Different modes of Hive
8. Difference between Hive and RDBMS
9. Features of Hive
10. Demo on HiveQL
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are:
- Mappers process input records in parallel, emitting (key, value) pairs.
- A shuffle/sort phase groups values by key to same reducer.
- Reducers process grouped values to produce final output, aggregating as needed.
- This allows massive datasets to be processed across a cluster in a fault-tolerant way.
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
The document discusses Apache Tez, a framework for accelerating Hadoop query processing. Some key points:
- Tez is a dataflow framework that expresses computations as directed acyclic graphs (DAGs) of tasks. This allows optimizations like container reuse and locality-aware scheduling.
- It is built on YARN and provides a customizable execution engine as well as runtime and DAG APIs for applications to define computations.
- Compared to MapReduce, Tez can provide better performance, predictability, and resource utilization through its DAG execution model and optimizations like reducing intermediate data writes.
- It has been used to improve performance for workloads like Hive, Pig, and large TPC-DS queries
we are interested in performing Big Data analytics, we need to
learn Hadoop to perform operations with Hadoop MapReduce. In this Presentation, we
will discuss what MapReduce is, why it is necessary, how MapReduce programs can
be developed through Apache Hadoop, and more.
Hadoop and HBase experiences in perf log projectMao Geng
This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
Hadoop eco system with mapreduce hive and pigKhanKhaja1
This document provides an overview of MapReduce architecture and components. It discusses how MapReduce processes data using map and reduce tasks on key-value pairs. The JobTracker manages jobs by scheduling tasks on TaskTrackers. Data is partitioned and sorted during the shuffle and sort phase before being processed by reducers. Components like Hive, Pig, partitions, combiners, and HBase are described in the context of how they integrate with and optimize MapReduce processing.
This document discusses MapReduce and how it can be used to parallelize a word counting task over large datasets. It explains that MapReduce programs have two phases - mapping and reducing. The mapping phase takes input data and feeds each element to mappers, while the reducing phase aggregates the outputs from mappers. It also describes how Hadoop implements MapReduce by splitting files into splits, assigning splits to mappers across nodes, and using reducers to aggregate the outputs.
MapReduce is a programming model for processing large datasets in a distributed environment. It consists of a map function that processes input key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same key. It allows for parallelization of computations across large clusters. Example applications include word count, sorting, and indexing web links. Hadoop is an open source implementation of MapReduce that runs on commodity hardware.
Learning Objectives - In this module, you will understand Hadoop MapReduce framework and how MapReduce works on data stored in HDFS. Also, you will learn what are the different types of Input and Output formats in MapReduce framework and their usage.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
Big data refers to large volumes of unstructured or semi-structured data that is difficult to process using traditional databases and analysis tools. The amount of data generated daily is growing exponentially due to factors like increased internet usage and data collection by organizations. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for reliable storage and MapReduce as a programming model to process data in parallel across nodes.
Hadoop/MapReduce is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, a programming model where input data is processed by "map" functions in parallel, and results are combined by "reduce" functions, to process and generate outputs from large amounts of data and nodes. The core components are the Hadoop Distributed File System for data storage, and the MapReduce programming model and framework. MapReduce jobs involve mapping data to intermediate key-value pairs, shuffling and sorting the data, and reducing to output results.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
MongoDB is an open-source document database, and the leading NoSQL database. Written in C++.
MongoDB has official drivers for a variety of popular programming languages and development environments. There are also a large number of unofficial or community-supported drivers for other programming languages and frameworks.
SonarQube is an open platform to manage code quality. It has got a very efficient way of navigating, a balance between high-level view, dashboard, TimeMachine and defect hunting tools.
SonarQube tool is a web-based application. Rules, alerts, thresholds, exclusions, settings… can be configured online.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
### Data Description and Analysis Summary for Presentation
#### 1. **Importing Libraries**
Libraries used:
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`: Data visualization
- `scikit-learn`: Machine learning utilities
- `statsmodels`, `pmdarima`: Statistical modeling
- `keras`: Deep learning models
#### 2. **Loading and Exploring the Dataset**
**Dataset Overview:**
- **Source:** CSV file (`mumbai-monthly-rains.csv`)
- **Columns:**
- `Year`: The year of the recorded data.
- `Jan` to `Dec`: Monthly rainfall data.
- `Total`: Total annual rainfall.
**Initial Data Checks:**
- Displayed first few rows.
- Summary statistics (mean, standard deviation, min, max).
- Checked for missing values.
- Verified data types.
**Visualizations:**
- **Annual Rainfall Time Series:** Trends in annual rainfall over the years.
- **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall.
- **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall.
- **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall.
- **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall.
#### 3. **Data Transformation**
**Steps:**
- Ensured 'Year' column is of integer type.
- Created a datetime index.
- Converted monthly data to a time series format.
- Created lag features to capture past values.
- Generated rolling statistics (mean, standard deviation) for different window sizes.
- Added seasonal indicators (dummy variables for months).
- Dropped rows with NaN values.
**Result:**
- Transformed dataset with additional features ready for time series analysis.
#### 4. **Data Splitting**
**Procedure:**
- Split the data into features (`X`) and target (`y`).
- Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order.
**Result:**
- Training set: `(X_train, y_train)`
- Testing set: `(X_test, y_test)`
#### 5. **Automated Hyperparameter Tuning**
**Tool Used:** `pmdarima`
- Automatically selected the best parameters for the SARIMA model.
- Evaluated using metrics such as AIC and BIC.
**Output:**
- Best SARIMA model parameters and statistical summary.
#### 6. **SARIMA Model**
**Steps:**
- Fit the SARIMA model using the training data.
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Indicates accuracy on training data.
- **Test MAE:** Indicates accuracy on unseen data.
- **Train RMSE:** Measures average error magnitude on training data.
- **Test RMSE:** Measures average error magnitude on testing data.
#### 7. **LSTM Model**
**Preparation:**
- Reshaped data for LSTM input.
- Converted data to `float32`.
**Model Building and Training:**
- Built an LSTM model with one LSTM layer and one Dense layer.
- Trained the model on the training data.
**Evaluation:**
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Accuracy on training data.
- **T
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
2. Combiner
• A Combiner, also known as a Mini-reduce or Mapper side reducer
• The Combiner will receive as input all data emitted by the Mapper
instances on a given node and its output from the Combiner is then
sent to the Reducers,
• The Combiner will be used in between the Map class and the
Reduce class to reduce the volume of data transfer between Map
and Reduce.
• Usage of the Combiner is optional.
3. When ?
• If a reduce function is both
commutative and associative , we do not need to write any
additional code to take advantage .
job.setCombinerClass(Reduce.class);
• The Combiner should be an instance of the Reducer interface. A
combiner does not have a predefined interface
• If your Reducer itself cannot be used directly as a Combiner
because of commutativity or associativity, you might
still be able to write a third class to use as a Combiner for your job.
• Note – Hadoop does not guarantee that how many times combiners
function will run and how many times it will run for a map output.
8. Speculative execution
• One problem with the Hadoop system is that by dividing the tasks
across many nodes, it is possible for a few slow nodes to rate-limit
the rest of the program.
• Hadoop platform will schedule redundant copies of the remaining
tasks across several nodes which do not have other work to
perform. This process is known as speculative execution.
• When tasks complete, they announce this fact to the JobTracker.
Whichever copy of a task finishes first becomes the definitive copy.
9. • Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by
configuration ;
• mapred.map.tasks.speculative.execution
• mapred.reduce.tasks.speculative.execution
• There is a hard limit of 10% of slots used for speculation across all
hadoop jobs. This is not configurable right now. However there is a
per-job option to cap the ratio of speculated tasks to total tasks:
mapreduce.job.speculative.speculativecap=0.1
10. Locating Stragglers
Hadoop monitors each task progress using a progress score
between 0 and 1
If a task’s progress score is less than (average – 0.2), and the
task has run for at least 1 minute, it is marked as a straggler
11. COUNTERS
• Counters are used to determine if and how often a
particular event occurred during a job execution.
• 4 categories of counters in Hadoop
• File system,
• Job
• Map Reduce Framework,
• Custom counter
16. Custom Counters
• MapReduce allows you to define your own custom
counters. Custom counters are useful for counting
specific records such as Bad Records, as the framework
counts only total records. Custom counters can also be
used to count outliers such as example maximum and
minimum values, and for summations.
17. Steps to write custome counter
• define a enum (mapper or reducer , anywhere based upon requirement );
• public static enum MATCH_COUNTER {
Score_above_400,
Score_below_20,Temp_abv_55;
}
context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);
18. Data Types
• Hadoop MapReduce uses typed data at all times when it
interacts with user-provided Mappers and Reducers.
• In wordCount, you must have seen LongWritable,
IntWrtitable and Text. It is fairly easy to understand the
relation between them and Java’s primitive types.
LongWritable is equivalent to long, IntWritable to int and
Text to String.
19. Hadoop writable classes (data
types) vs Java Data types
Java Hadoop
Byte Bytewritable
int Intwritable /Vintwritable/
float floatwritable
long Longwritable / VLongwritable
Double DoubleWritable
String Text / Nullwritable
20. • What is a Writable in Hadoop?
• Why does Hadoop use Writable(s)?
• Limitation of primitive Hadoop Writable classes
• Custom Writable
21. Writable in Hadoop
• It is fairly easy to understand the relation between them and Java’s
primitive types. LongWritable is equivalent to long, IntWritable to int
and Text to String.
• Writable in an interface in Hadoop and types in Hadoop must
implement this interface. Hadoop provides these writable wrappers
for almost all Java primitive types and some other types
• To implement the Writable interface we require two methods ;
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}
22. Why does Hadoop use
Writable(s)
• As we already know, data needs to be transmitted between different
nodes in a distributed computing environment.
• This requires serialization and deserialization of data to convert the
data that is in structured format to byte stream and vice-versa.
• Hadoop therefore uses simple and efficient serialization protocol to
serialize data between map and reduce phase and these are called
Writable(s).
23. WritableComparable
• interface is just a subinterface of the Writable and
java.lang.Comparable interfaces.
• For implementing a WritableComparable we must have compareTo
method apart from readFields and write methods.
• Comparison of types is crucial for MapReduce, where there is a
sorting phase during which keys are compared with one another.
public interface WritableComparable extends Writable,
Comparable{
}
24. • public interface WritableComparable extends Writable, Comparable
{
void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o)
}
• WritableComparables can be compared to each other, typically via
Comparators. Any type which is to be used as a key in the Hadoop
Map-Reduce framework should implement this interface.
• Any type which is to be used as a value in the Hadoop Map-Reduce
framework should implement the Writable interface.
25. Limitation of primitive
Hadoop Writable classes
• Writable that can be used in simple applications like wordcount, but
clearly these cannot serve our purpose all the time.
• Now if you want to still use the primitive Hadoop Writable(s), you
would have to convert the value into a string and transmit it. However
it gets very messy when you have to deal with string manipulations.
27. INPUT Format
• The InputFormat class is one of the fundamental classes in the
Hadoop Map Reduce framework. This class is responsible for
defining two main things:
Data splits
Record reader
• Data split is a fundamental concept in Hadoop Map Reduce
framework which defines both the size of individual Map tasks and
its potential execution server.
• The Record Reader is responsible for actual reading records from
the input file and submitting them (as key/value pairs) to the
mapper.
28. • public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext
context) throws IOException, InterruptedException;
}
30. MultiInputs
• We use MultipleInputs class which supports MapReduce
jobs that have multiple input paths with a different
InputFormat and Mapper for each path.
• MultipleInputs is a feature that supports different input
formats in the MapReduce.
32. • Step :1 Add configuration in driver class
MultipleInputs.addInputPath(job,new
Path(args[0]),TextInputFormat.class,myMapper1.class);
MultipleInputs.addInputPath(job,new
Path(args[1]),TextInputFormat.class, myMapper2.class);
• Step :2 Write different Mapper for different the file path ;
myMapper1 extend mapper<Ki,Vi,Ko,Vo> {
}
• myMapper2 extend mapper<Ki,Vi,Ko,Vo>{
}
33. MultipleOutputFormat
• FileOutputFormat and its subclasses generate a set of
files in the output directory.
• There is one file per reducer, and files are named by the
partition number: part-00000, part-00001, etc.
• There is sometimes a need to have more control over
the naming of the files or to produce multiple files per
reducer.
34. • Step -1
MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT",
TextOutputFormat.class, Text.class, DoubleWritable.class);
• Step-2
Overide setup() method in reducer class and create an instance of
multiOutputs() ;
public void setup(Context context) throws IOException,
InterruptedException {
mos = new MultipleOutputs<Text, DoubleWritable>(context); }
• Step-3
We will use multiOutputs() instance in reduce() method to write data
to the ouput
mos.write(“NAMED_OUTPUT", outputKey, outputValue);
35. DISTRIBUTED CACHE
• If you are writing Map Reduce Applications, where you want some
files to be shared across all nodes in Hadoop Cluster. It can be
simple properties file or can be executable jar file.
• This Distributed Cache is configured with Job Configuration, What it
does is, it provides read only data to all machine on the cluster.
• The framework will copy the necessary files on to the slave node
before any tasks for the job are executed on that node
37. • Step 1 : Put file to HDFS
# hdfs -put /rakesh/someFolder /user/rakesh/cachefile1
• Step 2: Add cachefile in Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1
"),job.getConfiguration());
• Step 3: Access Cached file ;
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new
FileInputStream(cacheFiles[0].toString());
38. Mapreduce 1.0 vs Mapreduce 2.0
• one easy way to differentiate between Hadoop old api and new api
is packages.
• old api packages - mapred org.apache.hadoop.mapred package,
• new api packages – mapreduce -org.apache.hadoop.mapreduce
package.
40. Joins
• Joins is one of the interesting features available in MapReduce.
• When processing large data sets the need for joining data by a
common key can be very useful.
• By joining data you can further gain insight such as joining with
timestamps to correlate events with a time a day.
• Joins is one of the interesting features available in MapReduce.
MapReduce can perform joins between very large
datasets.Implementation of join depends on how large the datasets
are and how they are partiotioned . If the join is performed by the
mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.
41. Map-Side Join
• A map-side join between large inputs works by performing the join
before the data reaches the map function.
• For this to work, though, the inputs to each map must be partitioned
and sorted in a particular way.
• Each input data set must be divided into the same number of
partitions, and it must be sorted by the same key (the join key) in
each source.
• All the records for a particular key must reside in the same partition.
This may sound like a strict requirement (and it is), but it actually fits
the description of the output of a MapReduce job.
42. Reduce side Join
• Reduce-Side joins are more simple than Map-Side joins since the
input datasets need not to be structured. But it is less efficient as
both datasets have to go through the MapReduce shuffle phase. the
records with the same key are brought together in the reducer. We
can also use the Secondary Sort technique to control the order of
the records.
• How it is done?
The key of the map output, of datasets being joined, has to be the
join key - so they reach the same reducer.
• Each dataset has to be tagged with its identity, in the mapper- to
help differentiate between the datasets in the reducer, so they can
be processed accordingly.
43. • In each reducer, the data values from both datasets, for
keys assigned to the reducer, are available, to be
processed as required.
• A secondary sort needs to be done to ensure the
ordering of the values sent to the reducer.
• If the input files are of different formats, we would need
separate mappers, and we would need to use
MultipleInputs class in the driver to add the inputs and
associate the specific mapper to the same.
44. Improving MapReduce
Performance
• Use Compression technique (LZO,GZIP,Snappy….)
• Tune the number of map and reduce tasks appropriately
• Write a Combiner
• Use the most appropriate and compact Writable type for your data
• Reuse Writables
• Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for-
improving-mapreduce-performance/
45. Yet Another Resource Negotiator
(YARN)
• YARN (Yet Another Resource Negotiator) is the resource
management layer for the Apache Hadoop ecosystem.
In a YARN cluster, there are two types of hosts;
• The ResourceManager is the master daemon that communicates
with the client, tracks resources on the cluster, and orchestrates
work by assigning tasks to NodeManagers.
• A NodeManager is a worker daemon that launches and tracks
processes spawned on worker hosts.
46. • Containers are an important YARN concept. You can think of a
container as a request to hold resources on the YARN cluster.
• Use of a YARN cluster begins with a request from a client consisting
of an application. The ResourceManager negotiates the necessary
resources for a container and launches an ApplicationMaster to
represent the submitted application.
• Using a resource-request protocol, the ApplicationMaster negotiates
resource containers for the application at each node. Upon
execution of the application, the ApplicationMaster monitors the
container until completion. When the application is complete, the
ApplicationMaster unregisters its container with the
ResourceManager, and the cycle is complete.
When a MapReduce Job is run on a large dataset, Hadoop Mapper generates large chunks of intermediate data that is passed on to Hadoop Reducer for further processing, which leads to massive network congestion.
So reducing this network congestion , MapReduce framework offers ‘Combiner’
In MapReduce a job is broken into several tasks which will execute in parallel. This model of execution is sensitive to slow tasks (even if they are very few in number) as they will slowdown the overall execution of a job. Therefore, Hadoop detects such slow tasks and runs (duplicate) backup tasks for such tasks. This is calledspeculative execution. Speculating more tasks can help jobs finish faster - but can also waste CPU cycles. Conversely - speculating fewer tasks can save CPU cycles - but cause jobs to finish slower. The options documented here allow the users to control the aggressiveness of the speculation algorithms and choose the right balance between efficiency and latency.
The FILE_BYTES_WRITTEN counter is incremented for each byte written to the local file system. These writes occur during the map phase when the mappers write their intermediate results to the local file system. They also occur during the shuffle phase when the reducers spill intermediate results to their local disks while sorting.
The off-the-shelf Hadoop counters that correspond to MAPRFS_BYTES_READ and MAPRFS_BYTES_WRITTEN are HDFS_BYTES_READ and HDFS_BYTES_WRITTEN.
The amount of data read and written will depend on the compression algorithm you use, if any.
The table above describes the counters that apply to Hadoop jobs.
The DATA_LOCAL_MAPS indicates how many map tasks executed on local file systems. Optimally, all the map tasks will execute on local data to exploit locality of reference, but this isn’t always possible.
The FALLOW_SLOTS_MILLIS_MAPS indicates how much time map tasks wait in the queue after the slots are reserved but before the map tasks execute. A high number indicates a possible mismatch between the number of slots configured for a task tracker and how many resources are actually available.
The SLOTS_MILLIS_* counters show how much time in milliseconds expired for the tasks. This value indicates wall clock time for the map and reduce tasks.
The TOTAL_LAUNCHED_MAPS counter defines how many map tasks were launched for the job, including failed tasks. Optimally, this number is the same as the number of splits for the job.
The COMBINE_* counters show how many records were read and written by the optional combiner. If you don’t specify a combiner, these counters will be 0.
The CPU statistics are gathered from /proc/cpuinfo and indicate how much total time was spent executing map and reduce tasks for a job.
The garbage collection counter is reported from GarbageCollectorMXBean.getCollectionTime().
The MAP*RECORDS are incremented for every successful record read and written by the mappers. Records that the map tasks failed to read or write are not included in these counters.
The PHYSICAL_MEMORY_BYTES statistics are gathered from /proc/meminfo and indicate how much RAM (not including swap space) was consumed by all the tasks.
All the counters, whether custom or framework, are stored in the JobTracker JVM memory, so there’s a practical limit to the number of counters you should use. The rule of thumb is to use less than 100, but this will vary based on physical memory capacity.
Serialization : it is a mechanism of writing the state of an object into a byte stream.
A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface.
More technically , To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object.
The reverse operation of serialization is called deserialization.
bjects which can be marshaled to or from files and across the network must obey a particular interface, called Writable, which allows Hadoop to read and write the data in a serialized form for transmission. Hadoop provides several stock classes which implement Writable: Text (which stores String data), IntWritable, LongWritable, FloatWritable, BooleanWritable, and several others. The entire list is in theorg.apache.hadoop.io package of the Hadoop source (see the API reference - http://hadoop.apache.org/docs/current/api/index.html).
Custom writable :
public class MyWritable implements Writable {
// Some data private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public static MyWritable read(DataInput in) throws IOException {
MyWritable w = new MyWritable();
w.readFields(in);
return w;
} }
public interface Comparable{
public int compareTo(Object obj);
}
WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
Any split implementation extends the Apache base abstract class - InputSplit, defining a split length and locations. A split length is the size of the split data (in bytes), while locations is the list of node names where the data for the split would be local. Split locations are a way for a scheduler to decide on which particular machine to execute this split. A very simple[1] a job tracker works as follows:
Receive a heartbeat form one of the task trackers, reporting map slot availability.
Find queued up split for which the available node is "local".
Submit split to the task tracker for the execution.
Locality can mean different things depending on storage mechanisms and the overall execution strategy. In the case of HDFS, for example, a split typically corresponds to a physical data block size and locations is a set of machines (with the set size defined by a replication factor) where this block is physically located. This is how FileInputFormat calculates splits.
HIPI is a framework for image processing of the image file with MapReduce.
Code example : http://www.lichun.cc/blog/2012/05/hadoop-multipleinputs-usage/
. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
How big is the DistributedCache?
The local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB.
Where does the DistributedCache store data?
/tmp/hadoop-<user.name>/mapred/local/taskTracker/archive
If both datasets are too large for either to be copied to each node in the cluster, we can still join them using MapReduce with a map-side or reduce-side join, depending on how the data is structured. One common example of this case is a user database and a log of some user activity (such as access logs). For a popular service, it is not feasible to distribute the user database (or the logs) to all the MapReduce nodes. Before diving into the implementation let us understand the problem thoroughly.
A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this.
If we have two datasets, for example, one dataset having user ids, names and the other having the user activity over the application. In-order to find out which user have performed what activity on the application we might need to join these two datasets such as both user names and the user activity will be joined together. Join can be applied based on the dataset size if one dataset is very small to be distributed across the cluster then we can use Side Data Distribution technique
Almost every Hadoop job that generates an non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, the reduced amount of disk IO during the shuffle will usually save time overall.
Whenever a job needs to output a significant amount of data, LZO compression can also increase performance on the output side. Since writes are replicated 3x by default, each GB of outpunnnnnnnt data you save will save 3GB of disk writes.In order to enable LZO compression, check out our recent guest blog from Twitter. Be sure to setmapred.compress.map.output to true.
The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.
Conclusion
Summarizing the important concepts presented in this section:
A cluster is made up of two or more hosts connected by an internal high-speed network. Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster.
In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers.
The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of two resourceson the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total.
A container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container.
An application in YARN comprises three parts:
The application client, which is how a program is run on the cluster.
An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application.
One or more tasks that do the actual work (runs in a process) in the container allocated by YARN.
A MapReduce application consists of map tasks and reduce tasks.
A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.