This document provides an overview of the big data technology stack, including the data layer (HDFS, S3, GPFS), data processing layer (MapReduce, Pig, Hive, HBase, Cassandra, Storm, Solr, Spark, Mahout), data ingestion layer (Flume, Kafka, Sqoop), data presentation layer (Kibana), operations and scheduling layer (Ambari, Oozie, ZooKeeper), and concludes with a brief biography of the author.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
The document summarizes the key components of the big data stack, from the presentation layer where users interact, through various processing and storage layers, down to the physical infrastructure of data centers. It provides examples like Facebook's petabyte-scale data warehouse and Google's globally distributed database Spanner. The stack aims to enable the processing and analysis of massive datasets across clusters of servers and data centers.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Brief Introduction about Hadoop and Core Services.Muthu Natarajan
I have given quick introduction about Hadoop, Big Data, Business Intelligence and other core services and program involved to use Hadoop as a successful tool for Big Data analysis.
My true understanding in Big-Data:
“Data” become “information” but now big data bring information to “Knowledge” and ‘knowledge” becomes “Wisdom” and “Wisdom” turn into “Business” or “Revenue”, All if you use promptly & timely manner
Apache Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides HDFS for distributed file storage and MapReduce as a programming model for distributed computations. Hadoop includes other technologies like YARN for resource management, Spark for fast computation, HBase for NoSQL database, and tools for data analysis, transfer, and security. Hadoop can run on-premise or in cloud environments and supports analytics workloads.
Quick Brief about " What is Hadoop"
I didn't explain in detail about hadoop, but reading this slides will give you insight of Hadoop and core product usage. This document will be more useful for PM, Newbies, Technical Architect entering into Cloud Computing.
A quick comparison of Hadoop and Apache Spark with a detailed introduction.
Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. They do different things.
Looking for Similar IT Services?
Write to us business@altencalsoftlabs.com
(OR)
Visit Us @ https://www.altencalsoftlabs.com/
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed based on Google papers describing Google File System (GFS) for reliable distributed data storage and MapReduce for distributed parallel processing. Hadoop uses HDFS for storage and MapReduce for processing in a scalable, fault-tolerant manner on commodity hardware. It has a growing ecosystem of projects like Pig, Hive, HBase, Zookeeper, Spark and others that provide additional capabilities for SQL queries, real-time processing, coordination services and more. Major vendors that provide Hadoop distributions include Hortonworks and Cloudera.
Introduction to Apache Hadoop. Includes Hadoop v.1.0 and HDFS / MapReduce to v.2.0. Includes Impala, Yarn, Tez and the entire arsenal of projects for Apache Hadoop.
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
The document provides an overview and quick reference guide to big data concepts including Hadoop, MapReduce, HDFS, YARN, Spark, Storm, Hive, Pig, HBase and NoSQL databases. It discusses the evolution of Hadoop from versions 1 to 2, and new frameworks like Tez and YARN that allow different types of processing beyond MapReduce. The document also summarizes common big data challenges around skills, integration and analytics.
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools or processing applications. A lot of challenges such as capture, curation, storage, search, sharing, analysis, and visualization can be encountered while handling Big Data. On the other hand the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Big Data certification is one of the most recognized credentials of today.
For more details Click http://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This document provides an overview of Hadoop, including:
- Prerequisites for getting the most out of Hadoop include programming skills in languages like Java and Python, SQL knowledge, and basic Linux skills.
- Hadoop is a software framework for distributed processing of large datasets across computer clusters using MapReduce and HDFS.
- Core Hadoop components include HDFS for storage, MapReduce for distributed processing, and YARN for resource management.
- The Hadoop ecosystem also includes components like HBase, Pig, Hive, Mahout, Sqoop and others that provide additional functionality.
Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers.
The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks.
Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
Apache hadoop introduction and architectureHarikrishnan K
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large data sets across clusters of commodity hardware. The core of Hadoop is a storage part known as Hadoop Distributed File System (HDFS) and a processing part known as MapReduce. HDFS provides distributed storage and MapReduce enables distributed processing of large datasets in a reliable, fault-tolerant and scalable manner. Hadoop has become popular for distributed computing as it is reliable, economical and scalable to handle large and varying amounts of data.
Business intelligence analyzes data to provide actionable information for decision making. Big data is a $50 billion market by 2017, referring to technologies that capture, store, manage and analyze large variable data collections. Hadoop is an open source framework for distributed storage and processing of large data sets on commodity hardware, enabling businesses to gain insight from massive amounts of structured and unstructured data. It involves components like HDFS for data storage, MapReduce for processing, and others for accessing, storing, integrating, and managing data.
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
Hadoop is an open source software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. Hadoop supports various big data applications like HBase for distributed column storage, Hive for data warehousing and querying, Pig and Jaql for data flow languages, and Hadoop ecosystem projects for tasks like system monitoring and machine learning.
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Similar to Big Data Technology Stack : Nutshell (20)
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
### Data Description and Analysis Summary for Presentation
#### 1. **Importing Libraries**
Libraries used:
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`: Data visualization
- `scikit-learn`: Machine learning utilities
- `statsmodels`, `pmdarima`: Statistical modeling
- `keras`: Deep learning models
#### 2. **Loading and Exploring the Dataset**
**Dataset Overview:**
- **Source:** CSV file (`mumbai-monthly-rains.csv`)
- **Columns:**
- `Year`: The year of the recorded data.
- `Jan` to `Dec`: Monthly rainfall data.
- `Total`: Total annual rainfall.
**Initial Data Checks:**
- Displayed first few rows.
- Summary statistics (mean, standard deviation, min, max).
- Checked for missing values.
- Verified data types.
**Visualizations:**
- **Annual Rainfall Time Series:** Trends in annual rainfall over the years.
- **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall.
- **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall.
- **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall.
- **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall.
#### 3. **Data Transformation**
**Steps:**
- Ensured 'Year' column is of integer type.
- Created a datetime index.
- Converted monthly data to a time series format.
- Created lag features to capture past values.
- Generated rolling statistics (mean, standard deviation) for different window sizes.
- Added seasonal indicators (dummy variables for months).
- Dropped rows with NaN values.
**Result:**
- Transformed dataset with additional features ready for time series analysis.
#### 4. **Data Splitting**
**Procedure:**
- Split the data into features (`X`) and target (`y`).
- Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order.
**Result:**
- Training set: `(X_train, y_train)`
- Testing set: `(X_test, y_test)`
#### 5. **Automated Hyperparameter Tuning**
**Tool Used:** `pmdarima`
- Automatically selected the best parameters for the SARIMA model.
- Evaluated using metrics such as AIC and BIC.
**Output:**
- Best SARIMA model parameters and statistical summary.
#### 6. **SARIMA Model**
**Steps:**
- Fit the SARIMA model using the training data.
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Indicates accuracy on training data.
- **Test MAE:** Indicates accuracy on unseen data.
- **Train RMSE:** Measures average error magnitude on training data.
- **Test RMSE:** Measures average error magnitude on testing data.
#### 7. **LSTM Model**
**Preparation:**
- Reshaped data for LSTM input.
- Converted data to `float32`.
**Model Building and Training:**
- Built an LSTM model with one LSTM layer and one Dense layer.
- Trained the model on the training data.
**Evaluation:**
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Accuracy on training data.
- **T
2. Agenda
▪ Big Data Stack : In a Nutshell
▪ Data Layer
▪ Data Processing Layer
▪ Data Ingestion Layer
▪ Data Presentation Layer
▪ Operations & Scheduling Layer
▪ Security & Governance
2
4. Data Layer
4
Hadoop Distributed File System (HDFS)
HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing
large volumes of data in inexpensive commodity hardware.
Amazon Simple Storage Service (S3)
S3 is a cloud based scalable, distributed file system offering from Amazon. It can be
utilized as the data layer in big data applications, coupled with other required
components.
IBM General Parallel File System (GPFS) / Spectrum Scale
GPFS is a high-performance clustered file system developed by IBM.
5. Data Processing Layer
5
Hadoop MapReduce
Hadoop Map/Reduce is a software framework for distributed processing of large data sets on
compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The
framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A
MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system.
Apache Pig
Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig
allows Apache Hadoop users to write complex MapReduce transformations using a simple
scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can
be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the
user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
6. Data Processing Layer
6
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn
it into actionable business insight. Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like
language called HiveQL with schema on read and transparently converts queries to map/reduce,
Apache Tez and Spark jobs.
Apache HBase
HBase is an open-source NoSQL database that provides real-time read/write access to large
datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase
provides a strong consistency model, and range-based partitioning. Reads, including range-based
reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on
Cassandra..
7. Data Processing Layer
7
Apache Cassandra
Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant
and can be used to manage huge volumes of data. Cassandra's consistency model is based on
Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications
where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very
good write scaling.
Apache Storm
Storm is a distributed real-time computation system for processing large volumes of high-velocity
data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop did for batch processing.
Apache Solr
Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr
powers the search and navigation features of many of the world’s largest Internet sites, enabling
powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding
tabular, text, geo-location or sensor data that is stored in Hadoop.
8. Data Processing Layer
8
Apache Spark
Apache Spark is an open source cluster computing framework for large-scale data processing.
Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or
10x faster on disk for program execution. It provides in-memory computations for increased speed
and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access
Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data
from HDFS, Flume, Kafka, Twitter and other sources.
Apache Mahout
Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on
top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of
artificial intelligence focused on enabling machines to learn without being explicitly programmed,
and it is commonly used to improve future performance based on previous outcomes. Mahout
provides the tools and algorithms to automatically find meaningful patterns in those big data sets
stored in the HDFS.
9. Data Ingestion Layer
9
Apache Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and
machine data, geo-location data and social media) into the HDFS. It has a simple and flexible
architecture based on streaming data flows; and is robust and fault tolerant with comes with
configurable reliability mechanisms for failover and recovery.
Apache Kafka
Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages
in categories called topics. Producers are processes that publish messages to a Kafka topic.
Consumers are processes that subscribe to topics and process the feed of published
messages. Kafka is run as a cluster comprising of one or more servers each of which is called
a broker.
Apache Sqoop
Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or
mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS,
transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.
10. Data Presentation Layer
10
Kibana
Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time
summary and charting of streaming data. The visualization capabilities it provides allow users to
different charts, plots and maps of large volumes of data.
11. Operations & Scheduling Layer
11
Ambari
Ambari is an open framework that helps in provisioning, managing and monitoring of Apache
Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an
intuitive web interface that allows one to easily provision, configure and test all the Hadoop
services and core components. It also comes with the powerful Ambari Blueprints API that can be
utilized for automating cluster installations without any user intervention.
Apache Oozie
Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around
job scheduling within the cluster. Oozie is a Java based web application that is primarily used to
schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit
of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache
tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.
Apache ZooKeeper
Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed
configuration service, a synchronization service and a naming registry for distributed systems that
can use Zookeeper to store and mediate updates to important configuration information.
12. About Me – Khalid Imran
12
A tester by passion, I’ve spent the past 16+ years testing disparate systems, learning new domains, developing
innovative solutions, designing test strategies, challenging conventional methods, proving new techniques and
embracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing,
non-functional testing, manual, automation, test and project execution methodologies, licensed and open stack
tools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management,
client engagement, on-site, off-shore team dynamics and more.
I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the Testing
Centre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization,
pre-sales and business development, solutioning and consultancy, training and process improvement group. I hold
multiple certifications namely: CSQA, CSM and CPISI.
I welcome any questions or feedback you may have on this presentation.