SlideShare a Scribd company logo
By: Khalid Imran
Big Data :
Technology Stack
▪ Big Data Stack : In a Nutshell
▪ Data Layer
▪ Data Processing Layer
▪ Data Ingestion Layer
▪ Data Presentation Layer
▪ Operations & Scheduling Layer
▪ Security & Governance
Big Data Technology Stack : In a nutshell
Data Layer
Hadoop Distributed File System (HDFS)
HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing
large volumes of data in inexpensive commodity hardware.
Amazon Simple Storage Service (S3)
S3 is a cloud based scalable, distributed file system offering from Amazon. It can be
utilized as the data layer in big data applications, coupled with other required
IBM General Parallel File System (GPFS) / Spectrum Scale
GPFS is a high-performance clustered file system developed by IBM.

Recommended for you

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop

This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.

Spark SQL
Spark SQLSpark SQL
Spark SQL

The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview

The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.

data lakeadls gen2modern data warehouse
Data Processing Layer
Hadoop MapReduce
Hadoop Map/Reduce is a software framework for distributed processing of large data sets on
compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The
framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A
MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system.
Apache Pig
Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig
allows Apache Hadoop users to write complex MapReduce transformations using a simple
scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can
be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the
user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Data Processing Layer
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn
it into actionable business insight. Apache Hive supports analysis of large datasets stored in
Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like
language called HiveQL with schema on read and transparently converts queries to map/reduce,
Apache Tez and Spark jobs.
Apache HBase
HBase is an open-source NoSQL database that provides real-time read/write access to large
datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase
provides a strong consistency model, and range-based partitioning. Reads, including range-based
reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on
Data Processing Layer
Apache Cassandra
Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant
and can be used to manage huge volumes of data. Cassandra's consistency model is based on
Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications
where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very
good write scaling.
Apache Storm
Storm is a distributed real-time computation system for processing large volumes of high-velocity
data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time
processing what Hadoop did for batch processing.
Apache Solr
Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr
powers the search and navigation features of many of the world’s largest Internet sites, enabling
powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding
tabular, text, geo-location or sensor data that is stored in Hadoop.
Data Processing Layer
Apache Spark
Apache Spark is an open source cluster computing framework for large-scale data processing.
Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or
10x faster on disk for program execution. It provides in-memory computations for increased speed
and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access
Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data
from HDFS, Flume, Kafka, Twitter and other sources.
Apache Mahout
Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on
top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of
artificial intelligence focused on enabling machines to learn without being explicitly programmed,
and it is commonly used to improve future performance based on previous outcomes. Mahout
provides the tools and algorithms to automatically find meaningful patterns in those big data sets
stored in the HDFS.

Recommended for you

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture

Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop

The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.

apache hadoopdistributed computinghadoop
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House

Data Con LA 2020 Description In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake. Speaker Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning

data con ladata con la 2020dcla
Data Ingestion Layer
Apache Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and
machine data, geo-location data and social media) into the HDFS. It has a simple and flexible
architecture based on streaming data flows; and is robust and fault tolerant with comes with
configurable reliability mechanisms for failover and recovery.
Apache Kafka
Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages
in categories called topics. Producers are processes that publish messages to a Kafka topic.
Consumers are processes that subscribe to topics and process the feed of published
messages. Kafka is run as a cluster comprising of one or more servers each of which is called
a broker.
Apache Sqoop
Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or
mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS,
transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.
Data Presentation Layer
Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time
summary and charting of streaming data. The visualization capabilities it provides allow users to
different charts, plots and maps of large volumes of data.
Operations & Scheduling Layer
Ambari is an open framework that helps in provisioning, managing and monitoring of Apache
Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an
intuitive web interface that allows one to easily provision, configure and test all the Hadoop
services and core components. It also comes with the powerful Ambari Blueprints API that can be
utilized for automating cluster installations without any user intervention.
Apache Oozie
Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around
job scheduling within the cluster. Oozie is a Java based web application that is primarily used to
schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit
of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache
tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop.
Apache ZooKeeper
Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed
configuration service, a synchronization service and a naming registry for distributed systems that
can use Zookeeper to store and mediate updates to important configuration information.
About Me – Khalid Imran
A tester by passion, I’ve spent the past 16+ years testing disparate systems, learning new domains, developing
innovative solutions, designing test strategies, challenging conventional methods, proving new techniques and
embracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing,
non-functional testing, manual, automation, test and project execution methodologies, licensed and open stack
tools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management,
client engagement, on-site, off-shore team dynamics and more.
I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the Testing
Centre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization,
pre-sales and business development, solutioning and consultancy, training and process improvement group. I hold
multiple certifications namely: CSQA, CSM and CPISI.
I welcome any questions or feedback you may have on this presentation.

Recommended for you

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

by EMC
apache hadoophadoophadoop 101
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology

This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.

What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.

dataworks summit barcelonadws19apache hive

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Suvradeep Rudra
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
Cloudera, Inc.
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Spark SQL
Spark SQLSpark SQL
Spark SQL
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta

Similar to Big Data Technology Stack : Nutshell

Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
Worapol Alex Pongpech, PhD
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
Muthu Natarajan
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
ALTEN Calsoft Labs
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
Jonathan Bloom
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Omar Jaber
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
Neev Technologies
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
Big Data Tools & Libraries
Big Data Tools & LibrariesBig Data Tools & Libraries
Big Data Tools & Libraries
Krisshhna Daasaarii
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth

Similar to Big Data Technology Stack : Nutshell (20)

Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Big Data Tools & Libraries
Big Data Tools & LibrariesBig Data Tools & Libraries
Big Data Tools & Libraries
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics

Recently uploaded

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
yogita singh$A17
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...

Recently uploaded (20)

[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeLaxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Laxmi Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...

Big Data Technology Stack : Nutshell

  • 1. By: Khalid Imran Big Data : Technology Stack
  • 2. Agenda ▪ Big Data Stack : In a Nutshell ▪ Data Layer ▪ Data Processing Layer ▪ Data Ingestion Layer ▪ Data Presentation Layer ▪ Operations & Scheduling Layer ▪ Security & Governance 2
  • 3. Big Data Technology Stack : In a nutshell 3
  • 4. Data Layer 4 Hadoop Distributed File System (HDFS) HDFS is a scalable, fault-tolerant Java based distributed file system that is used for storing large volumes of data in inexpensive commodity hardware. Amazon Simple Storage Service (S3) S3 is a cloud based scalable, distributed file system offering from Amazon. It can be utilized as the data layer in big data applications, coupled with other required components. IBM General Parallel File System (GPFS) / Spectrum Scale GPFS is a high-performance clustered file system developed by IBM.
  • 5. Data Processing Layer 5 Hadoop MapReduce Hadoop Map/Reduce is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. Apache Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop. Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed on the data. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
  • 6. Data Processing Layer 6 Apache Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive is used to explore, structure and analyze data, then turn it into actionable business insight. Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and Spark jobs. Apache HBase HBase is an open-source NoSQL database that provides real-time read/write access to large datasets with extremely low latency as well as fault tolerance. HBase runs on top of HDFS. HBase provides a strong consistency model, and range-based partitioning. Reads, including range-based reads, tend to scale much better on HBase, whereas writes do not scale as well as they do on Cassandra..
  • 7. Data Processing Layer 7 Apache Cassandra Cassandra is another open-source distributed NoSQL database. It is highly scalable, fault tolerant and can be used to manage huge volumes of data. Cassandra's consistency model is based on Amazon's Dynamo: it provides eventual consistency. This is very appealing for some applications where you want to guarantee the availability of writes. Similarly, Cassandra tends to provide very good write scaling. Apache Storm Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Apache Solr Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop. Solr powers the search and navigation features of many of the world’s largest Internet sites, enabling powerful full-text search and near real-time indexing. Apache Solr can be used for rapidly finding tabular, text, geo-location or sensor data that is stored in Hadoop.
  • 8. Data Processing Layer 8 Apache Spark Apache Spark is an open source cluster computing framework for large-scale data processing. Studies have shown that Spark can run up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk for program execution. It provides in-memory computations for increased speed and data processing over MapReduce. It runs on top of existing Hadoop cluster and can access Hadoop data store (HDFS), as well as also process structured data from Hive and streaming data from HDFS, Flume, Kafka, Twitter and other sources. Apache Mahout Apache Mahout is a library of scalable machine-learning algorithms that can be implemented on top of Apache Hadoop and it utilizes the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes. Mahout provides the tools and algorithms to automatically find meaningful patterns in those big data sets stored in the HDFS.
  • 9. Data Ingestion Layer 9 Apache Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data (e.g. application logs, sensor and machine data, geo-location data and social media) into the HDFS. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with comes with configurable reliability mechanisms for failover and recovery. Apache Kafka Kafka is a high throughput distributed messaging system. Kafka maintains feeds of messages in categories called topics. Producers are processes that publish messages to a Kafka topic. Consumers are processes that subscribe to topics and process the feed of published messages. Kafka is run as a cluster comprising of one or more servers each of which is called a broker. Apache Sqoop Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a RDBMS or a mainframe into HDFS, transform the data using Hadoop MapReduce, and then export the data back into an RDBMS.
  • 10. Data Presentation Layer 10 Kibana Kibana is an analytics and visualization plugin that works with ElasticSearch. It provides real-time summary and charting of streaming data. The visualization capabilities it provides allow users to different charts, plots and maps of large volumes of data.
  • 11. Operations & Scheduling Layer 11 Ambari Ambari is an open framework that helps in provisioning, managing and monitoring of Apache Hadoop clusters. It simplifies the deployment and maintenance of hosts. Ambari also includes an intuitive web interface that allows one to easily provision, configure and test all the Hadoop services and core components. It also comes with the powerful Ambari Blueprints API that can be utilized for automating cluster installations without any user intervention. Apache Oozie Apache Oozie provides operational service capabilities for a Hadoop cluster, specifically around job scheduling within the cluster. Oozie is a Java based web application that is primarily used to schedule Apache Hadoop jobs. Oozie can combine multiple jobs sequentially into one logical unit of work. It can be integrated with the Hadoop stack, and supports Hadoop jobs for various Apache tools such as MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Apache ZooKeeper Apache ZooKeeper provides operational services for a Hadoop cluster. It provides a distributed configuration service, a synchronization service and a naming registry for distributed systems that can use Zookeeper to store and mediate updates to important configuration information.
  • 12. About Me – Khalid Imran 12 A tester by passion, I’ve spent the past 16+ years testing disparate systems, learning new domains, developing innovative solutions, designing test strategies, challenging conventional methods, proving new techniques and embracing emerging tools & technologies. The breadth and depth of my experience cuts across functional testing, non-functional testing, manual, automation, test and project execution methodologies, licensed and open stack tools, platforms, devices, programming languages, custom-built test harnesses and utilities, delivery management, client engagement, on-site, off-shore team dynamics and more. I am currently heading the 1400+ QA strong testing practice at Cybage as a QA Evangelist. I manage the Testing Centre of Excellence (TCoE), lead a team of architects and specialists and assist in deliveries across the organization, pre-sales and business development, solutioning and consultancy, training and process improvement group. I hold multiple certifications namely: CSQA, CSM and CPISI. I welcome any questions or feedback you may have on this presentation.