SlideShare a Scribd company logo
Real-time Analytics
Kafka, Apache Samza, Hadoop Yarn, Druid, Tranquility and Metabase.
Leandro Totino Pereira
Devops/Cloud Engineer
Agenda
 What is Analytics?
 How can we get pattern data?
 Ad-hoc solution
 ETL’s types
 Real-Time Streaming
 What is Kafka?
 Apache Hadoop YARN
 Druid
 Tranquility
 Business intelligence web application
What is analytics?
Data-driven decisions
Forecast future results
Reporting
Machine Learning
Metrics/Monitoring
Optimize data
Analytics is the discovery, interpretation, and communication of meaningful patterns in data and
it can be used in the following scenarios.
How can we get pattern data?
In computing, extract, transform, load (ETL) refers to a process in database usage and especially
in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does
ETL from multiples data source.
Ad-hoc solution
Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra,
HDFS, etc.
Apache Drill – Multiple NoSQL database support – MongoDB, HBase,
HDFS, S3 and etc.
• Do all ETL steps at once
• Data Cleasing is complex
• Extract information from production servers
Disadvantages
• Don´t need to create complex infrastracture for Analytics
• Don´t nedd to extract informations to other systemsAdvantages:
ETL’s types
Conclusion
In my perpective Batch mode
is totally for legacy system
which cannot migrate to real-
time stream or for small ones.
Batch mode extracts data using copy tools through jobs to populate data warehouse such as
HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL
in real-time.
Real-Time Streaming
Real-Time Streaming topology
You can extract data with a
tool called flume or by your
applications directly. Flume
is able to send data from
various types of sources
and output them to Kafka
and HDFS.
What is Kafka?
Kafka is a distributed messaging system providing fast, highly
scalable and redundant messaging through a pub-sub model
Topic is the container with
which messages are
associated. It´s divided into a
number of partitions.
Each node in the cluster is
called a Kafka broker.
Consumers is responsible for
getting messages from a
topic
Producers is responsible for
publishing data/messages
into a topic
The basic architecture of Kafka is
organized around a few key terms:
topics, producers, consumers, and
brokers.
Apache Hadoop YARN
(Yet Another Resource Negotiator) Client
Submit an application/job.
Node Manager
Provide computacional resources and
Manage application containers
Application Master
Monitor the containers and their resource
consumption
Negotiates appropriate resource for containers
Container
Run the application spawned by
application master
Resource manager
Check Node Manager and available
resources in the cluster. Monitor
application masters.
What is Samza?
Apache Samza is a distributed stream processing framework (application manager into Yarn).
It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management. it's commonly used to transform,
cleanup, normalize data before save to data warehouse
You can tranform/cleanup data
between job forward it through
Kafka topics. For example if
the message “I´m Leandro
and I´m system engineer”
got to samza job1 it can
normalize like “name:
Leandro, and I´m system
engineer” and the job samza2
tranform to “name: Leandro,
job: “system engineer”.
Samza Hadoop Integration
We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and
available, number of Jobs and their status, information about application máster and containers.
Samza work-Flow
You should start a job on the Yarn grid running the samza script run-
job.sh with a specific configuration file for each job. You must setup in
the config file “job name”, the location of yarn package file, the task
class location to find a process method, kafka input topic name,etc..
Druid – Real-time and historical data Data Warehouse
Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data
aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of
data. Druid is most commonly used to power user-facing analytic applications.
Sub-second OLAP
Queries
Druid’s unique
architecture enables
rapid multi-dimensional
filtering, ad-hoc attribute
groupings, and extremely
fast aggregations.
Real-time Streaming
Ingestion
Druid employs lock-free
ingestion to allow for
simultaneous ingestion
and querying of high
dimensional, high volume
data sets. Explore events
immediately after they
occur.
Power Analytic
Applications
Druid has numerous
features built for multi-
tenancy. Power user-
facing analytic
applications designed to
be used by thousands of
concurrent users.
Cost Effective
Druid is extremely cost
effective at scale and has
numerous features built
in for cost reduction.
Trade off cost and
performance with simple
configuration knobs.
Highly Available
Druid is used to back
SaaS implementations
that need to be up all the
time. Druid supports rolling
updates so your data is
still available and
queryable during software
updates. Scale up or down
without data loss.
Scalable
Existing Druid
deployments handle
trillions of events,
petabytes of data, and
thousands of queries
every second.
Source: http://druid.io/druid.htm
Druid architecture
Druid Components
Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve
queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and
serve queries on segments.
Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering
queries and gathering and merging results. Broker nodes know what segments live where.
Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new
segments, drop old segments, and move segments to load balance.
Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is
common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing
segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also
lossless; data remains queryable throughout the entire process.
Querying Druid data
Request and output is json
format. We are getting values
from field metrics from host
compute-3.
Tranquility – Sending events to Druid
Tranquility is a tool which gets the
final processed data from Kafka
Topics writing it into druid
database/datasources
You must know what data structure is
coming and how it´s going to save into
druid datasource therefore you must
map dimension metrics in tranquility
configuration file.
Business intelligence web application
Business intelligence web applications permits user to explore and visualize into data
warehouse and create reports easily.
Superset – It´s a amazing tool developed by airbnb which permits user create awesome
reports but we got some limitations about querying raw data and not aggregation data.It´s
required on installation many python pip modules.
Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and
looks like the most complete.
Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
Metabase - Open source business intelligence tool
Get the jar file , run, access it.
https://<Address>:3000
Add database/datasource
connection on web UI.
Ask Question to build
report/analysis.
Thank you!
Questions?
More information:
Linkedin:
https://www.linkedin.com/in/leandro-totino-pereira
Facebook:
https://www.facebook.com/leandro.totinopereira

More Related Content

Real time analytics

  • 1. Real-time Analytics Kafka, Apache Samza, Hadoop Yarn, Druid, Tranquility and Metabase. Leandro Totino Pereira Devops/Cloud Engineer
  • 2. Agenda  What is Analytics?  How can we get pattern data?  Ad-hoc solution  ETL’s types  Real-Time Streaming  What is Kafka?  Apache Hadoop YARN  Druid  Tranquility  Business intelligence web application
  • 3. What is analytics? Data-driven decisions Forecast future results Reporting Machine Learning Metrics/Monitoring Optimize data Analytics is the discovery, interpretation, and communication of meaningful patterns in data and it can be used in the following scenarios.
  • 4. How can we get pattern data? In computing, extract, transform, load (ETL) refers to a process in database usage and especially in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does ETL from multiples data source.
  • 5. Ad-hoc solution Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra, HDFS, etc. Apache Drill – Multiple NoSQL database support – MongoDB, HBase, HDFS, S3 and etc. • Do all ETL steps at once • Data Cleasing is complex • Extract information from production servers Disadvantages • Don´t need to create complex infrastracture for Analytics • Don´t nedd to extract informations to other systemsAdvantages:
  • 6. ETL’s types Conclusion In my perpective Batch mode is totally for legacy system which cannot migrate to real- time stream or for small ones. Batch mode extracts data using copy tools through jobs to populate data warehouse such as HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL in real-time.
  • 8. Real-Time Streaming topology You can extract data with a tool called flume or by your applications directly. Flume is able to send data from various types of sources and output them to Kafka and HDFS.
  • 9. What is Kafka? Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model Topic is the container with which messages are associated. It´s divided into a number of partitions. Each node in the cluster is called a Kafka broker. Consumers is responsible for getting messages from a topic Producers is responsible for publishing data/messages into a topic The basic architecture of Kafka is organized around a few key terms: topics, producers, consumers, and brokers.
  • 10. Apache Hadoop YARN (Yet Another Resource Negotiator) Client Submit an application/job. Node Manager Provide computacional resources and Manage application containers Application Master Monitor the containers and their resource consumption Negotiates appropriate resource for containers Container Run the application spawned by application master Resource manager Check Node Manager and available resources in the cluster. Monitor application masters.
  • 11. What is Samza? Apache Samza is a distributed stream processing framework (application manager into Yarn). It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. it's commonly used to transform, cleanup, normalize data before save to data warehouse You can tranform/cleanup data between job forward it through Kafka topics. For example if the message “I´m Leandro and I´m system engineer” got to samza job1 it can normalize like “name: Leandro, and I´m system engineer” and the job samza2 tranform to “name: Leandro, job: “system engineer”.
  • 12. Samza Hadoop Integration We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and available, number of Jobs and their status, information about application máster and containers.
  • 13. Samza work-Flow You should start a job on the Yarn grid running the samza script run- job.sh with a specific configuration file for each job. You must setup in the config file “job name”, the location of yarn package file, the task class location to find a process method, kafka input topic name,etc..
  • 14. Druid – Real-time and historical data Data Warehouse Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications. Sub-second OLAP Queries Druid’s unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute groupings, and extremely fast aggregations. Real-time Streaming Ingestion Druid employs lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high volume data sets. Explore events immediately after they occur. Power Analytic Applications Druid has numerous features built for multi- tenancy. Power user- facing analytic applications designed to be used by thousands of concurrent users. Cost Effective Druid is extremely cost effective at scale and has numerous features built in for cost reduction. Trade off cost and performance with simple configuration knobs. Highly Available Druid is used to back SaaS implementations that need to be up all the time. Druid supports rolling updates so your data is still available and queryable during software updates. Scale up or down without data loss. Scalable Existing Druid deployments handle trillions of events, petabytes of data, and thousands of queries every second. Source: http://druid.io/druid.htm
  • 16. Druid Components Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and serve queries on segments. Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering queries and gathering and merging results. Broker nodes know what segments live where. Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new segments, drop old segments, and move segments to load balance. Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also lossless; data remains queryable throughout the entire process.
  • 17. Querying Druid data Request and output is json format. We are getting values from field metrics from host compute-3.
  • 18. Tranquility – Sending events to Druid Tranquility is a tool which gets the final processed data from Kafka Topics writing it into druid database/datasources You must know what data structure is coming and how it´s going to save into druid datasource therefore you must map dimension metrics in tranquility configuration file.
  • 19. Business intelligence web application Business intelligence web applications permits user to explore and visualize into data warehouse and create reports easily. Superset – It´s a amazing tool developed by airbnb which permits user create awesome reports but we got some limitations about querying raw data and not aggregation data.It´s required on installation many python pip modules. Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and looks like the most complete. Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
  • 20. Metabase - Open source business intelligence tool Get the jar file , run, access it. https://<Address>:3000 Add database/datasource connection on web UI. Ask Question to build report/analysis.