Apache frameworks for Big and Fast Data

Apache Frameworks
for Big and Fast Data
- Naveen Korakoppa

Traditional way of request / response

● Request/response model — API consumers send requests to an API server and receive a response.
● Pull-based interaction — API consumers send an API request when data or functionality is required (e.g.
user interface, at a pre-scheduled time).
● Synchronous — API consumers receive the response after a request is sent.
● Multiple content types — since REST APIs are built upon HTTP, responses may be JSON, XML, or other
content types as necessary to support consumer needs (e.g. CSV, PDF).
● Internal and external access — REST APIs may be restricted for internal use or for external use by
partners or public developers.

● Flexible interactions — Building upon the available HTTP verbs, consumers may interact
with REST-based APIs through resources in a variety of ways: queries/search, creating new
resources, modifying existing resources, and deleting resources. We can also build complex
workflows by combining these interactions into higher-level processes.
● Caching and concurrency protocol support — HTTP has caching semantics built-in, allow
for caching servers to be placed between the consumer and API server, as well as cache
control of responses and eTags for concurrency control to prevent overwriting content.

Modern way of data stream
● Publish/subscribe model — Apps or APIs publish messages to a topic which may have zero, one, or many
subscribers rather than a request/response model.
● Subscriber notification interaction — Apps receive notification when a new message is available, such as
when data is modified or new data is available.
● Asynchronous — Unlike REST APIs, apps cannot use message streams to submit a request and receive a
response back without complex coordination between parties.
● Single content-type — At Capital One, our message streaming is built upon Avro, a compact binary
format useful for data serialization. Unlike HTTP, Avro doesn’t support other content types (e.g. CSV,
PDF).

Modern way of data stream
● Replay ability — Message streaming is built on Kafka, subscribers may revisit and replay
previous messages sequentially.
● No caching or concurrency protocol support — Message streaming doesn’t offer caching
semantics, cache-control, or concurrency control between publisher and subscriber.
● Internal access only — Subscribers must be internal to the organization, unlike HTTP which
may be externalized to partner or public consumers.

Important concepts for Big & Fast data
architectures
Big data architecture is the overarching system used to ingest and process enormous
amounts of data (often referred to as "big data") so that it can be analyzed for business
purposes. The architecture can be considered the blueprint for a big data solution based
on the business needs of an organization. Big data architecture is designed to handle the
following types of work:
● Batch processing of big data sources.
● Real-time processing of big data.
● Predictive analytics and machine learning.
A well-designed big data architecture can save your company money and help you predict
future trends so you can make good business decisions.

Concepts of Big data architecture
Types :
● Lambda Architecture ( Batch-first-approach )
- batching is used as the primary processing method with streams used to
supplement and provide early but unrefined results
● Kappa Architecture ( Stream-first-approach )
- streams are used for everything, simplifies the model and has only
recently become possible as stream processing engines have grown more
sophisticated.

Lambda Architecture
This architecture was introduced by Nathan Marz in which we have three layers to
provide real-time streaming and compensate any data error occurs if any. The three
layers are Batch Layer, Speed layer, and Serving Layer.
So data is routed to batch layer and speed layer by our data collector concurrently. So
Hadoop is our batch layer and Apache Storm is our speed layer. And NoSQL datastore like
Cassandra, MongoDB is our serving layer in which analyzed results will be stored.
So the idea behind these layers was that the speed layer will be providing real-time
results into serving layer and if any data errors or any data is missed while stream
processing, then batch job will compensate that and the MapReduce job will run after
the regular interval and updates our serving layer, so providing accurate results.

Kappa Architecture
Now the above lambda architecture solves our problem for data error and also provide
flexibility to provide real-time and accurate results to the user.
But Apache Kafka founders raises the question on this lambda architecture, they loved
the benefits provide by the lambda architecture, but they also state that it is very hard to
build the pipeline and maintain analysis logic in both batch and speed layer.
So If we use frameworks like Apache spark streaming, Flink, Beam they provide support
for both batch and real-time streaming. So it will be very easy for developers to
maintain the logical part of the data pipeline.

Data Ingestion tools
List of all Data Ingestion tools as a open source :
1. Apache Kafka
2. Apache Flume
3. Apache sqoop
4. Apache NIFI

Apache Flume
● Flume is a distributed system that can
be used to collect, aggregate, and
transfer streaming events into Hadoop.
● Flume is configuration-based and has
interceptors to perform simple
transformations on in-flight data.
● It comes with many built-in sources,
channels, and sinks, for example, Kafka
Channel and Avro sink.
● Flume data load can be driven by an
event.
● In order to load streaming data such as
tweets generated on Twitter or log files
of a web server, Flume should be used.
Flume agents are built for fetching
streaming data.

Apache Sqoop
● Sqoop is used for importing data from
structured data sources such as
RDBMS.
● Sqoop has a connector based
architecture. Connectors know how to
connect to the respective data source
and fetch the data.
● HDFS is a destination for data import
using Sqoop.
● Sqoop data load is not event-driven.
● In order to import data from
structured data sources, one has to
use Sqoop only, because its
connectors know how to interact with
structured data sources and fetch data
from them.

Apache Kafka
● Kafka is a distributed, high-throughput message bus that decouples data producers
from consumers. Messages are organized into topics, topics are split into partitions,
and partitions are replicated across the nodes — called brokers — in the cluster.
● Compared to Flume, Kafka offers better scalability and message durability.
● Kafka now comes in two flavors: the “classic” producer/consumer model, and the
new Kafka-connect, which provides configurable connectors (sources/sinks) to
external data stores.
● Kafka can be used for event processing and integration between components of
large software systems.
● Because messages are persisted on disk as well as replicated within the cluster, data
loss scenarios are less common than with Flume.

Apache NIFI
● Unlike Flume and Kafka, NiFi can handle messages with arbitrary sizes. Behind a drag-and-drop
Web-based UI, NiFi runs in a cluster and provides real-time control that makes it easy to
manage the movement of data between any source and any destination.
● It supports disparate and distributed sources of differing formats, schema, protocols, speeds,
and sizes.
● NiFi can be used in mission-critical data flows with rigorous security & compliance
requirements, where we can visualize the entire process and make changes immediately, in
real-time.
● Some of NiFi’s key features are prioritized queuing, data traceability and back-pressure
threshold configuration per connection.
● Although it is used to create fault-tolerant production pipelines, NiFi does not yet replicate data
like Kafka. If a node goes down, the flow can be directed to another node, but data queued for
the failed node will have to wait until the node comes back up.
● NiFi is not a full-fledged ETL tool, nor ideal for complex computations and event processing
(CEP). For that, it should instead connect to a streaming framework like Apache Flink, Spark
Streaming or Storm.

Data Computation and analytics tools
Batch-only frameworks Stream-only frameworks
Hybrid
frameworks

Data computation and analytics tools
List of all tools available as a open source:
1. Apache Hadoop
2. Apache Storm
3. Apache Spark
4. Apache Samza
5. Apache Flink
6. Apache Beam
7. Esper tool

Apache Hadoop - Batch-approach
● Distributed Batch processing of large volume and unstructured dataset.
● It has High Latency (Slow Computation).
● Processing framework used by Hadoop is a distributed batch processing which uses
MapReduce engine for computation which follows a map, sort, shuffle, reduce
algorithm.
● MapR jobs are executed in a sequential manner still it is completed.
● Architecture is based on a topology of Spouts and bolts.
● Speed: Due to batch processing on a large volume of data Hadoop take longer
computation time which means latency is more hence Hadoop is relatively slow.

Apache Storm - Stream-approach
● Distributed real-time processing of data having a large volume and high velocity.
● It has Low Latency (Fast Computation).
● Architecture consists of HDFS and MapReduce
● Processing framework used by Storm is distributed real-time data processing which
uses DAGs in a framework to generate topologies which are composed of Stream,
Spouts, and Bolts.
● Speed: Due to near real-time processing Storm handle data with very low latency to
give a result with minimum delay.

Apache Spark - Batch-first-approach ( 3G of bigdata )
● Apache spark is Batch Processing as well as Real Time Data Processing. ( Lambda Architecture )
● Apache Spark has the ability to support multiple languages like Java, Scala, Python and R
● Apache Spark streaming have higher latency comparing Apache Storm
● Speed: Apache Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk.
● Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster

Apache Flink - Stream-first-approach ( 4G of bigdata )
● Apache Flink is a stream processing framework that can also handle batch tasks.( Kappa architecture)
● Flink’s stream-first approach offers low latency, high throughput, and real entry-by-entry processing.
● Flink is currently a unique option in the processing framework world. While Spark performs batch and
stream processing, its streaming is not appropriate for many use cases because of its micro-batch
architecture
● Flink manages many things by itself. Somewhat unconventional, it manages its own memory instead of
relying on the native Java garbage collection mechanisms for performance reasons. Unlike Spark, Flink
does not require manual optimization and adjustment when the characteristics of the data it
processes change. It handles data partitioning and caching automatically as well.
● Flink has less APIs compared with Spark.
● One of the largest drawbacks of Flink at the moment is that it is still a very young project. Large scale
deployments in the wild are still not as common as other processing frameworks and there hasn’t
been much research into Flink’s scaling limitations.

EsperTech
● Esper is a streaming engine.
● Esper appears to be based primarily on streams, so of the two choices, it is most similar to
Flink.
● Esper has data storage/database functionality integrated, while Flink and Spark are pure
processing engines intended to work with external data stores
● Esper has reactive programming built in. Spark will have a very hard time supporting this,
while Flink should make it somewhat easier but still nontrivial.
● Esper’s integrations appear to target the enterprise, while Flink’s integrations target open-
source tools popular in Silicon Valley (e.g. Kafka)
● Esper appears to be much more mature, having had stable releases at least since 2008. I
believe Flink’s first stable release was in 2015.
● Esper started as an enterprise product, while Flink started with open source, bringing many
cultural differences

Differences between frameworks :
1. Spark & Flink : https://www.educba.com/apache-spark-vs-apache-flink/
2. Hadoop & Storm : https://www.educba.com/apache-hadoop-vs-apache-storm/
3. Storm & Spark : https://www.educba.com/apache-storm-vs-apache-spark/
4. Hadoop & Spark : https://www.educba.com/apache-storm-vs-apache-spark/
Hadoop , Spark & Flink : https://data-flair.training/blogs/hadoop-vs-spark-vs-flink/ ( IMPORTANT )

Conclusions
The most important part is choosing the best Streaming Framework
And the honest answer is: it depends :)
1. For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is likely less
expensive to implement than some other solutions.
2. For stream-only workloads, Storm has wide language support and can deliver very low latency
processing, but can deliver duplicates and cannot guarantee ordering in its default
configuration. Samza integrates tightly with YARN and Kafka in order to provide flexibility, easy
multi-team usage, and straightforward replication and state management.
3. For mixed workloads, Spark provides high speed batch processing and micro-batch processing
for streaming. It has wide support, integrated libraries and tooling, and flexible integrations.
Flink provides true stream processing with batch processing support. It is heavily optimized,
can run tasks written for other platforms, and provides low latency processing, but is still in the
early days of adoption.

Apache frameworks for Big and Fast Data

Lambda use case : Social media & network analysis

Kappa use case : Fraud Detection

References and blogs links
• https://dzone.com/articles/big-data-ingestion-flume-kafka-and-nifi
• https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-
flink-big-data-frameworks-compared
• https://www.quora.com/What-is-the-closest-option-to-Esper-Apache-Spark-or-
Apache-Flink
• https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-
streams-vs-samza-choose-your-stream-processing-91ea3f04675b
• https://www.slideshare.net/gschmutz/big-data-architecture-
53231252?qid=3785d5c7-bd9c-408b-8714-ef9064a2be3b&v=&b=&from_search=1 [
IMPORTANT ]

System Requirements
Apache Kafka :
· at least 8 GB RAM
· at least 500 GB Storage
· Ubuntu 14.04 or later, RHEL 6, RHEL 7, or equivalent
· Access to Kafka (specifically, the ability to consume messages and to
communicate with Zookeeper)
· Access to Kafka Connect instances (if you want to configure Kafka Connect)
· Ability to connect to the server from the user’s web browser.
Docker image : https://hub.docker.com/r/bitnami/kafka/
Apache Hadoop :
System Requirements: Per Cloudera page, the VM takes 4GB RAM and 3GB of disk
space. This means your laptop should have more than that (I'd recommend 8GB+).
Storage-wise, as long as you have enough to test with small and medium-sized data
sets (10s of GB), you'll be fine. As for the CPU, if your machine has that amount of
RAM you'll most likely be fine. I'm using a single-node crappy Pentium G3210 with
4GB of ram for testing my small jobs and it works just fine.
Docker image : https://hub.docker.com/r/apache/hadoop
Apache NIFI :
NiFi Registry has the following minimum system requirements:
● Requires Java Development Kit (JDK) 8, newer than 1.8.0_45
● Supported Operating Systems:
○ Linux
○ Unix
○ Mac OS X
● Supported Web Browsers:
○ Google Chrome: Current & (Current - 1)
○ Mozilla FireFox: Current & (Current - 1)
○ Safari: Current & (Current - 1)
Docker image : https://hub.docker.com/r/apache/nifi/

System Requirements
Apache Spark :
Hardware
We used a virtual machine with the following setup:
* CPU core count: 32 virtual cores (16 physical cores), Intel Xeon CPU E5-2686 v4 @
2.30GHz
* System memory: 244 GB
* Total local disk space for shuffle: 4 x 1900 GB NVMe SSD
Software
● OS: Ubuntu 16.04
● Spark: Apache Spark 2.3.0 in local cluster mode
● Pandas version: 0.20.3
● Python version: 2.7.12
Docker image : https://hub.docker.com/r/sequenceiq/spark/
Apache Flink :
Recommended Operating System
● Microsoft Windows 10
● Ubuntu 16.04 LTS
● Apple macOS 10.13/High Sierra
Memory Requirement
● Memory - Minimum 4 GB, Recommended 8 GB
● Storage Space - 30 GB
Note − Java 8 must be available with environment variables already set.
Docker image : https://hub.docker.com/_/flink

Apache frameworks for Big and Fast Data

Related slideshows

More Related Content

Apache frameworks for Big and Fast Data