Analysis of Major Trends in Big Data Analytics

Hadoop Summit
San Jose, California
June 28th 2016
Analysis of Major Trends in
Big Data Analytics
Slim Baltagi
Director, Enterprise Architecture
Capital One Financial Corporation

Welcome!
About me:
• I’m currently director of Enterprise Architecture at Capital One: a
top 10 US financial corporation based in McLean, VA.
• I have over 20 years of IT experience.
• I have over 7 years of Big Data experience: Engineer, Architect,
Evangelist, Blogger, Thought Leader, Speaker, Organizer of Apache
Flink meetups in many countries, Creator and maintainer of the Big
Data Knowledge Base: http://SparkBigData.com with over 7,000
categorized web resources about Hadoop, Spark, Flink, …
Thanks: This talk won the community vote of the ‘Future
of Apache Hadoop’ track. Thanks to all of you who: voted
for this talk, attending this talk now, reading these slides.
Disclaimer: This is a vendor-independent talk that
expresses my own opinions. I am not endorsing nor
promoting any product or vendor mentioned in this talk.2

Agenda
1. Portability between Big Data Execution
Engines
2. Emergence of stream analytics
3. In-Memory analytics
4. Rapid Application Development of Big Data
applications
5. Open sourcing Machine Learning systems by
tech giants
6. Hybrid Cloud Computing
3

What is a typical Big Data Analytics Stack:
Hadoop, Spark, Flink, …?
4

1. Portability between Big Data Execution Engines
If you have an existing Big Data application based on
MapReduce and you want to benefit from a different
execution engine such as Tez, Spark or Flink, you might
need to:
• Reuse some of your existing code such as mapper and
reduce functions. Example:
• Leverage a ‘compatibility layer’ to run your existing
Big Data application on the new engine. Example:
Hadoop Compatibility Layer from Flink
• Switch to a different engine if the tool you used
supports it. Example: Hive/Pig on Tez, Hive/Pig on
Spark, Sqoop on Spark, Cascading on Flink.
• Rewrite your Big Data application! 5

1. Portability between Big Data Execution Engines
Apache Beam (unified Batch and Stream processing) is
a new Apache incubator project based on years of
experience developing Big Data infrastructure
(MapReduce, FlumeJava, MillWheel) within Google
http://beam.incubator.apache.org/
Apache Beam provides a unified API for Batch and
Stream processing and also multiple runners.
Beam programs become portable across multiple
runtime environments, both proprietary (e.g., Google
Cloud Dataflow) and open-source (e.g., Flink, Spark).
Apache Beam web
resourceshttp://sparkbigdata.com/component/tags/tag/67
6

Agenda
Engines
applications
tech giants
7

Stonebraker et al. predicted in 2005 that stream
processing is going to become increasingly important
and attributed this to the ‘sensorization of the real
world: everything of material significance on the
planet get ‘sensor-tagged’ and report its state or
location in real time’. http://cs.brown.edu/~ugur/8rulesSigRec.pdf
I think stream processing is becoming important not
only because of this sensorization of the real world but
also because of the following factors:
1. Data streams
2. Technology
3. Business
4. Consumers
8

ConsumersData Streams
Technology Business1
2 3
4
Emergence of Stream
Analytics
9

1 Data Streams
 Real-world data is available as series of events that
are continuously produced by a variety of
applications and disparate systems inside and
outside the enterprise.
 Examples:
• Sensor networks data
• Web logs
• Database transactions
• System logs
• Tweets and social media data
• Click streams
• Mobile apps data
10

2 Technology
Simplified data architecture with Apache Kafka as a
major innovation and backbone of stream
architectures.
Rapidly maturing open source stream analytics tools:
Apache Flink, Apache Apex, Spark Streaming, Kafka Streams,
Apache Samza, Apache Storm, Apache Gearpump, Heron, …
Cloud services for stream processing: Google Cloud
Dataflow, Microsoft’s Azure Stream Analytics, Amazon Kinesis
Streams, IBM InfoSphere Streams, …
Vendors innovating in this space: Confluent, Data
Artisans, Databricks, MapR, Hortonworks, StreamSets, …
More mobile devices than human beings!
11

3 Business
Challenges:
Lag between data creation and actionable insights.
Infrastructure is idle most of the time
Web and mobile application growth, new types/sources
of data.
Need of organizations to shift from reactive approach
to a more of a proactive approach to interactions with
customers, suppliers and employees.
12

3 Business
Opportunities:
Embracing stream analytics helps organizations with
faster time to insight, competitive advantages and
operational efficiency in a wide range of verticals.
With stream analytics, new startups are/will be
challenging established companies. Example: Pay-As-
You-Go insurance or Usage-Based Auto Insurance
Speed is said to have become the new currency of
business.
13

4 Consumers
Consumers expect everything to be online and
immediately accessible through mobile
applications.
Mobile, always-on consumers are becoming more and
more demanding for instant responses from enterprise
applications in the way they are used to in mobile
applications from social networks such as Twitter,
Facebook, Linkedin …
Younger generation who grow up with video gaming
and accustomed to real-time interaction are now
themselves a growing class of consumers.
14

 Financial services
 Telecommunications
 Online gaming systems
 Security & Intelligence
 Advertisement serving
 Sensor Networks
 Social Media
 Healthcare
 Oil & Gas
 Retail & eCommerce
 Transportation and logistics

Stream Processor
Business
Applications
(e.g. Enterprise
Command
Center)
Personal Mobile
Applications
Data Lake
Event
Collector
& Broker
Advanced Analytics
& Machine Learning
Real-Time
Notifications
Real-Time
DecisionsApps
Sensors
Devices
Other
Sources
Business
System
Backend
Dashboards
Sourcing & Integration Analytics & Processing Serving & Consuming
16
End-to-end stream analytics solution architecture

Agenda
Engines
applications
tech giants
17

3. In-Memory Analytics
While In-Memory Analytics are not new, the trend is that
they are the focus of renewed attention thanks to:
• the availability of new memory that could easily fit
most active data sets
• the maturing or newly available in-memory open source
tools in many categories such as:
 Memory-centric distributed File System
 Columnar data format
 Key Value data stores
 IMDG: In-Memory Data Grids
 Distributed Cache
 Very Large Hashmaps
In the next couple slides, I will share a few examples
18

Alluxio http://alluxio.org (formerly known as Tachyon) is
an open source memory speed virtual distributed
storage system. Example of its usage patterns:
• Accelerate Big Data Analytics workloads by
prefetching views and creating caches on demand.
• Sharing data between applications by writing to
Alluxio’s in-memory data store and read it back at
far greater speed.
 Rocks DB https://github.com/facebook/rocksdb/ An open
source library from Facebook that provides an
embeddable, persistent key-value store. It is suited for
fast storage of data on RAM and flash drives. It is used
as state backend by Samza, Flink, Kafka Streams, …
19

Apache Arrow (http://arrow.apache.org/) for columnar in-
memory analytics.
• Apache Arrow enables execution engines to take
advantage of the latest SIMD (Single Input Multiple
Data) operations included in modern processors, for
native vectorized optimization of analytical data
processing.
• Columnar layout of data also allows for a better use of
CPU caches by placing all data relevant to a column
operation in as compact of a format as possible.
• Apache Arrow advantages is that systems utilizing it
as a common memory format have no overhead for
cross-system data communication and also can share
functionality.
20

Agenda
Engines
2. Emergence of stream analytics frameworks
applications
tech giants
6. Deployment of Big Data applications in a
hybrid model: on-premise and on the cloud
21

4. Rapid Application Development of Big
Data applications
MicroservicesAPIs
Notebooks
/Shells
GUIs1
2 3
4
Rapid Applications Development of
Big Data Analytics
22

Data applications
1 APIs
 Apache Spark and Apache Flink provide high level and
easy to use API compared to Hadoop MapReduce
 Apache Beam is a new open source project from
Google that attempts to unify data processing
frameworks with a core API, allowing easy portability
between execution engines.
 Use Apache Beam unified API for batch and streaming
and then run on a local runner, Apache Spark, Apache
Flink, …
 The biggest advantage is in developer productivity and
ease of migration between processing engines.
23

Data applications
2 Shells or Notebooks
• REPL (Read Evaluate Print Loop) interpreter
• Interactive queries
• Explore data quickly
• Sketch out your ideas in the shell to make sure you’ve
got your code right before deploying it to a cluster.
• Web-based interactive computation environment
• Collaborative data analytics and visualization tool
• Combines rich text, execution code, plots and rich
media
• Exploratory data science
• Saving and replaying of written code
24

Data applications
2 Shells or Notebooks Apache Zeppelin
25

Data applications
3 GUIs
 Apache Nifi
26

Data applications
4 Microservices:
 Microservices are an important trend in building larger
systems by:
• decomposing their functions into relatively simple,
single purpose services
• that asynchronously communicate via Apache
Kafka as a message passing technology that avoid
unwanted dependencies between these services.
 This streaming architectural style provides agility
as microservices can be built and maintained by
small and cross-functional teams.
27

Agenda
Engines
applications
tech giants
28

5. Open sourcing Machine Learning systems
by tech giants
Yahoo
CaffeOnSpark
Facebook
Torch
IBM
SystemML
Google
TensorFlow1
2 3
5
Open sourcing machine
learning systems by tech giants
29
4
Microsoft
DMTK
Amazon
DSSTNE
6

5. Open sourcing Machine Learning systems
by tech giants
1 Torch http://torch.ch/ is an open source
Machine Learning library which provides a
wide range of deep learning algorithms.
Facebook donated its optimized deep learning modules to
the Torch project on January 16, 2015.
2 Apache SystemML http://systemml.apache.org/
is a distributed and declarative machine learning platform.
It was created in 2010 by IBM and donated as an open
source Apache project on November 2nd, 2015.
3 TensorFlow is an open source machine learning library
created by Google. https://www.tensorflow.org It was released
under the Apache 2.0 open source license on November 9th,
2015 30

5. Open sourcing Machine Learning
systems by tech giants
4 DMTK (Distributed Machine Learning Toolkit) allows
models to be trained on multiple nodes at once.
http://www.dmtk.io/ DMTK was open sourced
by Microsoft on November 12, 2015.
5 CaffeOnSpark https://github.com/yahoo/CaffeOnSpark is an
open source machine learning library created by Yahoo. It
was open sourced on February 24th, 2016
DSSTNE (Deep Scalable Sparse Tensor Network
Engine) “Destiny” is an Amazon developed library for
building Deep Learning (DL) Machine Learning (ML)
models. It was open sourced on May 11th, 2016
https://github.com/amznlabs/amazon-dsstne
31
6

5. Open sourcing Machine Learning
systems by tech giants
It is expected to see wider adoption of Machine Learning
tools by companies besides these tech giants in a
similar way that MapReduce and Hadoop helped making
“Big Data” a part of just every company’s strategy!
These tech giants are not pushing their machine
learning systems for internal use only but they are
racing to open source them, attract users and
committers and advance the entire industry.
This combined with deployment on commodity clusters
will accelerate such adoption and as a result we will see
new machine learning use cases especially building on
deep learning that will transform multiple industries.
32

Agenda
Engines
applications
tech giants
33

Cloud is becoming mainstream and software stack is
adapting.
Big Data applications will eventually all move to the
cloud to benefit from agility, elasticity and on-demand
computing!
Meanwhile, companies need to advance their strategy
for hybrid integration between cloud and on-premise
deployments.
Deployment of Big Data applications in a hybrid
model: on-premise and on the cloud
34

The following are a few patterns for such hybrid
integration:
1. Replicating data from SaaS apps to existing on-
premise databases to be used by other on-premise
applications such as analytics ones.
2. Integrating SaaS applications themselves with on-
premise applications.
3. Hybrid Data Warehousing with the Cloud: move data
from on-premise data warehouse to the cloud.
4. Real-Time analytics on streaming data: depending on
your use case, you might keep your stream analytics
infrastructure directly accessible on-premise for low
latency.

Key Takeaways
1. Adopt Apache Beam for easier development and
portability between Big Data Execution Engines
2. Adopt stream analytics for faster time to insight,
competitive advantages and operational efficiency
3. Accelerate your Big Data applications with In-Memory
open source tools
4. Adopt Rapid Application Development of Big Data
applications: APIs, Notebooks, GUIs, Microservices…
5. Have Machine Learning part of your strategy or
passively watch your industry completely
transformed!
6. How to advance your strategy for hybrid integration
between cloud and on-premise deployments?
36

Thanks!
To all of you for attending!
Any questions?
Let’s keep in touch!
• sbaltagi@gmail.com
• @SlimBaltagi
• https://www.linkedin.com/in/slimbaltagi
37

Analysis of Major Trends in Big Data Analytics

Related slideshows

More Related Content

Analysis of Major Trends in Big Data Analytics