Big data trends challenges opportunities

Big Data
Trends, Challenges, and Opportunities
Mohammed Guller
Jan 30, 2015

About Me
 Principal Architect at Glassbeam
 Founded two startups
 Passionate about building products,
big data analytics, and machine
learning
www.linkedin.com/in/mohammedguller
@MohammedGuller
3
Available on Amazon

CPU Trend
 CPU clock speed plateaued around 2004
 CPUs are not getting any faster
 Trend is to add more cores/CPU and more CPUs/system
5

Challenges
 Multi-threaded programs required to utilize all cores in a machine
 Writing multi-threaded program is hard
 Tools provided by traditional languages are primitive
 Problems such as deadlocks, livelocks, starvation, and race
conditions are difficult to avoid and detect
6

Functional Programming (FP)
 Based on theory developed in the 1930s
 Program composed of functions
– Executed by evaluating expressions
 Functions are first-class citizens
– Can be passed as an argument to another function
– Can be returned by another function
– Can be defined inside another function
– Can be defined as an unnamed literal similar to a string literal
 Functions do not have side effect
– Always returns the same output for a given input
– Order of execution is not important
 Discourages mutable variables
7

Benefits of Functional Programming
 Makes it easier to write multi-threaded programs
 Improves developer productivity
 Enables better quality code
8

Functional Programming Languages
 Lisp
 Erlang
 Haskell
 Scala
 Swift
9

Opportunities
 High demand for people who know Scala
– Scala is one of the most popular FP languages
 Shortage of people who know Scala
10

3 Vs of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data

Amount of Data Generated is Exploding
13

5x More Connected Things Than People by 2020
14
Network of objects embedded with software for
collecting and exchanging data over the Internet

Big Data Challenges
 Storage
– Traditional SAN and NAS storage devices are expensive
 Processing
– Traditional RDBMS were not designed to handle big data
 How to get value out of data
 How to do it economically
15

Open-source Big Data Storage Technologies
 Distributed File Systems
– HDFS
 NoSQL data stores
– Cassandra
– HBase
– MongoDB
– Druid
– ElasticSearch
– SolrCloud
16

How Much Data Can a Standard Server Process
100
GB
10
TB
100
TB1
TB

Options For Increasing Data Processing Power
 Scale-up
 Scale-out
18

Scale-up
 Use a more powerful high-end server
– Faster CPU
– Faster Disk
– Large number of CPUs
– Large amount of memory
 Proprietary
 Expensive
 Limited scalability
19

Scale-out
 Use a cluster of commodity servers
 Inexpensive
 Economical to scale
 Preferred architecture
20

Challenges With Scale-out Architecture
 Writing an distributed application is even harder than writing a
multi-threaded one
 Many details involved
– Split a workload into chunks that can be distributed across a cluster
– Schedule compute resources among different jobs
– Manage inter-node communication
– Handle network and node failures
 Hardware failures are more common at a cluster level
– Probability of a single node failing is very low
– Probability of any one node failing from a cluster of thousands of
nodes is very high
21

Getting Value Out of Data
 Traditional analytics / BI
 Machine Learning
– Predictive analytics
– Train software to do human tasks
22

Traditional Analytics / BI
 What happened
– Revenue growth for the last month/quarter/year
– Customer growth for the last month/quarter/year
 Why it happened
– Why profit dropped
– Why sales dropped
 Other insights
– What is the country-wise breakup of people downloading an app
– How much time people spend in an app
23

Predictive Analytics
 Ask software to predict
– What product will a customer most likely buy
– What ad will a visitor most likely click
– What movies/songs/books will a customer like
– What are chances that a patient may have an heart attack
 More interesting and valuable than traditional analytics
24

Train Software To Do Human Tasks
 Image classification
– Facebook
– Flickr
 Voice recognition and natural
language processing
– Siri
 Body movement recognition
– Xbox Kinect
 Self-driving car
– Google car
 Medical diagnosis
 Anomaly detection
– Fraudulent transaction
– Security attack
25

Distributed Data Processing Frameworks
 Batch processing
– MapReduce
 Stream processing
– Samza
– Heron
– Storm
 Batch and stream processing
– Spark
– Flink
– Apex
26

Spark
27
Fast, easy-to-use, and general-purpose cluster
computing framework for processing large datasets

Supports a Variety of Data Sources
28

Spark Benefits
 Makes it easy to write distributed data processing applications
– Expressive API
 Takes care of the messy details of distributed computing
 Allows developers to just focus on the business logic
– Same code works on a single computer or a cluster of nodes
29

Integrated Libraries for a Variety of Tasks
30
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib &
Spark
ML

Spark is Fast
 In-memory computation
 Advanced Directed Acyclic Graph (DAG) execution engine
32

Why In-memory Computation Matters
33
100 MB/s
500 MB/s
10 GB/s

Read Time Comparison
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
34

What Are People Using Spark For
35
Source: Databricks Survey 2015

Top Reasons For Using Spark
36
Source: Databricks Survey 2015

Adoption of Spark is Growing Rapidly

Opportunities
 Big data will only get bigger
– Everything will be data driven
– New data-driven applications will be invented
– Data will enable us to solve extremely difficult problems
 Spark and other big data technologies are rapidly evolving
 Strong demand for people who know how to store, process and
get value out of big data
40

Big data trends challenges opportunities

More Related Content

Big data trends challenges opportunities