SlideShare a Scribd company logo
Big Data
Trends, Challenges, and Opportunities
Mohammed Guller
Jan 30, 2015
About Me
 Principal Architect at Glassbeam
 Founded two startups
 Passionate about building products,
big data analytics, and machine
learning
www.linkedin.com/in/mohammedguller
@MohammedGuller
3
Available on Amazon
Functional Programming
CPU Trend
 CPU clock speed plateaued around 2004
 CPUs are not getting any faster
 Trend is to add more cores/CPU and more CPUs/system
5
Challenges
 Multi-threaded programs required to utilize all cores in a machine
 Writing multi-threaded program is hard
 Tools provided by traditional languages are primitive
 Problems such as deadlocks, livelocks, starvation, and race
conditions are difficult to avoid and detect
6
Functional Programming (FP)
 Based on theory developed in the 1930s
 Program composed of functions
– Executed by evaluating expressions
 Functions are first-class citizens
– Can be passed as an argument to another function
– Can be returned by another function
– Can be defined inside another function
– Can be defined as an unnamed literal similar to a string literal
 Functions do not have side effect
– Always returns the same output for a given input
– Order of execution is not important
 Discourages mutable variables
7
Benefits of Functional Programming
 Makes it easier to write multi-threaded programs
 Improves developer productivity
 Enables better quality code
8
Functional Programming Languages
 Lisp
 Erlang
 Haskell
 Scala
 Swift
9
Opportunities
 High demand for people who know Scala
– Scala is one of the most popular FP languages
 Shortage of people who know Scala
10
Big Data
3 Vs of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
Amount of Data Generated is Exploding
13
5x More Connected Things Than People by 2020
14
Network of objects embedded with software for
collecting and exchanging data over the Internet
Big Data Challenges
 Storage
– Traditional SAN and NAS storage devices are expensive
 Processing
– Traditional RDBMS were not designed to handle big data
 How to get value out of data
 How to do it economically
15
Open-source Big Data Storage Technologies
 Distributed File Systems
– HDFS
 NoSQL data stores
– Cassandra
– HBase
– MongoDB
– Druid
– ElasticSearch
– SolrCloud
16
How Much Data Can a Standard Server Process
100
GB
10
TB
100
TB1
TB
Options For Increasing Data Processing Power
 Scale-up
 Scale-out
18
Scale-up
 Use a more powerful high-end server
– Faster CPU
– Faster Disk
– Large number of CPUs
– Large amount of memory
 Proprietary
 Expensive
 Limited scalability
19
Scale-out
 Use a cluster of commodity servers
 Inexpensive
 Economical to scale
 Preferred architecture
20
Challenges With Scale-out Architecture
 Writing an distributed application is even harder than writing a
multi-threaded one
 Many details involved
– Split a workload into chunks that can be distributed across a cluster
– Schedule compute resources among different jobs
– Manage inter-node communication
– Handle network and node failures
 Hardware failures are more common at a cluster level
– Probability of a single node failing is very low
– Probability of any one node failing from a cluster of thousands of
nodes is very high
21
Getting Value Out of Data
 Traditional analytics / BI
 Machine Learning
– Predictive analytics
– Train software to do human tasks
22
Traditional Analytics / BI
 What happened
– Revenue growth for the last month/quarter/year
– Customer growth for the last month/quarter/year
 Why it happened
– Why profit dropped
– Why sales dropped
 Other insights
– What is the country-wise breakup of people downloading an app
– How much time people spend in an app
23
Predictive Analytics
 Ask software to predict
– What product will a customer most likely buy
– What ad will a visitor most likely click
– What movies/songs/books will a customer like
– What are chances that a patient may have an heart attack
 More interesting and valuable than traditional analytics
24
Train Software To Do Human Tasks
 Image classification
– Facebook
– Flickr
 Voice recognition and natural
language processing
– Siri
 Body movement recognition
– Xbox Kinect
 Self-driving car
– Google car
 Medical diagnosis
 Anomaly detection
– Fraudulent transaction
– Security attack
25
Distributed Data Processing Frameworks
 Batch processing
– MapReduce
 Stream processing
– Samza
– Heron
– Storm
 Batch and stream processing
– Spark
– Flink
– Apex
26
Spark
27
Fast, easy-to-use, and general-purpose cluster
computing framework for processing large datasets
Supports a Variety of Data Sources
28
Spark Benefits
 Makes it easy to write distributed data processing applications
– Expressive API
 Takes care of the messy details of distributed computing
 Allows developers to just focus on the business logic
– Same code works on a single computer or a cluster of nodes
29
Integrated Libraries for a Variety of Tasks
30
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib &
Spark
ML
Spark is Fast
 In-memory computation
 Advanced Directed Acyclic Graph (DAG) execution engine
32
Why In-memory Computation Matters
33
100 MB/s
500 MB/s
10 GB/s
Read Time Comparison
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
34
What Are People Using Spark For
35
Source: Databricks Survey 2015
Top Reasons For Using Spark
36
Source: Databricks Survey 2015
Adoption of Spark is Growing Rapidly
Opportunities
 Big data will only get bigger
– Everything will be data driven
– New data-driven applications will be invented
– Data will enable us to solve extremely difficult problems
 Spark and other big data technologies are rapidly evolving
 Strong demand for people who know how to store, process and
get value out of big data
40
41

More Related Content

Big data trends challenges opportunities

  • 1. Big Data Trends, Challenges, and Opportunities Mohammed Guller Jan 30, 2015
  • 2. About Me  Principal Architect at Glassbeam  Founded two startups  Passionate about building products, big data analytics, and machine learning www.linkedin.com/in/mohammedguller @MohammedGuller 3 Available on Amazon
  • 4. CPU Trend  CPU clock speed plateaued around 2004  CPUs are not getting any faster  Trend is to add more cores/CPU and more CPUs/system 5
  • 5. Challenges  Multi-threaded programs required to utilize all cores in a machine  Writing multi-threaded program is hard  Tools provided by traditional languages are primitive  Problems such as deadlocks, livelocks, starvation, and race conditions are difficult to avoid and detect 6
  • 6. Functional Programming (FP)  Based on theory developed in the 1930s  Program composed of functions – Executed by evaluating expressions  Functions are first-class citizens – Can be passed as an argument to another function – Can be returned by another function – Can be defined inside another function – Can be defined as an unnamed literal similar to a string literal  Functions do not have side effect – Always returns the same output for a given input – Order of execution is not important  Discourages mutable variables 7
  • 7. Benefits of Functional Programming  Makes it easier to write multi-threaded programs  Improves developer productivity  Enables better quality code 8
  • 8. Functional Programming Languages  Lisp  Erlang  Haskell  Scala  Swift 9
  • 9. Opportunities  High demand for people who know Scala – Scala is one of the most popular FP languages  Shortage of people who know Scala 10
  • 11. 3 Vs of Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data
  • 12. Amount of Data Generated is Exploding 13
  • 13. 5x More Connected Things Than People by 2020 14 Network of objects embedded with software for collecting and exchanging data over the Internet
  • 14. Big Data Challenges  Storage – Traditional SAN and NAS storage devices are expensive  Processing – Traditional RDBMS were not designed to handle big data  How to get value out of data  How to do it economically 15
  • 15. Open-source Big Data Storage Technologies  Distributed File Systems – HDFS  NoSQL data stores – Cassandra – HBase – MongoDB – Druid – ElasticSearch – SolrCloud 16
  • 16. How Much Data Can a Standard Server Process 100 GB 10 TB 100 TB1 TB
  • 17. Options For Increasing Data Processing Power  Scale-up  Scale-out 18
  • 18. Scale-up  Use a more powerful high-end server – Faster CPU – Faster Disk – Large number of CPUs – Large amount of memory  Proprietary  Expensive  Limited scalability 19
  • 19. Scale-out  Use a cluster of commodity servers  Inexpensive  Economical to scale  Preferred architecture 20
  • 20. Challenges With Scale-out Architecture  Writing an distributed application is even harder than writing a multi-threaded one  Many details involved – Split a workload into chunks that can be distributed across a cluster – Schedule compute resources among different jobs – Manage inter-node communication – Handle network and node failures  Hardware failures are more common at a cluster level – Probability of a single node failing is very low – Probability of any one node failing from a cluster of thousands of nodes is very high 21
  • 21. Getting Value Out of Data  Traditional analytics / BI  Machine Learning – Predictive analytics – Train software to do human tasks 22
  • 22. Traditional Analytics / BI  What happened – Revenue growth for the last month/quarter/year – Customer growth for the last month/quarter/year  Why it happened – Why profit dropped – Why sales dropped  Other insights – What is the country-wise breakup of people downloading an app – How much time people spend in an app 23
  • 23. Predictive Analytics  Ask software to predict – What product will a customer most likely buy – What ad will a visitor most likely click – What movies/songs/books will a customer like – What are chances that a patient may have an heart attack  More interesting and valuable than traditional analytics 24
  • 24. Train Software To Do Human Tasks  Image classification – Facebook – Flickr  Voice recognition and natural language processing – Siri  Body movement recognition – Xbox Kinect  Self-driving car – Google car  Medical diagnosis  Anomaly detection – Fraudulent transaction – Security attack 25
  • 25. Distributed Data Processing Frameworks  Batch processing – MapReduce  Stream processing – Samza – Heron – Storm  Batch and stream processing – Spark – Flink – Apex 26
  • 26. Spark 27 Fast, easy-to-use, and general-purpose cluster computing framework for processing large datasets
  • 27. Supports a Variety of Data Sources 28
  • 28. Spark Benefits  Makes it easy to write distributed data processing applications – Expressive API  Takes care of the messy details of distributed computing  Allows developers to just focus on the business logic – Same code works on a single computer or a cluster of nodes 29
  • 29. Integrated Libraries for a Variety of Tasks 30 Spark Core Spark SQL GraphX Spark Streaming MLlib & Spark ML
  • 30. Spark is Fast  In-memory computation  Advanced Directed Acyclic Graph (DAG) execution engine 32
  • 31. Why In-memory Computation Matters 33 100 MB/s 500 MB/s 10 GB/s
  • 32. Read Time Comparison 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM 34
  • 33. What Are People Using Spark For 35 Source: Databricks Survey 2015
  • 34. Top Reasons For Using Spark 36 Source: Databricks Survey 2015
  • 35. Adoption of Spark is Growing Rapidly
  • 36. Opportunities  Big data will only get bigger – Everything will be data driven – New data-driven applications will be invented – Data will enable us to solve extremely difficult problems  Spark and other big data technologies are rapidly evolving  Strong demand for people who know how to store, process and get value out of big data 40
  • 37. 41