Introduction to Big Data

© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1
Mohammed Guller
Oct 02, 2016
Introduction to Big Data

Big Data
Big Data Technologies
Kafka
Hadoop
Spark
Agenda

About Me
• Engineering Manager / Principal Architect at Glassbeam
• Founded two startups
• Passionate about building products, big data analytics, and
machine learning
• www.linkedin.com/in/mohammedguller
• @MohammedGuller
4

• Hands-on guide with lots of examples
• Covers both fundamental and advanced
topics such as machine learning
• Includes a primer on functional
programming and Scala
• Introduces other important Big Data
technologies such as HDFS, Parquet,
Kafka, HBase, Cassandra, Mesos, and
YARN
Big Data Analytics with Spark
Available on Amazon

About Glassbeam
Glassbeam brings structure and meaning to data from any connected machine or device while providing
actionable intelligence
Cloud based analytics platform that helps
organizations turn raw machine data to insights
Making sense of multi
structured machine data
 Data center devices
 Medical devices
 Sensors
 ATMs
 Automobiles
 Data from any machine
Providing comprehensive set of apps
& tools for machine data analysis
 50,000+ systems being tracked today
 1,500+ different software rev codes
 1.2 Billion sensor readings per day
 1+ Trillion sensor readings tracked

Big Data
Kafka
Hadoop
Spark

Data Growing At a Faster Pace Than Ever
9

Internet of Things (IoT)
• Network of objects embedded with
software for collecting and sending data
over the Internet
• 5x more connected things than people by
2020

Industrial IoT
• Manufacturing
• Automotive
• Medical
• Data Center
• EVC
• Smart Meter
11
Glassbeam target market is focused on driving opera onal & business
analy cs value for connected product companies in Industrial IoT market
IT & Networks Medical & Health Care
Transporta on
EV Chargers & Smart Grid
Industrial & Mfg

Key Attributes of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
•
•
•
•
•
•
•
•
•

Big Data Comes with Big Challenges
• Storage
• Processing
• Value

Storage Challenges
• Legacy SAN / NAS storage devices are expensive
• Traditional RDBMS were not designed for Big Data
• Cannot handle volume, velocity, variety of Big Data
14

Processing Challenges
• Diverse processing
• Organizations want do more than just BI / traditional analytics
• Go beyond SQL queries
• Timeliness
• Process data in reasonable amount of time
• Value of data decreases over time
15

How Much Data Can a Standard Server Process

•
•
17

• Large number of CPUs / cores
• Faster cores
• Large amount of memory
• Faster memory bus
• High-performance architecture
Scale-up with Powerful High-end Server
18

Disadvantages of Scale-up Architecture
• Proprietary
• Expensive
• Limited scalability
19

• Cluster of servers
• Commodity machines
• Pool together resources
• CPU
• Memory
• Disk
Scale-out Architecture
20

Benefits of Scale-out Architecture
• Relatively inexpensive
• Economical to scale
• No huge upfront investment
• Start small and expand cluster as workload increases
21

Challenges With Scale-out Architecture
• Writing distributed applications is very hard
• Split job into chunks that can be distributed across a cluster
• Schedule compute resources among different jobs
• Manage inter-node communication
• Handle network and node failures
• Hardware failures are more common at a cluster level
• Probability of a single node failing is low
• Probability of any one node in a large cluster failing is high
22

Getting Value Out of Big Data
• Traditional analytics / BI
• Custom processing
• Machine Learning
• Predictive analytics
• Automate complex tasks
• Stream processing
• Analyze in real-time/near real-time
• React in real-time
23

Traditional Analytics / BI
• What
• Customer growth for the last month/quarter/year
• Segmentation of customers by demographics
• Average time spent by mobile app users
• Why
• Sales growth slowed
• regional issue
• supply issue
• Profit dropped
• revenue dropped
• expenses increased
24

Custom Processing
• Index web pages
• Google
• Bing
• Process genome data
• Identify mutations linked to cancer, Alzheimer's and other disease
• Click analysis
• Log analysis
• 360-degree real time view of a customer
25

Predictive Analytics
• Advertisements that a visitor will most likely click
• Movies / songs / news that a customer will like
• Products that a customer will buy
• Patient will have an heart attack
26

• Virtual assistant
• Siri
• Google Now
• Autonomous machine
• Self-driving car
• Robots
• Tag Images
• Facebook
• Flickr
• Expert System
• Medical diagnosis
• Personalized medicine
• Security
• Fraud detection
• Network Security
• Music recognition
• Shazam
• SoundHound
Automate Complex Tasks
27

Big Data
Kafka
Hadoop
Spark

•
•
•
•
•
•
29

•
•
•
•
•
•
30

• Text
• CSV
• JSON
• XML
• Binary
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar
(ORC)
File Formats
31

• Hive
• Spark SQL
• Impala
• Presto
• Drill
• Phoenix
• HAWQ
• Tajo
Distributed SQL Query Engine
32
Data Warehouse
Distributed
Storage
Distributed
Query Engine

•
•
•
•
•
•
33

•
•
•
34

•
•
35

Publish – Subscribe / Messaging Systems
• Kafka
• RabbitMQ
• ActiveMQ
• ZeroMQ
36

• Batch
• Hadoop MapReduce
• HPCC
• Stream
• Kafka Streams
• Heron
• Storm
• Samza
• Batch and Stream
• Spark
• Flink
• Beam
• Apex
• Ignite
Big Data Computing Frameworks
37

Big Data
Kafka
Hadoop
Spark

• Distributed publish-subscribe
messaging system
• Partitioned and replicated
commit log service for
building distributed datastore
Kafka

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
40

•
•
•
•
•
•
•
•
41

•
•
•
•
•
•
42

Big Data
Kafka
Hadoop
Spark

Hadoop

•
•
•
•
45

•
•
•
•
•

Hadoop is Not a Single Product
50

Hadoop Core Components
51
=

Big Data
Kafka
Hadoop
Spark

•
•
•
53

•
•
•
•
•
•
•
54

Adoption of Spark is Growing Rapidly

Spark
Fast, easy-to-use, general-purpose cluster computing framework
for processing large datasets using a simpler programming
model
56
• • •

Benefits
• Scale
• Fault-tolerance
• Abstracts distributed computing
• Hides the messy details of writing distributed applications
• Allows developers to just focus on the data processing logic
• Same code works on a laptop or a cluster of servers
• Ease-of-use
• Speed
• Flexibility
57

Easy To Use
• Library with an expressive API
• Scala, Java, Python, R
• RDD API with 80+ operators (MR has only two)
• Dataset/DataFrame API
• Interactive development
• spark-shell
• notebooks
58

• Batch processing
• Interactive analytics
• Stream analysis
• Machine learning
• Graph analytics
Integrated Libraries For a Variety of DP Tasks
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib

Benefits of a Unified Platform
• Solve a variety of problems with a single toolkit
• No need to learn different tools for each use case
• Avoid code and data duplication
• Achieve operational simplicity

Why is Spark Fast
• Advanced job execution engine
• Allows applications to cache data in memory
61

Advanced Job Execution Engine
• Directed Acyclic Graph (DAG) of stages
• simple job can contain just one stage
• complex job can contain many stages
• eliminates expensive operations between multiple jobs
• synchronization
• serialization/deserialization
• disk I/O
• Lazy operator evaluation
• Pipelined operations
62

Allows Applications to Cache Data in Memory
•Minimize disk I/O
•Reading data from memory is orders of magnitude
faster than reading from disk
•In-memory data sharing across DAGs
• different jobs can work with the same cached data
65

Why Caching Makes Applications Run Faster
66
100 MB/s
500 MB/s
10 GB/s

Read Latency Comparison
67
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM

Spark Does Not Provide Storage
• Works with a variety of data sources
• No need to import data into Spark
• Scale compute and storage cluster independently

Process Data From a Variety Of Data Sources
And Many More

Spark Does Not Replace Hadoop
76
= =

Hadoop is Optional
77
= =

Ideal Applications
• Complex data processing
• multi-step pipeline
• Iterative algorithm
• Machine Learning
• Graph analytics
• Ad hoc analysis
• Interactive

Introduction to Big Data

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (15)

Similar to Introduction to Big Data

Similar to Introduction to Big Data (20)

Recently uploaded

Recently uploaded (20)

Introduction to Big Data