عصر کلان داده، چرا و چگونه؟

‫و‬ ‫چرا‬ ،‫داده‬ ‫کالن‬ ‫عصر‬
‫چگونه؟‬
VAHID AMIRI
VAHIDAMIRY.IR
VAHID.AMIRY@GMAIL.COM

Big DataData Data Processing
Data Gathering
Data Storing

Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…

12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Volume

Variety (Complexity)
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data
 You can only scan the data once
 Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of
data need to linked together

A Single View to the Customer
Customer
Social
Media
Gaming
Entertain
Bankin
g
Financ
e
Our
Known
History
Purchase

Velocity (Speed)
 Data is begin generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
Social media and networks
(all of us are generating data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)

 The Model of Generating/Consuming Data has Changed
The Model Has Changed…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data

Solution
Big
Data
Big
Comput
ation
Big
Computer

 Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of work.
 HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster
Hadoop

 More than just the Elephant in the room
 Over 120+ types of NoSQL databases
So many NoSQL options

 Extend the Scope of RDBMS
 Caching
 Master/Slave
 Table Partitioning
 Federated Tables
 Sharding
NoSql
 Relational database (RDBMS) technology
 Has not fundamentally changed in over 40 years
 Default choice for holding data behind many web apps
 Handling more users means adding a bigger server

RDBMS with Extended Functionality
Vs.
Systems Built from Scratch
with Scalability in Mind
NoSQL Movement

CAP Theorem
 “Of three properties of shared-data systems – data Consistency, system
Availability and tolerance to network Partition – only two can be achieved at
any given moment in time.”

“Of three properties of shared-data systems – data
Consistency, system Availability and tolerance to
network Partition – only two can be achieved at any
given moment in time.”
 CA
 Highly-available consistency
 CP
 Enforced consistency
 AP
 Eventual consistency
CAP Theorem

 Schema-less
 State (Persistent or Volatile)
 Example:
 Redis
 Amazon DynamoDB
Key / Value Database

 Wide, sparse column sets
 Schema-light
 Examples:
 Cassandra
 HBase
 BigTable
 GAE HR DS
Column Database

 Use for data that is
 document-oriented (collection of JSON documents) w/semi structured
data
 Encodings include XML, YAML, JSON & BSON
 binary forms
 PDF, Microsoft Office documents -- Word, Excel…)
 Examples: MongoDB, CouchDB
Document Database

Graph Database
Use for data with
 a lot of many-to-many relationships
 when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
 Examples: Neo4J, FreeBase (Google)

So which type of NoSQL? Back to CAP…
CP = noSQL/column
Hadoop
Big Table
HBase
MemCacheDB
AP = noSQL/document or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
CA = SQL/RDBMS
SQL Sever / SQL
Azure
Oracle
MySQL

Apache Hadoop
 A framework for storing & processing Petabyte of data using commodity hardware
and storage
 Apache project
 Implemented in Java
 Community of contributors is growing
 Yahoo: HDFS and MapReduce
 Powerset: HBase
 Facebook: Hive and FairShare scheduler
 IBM: Eclipse plugins

Hadoop System Principles
 Scale-Out rather than Scale-Up
 Bring code to data rather than data to code
 Deal with failures – they are common
 Abstract complexity of distributed and concurrent applications

Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software Layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
 Very easy to scale down as well

Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes connected by
high-capacity link
 Many data-intensive applications are not CPU demanding causing bottlenecks in
network

Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage

Failures are Common
 Given a large number machines, failures are common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried

Abstract Complexity
 Hadoop abstracts many complexities in distributed and concurrent applications
 Defines small number of components
 Provides simple and well defined interfaces of interactions between these components
 Frees developer from worrying about system level challenges
 processing pipelines, data partitioning, code distribution
 Allows developers to focus on application development and business logic

Distribution Vendors
 Cloudera Distribution for Hadoop (CDH)
 MapR Distribution
 Hortonworks Data Platform (HDP)
 Apache BigTop Distribution

Components
 Distributed File System
 HDFS
 Distributed Processing Framework
 Map/Reduce

The Storage:
Hadoop Distributed File System

HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware

REPLICA MANGEMENT
 A common practice is to spread the nodes across multiple racks
 A good replica placement policy should improve data reliability, availability,
and network bandwidth utilization
 Namenode determines replica placement

The Execution Engine:
Apache Yarn

Yarn Components
 RescourceManager:
 Arbitrates resources among all the applications in the
system
 NodeManager:
 the per-machine slave, which is responsible for launching
the applications’ containers, monitoring their resource
usage
 ApplicationMaster:
 Negotiate appropriate resource containers from the
Scheduler, tracking their status and monitoring for progress
 Container:
 Unit of allocation incorporating resource elements such as
memory, cpu, disk, network etc., to execute a specific task of the
application (similar to map/reduce slots in MRv1)

The Processing Model:
MapReduce

What is MapReduce?
 Parallel programming model for large clusters
 User implements Map() and Reduce()
 Parallel computing framework
 Libraries take care of EVERYTHING else
 Parallelization
 Fault Tolerance
 Data Distribution
 Load Balancing
 MapReduce library does most of the hard work for us!
 Takes care of distributed processing and coordination
 Scheduling
 Task Localization with Data
 Error Handling
 Data Synchronization

Map and Reduce
 Map()
 Map workers read in contents of corresponding input partition
 Process a key/value pair to generate intermediate key/value pairs
 Reduce()
 Merge all intermediate values associated with the same key
 eg. <key, [value1, value2,..., valueN]>
 Output of user's reduce function is written to output file on global file system
 When all tasks have completed, master wakes up user program

Distributed Processing
 Word count on a huge file

عصر کلان داده، چرا و چگونه؟

Related slideshows

More Related Content

عصر کلان داده، چرا و چگونه؟