SlideShare a Scribd company logo
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hadoop Training | Edureka
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Agenda
1. Evolution Of Data
2. What is Big Data?
3. Big Data as an Opportunity
4. Problems in Encasing Opportunity
5. Hadoop as a Solution
6. Hadoop Ecosystem
7. Edureka Big Data & Hadoop Training
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Other Factors
IOT
Evolution of Technology
Telephone
Mobile
Desktop
Cloud
Car
Smart Car
4
Social Media
3
2
Evolution of
Technology
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Other Factors
IOT
IOT: 50 Billion devices by 2020
Evolution of
Technology
4
Social Media
3
IOT
2
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Other Factors
Social Media
4,166,667 likes &
200,000 photos
347,222 tweets
300 hours of
video uploaded
204,000,000
emails
1,736,111
Instagram pics
Social Media
Evolution of
Technology
IOT
4
Social Media
3
2
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Other Factors
Evolution of
Technology
IOT
Other Factors
4
Social Media
3
2
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What Is Big Data?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Big Data?
Big data is the term for collection of data sets so large and complex that it becomes difficult to process using on-hand
database system tools or traditional data processing applications
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
5 V’s of Big Data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
2020
10,000
20,000
30,000
40,000
………...……………………..…………………...
2009 2010 2011 2012 20142013 2015 2016 2017 2018 2019
Exabytes
By 2020, accumulated digital universe of
data will grow from 4.4 zetabyets today
to around 44 zettabytes, or
44 trillion gigabytes.
Volume
Volume
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Different kinds of data is being generated from various sources
Structured Semi-Structured Un-Structured
XML CSV TSV
Variety
Table Audio Video ImageLog
XML CSV TSVJSON E-mail
Variety
2
Volume
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Data is being generated at an alarming rate
100,000+ tweets
695,000 + status update
11,000,000 + instant messages
698,445 Google Searches
168,000,000 + emails
1,820 TB data created
217+ new mobile users
Every 60 secondsEvery 60 seconds
Velocity
Velocity
3
Variety
2
Volume
1
M a i n f r a m e
C l i e n t /
S e r v e r I n t e r n e t
M o b i l e , s o c i a l
m e d i a , c l o u d
…
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Mechanism to bring the correct meaning out of the data
Value?
Value
Value
4
Velocity
3
Variety
2
Volume
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Veracity
5 Uncertainty and inconsistencies in the data
Veracity
Value
4
Velocity
3
Variety
2
Volume
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Data is being generated at an
alarming rate
Value ?
Mechanism to bring the correct
meaning out of the data Uncertainty and inconsistencies in the data
Volume Variety Velocity
VeracityValue
. . . . . .
V ’ s associated wit h B ig Dat a may
grow wit h t ime
Different kinds of data is being
generated from various sources
5 V’s of Big Data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data as an Opportunity
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data as an Opportunity
Cost effective storage system for huge
data sets
Cost
Reduction
Improved
Services or
Products
Faster and Better
Decision Making
Next Generation
Products
Big Data
Analytics
Provides ways to analyze information
quickly and make decisions
Evaluation of customer
needs & satisfaction
Automated Car,
Healthcare, etc.
Many more opportunitiesMany more opportunities
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
IBM Big Data Analytics
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data Collected by Smart Meter
Managing the large volume and velocity of information generated by short-interval reads of smart meter data can
overwhelm existing IT resources
…
Big Data generated
by Smart Meter
Data was collected
in 1 Month
Data is collected
in 15 Minutes
96 million reads per day
for every million meters
Earlier Now
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Problem with Smart Meter Big Data
To manage and use this information to gain insight, utility companies must be capable of high-volume data management and advanced
analytics designed to transform data into actionable insights.
… …
…
… …
…
Store Analyze
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
How Smart Meter Big Data Is Analysed
Before analyzing Big Data
After analyzing Big Data
Time-of-use pricing encourages cost-savvy retail like industrial heavy machines to be used at off-peak times
Energy utilization and billing has
increased
During peak-load the users
require more energy
During off-peak times the
users required less energy
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
IBM Smart Meter Solution
IBM offers an integrated suite of products designed to enable IT to leverage big data in a variety of ways that can
contribute to the success of energy companies
IBM Solution
Data Analysis Data Mining
Data Warehousing User Data Security Reporting
Managing smart meter data
Forecasting and scheduling loads
5
Optimizing energy trading
4
Optimizing unit commitment
3
Monitoring the distribution grid
2
1
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
ONCOR using IBM Smart Meter Solution
Utilizes smart electricity meters to accurately measure
the electricity usage of a household
1 Instrumented
Unprecedented access to detailed information about
their electricity use
2 Interconnected
Consumers monitor and control their electricity usage
through near-real time readings of electricity meters
3 Intelligent
Customers in Oncor’s service territory showed last year during the company’s biggest energy saver contest that by
using the information from Oncor’s advanced meter
Users reduced their electric usage and bills by 25 percent or more
Oncor Electric Delivery has incorporate
IBM Smart Meter service
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Problems with Encasing Opportunity
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Problems with Big Data
• Data generated in past 2 years is more than the previous history in total
• By 2020, total digital data will grow to 44 Zettabytes approximately
• By 2020, about 1.7 MB of new info will be created every second for every person
Problem 1: Storing exponentially growing huge datasets
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Problems with Big Data
Problem 2: Processing data having complex structure
Structured
• Organized data format
• Data schema is fixed
• Ex: RDBMS data, etc.
Semi – Structured
• Partial organized data
• Lacks formal structure of a data
model
• Ex: XML & JSON files, etc.
Unstructured
▪ Un-organized data
▪ Unknown schema
▪ Ex: multi-media files, etc.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Problems with Big Data
Problem 3: Processing data faster
The data is growing at much faster rate than
that of disk read/write speed
Bringing huge amount of data to computation unit
becomes a bottleneck
Slave A
Slave B
Slave C Slave D
Slave E
Master
Data Source: Tom’s Hardware
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop-as-a-Solution
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop - Solution to Big Data Problems
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
HDFS
(Storage)
MapReduce
(Processing)
Allows parallel processing of the data stored in
HDFS
Allows to dump any kind of data across the
cluster
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Distributed File System
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Distributed File System
HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit.
Hadoop Cluster
DataNode
(Slaves)
NameNode
(Master)
HDFS has two core components, i.e. NameNode and DataNode.
• The NameNode is the main node that contains metadata about the data
stored.
• Data is stored on the DataNodes which are commodity hardware in the
distributed environment.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Storing Data (Solution)
Solution: HDFS
▪ Storage unit of Hadoop
▪ It is a Distributed File System
▪ Divide files (input data) into smaller chunks and stores it across the cluster
▪ Scalable as per requirement
512 MB
File
128 MB
128 MB
128 MB
128 MB
Problem 1: Storing exponentially growing huge datasets
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Store Different Kinds Of Data (Solution)
HDFS
ReadWrite
Solution: HDFS
▪ Allows to store any kind of data, be it structured, semi-structured or unstructured
▪ Follows WORM (Write Once Read Many)
▪ No schema validation is done while dumping data
Problem 2: Storing unstructured data
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Processing Data Faster (Solution)
Solution: Hadoop MapReduce
▪ Provides parallel processing of data present in HDFS
▪ Allows to process data locally i.e. each node works with a part of data which is stored on it
Problem 3: Processing data faster
2
1 hr.
1
4 hr.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Ecosystem
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Hadoop Ecosystem
Hadoop provides a scalable solution to store and process huge data sets in parallel and distributed
fashion.
Apache Hive is a data warehousing tool that allows us to perform big data analytics using Hive Query
Language which is very similar to SQL.
Apache Pig is a platform, used to analyze large data sets representing them as data flows.
Apache Spark is an in-memory data processing engine that allows us to efficiently execute streaming,
machine learning or SQL workloads and requires fast iterative access to datasets.
Apache HBase is a NoSQL database that allows us to store unstructured and semi – structured data
with ease and provides real time read/write access.
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data & Hadoop Certification Training
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data Hadoop Certification Training
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Some Big Data & Hadoop Projects @ Edureka
1
2
3 Project #3: Tourism Data Analysis
Industry: Tourism
Project #1: Analyze social bookmarking sites
Industry: Social Media
Project #2: Customer Complaints Analysis
Industry: Retail
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Some Big Data & Hadoop Projects @ Edureka
4
5
6 Project #6: Analyze Movie Ratings
Industry: Media
Project #4: Airline Data Analysis
Industry: Aviation
Project #5: Analyze Loan Dataset
Industry: Banking and Finance
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Session In A Minute
Big Data as an Opportunity
Big Data & Hadoop Training By Edureka
5 V’s of Big Data
Hadoop-as-a-Solution
How Data Evolved as Big Data
Problems with Big Data
512
MB
File
128 MB
128 MB
128 MB
128 MB
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING

More Related Content

Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hadoop Training | Edureka