SlideShare a Scribd company logo
Introduction to Bigdata
Venkat Reddy
•
•
•
•
•
•
•
•
•
•

What is Bigdata
Sources of Bigdata
What can be done with Big data?
Handling Bigdata
MapReduce
Hadoop
Hadoop components
Hadoop ecosystem
Big data example
Other bigdata use cases

Bigdata Analysis Course
Venkat Reddy

Contents

2
•
•
•
•

Excel : Have you ever tried a pivot table on 500 MB file?
SAS/R : Have you ever tried a frequency table on 2 GB file?
Access: Have you ever tried running a query on 10 GB file
SQL: Have you ever tried running a query on 50 GB file

Bigdata Analysis Course
Venkat Reddy

How much time did it take?

3
Can you think of…
• What if we get a new data set like this, every day?
• What if we need to execute complex queries on this data set
everyday ?
• Does anybody really deal with this type of data set?
• Is it possible to store and analyze this data?

• Yes google deals with more than 20 PB data everyday

Bigdata Analysis Course
Venkat Reddy

• Can you think of running a query on 20,980,000 GB file.

4

Recommended for you

Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB

This document provides an introduction to big data and Hadoop. It discusses what big data is, characteristics of big data like volume, velocity and variety. It then introduces Hadoop as a framework for storing and analyzing big data, describing its main components like HDFS and MapReduce. The document outlines a typical big data workflow and gives examples of big data use cases. It also provides an overview of setting up Hadoop on a single node, including installing Java, configuring SSH, downloading and extracting Hadoop files, editing configuration files, formatting the namenode, starting Hadoop daemons and testing the installation.

hadoopbig datasap
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction

Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.

big data
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications

This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.

hadoopmarklogichadoop summit
•
•
•
•
•

Google processes 20 PB a day (2008)
Way back Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
That’s right

Bigdata Analysis Course
Venkat Reddy

Yes….its true

5
•
•
•
���
•
•
•
•
•
•
•
•
•
•

Email users send more than 204 million messages;
Mobile Web receives 217 new users;
Google receives over 2 million search queries;
YouTube users upload 48 hours of new video;
Facebook users share 684,000 bits of content;
Twitter users send more than 100,000 tweets;
Consumers spend $272,000 on Web shopping;
Apple receives around 47,000 application downloads;
Brands receive more than 34,000 Facebook 'likes';
Tumblr blog owners publish 27,000 new posts;
Instagram users share 3,600 new photos;
Flickr users, on the other hand, add 3,125 new photos;
Foursquare users perform 2,000 check-ins;
WordPress users publish close to 350 new blog posts.

And this is one year back….. Damn!!

Bigdata Analysis Course
Venkat Reddy

In fact, in a minute…

6
What is a large file?
• Traditionally, many operating systems and their underlying file
system implementations used 32-bit integers to represent file sizes
and positions. Consequently no file could be larger than 232-1
bytes (4 GB).
• In many implementations the problem was exacerbated by
treating the sizes as signed numbers, which further lowered the
limit to 231-1 bytes (2 GB).
• Files larger than this, too large for 32-bit operating systems to
handle, came to be known as large files.

What the …

Bigdata Analysis Course
Venkat Reddy

• If you are using a 32 bit OS then 4GB is a large file

7
Definition of Bigdata

Bigdata Analysis Course
Venkat Reddy

Sorry …There is no single standard
definition…

8

Recommended for you

Big data concepts
Big data conceptsBig data concepts
Big data concepts

This document provides an overview of big data concepts, including NoSQL databases, batch and real-time data processing frameworks, and analytical querying tools. It discusses scalability challenges with traditional SQL databases and introduces horizontal scaling with NoSQL systems like key-value, document, column, and graph stores. MapReduce and Hadoop are described for batch processing, while Storm is presented for real-time processing. Hive and Pig are summarized as tools for running analytical queries over large datasets.

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop

The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.

big datahadoopmapreduce
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.

hdfsmapreduceyarn
Bigdata …
•
•
•
•
•
•
•
•

Capture
Curate
Store
Search
Share
Transfer
Analyze
and to create visualizations

Bigdata Analysis Course
Venkat Reddy

Any data that is difficult to

9
• Collection of data sets so large and complex that it becomes
difficult to process using on-hand database management
tools or traditional data processing applications
• “Big Data” is the data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…

BTW is it Bigdata/big data/Big data/bigdata/BigData /Big Data?

Bigdata Analysis Course
Venkat Reddy

Bigdata means

10
Bigdata is not just about size
• Volume
• Data volumes are becoming unmanageable
• Data complexity is growing. more types of data captured than
previously

• Velocity
• Some data is arriving so rapidly that it must either be processed
instantly, or lost. This is a whole subfield called “stream
processing”

Bigdata Analysis Course
Venkat Reddy

• Variety

11
•
•
•
•

Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
• Social Network, Semantic Web (RDF), …

• Streaming Data
• You can only scan the data once

• Text, numerical, images, audio, video, sequences, time series,
social media data, multi-dim arrays, etc…

Bigdata Analysis Course
Venkat Reddy

Types of data

12

Recommended for you

Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure

Big Data technologies are bringing new complexity, new tasks and new opportunities in this world. How good is Apache Hadoop for all of this?

hadoopbig datadistributed processing
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage

This was presented at NHN on Jan. 27, 2009. It introduces Big Data, its storages, and its analyses. Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce. In addition, in terms of Schema-Free, various non-relational data storages are explained.

rdbmsschema-freemapreduce
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective

This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.

clouderapigmdac
•
•
•
•
•
•

Social media brand value analytics
Product sentiment analysis
Customer buying preference predictions
Video analytics
Fraud detection
Aggregation and Statistics
• Data warehouse and OLAP

• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)

Bigdata Analysis Course
Venkat Reddy

What can be done with Bigdata?

• Knowledge discovery
• Data Mining
• Statistical Modeling

13
But, datasets are huge, complex and difficult
to process
What is the solution?

Bigdata Analysis Course
Venkat Reddy

Ok..…. Analysis on this bigdata can give us
awesome insights

14
Handling bigdata- Parallel computing
• Imagine a 1gb text file, all the status updates on Facebook in a day
• Now suppose that a simple counting of the number of rows takes
10 minutes.
• What do you do if you have 6 months data, a file of size 200GB, if
you still want to find the results in 10 minutes?

• Parallel computing?
• Put multiple CPUs in a machine (100?)
• Write a code that will calculate 200 parallel counts and finally
sums up
• But you need a super computer

Bigdata Analysis Course
Venkat Reddy

• Select count(*) from fb_status

15
Handling bigdata – Is there a better
way?
• Till 1985, There is no way to connect multiple computers. All
systems were Centralized Systems.

• After 1985,We have powerful microprocessors and High Speed
Computer Networks (LANs , WANs), which lead to distributed
systems
• Now that we have a distributed system that ensures a
collection of independent computers appears to its users as a
single coherent system, can we use some cheap computers
and process our bigdata quickly?

Bigdata Analysis Course
Venkat Reddy

• So multi-core system or super computers were the only options
for big data problems

16

Recommended for you

Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview

Big data refers to datasets that are too large to be managed by traditional database tools. It is characterized by volume, velocity, and variety. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers. It works by distributing storage across nodes as blocks and distributing computation via a MapReduce programming paradigm where nodes process data in parallel. Common uses of big data include analyzing social media, sensor data, and using machine learning on large datasets.

hadoopbig dataoracle
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System

The document provides information about Hadoop, its core components, and MapReduce programming model. It defines Hadoop as an open source software framework used for distributed storage and processing of large datasets. It describes the main Hadoop components like HDFS, NameNode, DataNode, JobTracker and Secondary NameNode. It also explains MapReduce as a programming model used for distributed processing of big data across clusters.

big datajavahadoop
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru

Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .

• We want to cut the data into small pieces & place them on
different machines
• Divide the overall problem into small tasks & run these small
tasks locally
• Finally collate the results from local machines
• So, we want to process our bigdata in a parallel programming
model and associated implementation.
• This is known as MapReduce

Bigdata Analysis Course
Venkat Reddy

Distributed computing

17
• Processing data using special map() and reduce() functions
• The map() function is called on every item in the input and
emits a series of intermediate key/value pairs(Local
calculation)
• All values associated with a given key are grouped together
• The reduce() function is called on every unique key, and its
value list, and emits a value that is added to the output(final
organization)

Bigdata Analysis Course
Venkat Reddy

MapReduce…. Programming Model

18
Bigdata Analysis Course
Venkat Reddy

Mummy ‘s MapReduce

19
Not just MapReduce
1.
2.
3.
4.
5.

6.

Setup a cluster of machines, then divide the whole data set into
blocks and store them in local machines
Assign a master node that takes charge of all meta data, work
scheduling and distribution, and job orchestration
Assign worker slots to execute map or reduce functions
Load Balance (What if one machine is very slow in the cluster?)
Fault Tolerance (What if the intermediate data is partially read,
but the machine fails before all reduce(collation) operations
can complete?)
Finally write the map reduce code that solves our problem

Bigdata Analysis Course
Venkat Reddy

• Earlier count=count+1 was sufficient but now, we need to

20

Recommended for you

Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview

An overview of what makes big data challenging and highlights the reasoning behind some of the recent trends in the area.

big datasparkhdfs
Big data ppt
Big data pptBig data ppt
Big data ppt

The document discusses big data, including what it is, sources of big data like social media and stock exchange data, and the three Vs of big data - volume, velocity, and variety. It then discusses Hadoop, the open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed computation, and YARN which manages computing resources. The document also provides overviews of Pig and Jaql, programming languages used for analyzing data in Hadoop.

hadoopbig data
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap

Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada

vikram andem bigdata security cryptography
Ok..…. Analysis on bigdata can give us awesome
insights

I found a solution, distributed computing or
MapReduce
But looks like this data storage & parallel processing
is complicated
What is the solution?

Bigdata Analysis Course
Venkat Reddy

But, datasets are huge, complex and difficult to
process

21
Hadoop
• Hadoop is a bunch of tools, it has many components. HDFS
and MapReduce are two core components of Hadoop
• makes our job easy to store the data on commodity hardware
• Built to expect hardware failures
• Intended for large files & batch inserts

• MapReduce
• For parallel processing

Bigdata Analysis Course
Venkat Reddy

• HDFS: Hadoop Distributed File System

• So Hadoop is a software platform that lets one easily write
and run applications that process bigdata
22
• Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing across
clusters of commonly available computers (in thousands).
• Efficient: By distributing the data, it can process it in parallel
on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of data
and automatically redeploys computing tasks based on
failures.
• And Hadoop is free

Bigdata Analysis Course
Venkat Reddy

Why Hadoop is useful

23
So what is Hadoop?

• Hadoop is a platform/framework
• Which allows the user to quickly write and test distributed
systems
• Which is efficient in automatically distributing the data and work
across machines

Bigdata Analysis Course
Venkat Reddy

• Hadoop is not Bigdata
• Hadoop is not a database

24

Recommended for you

Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP

- Hadoop is a framework for managing and processing big data distributed across clusters of computers. It allows for parallel processing of large datasets. - Big data comes from various sources like customer behavior, machine data from sensors, etc. It is used by companies to better understand customers and target ads. - Hadoop uses a master-slave architecture with a NameNode master and DataNode slaves. Files are divided into blocks and replicated across DataNodes for reliability. The NameNode tracks where data blocks are stored.

hadoopbig dataframework
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop

Introduction to Big Data Hadoop Presented by Dr. Sandeep Deshmukh: Committer Apache Apex, DataTorrent engineer

big dataapache apexhadoop
Big Data
Big DataBig Data
Big Data

What is Bigdata Sources of Bigdata What can be done with Big data? Handling Bigdata MapReduce What is Hadoop? Why Hadoop is Useful? Other big data use cases

big datahadoophive
Ok..…. Analysis on bigdata can give us awesome
insights

I found a solution, distributed computing or
MapReduce
But looks like this data storage & parallel processing
is complicated
Ok, I can use Hadoop framework…..I don’t know Java,
how do I write MapReduce programs?

Bigdata Analysis Course
Venkat Reddy

But, datasets are huge, complex and difficult to
process

25
MapReduce made easy
• Hive:

• Pig:
• Pig is a high-level platform for processing big data on Hadoop
clusters.
• Pig consists of a data flow language, called Pig Latin, supporting
writing queries on large datasets and an execution environment
running programs from a console
• The Pig Latin programs consist of dataset transformation series
converted under the covers, to a MapReduce program series

Bigdata Analysis Course
Venkat Reddy

• Hive is for data analysts with strong SQL skills providing an SQL-like
interface and a relational data model
• Hive uses a language called HiveQL; very similar to SQL
• Hive translates queries into a series of MapReduce jobs

• Mahout
• Mahout is an open source machine-learning library facilitating
building scalable matching learning libraries

26
Bigdata Analysis Course
Venkat Reddy

Hadoop ecosystem

27
Bigdata Analysis Course
Venkat Reddy

Bigdata ecosystem

28

Recommended for you

bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx

1. Hadoop is a software platform that allows for the distributed storage and processing of extremely large datasets across clusters of commodity hardware. 2. It addresses problems like parallel processing, fault tolerance, and scalability to reliably handle data at the petabyte scale. 3. Using Hadoop's core components - the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing - it can efficiently distribute data and computation across large clusters to enable analysis of big data.

bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf

1. Big data refers to large and complex datasets that are difficult to process using traditional database and software techniques. 2. Hadoop is an open-source software platform that allows distributed processing of large datasets across clusters of computers. It solves the problems of big data by dividing it across nodes and processing it in parallel using MapReduce. 3. Hadoop provides reliable and scalable storage of big data using HDFS and efficient parallel processing of that data using MapReduce, allowing organizations to gain insights from large and diverse datasets.

Big data technology
Big data technology Big data technology
Big data technology

Hadoop is a software platform that can reliably store and process extremely large datasets in a distributed, scalable, and economical manner. It distributes data and processing tasks across clusters of commodity hardware. Hadoop uses HDFS for reliable storage of large files across nodes, and MapReduce for efficiently processing data in parallel on the nodes where data is located. Together, HDFS and MapReduce allow users to quickly write distributed systems that can handle petabytes of data and complex queries.

googleyahoogoogle analytics
Bigdata example
• The Business Problem:
• Analyze this week’s stack overflow datahttp://stackoverflow.com/
• What are the most popular topics in this week?

• Find out some simple descriptive statistics for each field
• Total questions
• Total unique tags
• Frequency of each tag etc.,

• The ‘tag’ with max frequency is the most popular topic
• Lets use Hadoop to find these values, since we can’t rapidly
process this data with usual tools

Bigdata Analysis Course
Venkat Reddy

• Approach:

29
Bigdata example: Dataset

Bigdata Analysis Course
Venkat Reddy

7GB text file, contains questions and respective tags

30
Bigdata Analysis Course
Venkat Reddy

Move the dataset to HDFS

• The file size is 6.99GB, it has been automatically cut into several
pieces/blocks, size of the each block is 64MB
• This can be done by just using a simple command
bin/hadoop fs -copyFromLocal /home/final_stack_data stack_data
*Data later copied into Hive table

31
Bigdata Analysis Course
Venkat Reddy

Data in HDFS: Hadoop Distributed File
System

• Each block is 64MB total file size is
7GB, so total 112 blocks

32

Recommended for you

bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx

1) The document discusses big data, including how it is defined, challenges of working with large datasets, and solutions like Hadoop. 2) It explains that big data refers to datasets that are too large to be handled by traditional database tools due to their scale, diversity and complexity. Hadoop is presented as a solution for reliably storing and processing big data across clusters of commodity servers. 3) Benefits of analyzing big data are outlined, such as gaining insights, competitive advantages and better decision making. Applications of big data analytics are also mentioned in areas like healthcare, security, manufacturing and more.

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop

Cloud Computing Evolution Why Cloud Computing needed? Cloud Computing Models Cloud Solutions Cloud Jobs opportunities Criteria for Big Data Big Data challenges Technologies to process Big Data- Hadoop Hadoop History and Architecture Hadoop Eco-System Hadoop Real-time Use cases Hadoop Job opportunities Hadoop and SAP HANA integration Summary

cloud computingsap hanabig data
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview

Abhishek Roy will teach a master class on Big Data and Hadoop. The class will cover what Big Data is, the history and background of Hadoop, how to set up and use Hadoop, and tools like HDFS, MapReduce, Pig, Hive, Mahout, Sqoop, Flume, Hue, Zookeeper and Impala. The class will also discuss real world use cases and the growing market for Big Data tools and skills.

pigyarnhive
Processing the data

Here is our query

MapReduce is about to start

Bigdata Analysis Course
Venkat Reddy

• What are the total number of entries in this file?

33
Bigdata Analysis Course
Venkat Reddy

Map reduce jobs in progress

34
The execution time

The result

Bigdata Analysis Course
Venkat Reddy

runtime

• Note: I ran Hadoop on a very basic machine(1.5 GB RAM , i3 processor
on,32bit virtual machine).
• This example is just for demo purpose, the same query will take much lesser
time, if we are running on a multi node cluster setup

35
Bigdata example: Results

• ‘C’ happens to be most popular tag

• It took around 15 minutes to get these insights

Bigdata Analysis Course
Venkat Reddy

• The query returns , means there are nearly 6 million stack
overflow questions and tags
• Similarly we can run other map reduce jobs on the tags to
find out most frequent topics.

36

Recommended for you

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data

Introduction to Big Data and NoSQL. This presentation was given to the Master DBA course at John Bryce Education in Israel. Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.

dbahadoopbig data
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark

This document discusses data ingestion with Spark. It provides an overview of Spark, which is a unified analytics engine that can handle batch processing, streaming, SQL queries, machine learning and graph processing. Spark improves on MapReduce by keeping data in-memory between jobs for faster processing. The document contrasts data collection, which occurs where data originates, with data ingestion, which receives and routes data, sometimes coupled with storage.

map reducesparkhadoop
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop

Big data comes from many sources like social media, e-commerce sites, and stock markets. Hadoop is an open-source framework that allows processing and storing large amounts of data across clusters of computers. It uses HDFS for storage and MapReduce for processing. HDFS stores data across cluster nodes and is fault tolerant. MapReduce analyzes data through parallel map and reduce functions. Sqoop imports and exports data between Hadoop and relational databases.

datahadoopprocessing
• In the above example, we have the stack overflow questions
and corresponding tags
• Can we use some supervised machine learning technique to
predict the tags for the new questions?
• Can you write the map reduce code for Naïve Bayes
algorithm/Random forest?
• How is Wikipedia highlighting some words in your text as
hyperlinks?
• How can YouTube suggest you relevant tags after you upload a
video?
• How is amazon recommending you a new product?
• How are the companies leveraging bigdata analytics?

Bigdata Analysis Course
Venkat Reddy

Advanced analytics…

37
Bigdata use cases
•
•

•

Amazon has been collecting customer information for years--not just addresses
and payment information but the identity of everything that a customer had
ever bought or even looked at.
While dozens of other companies do that, too, Amazon’s doing something
remarkable with theirs. They’re using that data to build customer relationship

•

•
•

Corporations and investors want to be able to track the consumer market
as closely as possible to signal trends that will inform their next product
launches.
LinkedIn is a bank of data not just about people, but how people are
making their money and what industries they are working in and how they
connect to each other.

Bigdata Analysis Course
Venkat Reddy

Ford collects and aggregates data from the 4 million vehicles that use in-car
sensing and remote app management software
The data allows to glean information on a range of issues, from how drivers
are using their vehicles, to the driving environment that could help them
improve the quality of the vehicle

38
Bigdata use cases
•
•

•
•

Largest retail company in the world. Fortune 1 out of 500
Largest sales data warehouse: Retail Link, a $4 billion project (1991). One of
the largest “civilian” data warehouse in the world: 2004: 460 terabytes,
Internet half as large
Defines data science: What do hurricanes, strawberry Pop-Tarts, and beer have
in common?

•

•

•

Includes financial and marketing applications, but with special focus on
industrial uses of big data
When will this gas turbine need maintenance? How can we optimize the
performance of a locomotive? What is the best way to make decisions
about energy finance?

Bigdata Analysis Course
Venkat Reddy

AT&T has 300 million customers. A team of researchers is working to turn
data collected through the company’s cellular network into a trove of
information for policymakers, urban planners and traffic engineers.
The researchers want to see how the city changes hourly by looking at calls
and text messages relayed through cell towers around the region, noting
that certain towers see more activity at different times

39
-Venkat Reddy
Bigdata Analysis Course
Venkat Reddy

Thank you

40

Recommended for you

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data

Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/ Some slides are borrowed from random hadoop/big data presentations

hadoopmachine learningbig data
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)

1) Introduction to the key Big Data concepts 1.1 The Origins of Big Data 1.2 What is Big Data ? 1.3 Why is Big Data So Important ? 1.4 How Is Big Data Used In Practice ? 2) Introduction to the key principles of Big Data Systems 2.1 How to design Data Pipeline in 6 steps 2.2 Using Lambda Architecture for big data processing 3) Practical case study : Chat bot with Video Recommendation Engine 4) FAQ for student

big datalecturedata science
Large scale computing
Large scale computing Large scale computing
Large scale computing

LinkedIn is a large professional social network with 50 million users from around the world. It faces big data challenges at scale, such as caching a user's third degree network of up to 20 million connections and performing searches across 50 million user profiles. LinkedIn uses Hadoop and other scalable architectures like distributed search engines and custom graph engines to solve these problems. Hadoop provides a scalable framework to process massive amounts of user data across thousands of nodes through its MapReduce programming model and HDFS distributed file system.

More Related Content

What's hot

Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
Ahmed Salman
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
Bart Vandewoestyne
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
hybrid cloud
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
C. Scyphers
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 

What's hot (20)

Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 

Similar to A data analyst view of Bigdata

Big Data
Big DataBig Data
Big Data
Mahesh Bmn
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
VIJAYAPRABAP
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
AnjaliKumari301316
 
Big data technology
Big data technology Big data technology
Big data technology
omer mohamed abd alrhman
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
AjayAgarwal107
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
Trieu Nguyen
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Soujanya V
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
taimur hafeez
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Zohar Elkayam
 

Similar to A data analyst view of Bigdata (20)

Big Data
Big DataBig Data
Big Data
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Big data technology
Big data technology Big data technology
Big data technology
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data
Big DataBig Data
Big Data
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 

More from Venkata Reddy Konasani

Transformers 101
Transformers 101 Transformers 101
Transformers 101
Venkata Reddy Konasani
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
Venkata Reddy Konasani
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
Venkata Reddy Konasani
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
Venkata Reddy Konasani
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
Venkata Reddy Konasani
 
Decision tree
Decision treeDecision tree
Decision tree
Venkata Reddy Konasani
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
Venkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
Venkata Reddy Konasani
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
Venkata Reddy Konasani
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
Venkata Reddy Konasani
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
Venkata Reddy Konasani
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Venkata Reddy Konasani
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
Venkata Reddy Konasani
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
Venkata Reddy Konasani
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
Venkata Reddy Konasani
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
ARIMA
ARIMA ARIMA

More from Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
ARIMA
ARIMA ARIMA
ARIMA
 

Recently uploaded

Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptxUnlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
bipin95
 
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptxBRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
kambal1234567890
 
Ardra Nakshatra (आर्द्रा): Understanding its Effects and Remedies
Ardra Nakshatra (आर्द्रा): Understanding its Effects and RemediesArdra Nakshatra (आर्द्रा): Understanding its Effects and Remedies
Ardra Nakshatra (आर्द्रा): Understanding its Effects and Remedies
Astro Pathshala
 
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...
Murugan Solaiyappan
 
How to Store Data on the Odoo 17 Website
How to Store Data on the Odoo 17 WebsiteHow to Store Data on the Odoo 17 Website
How to Store Data on the Odoo 17 Website
Celine George
 
National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)
SaadaGrijaldo1
 
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdf
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdfThe Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdf
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdf
JackieSparrow3
 
AI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdfAI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdf
SrimanigandanMadurai
 
Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17
Celine George
 
How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17
Celine George
 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
marianell3076
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
heathfieldcps1
 
L1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 interventionL1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 intervention
RHODAJANEAURESTILA
 
NLC Grade 3.................................... ppt.pptx
NLC Grade 3.................................... ppt.pptxNLC Grade 3.................................... ppt.pptx
NLC Grade 3.................................... ppt.pptx
MichelleDeLaCruz93
 
NAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource BookNAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource Book
lakitawilson
 
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...
Neny Isharyanti
 
How to Install Theme in the Odoo 17 ERP
How to  Install Theme in the Odoo 17 ERPHow to  Install Theme in the Odoo 17 ERP
How to Install Theme in the Odoo 17 ERP
Celine George
 
How to Configure Time Off Types in Odoo 17
How to Configure Time Off Types in Odoo 17How to Configure Time Off Types in Odoo 17
How to Configure Time Off Types in Odoo 17
Celine George
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
Nguyen Thanh Tu Collection
 
(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening
MJDuyan
 

Recently uploaded (20)

Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptxUnlocking Educational Synergy-DIKSHA & Google Classroom.pptx
Unlocking Educational Synergy-DIKSHA & Google Classroom.pptx
 
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptxBRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
BRIGADA ESKWELA OPENING PROGRAM KICK OFF.pptx
 
Ardra Nakshatra (आर्द्रा): Understanding its Effects and Remedies
Ardra Nakshatra (आर्द्रा): Understanding its Effects and RemediesArdra Nakshatra (आर्द्रा): Understanding its Effects and Remedies
Ardra Nakshatra (आर्द्रा): Understanding its Effects and Remedies
 
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...
Lecture_Notes_Unit4_Chapter_8_9_10_RDBMS for the students affiliated by alaga...
 
How to Store Data on the Odoo 17 Website
How to Store Data on the Odoo 17 WebsiteHow to Store Data on the Odoo 17 Website
How to Store Data on the Odoo 17 Website
 
National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)National Learning Camp( Reading Intervention for grade1)
National Learning Camp( Reading Intervention for grade1)
 
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdf
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdfThe Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdf
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdf
 
AI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdfAI_in_HR_Presentation Part 1 2024 0703.pdf
AI_in_HR_Presentation Part 1 2024 0703.pdf
 
Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17Credit limit improvement system in odoo 17
Credit limit improvement system in odoo 17
 
How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17How to Handle the Separate Discount Account on Invoice in Odoo 17
How to Handle the Separate Discount Account on Invoice in Odoo 17
 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
L1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 interventionL1 L2- NLC PPT for Grade 10 intervention
L1 L2- NLC PPT for Grade 10 intervention
 
NLC Grade 3.................................... ppt.pptx
NLC Grade 3.................................... ppt.pptxNLC Grade 3.................................... ppt.pptx
NLC Grade 3.................................... ppt.pptx
 
NAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource BookNAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource Book
 
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...
 
How to Install Theme in the Odoo 17 ERP
How to  Install Theme in the Odoo 17 ERPHow to  Install Theme in the Odoo 17 ERP
How to Install Theme in the Odoo 17 ERP
 
How to Configure Time Off Types in Odoo 17
How to Configure Time Off Types in Odoo 17How to Configure Time Off Types in Odoo 17
How to Configure Time Off Types in Odoo 17
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 12 - GLOBAL SUCCESS - FORM MỚI 2025 - HK1 (C...
 
(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening
 

A data analyst view of Bigdata

  • 2. • • • • • • • • • • What is Bigdata Sources of Bigdata What can be done with Big data? Handling Bigdata MapReduce Hadoop Hadoop components Hadoop ecosystem Big data example Other bigdata use cases Bigdata Analysis Course Venkat Reddy Contents 2
  • 3. • • • • Excel : Have you ever tried a pivot table on 500 MB file? SAS/R : Have you ever tried a frequency table on 2 GB file? Access: Have you ever tried running a query on 10 GB file SQL: Have you ever tried running a query on 50 GB file Bigdata Analysis Course Venkat Reddy How much time did it take? 3
  • 4. Can you think of… • What if we get a new data set like this, every day? • What if we need to execute complex queries on this data set everyday ? • Does anybody really deal with this type of data set? • Is it possible to store and analyze this data? • Yes google deals with more than 20 PB data everyday Bigdata Analysis Course Venkat Reddy • Can you think of running a query on 20,980,000 GB file. 4
  • 5. • • • • • Google processes 20 PB a day (2008) Way back Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s Large Hydron Collider (LHC) generates 15 PB a year That’s right Bigdata Analysis Course Venkat Reddy Yes….its true 5
  • 6. • • • • • • • • • • • • • • Email users send more than 204 million messages; Mobile Web receives 217 new users; Google receives over 2 million search queries; YouTube users upload 48 hours of new video; Facebook users share 684,000 bits of content; Twitter users send more than 100,000 tweets; Consumers spend $272,000 on Web shopping; Apple receives around 47,000 application downloads; Brands receive more than 34,000 Facebook 'likes'; Tumblr blog owners publish 27,000 new posts; Instagram users share 3,600 new photos; Flickr users, on the other hand, add 3,125 new photos; Foursquare users perform 2,000 check-ins; WordPress users publish close to 350 new blog posts. And this is one year back….. Damn!! Bigdata Analysis Course Venkat Reddy In fact, in a minute… 6
  • 7. What is a large file? • Traditionally, many operating systems and their underlying file system implementations used 32-bit integers to represent file sizes and positions. Consequently no file could be larger than 232-1 bytes (4 GB). • In many implementations the problem was exacerbated by treating the sizes as signed numbers, which further lowered the limit to 231-1 bytes (2 GB). • Files larger than this, too large for 32-bit operating systems to handle, came to be known as large files. What the … Bigdata Analysis Course Venkat Reddy • If you are using a 32 bit OS then 4GB is a large file 7
  • 8. Definition of Bigdata Bigdata Analysis Course Venkat Reddy Sorry …There is no single standard definition… 8
  • 9. Bigdata … • • • • • • • • Capture Curate Store Search Share Transfer Analyze and to create visualizations Bigdata Analysis Course Venkat Reddy Any data that is difficult to 9
  • 10. • Collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • “Big Data” is the data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… BTW is it Bigdata/big data/Big data/bigdata/BigData /Big Data? Bigdata Analysis Course Venkat Reddy Bigdata means 10
  • 11. Bigdata is not just about size • Volume • Data volumes are becoming unmanageable • Data complexity is growing. more types of data captured than previously • Velocity • Some data is arriving so rapidly that it must either be processed instantly, or lost. This is a whole subfield called “stream processing” Bigdata Analysis Course Venkat Reddy • Variety 11
  • 12. • • • • Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data • Social Network, Semantic Web (RDF), … • Streaming Data • You can only scan the data once • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… Bigdata Analysis Course Venkat Reddy Types of data 12
  • 13. • • • • • • Social media brand value analytics Product sentiment analysis Customer buying preference predictions Video analytics Fraud detection Aggregation and Statistics • Data warehouse and OLAP • Indexing, Searching, and Querying • Keyword based search • Pattern matching (XML/RDF) Bigdata Analysis Course Venkat Reddy What can be done with Bigdata? • Knowledge discovery • Data Mining • Statistical Modeling 13
  • 14. But, datasets are huge, complex and difficult to process What is the solution? Bigdata Analysis Course Venkat Reddy Ok..…. Analysis on this bigdata can give us awesome insights 14
  • 15. Handling bigdata- Parallel computing • Imagine a 1gb text file, all the status updates on Facebook in a day • Now suppose that a simple counting of the number of rows takes 10 minutes. • What do you do if you have 6 months data, a file of size 200GB, if you still want to find the results in 10 minutes? • Parallel computing? • Put multiple CPUs in a machine (100?) • Write a code that will calculate 200 parallel counts and finally sums up • But you need a super computer Bigdata Analysis Course Venkat Reddy • Select count(*) from fb_status 15
  • 16. Handling bigdata – Is there a better way? • Till 1985, There is no way to connect multiple computers. All systems were Centralized Systems. • After 1985,We have powerful microprocessors and High Speed Computer Networks (LANs , WANs), which lead to distributed systems • Now that we have a distributed system that ensures a collection of independent computers appears to its users as a single coherent system, can we use some cheap computers and process our bigdata quickly? Bigdata Analysis Course Venkat Reddy • So multi-core system or super computers were the only options for big data problems 16
  • 17. • We want to cut the data into small pieces & place them on different machines • Divide the overall problem into small tasks & run these small tasks locally • Finally collate the results from local machines • So, we want to process our bigdata in a parallel programming model and associated implementation. • This is known as MapReduce Bigdata Analysis Course Venkat Reddy Distributed computing 17
  • 18. • Processing data using special map() and reduce() functions • The map() function is called on every item in the input and emits a series of intermediate key/value pairs(Local calculation) • All values associated with a given key are grouped together • The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output(final organization) Bigdata Analysis Course Venkat Reddy MapReduce…. Programming Model 18
  • 19. Bigdata Analysis Course Venkat Reddy Mummy ‘s MapReduce 19
  • 20. Not just MapReduce 1. 2. 3. 4. 5. 6. Setup a cluster of machines, then divide the whole data set into blocks and store them in local machines Assign a master node that takes charge of all meta data, work scheduling and distribution, and job orchestration Assign worker slots to execute map or reduce functions Load Balance (What if one machine is very slow in the cluster?) Fault Tolerance (What if the intermediate data is partially read, but the machine fails before all reduce(collation) operations can complete?) Finally write the map reduce code that solves our problem Bigdata Analysis Course Venkat Reddy • Earlier count=count+1 was sufficient but now, we need to 20
  • 21. Ok..…. Analysis on bigdata can give us awesome insights I found a solution, distributed computing or MapReduce But looks like this data storage & parallel processing is complicated What is the solution? Bigdata Analysis Course Venkat Reddy But, datasets are huge, complex and difficult to process 21
  • 22. Hadoop • Hadoop is a bunch of tools, it has many components. HDFS and MapReduce are two core components of Hadoop • makes our job easy to store the data on commodity hardware • Built to expect hardware failures • Intended for large files & batch inserts • MapReduce • For parallel processing Bigdata Analysis Course Venkat Reddy • HDFS: Hadoop Distributed File System • So Hadoop is a software platform that lets one easily write and run applications that process bigdata 22
  • 23. • Scalable: It can reliably store and process petabytes. • Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). • Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. • Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. • And Hadoop is free Bigdata Analysis Course Venkat Reddy Why Hadoop is useful 23
  • 24. So what is Hadoop? • Hadoop is a platform/framework • Which allows the user to quickly write and test distributed systems • Which is efficient in automatically distributing the data and work across machines Bigdata Analysis Course Venkat Reddy • Hadoop is not Bigdata • Hadoop is not a database 24
  • 25. Ok..…. Analysis on bigdata can give us awesome insights I found a solution, distributed computing or MapReduce But looks like this data storage & parallel processing is complicated Ok, I can use Hadoop framework…..I don’t know Java, how do I write MapReduce programs? Bigdata Analysis Course Venkat Reddy But, datasets are huge, complex and difficult to process 25
  • 26. MapReduce made easy • Hive: • Pig: • Pig is a high-level platform for processing big data on Hadoop clusters. • Pig consists of a data flow language, called Pig Latin, supporting writing queries on large datasets and an execution environment running programs from a console • The Pig Latin programs consist of dataset transformation series converted under the covers, to a MapReduce program series Bigdata Analysis Course Venkat Reddy • Hive is for data analysts with strong SQL skills providing an SQL-like interface and a relational data model • Hive uses a language called HiveQL; very similar to SQL • Hive translates queries into a series of MapReduce jobs • Mahout • Mahout is an open source machine-learning library facilitating building scalable matching learning libraries 26
  • 27. Bigdata Analysis Course Venkat Reddy Hadoop ecosystem 27
  • 28. Bigdata Analysis Course Venkat Reddy Bigdata ecosystem 28
  • 29. Bigdata example • The Business Problem: • Analyze this week’s stack overflow datahttp://stackoverflow.com/ • What are the most popular topics in this week? • Find out some simple descriptive statistics for each field • Total questions • Total unique tags • Frequency of each tag etc., • The ‘tag’ with max frequency is the most popular topic • Lets use Hadoop to find these values, since we can’t rapidly process this data with usual tools Bigdata Analysis Course Venkat Reddy • Approach: 29
  • 30. Bigdata example: Dataset Bigdata Analysis Course Venkat Reddy 7GB text file, contains questions and respective tags 30
  • 31. Bigdata Analysis Course Venkat Reddy Move the dataset to HDFS • The file size is 6.99GB, it has been automatically cut into several pieces/blocks, size of the each block is 64MB • This can be done by just using a simple command bin/hadoop fs -copyFromLocal /home/final_stack_data stack_data *Data later copied into Hive table 31
  • 32. Bigdata Analysis Course Venkat Reddy Data in HDFS: Hadoop Distributed File System • Each block is 64MB total file size is 7GB, so total 112 blocks 32
  • 33. Processing the data Here is our query MapReduce is about to start Bigdata Analysis Course Venkat Reddy • What are the total number of entries in this file? 33
  • 34. Bigdata Analysis Course Venkat Reddy Map reduce jobs in progress 34
  • 35. The execution time The result Bigdata Analysis Course Venkat Reddy runtime • Note: I ran Hadoop on a very basic machine(1.5 GB RAM , i3 processor on,32bit virtual machine). • This example is just for demo purpose, the same query will take much lesser time, if we are running on a multi node cluster setup 35
  • 36. Bigdata example: Results • ‘C’ happens to be most popular tag • It took around 15 minutes to get these insights Bigdata Analysis Course Venkat Reddy • The query returns , means there are nearly 6 million stack overflow questions and tags • Similarly we can run other map reduce jobs on the tags to find out most frequent topics. 36
  • 37. • In the above example, we have the stack overflow questions and corresponding tags • Can we use some supervised machine learning technique to predict the tags for the new questions? • Can you write the map reduce code for Naïve Bayes algorithm/Random forest? • How is Wikipedia highlighting some words in your text as hyperlinks? • How can YouTube suggest you relevant tags after you upload a video? • How is amazon recommending you a new product? • How are the companies leveraging bigdata analytics? Bigdata Analysis Course Venkat Reddy Advanced analytics… 37
  • 38. Bigdata use cases • • • Amazon has been collecting customer information for years--not just addresses and payment information but the identity of everything that a customer had ever bought or even looked at. While dozens of other companies do that, too, Amazon’s doing something remarkable with theirs. They’re using that data to build customer relationship • • • Corporations and investors want to be able to track the consumer market as closely as possible to signal trends that will inform their next product launches. LinkedIn is a bank of data not just about people, but how people are making their money and what industries they are working in and how they connect to each other. Bigdata Analysis Course Venkat Reddy Ford collects and aggregates data from the 4 million vehicles that use in-car sensing and remote app management software The data allows to glean information on a range of issues, from how drivers are using their vehicles, to the driving environment that could help them improve the quality of the vehicle 38
  • 39. Bigdata use cases • • • • Largest retail company in the world. Fortune 1 out of 500 Largest sales data warehouse: Retail Link, a $4 billion project (1991). One of the largest “civilian” data warehouse in the world: 2004: 460 terabytes, Internet half as large Defines data science: What do hurricanes, strawberry Pop-Tarts, and beer have in common? • • • Includes financial and marketing applications, but with special focus on industrial uses of big data When will this gas turbine need maintenance? How can we optimize the performance of a locomotive? What is the best way to make decisions about energy finance? Bigdata Analysis Course Venkat Reddy AT&T has 300 million customers. A team of researchers is working to turn data collected through the company’s cellular network into a trove of information for policymakers, urban planners and traffic engineers. The researchers want to see how the city changes hourly by looking at calls and text messages relayed through cell towers around the region, noting that certain towers see more activity at different times 39
  • 40. -Venkat Reddy Bigdata Analysis Course Venkat Reddy Thank you 40