SlideShare a Scribd company logo
Intro to Big Data using Hadoop




                       Sergejus Barinovas
                       sergejus.blogas.lt
                       fb.com/ITishnikai
                       @sergejusb
Information is powerful…
but it is how we use it that will define us
Data Explosion


                                      text
                                   audio
                                    video
                                  images



                              relational



                 picture from Big Data Integration
Big Data (globally)


– creates over 30 billion pieces of content per day

– stores 30 petabytes of data



– produces over 90 million tweets per day

Recommended for you

Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice

Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637

mapreducehadoopimage similarity search
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...

When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.

hadoophadoop world 2011david champagne
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals

This document provides an overview of Hadoop and the Hadoop ecosystem. It discusses key Hadoop concepts like HDFS, MapReduce, YARN and data locality. It also summarizes SQL on Hadoop using tools like Hive, Impala and Spark SQL. The document concludes with examples of using Sqoop and Flume to move data between relational databases and Hadoop.

oracleasmmap reduce
Big Data (our example)



– logs over 300 gigabytes of transactions per day

– stores more than 1,5 terabyte of aggregated data
4 Vs of Big Data


              volume
               volume
              velocity
               velocity
              variety
               variety
               variability
              variability
Big Data Challenges


Sort 10TB on 1 node = 2,5 days

   100-node cluster = 35 mins
Big Data Challenges

“Fat” servers implies high cost
 – use cheap commodity nodes instead


Large # of cheap nodes implies often failures
 – leverage automatic fault-tolerance
                      fault-tolerance

Recommended for you

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction

Hadoop is a framework for distributed storage and processing of large datasets across commodity hardware. It consists of HDFS for distributed file storage and MapReduce for distributed computation. HDFS divides files into blocks and replicates them across nodes for reliability. MapReduce allows processing of large datasets in parallel by splitting jobs into tasks executed across clusters. Hadoop was developed based on earlier systems from Google and Yahoo and is designed to reliably handle failures and provide high performance at large scales.

Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig

So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed. Thursday, May 8th, 02:00pm-02:50pm

pigmapreducehadoop
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

What is Bigdata. Big data. what is Hadoop. Hadoop Ecosystem. Big datayı hangi sektörlerde kullanabiliriz. Twitter Analiz.

bigdata hadoop mapreduce spark apache twitter anal
Big Data Challenges



We need new data-parallel programming
model for clusters of commodity machines
MapReduce
to the rescue!
MapReduce


Published in 2004 by Google
 – MapReduce: Simplified Data Processing on Large Clusters




Popularized by Apache Hadoop project
 – used by Yahoo!, Facebook, Twitter, Amazon, …
MapReduce

Recommended for you

Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way

The document provides an overview of distributed computing using Apache Hadoop. It discusses how Hadoop uses the MapReduce algorithm to parallelize tasks across large clusters of commodity hardware. Specifically, it breaks down jobs into map and reduce phases to distribute processing of large amounts of data. The document also notes that Hadoop is an open source framework used by many large companies to solve problems involving petabytes of data through batch processing in a fault tolerant manner.

google hadoop map reduce cluster distributed hdfs
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop

Introduction to Hadoop. What are Hadoop, MapReeduce, and Hadoop Distributed File System. Who uses Hadoop? How to run Hadoop? What are Pig, Hive, Mahout?

scaleopen-sourcemap reduce
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop

This document provides an overview of Apache Hadoop, including its architecture, components, and applications. Hadoop is an open-source framework for distributed storage and processing of large datasets. It uses Hadoop Distributed File System (HDFS) for storage and MapReduce for processing. HDFS stores data across clusters of nodes and replicates files for fault tolerance. MapReduce allows parallel processing of large datasets using a map and reduce workflow. The document also discusses Hadoop interfaces, Oracle connectors, and resources for further information.

oraclebig datadatabase
Word Count Example
 Input       Map   Shuffle & Sort   Reduce   Output


the quick                                    the, 3
  brown      Map                             brown, 2
   fox                              Reduce   fox, 2
                                             how, 1
                                             now, 1
 the fox
 ate the     Map
 mouse
                                             quick, 1
                                             ate, 1
                                    Reduce   mouse, 1
how now
 brown       Map                             cow, 1
  cow
Word Count Example
 Input      Map      Shuffle & Sort       Reduce   Output

                  the, 1       the, 1
the quick         quick, 1     brown, 1
  brown     Map   brown, 1     fox, 1
   fox            fox, 1       the, 1     Reduce
                               fox, 1
                               the, 1
                  the, 1       how, 1
 the fox          fox, 1       now, 1
 ate the    Map   ate, 1       brown, 1
 mouse            the, 1
                  mouse, 1
                               quick, 1
                               ate, 1
                  how, 1       mouse, 1   Reduce
how now           now, 1       cow, 1
 brown      Map   brown, 1
  cow             cow, 1
Word Count Example
 Input      Map   Shuffle & Sort           Reduce   Output

                            the, [1,1,1]
the quick                   brown, [1,1]            the, 3
  brown     Map             fox, [1,1]              brown, 2
   fox                      how, [1]       Reduce   fox, 2
                            now, [1]                how, 1
                                                    now, 1
 the fox
 ate the    Map
 mouse
                            quick, [1]              quick, 1
                            ate, [1]                ate, 1
                            mouse, [1]     Reduce   mouse, 1
how now
                            cow, [1]                cow, 1
 brown      Map
  cow
MapReduce philosophy
 – hide complexity

 – make it scalable

 – make it cheap

Recommended for you

The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop

The document discusses the family of Hadoop projects. It describes the history and origins of Hadoop, starting with Doug Cutting's work on Nutch and the implementation of Google's papers on MapReduce and the Google File System. It then summarizes several major Hadoop sub-projects, including HDFS for storage, MapReduce for distributed processing, HBase for structured storage, and Hive for data warehousing. For each project, it provides a brief overview of the architecture, data model, and programming interfaces.

barcampsaigonhadoop
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas

Working with Hive and finding the data insights of datascience.stackoverflow.com , Problem : Find the top 10 Users on datasceicne.stackexchange.com

pythonhadoopstackoverflow
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

This document provides an introduction and overview of Apache Hadoop. It begins with an outline and discusses why Hadoop is important given the growth of data. It then describes the core components of Hadoop - HDFS for distributed storage and MapReduce for distributed computing. The document explains how Hadoop is able to provide scalability and fault tolerance. It provides examples of how Hadoop is used in production at large companies. It concludes by discussing the Hadoop ecosystem and encouraging questions.

clouderamapreducehadoop
MapReduce popularized by

 Apache Hadoop project
Hadoop Overview

Open source implementation of
 – Google MapReduce paper

 – Google File System (GFS) paper


First release in 2008 by Yahoo!
 – wide adoption by Facebook, Twitter, Amazon, etc.
Hadoop Core



MapReduce (Job Scheduling / Execution System)


    Hadoop Distributed File System (HDFS)
Hadoop Core (HDFS)



     MapReduce (Job Scheduling / Execution System)

          Hadoop Distributed File System (HDFS)

• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes

Recommended for you

R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services

Dalbey, Timothy. "R, Hadoop and Amazon Web Services (PPT)." Portland R Users Group, 20 December 2012.

portland r hadoop
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing

1. The document discusses using Hadoop and Hive at Zing to build a log collecting, analyzing, and reporting system. 2. Scribe is used for fast log collection and storing data in Hadoop/Hive. Hive provides SQL-like queries to analyze large datasets. 3. The system transforms logs into Hive tables, runs analysis jobs in Hive, then exports data to MySQL for web reporting. This provides a scalable, high performance solution compared to the initial RDBMS-only system.

Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06

Rakuten Inc. uses Hadoop for various purposes including generating recommendation indexes, analyzing logs, and calculating metrics. Their current Hadoop system includes a cluster with 3 masters and 69 slaves, Ganglia monitoring, and HA with DRBD and Heartbeat. It provides benefits over their previous system like lower costs, improved scalability, and faster transaction times. However, they still face challenges around using up HDFS space and fully realizing their data warehouse goals with the new system.

hadoop
Hadoop Core (HDFS)



MapReduce (Job Scheduling / Execution System)

    Hadoop Distributed File System (HDFS)



  Name Node                    Data Node
Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks


     MapReduce (Job Scheduling / Execution System)
          Hadoop Distributed File System (HDFS)



       Name Node                      Data Node
Hadoop Core (MapReduce)

  Job Tracker                 Task Tracker




MapReduce (Job Scheduling / Execution System)
    Hadoop Distributed File System (HDFS)



  Name Node                    Data Node
Hadoop Core (Job submission)

                           Task Tracker
Client




            Job Tracker




            Name Node      Data Node

Recommended for you

Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale

Shark is a SQL query engine built on top of Spark, a fast MapReduce-like engine. It extends Spark to support SQL and complex analytics efficiently while maintaining the fault tolerance and scalability of MapReduce. Shark uses techniques from databases like columnar storage and dynamic query optimization to improve performance. Benchmarks show Shark can perform SQL queries and machine learning algorithms faster than traditional MapReduce systems like Hive and Hadoop. The goal of Shark is to provide a unified system for both SQL and complex analytics processing at large scale.

apache hadoopsqlhadoop summit 2013
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading

The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.

hadoop
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager

Raymie Stata, ex-CTO of Yahoo, talks about YARN, Hadoop's new Resource Manager, and other improvements in Hadoop 2.0.

verticloudyarnhadoop
Hadoop Ecosystem

            Pig (ETL)          Hive (BI)       Sqoop (RDBMS)


            MapReduce (Job Scheduling / Execution System)
Zookeeper




                                                               Avro
                  HBase


                 Hadoop Distributed File System (HDFS)
JavaScript MapReduce
var map = function (key, value, context) {
    var words = value.split(/[^a-zA-Z]/);
    for (var i = 0; i < words.length; i++) {
        if (words[i] !== "") {
            context.write(words[i].toLowerCase(), 1);
        }
    }
};
var reduce = function (key, values, context) {
    var sum = 0;
    while (values.hasNext()) {
        sum += parseInt(values.next());
    }
    context.write(key, sum);
};
Pig

words = LOAD '/example/count' AS (
      word: chararray,
      count: int
);
popular_words = ORDER words BY count DESC;
top_popular_words = LIMIT popular_words 10;
DUMP top_popular_words;
Hive
CREATE EXTERNAL TABLE WordCount (
      word string,
      count int
)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY 't'
      LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Recommended for you

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search

The document discusses how MapReduce can be used for various tasks related to search engines, including detecting duplicate web pages, processing document content, building inverted indexes, and analyzing search query logs. It provides examples of MapReduce jobs for normalizing document text, extracting entities, calculating ranking signals, and indexing individual words, phrases, stems and synonyms.

mapreduceinformation retrievalalgorithms
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce

The Google MapReduce presented in 2004 is the inspiration for Hadoop. Let's take a deep dive into MapReduce to better understand Hadoop.

googlemapreducedistributed computation
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014

This document provides an overview and agenda for a presentation on how Google handles big data. The presentation covers Google Cloud Platform and how it can be used to run Hadoop clusters on Google Compute Engine and leverage BigQuery for analytics. It also discusses how Google processes big data internally using technologies like MapReduce, BigTable and Dremel and how these concepts apply to customer use cases.

Über Demo
Demo
Hadoop in the Cloud
Thanks!
Questions?

More Related Content

What's hot

InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
inside-BigData.com
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Wei-Yu Chen
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
bigdatasyd
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
shubham kuwar
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Zekeriya Besiroglu
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
Eduard Hildebrandt
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
Nam Nham
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
Purna Chander K
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Cloudera, Inc.
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
Rakuten Group, Inc.
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
DataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 

What's hot (20)

InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 

Viewers also liked

YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
VertiCloud Inc
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
James Chittenden
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
Romain Jacotin
 

Viewers also liked (6)

YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Similar to Intro to Big Data using Hadoop

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
BTI360
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
aravindan_raghu
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop
HadoopHadoop
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
Takumi Asai
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
rhatr
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Mapreduse model
Mapreduse modelMapreduse model
Mapreduse model
Kalyaniwan
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
Shree Shree
 

Similar to Intro to Big Data using Hadoop (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Mapreduse model
Mapreduse modelMapreduse model
Mapreduse model
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 

More from Sergejus Barinovas

Bringing Developers to the Next Level
Bringing Developers to the Next LevelBringing Developers to the Next Level
Bringing Developers to the Next Level
Sergejus Barinovas
 
True story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureTrue story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azure
Sergejus Barinovas
 
Continuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliveryContinuous Happiness by Continuous Delivery
Continuous Happiness by Continuous Delivery
Sergejus Barinovas
 
Windows Azure from practical point of view
Windows Azure from practical point of viewWindows Azure from practical point of view
Windows Azure from practical point of view
Sergejus Barinovas
 
Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012
Sergejus Barinovas
 
Optimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessaryOptimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessary
Sergejus Barinovas
 
Release Often Release Safely
Release Often Release SafelyRelease Often Release Safely
Release Often Release Safely
Sergejus Barinovas
 
Kaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąKaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimą
Sergejus Barinovas
 
Introduction to Windows Azure Platform
Introduction to Windows Azure PlatformIntroduction to Windows Azure Platform
Introduction to Windows Azure Platform
Sergejus Barinovas
 
Web Scale with NoSQL
Web Scale with NoSQLWeb Scale with NoSQL
Web Scale with NoSQL
Sergejus Barinovas
 
Moving applications to the cloud
Moving applications to the cloudMoving applications to the cloud
Moving applications to the cloud
Sergejus Barinovas
 
NoSQL - what's that
NoSQL - what's thatNoSQL - what's that
NoSQL - what's that
Sergejus Barinovas
 
Demystifying HTML5
Demystifying HTML5Demystifying HTML5
Demystifying HTML5
Sergejus Barinovas
 
Architecting Windows Azure
Architecting Windows AzureArchitecting Windows Azure
Architecting Windows Azure
Sergejus Barinovas
 
Cloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformCloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure Platform
Sergejus Barinovas
 

More from Sergejus Barinovas (15)

Bringing Developers to the Next Level
Bringing Developers to the Next LevelBringing Developers to the Next Level
Bringing Developers to the Next Level
 
True story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureTrue story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azure
 
Continuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliveryContinuous Happiness by Continuous Delivery
Continuous Happiness by Continuous Delivery
 
Windows Azure from practical point of view
Windows Azure from practical point of viewWindows Azure from practical point of view
Windows Azure from practical point of view
 
Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012
 
Optimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessaryOptimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessary
 
Release Often Release Safely
Release Often Release SafelyRelease Often Release Safely
Release Often Release Safely
 
Kaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąKaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimą
 
Introduction to Windows Azure Platform
Introduction to Windows Azure PlatformIntroduction to Windows Azure Platform
Introduction to Windows Azure Platform
 
Web Scale with NoSQL
Web Scale with NoSQLWeb Scale with NoSQL
Web Scale with NoSQL
 
Moving applications to the cloud
Moving applications to the cloudMoving applications to the cloud
Moving applications to the cloud
 
NoSQL - what's that
NoSQL - what's thatNoSQL - what's that
NoSQL - what's that
 
Demystifying HTML5
Demystifying HTML5Demystifying HTML5
Demystifying HTML5
 
Architecting Windows Azure
Architecting Windows AzureArchitecting Windows Azure
Architecting Windows Azure
 
Cloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformCloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure Platform
 

Recently uploaded

論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 

Recently uploaded (20)

論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 

Intro to Big Data using Hadoop

  • 1. Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb
  • 2. Information is powerful… but it is how we use it that will define us
  • 3. Data Explosion text audio video images relational picture from Big Data Integration
  • 4. Big Data (globally) – creates over 30 billion pieces of content per day – stores 30 petabytes of data – produces over 90 million tweets per day
  • 5. Big Data (our example) – logs over 300 gigabytes of transactions per day – stores more than 1,5 terabyte of aggregated data
  • 6. 4 Vs of Big Data volume volume velocity velocity variety variety variability variability
  • 7. Big Data Challenges Sort 10TB on 1 node = 2,5 days 100-node cluster = 35 mins
  • 8. Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance fault-tolerance
  • 9. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines
  • 11. MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, …
  • 13. Word Count Example Input Map Shuffle & Sort Reduce Output the quick the, 3 brown Map brown, 2 fox Reduce fox, 2 how, 1 now, 1 the fox ate the Map mouse quick, 1 ate, 1 Reduce mouse, 1 how now brown Map cow, 1 cow
  • 14. Word Count Example Input Map Shuffle & Sort Reduce Output the, 1 the, 1 the quick quick, 1 brown, 1 brown Map brown, 1 fox, 1 fox fox, 1 the, 1 Reduce fox, 1 the, 1 the, 1 how, 1 the fox fox, 1 now, 1 ate the Map ate, 1 brown, 1 mouse the, 1 mouse, 1 quick, 1 ate, 1 how, 1 mouse, 1 Reduce how now now, 1 cow, 1 brown Map brown, 1 cow cow, 1
  • 15. Word Count Example Input Map Shuffle & Sort Reduce Output the, [1,1,1] the quick brown, [1,1] the, 3 brown Map fox, [1,1] brown, 2 fox how, [1] Reduce fox, 2 now, [1] how, 1 now, 1 the fox ate the Map mouse quick, [1] quick, 1 ate, [1] ate, 1 mouse, [1] Reduce mouse, 1 how now cow, [1] cow, 1 brown Map cow
  • 16. MapReduce philosophy – hide complexity – make it scalable – make it cheap
  • 17. MapReduce popularized by Apache Hadoop project
  • 18. Hadoop Overview Open source implementation of – Google MapReduce paper – Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc.
  • 19. Hadoop Core MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)
  • 20. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes
  • 21. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 22. Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 23. Hadoop Core (MapReduce) Job Tracker Task Tracker MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 24. Hadoop Core (Job submission) Task Tracker Client Job Tracker Name Node Data Node
  • 25. Hadoop Ecosystem Pig (ETL) Hive (BI) Sqoop (RDBMS) MapReduce (Job Scheduling / Execution System) Zookeeper Avro HBase Hadoop Distributed File System (HDFS)
  • 26. JavaScript MapReduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 27. Pig words = LOAD '/example/count' AS ( word: chararray, count: int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;
  • 28. Hive CREATE EXTERNAL TABLE WordCount ( word string, count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Editor's Notes

  1. So, which is really the “enterprise” now?
  2. Volume – exceeds physical limits of vertical scalabilityVelocity – decision window small compared to data change rateVariety – many different formats makes integration expensiveVariability – many options or variable interpretations confound analysis
  3. --run MapReducerunJs(&quot;/example/mr/WordCount.js&quot;, &quot;/example/data/davinci.txt&quot;, &quot;/example/count&quot;);--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY &apos;\\t&apos; LINES TERMINATED BY &apos;\\n&apos; STORED AS TEXTFILELOCATION &quot;/example/count&quot;;--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec(&quot;select * from WordCount order by count desc limit 10;&quot;);--execute LINQ style Hive queryhive.from(&quot;WordCount&quot;).orderBy(&quot;count DESC&quot;).take(10).run();--execute Pig scriptwords = LOAD &apos;/example/count&apos; AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from(&quot;/example/count&quot;, &quot;word: chararray, count: int&quot;).orderBy(&quot;count DESC&quot;).take(10).run();
  4. --run MapReducerunJs(&quot;/example/mr/WordCount.js&quot;, &quot;/example/data/davinci.txt&quot;, &quot;/example/count&quot;);--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY &apos;\\t&apos; LINES TERMINATED BY &apos;\\n&apos; STORED AS TEXTFILELOCATION &quot;/example/count&quot;;--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec(&quot;select * from WordCount order by count desc limit 10;&quot;);--execute LINQ style Hive queryhive.from(&quot;WordCount&quot;).orderBy(&quot;count DESC&quot;).take(10).run();--execute Pig scriptwords = LOAD &apos;/example/count&apos; AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from(&quot;/example/count&quot;, &quot;word: chararray, count: int&quot;).orderBy(&quot;count DESC&quot;).take(10).run();