Big Data in the Microsoft Platform

Building Big Data Solutions in
the Microsoft Platform
Jesus Rodriguez
Tellago, Inc, Tellago Studios

About Me…
• Hackerpreneur
• Co-Founder Tellago, Tellago Studios, Inc.
• Microsoft Architect Advisor
• Microsoft MVP
• Oracle ACE
• Speaker, Author
• http://weblogs.asp.net/gsusx
• http://jrodthoughts.com
• http://moesion.com

Agenda
• Big Data Overview
• MS HDInsight
– Map Reduce
– HDFS
– Hive
– Pig
– Sqoop
• HDInsight Service
• The Hadoop Ecosystem
• The Future….

Big Data?
• A bunch of data?
• An industry?
• An expertise?
• A trend?
• A cliché?

A Clue?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user
data + 15 TB/day
• 2009: eBay has 6.5 PB user data +
50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day

Processing Large Amounts of
Data is Complicated....

Sucessful Big Data = Scalable
Computing + Large Storage

Parallel Data Computing is
Complicated

Hadoop Design Principles
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Shall Move to Data
• Simple Core, Modular and Extensible

Hadoop History
• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
• 2003-2004: Google publishes GFS and MapReduce papers
• 2004: Cutting adds DFS & MapReduce support to Nutch
• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
• 2007: NY Times converts 4TB of archives over 100 EC2s
• 2008: Web-scale deployments at Y!, Facebook, Last.fm
• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
• May 2009:
– Yahoo does fastest sort of a TB, 62secs over 1460 nodes
– Yahoo sorts a PB in 16.25hours over 3658 nodes
• June 2009, Oct 2009: Hadoop Summit, Hadoop World
• September 2009: Doug Cutting joins Cloudera

Hadoop Ecosystem
ETL Tools BI Reporting RDBMS
Zookeepr (Coordination)

Pig (Data Flow) Hive (SQL) Sqoop

Avro (Serialization)
MapReduce (Job Scheduling/Execution System)

HBase (key-value store) (Streaming/Pipes APIs)

HDFS
(Hadoop Distributed File System)

HDFS Is…
• A distributed file system
• Redundant storage
• Designed to reliably store data using
commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System

HDFS at a Glance
Block Size = 64MB
Replication Factor = 3

Cost/GB is a few ¢/month
vs $/month

Map Reduce Is…
• A programming model for expressing
distributed computations at a massive
scale
• An execution framework for organizing
and performing such computations
• An open-source implementation called
Hadoop

Hive Is…
• A system for managing and querying structured data
built on top of Hadoop
– Map-Reduce for execution
– HDFS for storage
– Metadata on raw files

• Key Building Principles:
– SQL as a familiar data warehousing tool
– Extensibility – Types, Functions, Formats, Scripts
– Scalability and Performance

Pig Is…
Apache Pig is a platform for analyzing large data sets that consists of a
high-level language (PigLatin) for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

Pig Architecture
Grunt (Interactive shell) PigServer (Java API)

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan  LogicalPlan)
Pig Context
Compiler (LogicalPlan  PhysiclaPlan  MapReducePlan)

ExecutionEngine

Hadoop

HDInsight
Rocking Data Processing
with Pig

Sqoop Is…
• Easy import of data from many
databases to HDFS
• Generates code for use in MapReduce
applications
• Integrates with Hive

HDInsight
Bulk Data Loading Using
Sqoop

HDInsight Service Architecture

HDInsight
HDInsight Service
Overview

Hadoop is not a silver bullet...

Some Challenges
• Hadoop doesn’t power big data applications
– Not a transactional datastore. Slosh back and forth via
ETL
• Processing latency
– Non-incremental, must re-slurp entire dataset every
pass
• Ad-Hoc queries
– Bare metal interface, data import
• Graphs
– Only a handful of graph problems amenable to MR

Beyond Hadoop
• Percolator(incremental processing)
http://research.google.com/pubs/pub36726.html
• Dremel(ad-hoc analysis queries)
http://research.google.com/pubs/pub36632.html
• Pregel (Big graphs)
http://dl.acm.org/citation.cfm?id=1807184

Takeaways
• Hadoop provides the foundation of big
data solutions
• Computing and storage are the
fundamental components of Hadoop
• HDInsight Server and Service are
Microsoft’s distributions of Hadoop
• HDInsight is just one component of
Microsoft’s BI strategy

Thanks
jesus.rodriguez@tellago.com
http://www.tellagostudios.com
http://jrodthoughts.com
http://twitter.com/#!/jrodthoughts
http://weblogs.asp.net/gsusx

Big Data in the Microsoft Platform

More Related Content

Big Data in the Microsoft Platform