This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
Report
Share
Report
Share
1 of 53
More Related Content
Big Data in the Microsoft Platform
2. Building Big Data Solutions in
the Microsoft Platform
Jesus Rodriguez
Tellago, Inc, Tellago Studios
4. About Me…
• Hackerpreneur
• Co-Founder Tellago, Tellago Studios, Inc.
• Microsoft Architect Advisor
• Microsoft MVP
• Oracle ACE
• Speaker, Author
• http://weblogs.asp.net/gsusx
• http://jrodthoughts.com
• http://moesion.com
5. Agenda
• Big Data Overview
• MS HDInsight
– Map Reduce
– HDFS
– Hive
– Pig
– Sqoop
• HDInsight Service
• The Hadoop Ecosystem
• The Future….
6. Big Data?
• A bunch of data?
• An industry?
• An expertise?
• A trend?
• A cliché?
7. A Clue?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user
data + 15 TB/day
• 2009: eBay has 6.5 PB user data +
50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
17. Hadoop Design Principles
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Shall Move to Data
• Simple Core, Modular and Extensible
18. Hadoop History
• 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
• 2003-2004: Google publishes GFS and MapReduce papers
• 2004: Cutting adds DFS & MapReduce support to Nutch
• 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
• 2007: NY Times converts 4TB of archives over 100 EC2s
• 2008: Web-scale deployments at Y!, Facebook, Last.fm
• April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
• May 2009:
– Yahoo does fastest sort of a TB, 62secs over 1460 nodes
– Yahoo sorts a PB in 16.25hours over 3658 nodes
• June 2009, Oct 2009: Hadoop Summit, Hadoop World
• September 2009: Doug Cutting joins Cloudera
23. HDFS Is…
• A distributed file system
• Redundant storage
• Designed to reliably store data using
commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
24. HDFS at a Glance
Block Size = 64MB
Replication Factor = 3
Cost/GB is a few ¢/month
vs $/month
27. Map Reduce Is…
• A programming model for expressing
distributed computations at a massive
scale
• An execution framework for organizing
and performing such computations
• An open-source implementation called
Hadoop
31. Hive Is…
• A system for managing and querying structured data
built on top of Hadoop
– Map-Reduce for execution
– HDFS for storage
– Metadata on raw files
• Key Building Principles:
– SQL as a familiar data warehousing tool
– Extensibility – Types, Functions, Formats, Scripts
– Scalability and Performance
35. Pig Is…
Apache Pig is a platform for analyzing large data sets that consists of a
high-level language (PigLatin) for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
• Ease of programming
• Optimization opportunities
• Extensibility
• Built upon Hadoop
49. Some Challenges
• Hadoop doesn’t power big data applications
– Not a transactional datastore. Slosh back and forth via
ETL
• Processing latency
– Non-incremental, must re-slurp entire dataset every
pass
• Ad-Hoc queries
– Bare metal interface, data import
• Graphs
– Only a handful of graph problems amenable to MR
52. Takeaways
• Hadoop provides the foundation of big
data solutions
• Computing and storage are the
fundamental components of Hadoop
• HDInsight Server and Service are
Microsoft’s distributions of Hadoop
• HDInsight is just one component of
Microsoft’s BI strategy