SlideShare a Scribd company logo
Big Data in the Microsoft Platform
Building Big Data Solutions in
the Microsoft Platform
Jesus Rodriguez
Tellago, Inc, Tellago Studios
Big Data?
About Me…
•   Hackerpreneur
•   Co-Founder Tellago, Tellago Studios, Inc.
•   Microsoft Architect Advisor
•   Microsoft MVP
•   Oracle ACE
•   Speaker, Author
•   http://weblogs.asp.net/gsusx
•   http://jrodthoughts.com
•   http://moesion.com
Agenda
• Big Data Overview
• MS HDInsight
   –   Map Reduce
   –   HDFS
   –   Hive
   –   Pig
   –   Sqoop
• HDInsight Service
• The Hadoop Ecosystem
• The Future….
Big Data?
•   A bunch of data?
•   An industry?
•   An expertise?
•   A trend?
•   A cliché?
A Clue?
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user
  data + 15 TB/day
• 2009: eBay has 6.5 PB user data +
  50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
We Love Data!
But...
Processing Large Amounts of
   Data is Complicated....
Sucessful Big Data = Scalable
 Computing + Large Storage
A Trivial Model
Not So Fast....
Parallel Data Computing is
              Complicated
So Is Large Data Storage
Enter the World of Hadoop...
Hadoop Design Principles
•   System Shall Manage and Heal Itself
•   Performance Shall Scale Linearly
•   Compute Shall Move to Data
•   Simple Core, Modular and Extensible
Hadoop History
•   2002-2004: Doug Cutting and Mike Cafarella started working on Nutch
•   2003-2004: Google publishes GFS and MapReduce papers
•   2004: Cutting adds DFS & MapReduce support to Nutch
•   2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
•   2007: NY Times converts 4TB of archives over 100 EC2s
•   2008: Web-scale deployments at Y!, Facebook, Last.fm
•   April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes
•   May 2009:
     – Yahoo does fastest sort of a TB, 62secs over 1460 nodes
     – Yahoo sorts a PB in 16.25hours over 3658 nodes
•   June 2009, Oct 2009: Hadoop Summit, Hadoop World
•   September 2009: Doug Cutting joins Cloudera
Hadoop Ecosystem
                            ETL Tools        BI Reporting      RDBMS
Zookeepr (Coordination)




                          Pig (Data Flow)    Hive (SQL)         Sqoop




                                                                             Avro (Serialization)
                          MapReduce (Job Scheduling/Execution System)

                          HBase (key-value store)   (Streaming/Pipes APIs)


                                              HDFS
                                 (Hadoop Distributed File System)
Microsoft & Hadoop
HDInsight
HDFS
HDFS Is…
• A distributed file system
• Redundant storage
• Designed to reliably store data using
  commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
• The Hadoop Distributed File System
HDFS at a Glance
  Block Size = 64MB
 Replication Factor = 3




Cost/GB is a few ¢/month
      vs $/month
HDInsight
HDFS
Demo
Map Reduce
Map Reduce Is…
• A programming model for expressing
  distributed computations at a massive
  scale
• An execution framework for organizing
  and performing such computations
• An open-source implementation called
  Hadoop
Map Reduce At a Glance
HDInsight
Map Reduce Demo
Hive
Hive Is…
• A system for managing and querying structured data
  built on top of Hadoop
   – Map-Reduce for execution
   – HDFS for storage
   – Metadata on raw files

• Key Building Principles:
   – SQL as a familiar data warehousing tool
   – Extensibility – Types, Functions, Formats, Scripts
   – Scalability and Performance
Hive Architecture
HDInsight
Hacking with Hive
Pig
Pig Is…
Apache Pig is a platform for analyzing large data sets that consists of a
  high-level language (PigLatin) for expressing data analysis programs,
  coupled with infrastructure for evaluating these programs.

•   Ease of programming

•   Optimization opportunities

•   Extensibility

•   Built upon Hadoop
Pig Architecture
  Grunt (Interactive shell)                       PigServer (Java API)

                                 Parser (PigLatinLogicalPlan)


                              Optimizer (LogicalPlan  LogicalPlan)
Pig Context
                Compiler (LogicalPlan  PhysiclaPlan  MapReducePlan)

                                        ExecutionEngine

                                  Hadoop
HDInsight
Rocking Data Processing
        with Pig
Sqoop
Sqoop Is…
• Easy import of data from many
  databases to HDFS
• Generates code for use in MapReduce
  applications
• Integrates with Hive
Sqoop Architecture
HDInsight
Bulk Data Loading Using
Sqoop
HDInsight Service
HDInsight Service Architecture
HDInsight
HDInsight Service
   Overview
Hadoop Considerations
Super Crowded Ecosystem
The Hadoop Ecosystem
Hadoop is not a silver bullet...
Some Challenges
• Hadoop doesn’t power big data applications
   –     Not a transactional datastore. Slosh back and forth via
       ETL
• Processing latency
   –      Non-incremental, must re-slurp entire dataset every
       pass
• Ad-Hoc queries
   –    Bare metal interface, data import
• Graphs
   –    Only a handful of graph problems amenable to MR
Beyond Hadoop
• Percolator(incremental processing)
http://research.google.com/pubs/pub36726.html
• Dremel(ad-hoc analysis queries)
http://research.google.com/pubs/pub36632.html
• Pregel (Big graphs)
http://dl.acm.org/citation.cfm?id=1807184
In the Meantime...
Takeaways
• Hadoop provides the foundation of big
  data solutions
• Computing and storage are the
  fundamental components of Hadoop
• HDInsight Server and Service are
  Microsoft’s distributions of Hadoop
• HDInsight is just one component of
  Microsoft’s BI strategy
Thanks
 jesus.rodriguez@tellago.com
 http://www.tellagostudios.com
     http://jrodthoughts.com
http://twitter.com/#!/jrodthoughts
  http://weblogs.asp.net/gsusx

More Related Content

Big Data in the Microsoft Platform

  • 2. Building Big Data Solutions in the Microsoft Platform Jesus Rodriguez Tellago, Inc, Tellago Studios
  • 4. About Me… • Hackerpreneur • Co-Founder Tellago, Tellago Studios, Inc. • Microsoft Architect Advisor • Microsoft MVP • Oracle ACE • Speaker, Author • http://weblogs.asp.net/gsusx • http://jrodthoughts.com • http://moesion.com
  • 5. Agenda • Big Data Overview • MS HDInsight – Map Reduce – HDFS – Hive – Pig – Sqoop • HDInsight Service • The Hadoop Ecosystem • The Future….
  • 6. Big Data? • A bunch of data? • An industry? • An expertise? • A trend? • A cliché?
  • 7. A Clue? • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day
  • 10. Processing Large Amounts of Data is Complicated....
  • 11. Sucessful Big Data = Scalable Computing + Large Storage
  • 14. Parallel Data Computing is Complicated
  • 15. So Is Large Data Storage
  • 16. Enter the World of Hadoop...
  • 17. Hadoop Design Principles • System Shall Manage and Heal Itself • Performance Shall Scale Linearly • Compute Shall Move to Data • Simple Core, Modular and Extensible
  • 18. Hadoop History • 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch • 2003-2004: Google publishes GFS and MapReduce papers • 2004: Cutting adds DFS & MapReduce support to Nutch • 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch • 2007: NY Times converts 4TB of archives over 100 EC2s • 2008: Web-scale deployments at Y!, Facebook, Last.fm • April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes • May 2009: – Yahoo does fastest sort of a TB, 62secs over 1460 nodes – Yahoo sorts a PB in 16.25hours over 3658 nodes • June 2009, Oct 2009: Hadoop Summit, Hadoop World • September 2009: Doug Cutting joins Cloudera
  • 19. Hadoop Ecosystem ETL Tools BI Reporting RDBMS Zookeepr (Coordination) Pig (Data Flow) Hive (SQL) Sqoop Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (key-value store) (Streaming/Pipes APIs) HDFS (Hadoop Distributed File System)
  • 22. HDFS
  • 23. HDFS Is… • A distributed file system • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System
  • 24. HDFS at a Glance Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month
  • 27. Map Reduce Is… • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop
  • 28. Map Reduce At a Glance
  • 30. Hive
  • 31. Hive Is… • A system for managing and querying structured data built on top of Hadoop – Map-Reduce for execution – HDFS for storage – Metadata on raw files • Key Building Principles: – SQL as a familiar data warehousing tool – Extensibility – Types, Functions, Formats, Scripts – Scalability and Performance
  • 34. Pig
  • 35. Pig Is… Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs. • Ease of programming • Optimization opportunities • Extensibility • Built upon Hadoop
  • 36. Pig Architecture Grunt (Interactive shell) PigServer (Java API) Parser (PigLatinLogicalPlan) Optimizer (LogicalPlan  LogicalPlan) Pig Context Compiler (LogicalPlan  PhysiclaPlan  MapReducePlan) ExecutionEngine Hadoop
  • 38. Sqoop
  • 39. Sqoop Is… • Easy import of data from many databases to HDFS • Generates code for use in MapReduce applications • Integrates with Hive
  • 48. Hadoop is not a silver bullet...
  • 49. Some Challenges • Hadoop doesn’t power big data applications – Not a transactional datastore. Slosh back and forth via ETL • Processing latency – Non-incremental, must re-slurp entire dataset every pass • Ad-Hoc queries – Bare metal interface, data import • Graphs – Only a handful of graph problems amenable to MR
  • 50. Beyond Hadoop • Percolator(incremental processing) http://research.google.com/pubs/pub36726.html • Dremel(ad-hoc analysis queries) http://research.google.com/pubs/pub36632.html • Pregel (Big graphs) http://dl.acm.org/citation.cfm?id=1807184
  • 52. Takeaways • Hadoop provides the foundation of big data solutions • Computing and storage are the fundamental components of Hadoop • HDInsight Server and Service are Microsoft’s distributions of Hadoop • HDInsight is just one component of Microsoft’s BI strategy
  • 53. Thanks jesus.rodriguez@tellago.com http://www.tellagostudios.com http://jrodthoughts.com http://twitter.com/#!/jrodthoughts http://weblogs.asp.net/gsusx