SlideShare a Scribd company logo
Bridging Unstructured & Structured Data with Hadoop and VerticaGlenn Gebhart 	ggebhart@vertica.comSteve Watt         swatt@hp.com
ContentsOur background with Big Data
Accelerating and monitoring Apache Hadoop deployments with HP CMU
I have my Apache Hadoop Cluster deployed….. Now what ?
Sample application scenario with Apache Hadoop and Vertica3HP ConfidentialCluster Management Utility
Managing Scale Out with HP CMUProven cluster deployment and management tool
11 Years Experience
Proven with clusters of 3500+ nodes
Deployment and Management
Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
Command Line or GUI based cluster wide configuration
Manage a node individually or manage a cluster as a whole
Monitoring
Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
Extensible through Collectl integration5HP Confidential
6HP ConfidentialTech Bubble? What does the Data Say?Attribution: CC PascalTerjan via Flickr
7HP Confidential
But what if I could turn that into this?8HP Confidential
And see how the amount invested this year differs from previous years?
10HP ConfidentialWhere is the money going?
What type of startups get the most investment funding?
Amount invested in Software Startups by Zip Code
How did you do that?13HP ConfidentialHowdid you Do that?Attribution: CC  Colin_K on Flickr
14HP ConfidentialApache Identify Optimal Seed URLs& Crawl to a depth of 2http://www.crunchbase.com/companies?c=a&q=privately_heldCrawl data is stored in segment dirs on the HDFS
15HP Confidential
16HP ConfidentialMaking the data STRUCTUREDRetrieving HTMLPrelim Filtering on URLCompany POJO then /t Out
17HP ConfidentialAargh!My viz tool requires zipcodes to plot geospatially!
Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
The Story So FarUsed Nutch to retrieve investment data from web site.Used Hadoop to extract and structure the dataUsed Pig to add zipcode data.End result is a collection of relations describing investment activity.We’ve got raw data, now we need to understand it.
Why Vertica?Vertica and Hadoop are complementary technologies.Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation)Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.
Built from the Ground Up: The Four C’s of VerticaColumnar storage and executionContinuous performanceClusteringCompressionAchieve best data query performance with unique Vertica column storeLinear scaling by adding more resources on the flyStore more data, provide more views, use less hardwareQuery and load 24x7 with zero administration
Getting Data From Here To There
Connecting Vertica And HadoopVertica provides connectors for Hadoop 20.2 and Pig 0.7.Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data.Input retrieved from Vertica using standard SQL query.Output written to Vertica table.
Vertica As a M/R Data Source// Set up the configuration and job objectsConfiguration conf = getConf(); Job job = new Job(conf);  // Set the input format to retrieve data from Verticajob.setInputFormatClass(VerticaInputFormat.class);// Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput(	job,	“SELECT * FROM foo WHERE bar = ‘baz’);

More Related Content

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

  • 1. Bridging Unstructured & Structured Data with Hadoop and VerticaGlenn Gebhart ggebhart@vertica.comSteve Watt swatt@hp.com
  • 3. Accelerating and monitoring Apache Hadoop deployments with HP CMU
  • 4. I have my Apache Hadoop Cluster deployed….. Now what ?
  • 5. Sample application scenario with Apache Hadoop and Vertica3HP ConfidentialCluster Management Utility
  • 6. Managing Scale Out with HP CMUProven cluster deployment and management tool
  • 8. Proven with clusters of 3500+ nodes
  • 10. Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
  • 11. Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
  • 12. Command Line or GUI based cluster wide configuration
  • 13. Manage a node individually or manage a cluster as a whole
  • 15. Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
  • 16. Extensible through Collectl integration5HP Confidential
  • 17. 6HP ConfidentialTech Bubble? What does the Data Say?Attribution: CC PascalTerjan via Flickr
  • 19. But what if I could turn that into this?8HP Confidential
  • 20. And see how the amount invested this year differs from previous years?
  • 21. 10HP ConfidentialWhere is the money going?
  • 22. What type of startups get the most investment funding?
  • 23. Amount invested in Software Startups by Zip Code
  • 24. How did you do that?13HP ConfidentialHowdid you Do that?Attribution: CC  Colin_K on Flickr
  • 25. 14HP ConfidentialApache Identify Optimal Seed URLs& Crawl to a depth of 2http://www.crunchbase.com/companies?c=a&q=privately_heldCrawl data is stored in segment dirs on the HDFS
  • 27. 16HP ConfidentialMaking the data STRUCTUREDRetrieving HTMLPrelim Filtering on URLCompany POJO then /t Out
  • 28. 17HP ConfidentialAargh!My viz tool requires zipcodes to plot geospatially!
  • 29. Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • 30. The Story So FarUsed Nutch to retrieve investment data from web site.Used Hadoop to extract and structure the dataUsed Pig to add zipcode data.End result is a collection of relations describing investment activity.We’ve got raw data, now we need to understand it.
  • 31. Why Vertica?Vertica and Hadoop are complementary technologies.Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation)Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.
  • 32. Built from the Ground Up: The Four C’s of VerticaColumnar storage and executionContinuous performanceClusteringCompressionAchieve best data query performance with unique Vertica column storeLinear scaling by adding more resources on the flyStore more data, provide more views, use less hardwareQuery and load 24x7 with zero administration
  • 33. Getting Data From Here To There
  • 34. Connecting Vertica And HadoopVertica provides connectors for Hadoop 20.2 and Pig 0.7.Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data.Input retrieved from Vertica using standard SQL query.Output written to Vertica table.
  • 35. Vertica As a M/R Data Source// Set up the configuration and job objectsConfiguration conf = getConf(); Job job = new Job(conf); // Set the input format to retrieve data from Verticajob.setInputFormatClass(VerticaInputFormat.class);// Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput( job, “SELECT * FROM foo WHERE bar = ‘baz’);
  • 36. Vertica As a M/R Data Sink// Set up the configuration and job objectsConfiguration conf = getConf(); Job job = new Job(conf); // Set the output format to to write data to Verticajob.setOutputKeyClass(Text.class);job.setOutputValueClass(VerticaRecord.class);job.setOutputFormatClass(VerticaOutputFormat.class);// Define the table which will hold the outputVerticaOutputFormat.setOutput( job, <table name>, <truncate table?>, <col 1 def>, <col 2 def>, …, <col N def>);
  • 37. Reading Data Via Pig# Read some tuplesA = LOAD 'sql://< Your query here >' USING com.vertica.pig.VerticaLoader( ‘server1,server2,server3', ‘< DB Name>','5433',‘< user >',‘< password >’ ); 26
  • 38. Writing Data Via Pig# Write some tuplesSTORE < some var > INTO '{ < table name > (< col 1 def >, < col 2 def >, … )}'USING com.vertica.pig.VerticaStorer( ‘< server >',‘< DB >','5433',‘< user >',‘< password >’);27
  • 39. Reporting And Data Visualization
  • 40. Does My Favorite Application Work With Vertica?Vertica is an ANSI SQL99 compliant DB.Comes with drivers for ODBC, JDBC, and ADO.Net.If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.
  • 42. Traditional ReportsIntegrates smoothly with reporting frontends such as Jasper and Pentaho.Scriptable via the vsqlcommand line tool.C/C++ SDK for parallelized, in-DB computation.But… you have to know what questions you want to ask.
  • 45. Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges.Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data.Getting them to work together is easy.In Closing…