SlideShare a Scribd company logo
Search Analytics with Flume & HBase Otis Gospodneti ć   •••  Sematext International
Agenda Who I am
What Why How
Architecture Evolution
Role of Flume and HBase + Flume HBase Sink
Challenges
About Otis Gospodneti ć Lucene/Solr/Nutch/Mahout committer
Lucene in Action 1 & 2 co-author
Lucene Consulting since 2005
Sematext Int'l since 2007
About Sematext Consulting, development, support for: Big Data  (Hadoop, HBase, Voldemort...)
Search  (Lucene, Solr, Elastic Search...)
Web Crawling  (Nutch)
Machine Learning  (Mahout)
What We Built Analytics for Search Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)
Trending over time
Comparisons of time periods
Top N reports
Various report filters
Report Example
Why We Built it We need it search-hadoop.com  &  search-lucene.com Search customers need it Want to know what their visitors are searching for
Want to know how their search is behaving
… subliminal msg:  go use this site
How We Built it JavaScript Beacons

More Related Content

Search Analytics with Flume and HBase

Editor's Notes

  1. 10 days of data (5K/min)
  2. Flume is used simply to collect logs to a central place (HDFS) from multiple agents. But at the end we still have a single log file that something (raw log importer) then needs to process. No HBase is involved directly with Flume here and there is no HBase sink in this scenario.
  3. Making use of Flume's ability to plug in different Sinks, so instead of just collecting data to a log file on HDFS, we hook up FLUME-247 Sink to Flume and make it write directly to HBase.
  4. 2h, 2K/min, 1sys (240K actions, 43mb of input data) 1193mb - no prune, no compress 624mb - prune sort index only, no compress 408mb - prune, no compress 196mb - no prune, copress 106mb - prune sort index only, compress 64mb - prune, compress