Search Analytics with Flume and HBase
- 15. What We Built Analytics for Search Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)
- 21. Why We Built it We need it search-hadoop.com & search-lucene.com Search customers need it Want to know what their visitors are searching for
- 42. HBaseLog4JAppender Cons Doesn't help with reliable delivery e.g. when network or HBase down Non-centralized config with larger clusters e.g. changing destination table in HBase
- 54. Later these logs are processed by the MapReduce job Search Action -> Metric Capture -> Log File -> Flume Agent -> Flume Collector -> Decorators -> HBase Sink -> HBase Decorator: processes Flume Collector log events and prepares them for HBase
- 56. Why Flume Reliable delivery e.g. queue msgs locally if destination unreachable Easy, centralized management via Web UI or console
- 59. On Flume: slideshare.net/cloudera/inside-flume
- 66. Challenges “ HBase in a box” is like “dynamic equilibrium”, or “virtual reality”, or “jumbo shrimp” – search-hadoop.com/m/p68C12nb7Hn
- 68. Data pruning (variable levels) Query string distribution: very long-tail Lots of data to process, update, aggregate
- 69. Work @ Sematext We are hiring world-wide! Search & Data Analytics Machine Learning & NLP Biiig Data
Editor's Notes
- 10 days of data (5K/min)
- Flume is used simply to collect logs to a central place (HDFS) from multiple agents. But at the end we still have a single log file that something (raw log importer) then needs to process. No HBase is involved directly with Flume here and there is no HBase sink in this scenario.
- Making use of Flume's ability to plug in different Sinks, so instead of just collecting data to a log file on HDFS, we hook up FLUME-247 Sink to Flume and make it write directly to HBase.
- 2h, 2K/min, 1sys (240K actions, 43mb of input data) 1193mb - no prune, no compress 624mb - prune sort index only, no compress 408mb - prune, no compress 196mb - no prune, copress 106mb - prune sort index only, compress 64mb - prune, compress