Search Analytics with Flume and HBase

1. Search Analytics with Flume & HBase Otis Gospodneti ć ••• Sematext International

2. Agenda Who I am

3. What Why How

4. Architecture Evolution

5. Role of Flume and HBase + Flume HBase Sink

6. Challenges

7. About Otis Gospodneti ć Lucene/Solr/Nutch/Mahout committer

8. Lucene in Action 1 & 2 co-author

9. Lucene Consulting since 2005

10. Sematext Int'l since 2007

11. About Sematext Consulting, development, support for: Big Data (Hadoop, HBase, Voldemort...)

12. Search (Lucene, Solr, Elastic Search...)

13. Web Crawling (Nutch)

14. Machine Learning (Mahout)

15. What We Built Analytics for Search Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)

16. Trending over time

17. Comparisons of time periods

18. Top N reports

19. Various report filters

20. Report Example

21. Why We Built it We need it search-hadoop.com & search-lucene.com Search customers need it Want to know what their visitors are searching for

22. Want to know how their search is behaving

23. … subliminal msg: go use this site

24. How We Built it JavaScript Beacons

25. Metric Capture Web App

26. Data Capture Mechanisms Custom Log4J Appender

27. Flume Agents, Collectors, Sinks HBase

28. MapReduce Aggregations

29. Search Analytics Reporting Web App

30. What's Flume Distributed data/log collection service

31. Scalable, configurable, extensible

32. Centrally manageable, open source

33. Agents get data from app, Collectors save it

34. Abstractions: Source -> Decorator(s) -> Sink

35. What's HBase Scalable, reliable, distributed, column-oriented DB

36. On top of HDFS

37. MapReducable

38. High Level Architecture

39. Architecture #1

40. Architecture #1 - Getting Messy

41. Arch #2 – HBaseLog4JAppender

42. HBaseLog4JAppender Cons Doesn't help with reliable delivery e.g. when network or HBase down Non-centralized config with larger clusters e.g. changing destination table in HBase

43. e.g. changing sampling rate

44. Architecture #3 – Flume OOTB

45. Arch #4 – Flume HBase Sink

46. FLUME-247 – Flume HBase sink Contributed by Sematext in September 2010

47. Reviewed, pending commit

48. Similar to FLUME-6 (basic example), but more flexible

49. https://issues.cloudera.org/browse/FLUME-247

50. Walk-Through Start EC2 micro instance, configure logs-generation tool to simulate user actions

51. User actions start getting logged to a log file

52. Configure Flume Agent to "tail" the generated logs and send data to Flume Collector

53. Collector processes log messages and sends them to HBase's "raw logs" table

54. Later these logs are processed by the MapReduce job Search Action -> Metric Capture -> Log File -> Flume Agent -> Flume Collector -> Decorators -> HBase Sink -> HBase Decorator: processes Flume Collector log events and prepares them for HBase

55. HBase sink: FLUME-247

56. Why Flume Reliable delivery e.g. queue msgs locally if destination unreachable Easy, centralized management via Web UI or console

57. Good community, good progress

58. But: more complex, more moving parts

59. On Flume: slideshare.net/cloudera/inside-flume

60. Why HBase Scalable raw search data storage

61. MapReduce data input

62. Scalable aggregate data storage

63. Fast scans for time ranges, fast key lookups

64. Easy storage and compute power expansion

65. Good looking roadmap, community, progress

66. Challenges “ HBase in a box” is like “dynamic equilibrium”, or “virtual reality”, or “jumbo shrimp” – search-hadoop.com/m/p68C12nb7Hn

67. Data size. Solutions: Compression (4-5x smaller with lzo)

68. Data pruning (variable levels) Query string distribution: very long-tail Lots of data to process, update, aggregate

69. Work @ Sematext We are hiring world-wide! Search & Data Analytics Machine Learning & NLP Biiig Data

70. sematext.com

71. blog.sematext.com

72. @ sematext

73. @ otisg

74. [email_address] Contact

Search Analytics with Flume and HBase

Related slideshows

More Related Content

Search Analytics with Flume and HBase

Editor's Notes