SlideShare a Scribd company logo
Honu:
 A large scale data collection
and processing pipeline
Jerome Boulon
Netflix
Session Agenda


§  Honu
§  Goals
§  Architecture – Overview
§  Data Collection pipeline
§  Data Processing pipeline
§  Hive data warehouse
§  Honu Roadmap
§  Questions


                               2
Honu


§  Honu is a streaming data & log collection and processing
    pipeline built using:
 ›    Hadoop
 ›    Hive
 ›    Thrift


§  Honu is running in production for 6 months now and
    process over a billion log events/day running on EC2/
    EMR.


                                3
What are we trying to achieve?


§  Scalable log analysis to gain business insights:
 ›    Errors logs (unstructured logs)
 ›    Statistical logs (structured logs - App specific)
 ›    Performance logs (structured logs – Standard + App specific)
§  Output required:
 ›    Engineers access:
      •  Ad-hoc query and reporting
 ›    BI access:
      •  Flat files to be loaded into BI system for cross-functional reporting.
      •  Ad-hoc query for data examinations, etc.

                                           4
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 5
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 6
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 7
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 8
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 9
Current Netflix deployment EC2/EMR


                Applications
   AMAZON EC2



                                               Hive & Hadoop
                                                     EMR
                   Honu                          (for ad-hoc
                 Collectors                         query)



                                    S3                           Hive
                                                               MetaStore


                               Hive & Hadoop
                                EMR clusters



                                         10
Honu – Client SDK



  Structured Log API    -  Log4j Appenders
                                              - Hadoop Metric Plugin


                                                     - Tomcat Access Log




 -  Convert individual messages to   Communication layer:
 batches                             -  Discovery Service
 -  In memory buffering System       -  Transparent fail-over & load-
                                     balancing
                                     -  Thrift as a transport protocol & RPC
                                     (NIO)
                                       11
Honu - Unstructured/Structured logs
Log4J Appender

§  Configuration using standard Log4j properties file
§  Control:
  ›    In memory size
  ›    Batch size
  ›    Number of senders + Vip Address
  ›    Timeout




                                  12
Honu - Structured Log API                                   Hive



                                    App



Using Annotations                 Using the Key/Value API
§  Convert Java Class to Hive    §  Produce the same result as
    Table dynamically                 Annotation
§  Add/Remove column             §  Avoid unnecessary object
                                      creation
§  Supported java types:
   •  All primitives              §  Fully dynamic

   •  Map                         §  Thread Safe
   •  Object using the toString
      method
Structured Log API – Using Annotation

@Resource(table= MyTable")

public class MyClass implements
Annotatable {
 @Column("movieId")                      log.info(myAnnotatableObj);
 public String getMovieId() {
   […]
                                         DB=MyTable
 }
                                         Movied=XXXX
    @Column("clickIndex")                clieckIndex=3
    public int getClickIndex() {         requestInfo.returnCode=200
      […]                                requestInfo.duration_ms=300
    }                                    requestInfo.yyy=zzz

    @Column("requestInfo")
    public Map getRequestInfo() {
      […]
    }
}


                                    14
Structured Log API - Key/Value API


KeyValueSerialization kv ;               log.info
kv = new KeyValueSerialization();        (kv.generateMessage());

[…]                                      DB=MyTable
kv.startMessage("MyTable");              Movied=XXXX
                                         clickIndex=3
kv.addKeyValue("movieid", XXX");
kv.addKeyValue("clickIndex", 3);         requestInfo.returnCode=200
kv.addKeyValue( requestInfo",            requestInfo.duration_ms=300
requestInfoMap);                         requestInfo.yyy=zzz




                                    15
Honu Collector


§  Honu collector:
 ›    Save logs to FS using local storage & Hadoop FS API
 ›    FS could be localFS, HDFS, S3n, NFS…
      •  FS fail-over coming (similar to scribe)
 ›    Thrift NIO
 ›    Multiple writers (Data grouping)


§  Output: DataSink (Binary compatible with Chukwa)
 ›    Compression (LZO/GZIP via Hadoop codecs)
 ›    S3 optimizations
                                         16
Data Processing pipeline

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 17
Data Processing Pipeline
§  Proprietary Data Warehouse Workflows
   ›  Ability to test new build with production data
   ›  Ability to replay some data processing



§  CheckPoint System
   ›    keep track of all current states for recovery


§  Demuxer: Map-Reduce to parse/dispatch all logs to the
    right Hive table
   ›  Multiple parsers
   ›  Dynamic Output Format for Hive (Tables & columns
      management)
       •  Default schema (Map, hostname & timestamp)
       •  Table’s specific schema
       •  All tables partitioned by Date, hour & batchID
                                        18
Data Processing Pipeline - Details
                                               Table 1

                      Map/Reduce               Table 2

                                               Table n
§  Demux output:
 ›    1 directory per Hive table
                                                       Load
 ›    1 file per partition * reducerCount

                                                         Hive
                       Hive

                                   S3

                                        Hourly Merge
Hive Data warehouse


§  All data on S3 when final (SLA 2 hours)
 ›    All data (CheckPoint, DataSink & Hive final Output) are
      saved on S3, everything else is transient
 ›    Don t need to maintain live instances to store n years of
      data
 ›    Start/Stop EMR query cluster(s) on demand
      •  Get the best cluster s size for you
      •  Hive Reload partitions (EMR specific)




                                        20
Roadmap for Honu


§  Open source: GitHub (end of July)
  ›    Client SDK
  ›    Collector
  ›    Demuxer
§  Multiple writers
§  Persistent queue (client & server)
§  Real Time integration with external monitoring system
§  HBase/Cassandra investigation
§  Map/Reduce based data aggregator

                                    21
Questions?

   Jerome Boulon
   jboulon@gmail.com

More Related Content

Hadoop summit 2010, HONU

  • 1. Honu: A large scale data collection and processing pipeline Jerome Boulon Netflix
  • 2. Session Agenda §  Honu §  Goals §  Architecture – Overview §  Data Collection pipeline §  Data Processing pipeline §  Hive data warehouse §  Honu Roadmap §  Questions 2
  • 3. Honu §  Honu is a streaming data & log collection and processing pipeline built using: ›  Hadoop ›  Hive ›  Thrift §  Honu is running in production for 6 months now and process over a billion log events/day running on EC2/ EMR. 3
  • 4. What are we trying to achieve? §  Scalable log analysis to gain business insights: ›  Errors logs (unstructured logs) ›  Statistical logs (structured logs - App specific) ›  Performance logs (structured logs – Standard + App specific) §  Output required: ›  Engineers access: •  Ad-hoc query and reporting ›  BI access: •  Flat files to be loaded into BI system for cross-functional reporting. •  Ad-hoc query for data examinations, etc. 4
  • 5. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 5
  • 6. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 6
  • 7. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 7
  • 8. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 8
  • 9. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 9
  • 10. Current Netflix deployment EC2/EMR Applications AMAZON EC2 Hive & Hadoop EMR Honu (for ad-hoc Collectors query) S3 Hive MetaStore Hive & Hadoop EMR clusters 10
  • 11. Honu – Client SDK Structured Log API -  Log4j Appenders - Hadoop Metric Plugin - Tomcat Access Log -  Convert individual messages to Communication layer: batches -  Discovery Service -  In memory buffering System -  Transparent fail-over & load- balancing -  Thrift as a transport protocol & RPC (NIO) 11
  • 12. Honu - Unstructured/Structured logs Log4J Appender §  Configuration using standard Log4j properties file §  Control: ›  In memory size ›  Batch size ›  Number of senders + Vip Address ›  Timeout 12
  • 13. Honu - Structured Log API Hive App Using Annotations Using the Key/Value API §  Convert Java Class to Hive §  Produce the same result as Table dynamically Annotation §  Add/Remove column §  Avoid unnecessary object creation §  Supported java types: •  All primitives §  Fully dynamic •  Map §  Thread Safe •  Object using the toString method
  • 14. Structured Log API – Using Annotation @Resource(table= MyTable") public class MyClass implements Annotatable { @Column("movieId") log.info(myAnnotatableObj); public String getMovieId() { […] DB=MyTable } Movied=XXXX @Column("clickIndex") clieckIndex=3 public int getClickIndex() { requestInfo.returnCode=200 […] requestInfo.duration_ms=300 } requestInfo.yyy=zzz @Column("requestInfo") public Map getRequestInfo() { […] } } 14
  • 15. Structured Log API - Key/Value API KeyValueSerialization kv ; log.info kv = new KeyValueSerialization(); (kv.generateMessage()); […] DB=MyTable kv.startMessage("MyTable"); Movied=XXXX clickIndex=3 kv.addKeyValue("movieid", XXX"); kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200 kv.addKeyValue( requestInfo", requestInfo.duration_ms=300 requestInfoMap); requestInfo.yyy=zzz 15
  • 16. Honu Collector §  Honu collector: ›  Save logs to FS using local storage & Hadoop FS API ›  FS could be localFS, HDFS, S3n, NFS… •  FS fail-over coming (similar to scribe) ›  Thrift NIO ›  Multiple writers (Data grouping) §  Output: DataSink (Binary compatible with Chukwa) ›  Compression (LZO/GZIP via Hadoop codecs) ›  S3 optimizations 16
  • 17. Data Processing pipeline Data collection pipeline Application Collector Hive M/R Data processing pipeline 17
  • 18. Data Processing Pipeline §  Proprietary Data Warehouse Workflows ›  Ability to test new build with production data ›  Ability to replay some data processing §  CheckPoint System ›  keep track of all current states for recovery §  Demuxer: Map-Reduce to parse/dispatch all logs to the right Hive table ›  Multiple parsers ›  Dynamic Output Format for Hive (Tables & columns management) •  Default schema (Map, hostname & timestamp) •  Table’s specific schema •  All tables partitioned by Date, hour & batchID 18
  • 19. Data Processing Pipeline - Details Table 1 Map/Reduce Table 2 Table n §  Demux output: ›  1 directory per Hive table Load ›  1 file per partition * reducerCount Hive Hive S3 Hourly Merge
  • 20. Hive Data warehouse §  All data on S3 when final (SLA 2 hours) ›  All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient ›  Don t need to maintain live instances to store n years of data ›  Start/Stop EMR query cluster(s) on demand •  Get the best cluster s size for you •  Hive Reload partitions (EMR specific) 20
  • 21. Roadmap for Honu §  Open source: GitHub (end of July) ›  Client SDK ›  Collector ›  Demuxer §  Multiple writers §  Persistent queue (client & server) §  Real Time integration with external monitoring system §  HBase/Cassandra investigation §  Map/Reduce based data aggregator 21
  • 22. Questions? Jerome Boulon jboulon@gmail.com