Hadoop summit 2010, HONU

Honu:
A large scale data collection
and processing pipeline
Jerome Boulon
Netflix

Session Agenda

§  Honu
§  Goals
§  Architecture – Overview
§  Data Collection pipeline
§  Data Processing pipeline
§  Hive data warehouse
§  Honu Roadmap
§  Questions

2

Honu

§  Honu is a streaming data & log collection and processing
pipeline built using:
›  Hadoop
›  Hive
›  Thrift

§  Honu is running in production for 6 months now and
process over a billion log events/day running on EC2/
EMR.

3

What are we trying to achieve?

§  Scalable log analysis to gain business insights:
›  Errors logs (unstructured logs)
›  Statistical logs (structured logs - App specific)
›  Performance logs (structured logs – Standard + App specific)
§  Output required:
›  Engineers access:
•  Ad-hoc query and reporting
›  BI access:
•  Flat files to be loaded into BI system for cross-functional reporting.
•  Ad-hoc query for data examinations, etc.

4

Architecture - Overview

Data collection pipeline

Application Collector

Hive M/R

Data processing
pipeline 5




Hive M/R

Data processing
pipeline 6




Hive M/R

Data processing
pipeline 7




Hive M/R

Data processing
pipeline 8




Hive M/R

Data processing
pipeline 9

Current Netflix deployment EC2/EMR

Applications
AMAZON EC2

Hive & Hadoop
EMR
Honu (for ad-hoc
Collectors query)

S3 Hive
MetaStore

Hive & Hadoop
EMR clusters

10

Honu – Client SDK

Structured Log API -  Log4j Appenders
- Hadoop Metric Plugin

- Tomcat Access Log

-  Convert individual messages to Communication layer:
batches -  Discovery Service
-  In memory buffering System -  Transparent fail-over & load-
balancing
-  Thrift as a transport protocol & RPC
(NIO)
11

Honu - Unstructured/Structured logs
Log4J Appender

§  Configuration using standard Log4j properties file
§  Control:
›  In memory size
›  Batch size
›  Number of senders + Vip Address
›  Timeout

12

Honu - Structured Log API Hive

App

Using Annotations Using the Key/Value API
§  Convert Java Class to Hive §  Produce the same result as
Table dynamically Annotation
§  Add/Remove column §  Avoid unnecessary object
creation
§  Supported java types:
•  All primitives §  Fully dynamic

•  Map §  Thread Safe
•  Object using the toString
method

Structured Log API – Using Annotation

@Resource(table= MyTable")

public class MyClass implements
Annotatable {
@Column("movieId") log.info(myAnnotatableObj);
public String getMovieId() {
[…]
DB=MyTable
}
Movied=XXXX
@Column("clickIndex") clieckIndex=3
public int getClickIndex() { requestInfo.returnCode=200
[…] requestInfo.duration_ms=300
} requestInfo.yyy=zzz

@Column("requestInfo")
public Map getRequestInfo() {
[…]
}
}

14

Structured Log API - Key/Value API

KeyValueSerialization kv ; log.info
kv = new KeyValueSerialization(); (kv.generateMessage());

[…] DB=MyTable
kv.startMessage("MyTable"); Movied=XXXX
clickIndex=3
kv.addKeyValue("movieid", XXX");
kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200
kv.addKeyValue( requestInfo", requestInfo.duration_ms=300
requestInfoMap); requestInfo.yyy=zzz

15

Honu Collector

§  Honu collector:
›  Save logs to FS using local storage & Hadoop FS API
›  FS could be localFS, HDFS, S3n, NFS…
•  FS fail-over coming (similar to scribe)
›  Thrift NIO
›  Multiple writers (Data grouping)

§  Output: DataSink (Binary compatible with Chukwa)
›  Compression (LZO/GZIP via Hadoop codecs)
›  S3 optimizations
16

Data Processing pipeline



Hive M/R

Data processing
pipeline 17

Data Processing Pipeline
§  Proprietary Data Warehouse Workflows
›  Ability to test new build with production data
›  Ability to replay some data processing

§  CheckPoint System
›  keep track of all current states for recovery

§  Demuxer: Map-Reduce to parse/dispatch all logs to the
right Hive table
›  Multiple parsers
›  Dynamic Output Format for Hive (Tables & columns
management)
•  Default schema (Map, hostname & timestamp)
•  Table’s specific schema
•  All tables partitioned by Date, hour & batchID
18

Data Processing Pipeline - Details
Table 1

Map/Reduce Table 2

Table n
§  Demux output:
›  1 directory per Hive table
Load
›  1 file per partition * reducerCount

Hive
Hive

S3

Hourly Merge

Hive Data warehouse

§  All data on S3 when final (SLA 2 hours)
›  All data (CheckPoint, DataSink & Hive final Output) are
saved on S3, everything else is transient
›  Don t need to maintain live instances to store n years of
data
›  Start/Stop EMR query cluster(s) on demand
•  Get the best cluster s size for you
•  Hive Reload partitions (EMR specific)

20

Roadmap for Honu

§  Open source: GitHub (end of July)
›  Client SDK
›  Collector
›  Demuxer
§  Multiple writers
§  Persistent queue (client & server)
§  Real Time integration with external monitoring system
§  HBase/Cassandra investigation
§  Map/Reduce based data aggregator

21

Questions?

Jerome Boulon
jboulon@gmail.com

Hadoop summit 2010, HONU

Related slideshows

More Related Content

Hadoop summit 2010, HONU