CouchbasetoHadoop_Matt_Michael_Justin v4

Couchbase to Hadoop at Linkedin
Kafka is Enabling the Big Data Pipeline

• Define Problem Domain
Justin Michaels | Solution Architect, Couchbase
• Use case at LinkedIn
Michael Kehoe | Site Reliability Engineer, Linkedin
• Supporting Technology Overview and Demo
Matt Ingenthron | Senior Director, Couchbase
• Q&A
Agenda
2

Lambda Architecture
4
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
Y

Lambda Architecture
5
Interactive and
Real Time Applications
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
YHADOOP
COUCHBASE
STORM
COUCHBASEBroker
Cluster
Spout for
Topic
Kafka
Producers
Ordered
Subscriptions

• Hadoop … an open-source framework written for distributed storage
and distributed processing of very large data sets on commodity
hardware
• Kafka … append only write-ahead log that records messages to a
persistent store and allows subscribers to read and apply these
changes to their own stores in an appropriate time-frame
• Storm … distributed framework that uses custom created "spouts"
and "bolts" to define information sources and manipulations for
processing of streaming data
• Couchbase … an open source, distributed NoSQL document-
oriented database that is optimized for interactive applications with
an integrated data cache and incremental map reduce facility
6

COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD

TRACKING and
COLLECTION
ANALYSIS AND
VISUALIZATION
REST FILTER METRICS

• Site Reliability Engineer (SRE) at LinkedIn
• SRE for Profile & Higher-Education
• Member of LinkedIn’s CBVT
• B.E. (Electrical Engineering) from
the University of Queensland,
Australia
Michael Kehoe

• Kafka was created by LinkedIn
• Kafka is a publish-subcribe system built as a distributed commit log
• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn
Kafka @ LinkedIn

• Monitoring
• InGraphs
• Traditional Messaging (Pub-Sub)
• Analytics
• Who Viewed my Profile
• Experiment reports
• Executive reports
• Building block for (log) distributibuted applications
• Pinot
• Espresso
LinkedIn’s uses of Kafka

Use Case: Kafka to Hadoop (Analytics)
• LinkedIn tracks data to better understand how members use our
products
• Information such as which page got viewed and which content got
clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto our
Hadoop grid for analysis and daily report generation

Couchbase @ LinkedIn
• About 25 separate services with one or more clusters in multiple data
centers
• Up to 100 servers in a cluster
• Single and Multi-tenant clusters

Use Case: Jobs Cluster
• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)
• Hadoop to pre-build data by partition
• Couchbase 99 percentile latencies

Hadoop to Couchbase
• Our primary use-case for Hadoop  Couchbase is for building
(warming) / recovering Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL
processes, cache invalidation procedures etc

CouchbasetoHadoop_Matt_Michael_Justin v4

More Related Content

CouchbasetoHadoop_Matt_Michael_Justin v4

Editor's Notes