SlideShare a Scribd company logo
Couchbase to Hadoop at Linkedin
Kafka is Enabling the Big Data Pipeline
• Define Problem Domain
Justin Michaels | Solution Architect, Couchbase
• Use case at LinkedIn
Michael Kehoe | Site Reliability Engineer, Linkedin
• Supporting Technology Overview and Demo
Matt Ingenthron | Senior Director, Couchbase
• Q&A
Agenda
2
CouchbasetoHadoop_Matt_Michael_Justin v4
Lambda Architecture
4
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
Y
Lambda Architecture
5
Interactive and
Real Time Applications
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
YHADOOP
COUCHBASE
STORM
COUCHBASEBroker
Cluster
Spout for
Topic
Kafka
Producers
Ordered
Subscriptions
• Hadoop … an open-source framework written for distributed storage
and distributed processing of very large data sets on commodity
hardware
• Kafka … append only write-ahead log that records messages to a
persistent store and allows subscribers to read and apply these
changes to their own stores in an appropriate time-frame
• Storm … distributed framework that uses custom created "spouts"
and "bolts" to define information sources and manipulations for
processing of streaming data
• Couchbase … an open source, distributed NoSQL document-
oriented database that is optimized for interactive applications with
an integrated data cache and incremental map reduce facility
6
CouchbasetoHadoop_Matt_Michael_Justin v4
COMPLEX
EVENT PROCESSING
Real Time
REPOSITORY
PERPETUAL
STORE
ANALYTICAL
DB
BUSINESS
INTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCH
TRACK
REAL-TIME
TRACK
DASHBOARD
TRACKING and
COLLECTION
ANALYSIS AND
VISUALIZATION
REST FILTER METRICS
Use Case at Linkedin
10
• Site Reliability Engineer (SRE) at LinkedIn
• SRE for Profile & Higher-Education
• Member of LinkedIn’s CBVT
• B.E. (Electrical Engineering) from
the University of Queensland,
Australia
Michael Kehoe
• Kafka was created by LinkedIn
• Kafka is a publish-subcribe system built as a distributed commit log
• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn
Kafka @ LinkedIn
• Monitoring
• InGraphs
• Traditional Messaging (Pub-Sub)
• Analytics
• Who Viewed my Profile
• Experiment reports
• Executive reports
• Building block for (log) distributibuted applications
• Pinot
• Espresso
LinkedIn’s uses of Kafka
Use Case: Kafka to Hadoop (Analytics)
• LinkedIn tracks data to better understand how members use our
products
• Information such as which page got viewed and which content got
clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto our
Hadoop grid for analysis and daily report generation
Couchbase @ LinkedIn
• About 25 separate services with one or more clusters in multiple data
centers
• Up to 100 servers in a cluster
• Single and Multi-tenant clusters
Use Case: Jobs Cluster
• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)
• Hadoop to pre-build data by partition
• Couchbase 99 percentile latencies
Hadoop to Couchbase
• Our primary use-case for Hadoop  Couchbase is for building
(warming) / recovering Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL
processes, cache invalidation procedures etc

More Related Content

CouchbasetoHadoop_Matt_Michael_Justin v4

  • 1. Couchbase to Hadoop at Linkedin Kafka is Enabling the Big Data Pipeline
  • 2. • Define Problem Domain Justin Michaels | Solution Architect, Couchbase • Use case at LinkedIn Michael Kehoe | Site Reliability Engineer, Linkedin • Supporting Technology Overview and Demo Matt Ingenthron | Senior Director, Couchbase • Q&A Agenda 2
  • 5. Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUER YHADOOP COUCHBASE STORM COUCHBASEBroker Cluster Spout for Topic Kafka Producers Ordered Subscriptions
  • 6. • Hadoop … an open-source framework written for distributed storage and distributed processing of very large data sets on commodity hardware • Kafka … append only write-ahead log that records messages to a persistent store and allows subscribers to read and apply these changes to their own stores in an appropriate time-frame • Storm … distributed framework that uses custom created "spouts" and "bolts" to define information sources and manipulations for processing of streaming data • Couchbase … an open source, distributed NoSQL document- oriented database that is optimized for interactive applications with an integrated data cache and incremental map reduce facility 6
  • 10. Use Case at Linkedin 10
  • 11. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of LinkedIn’s CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia Michael Kehoe
  • 12. • Kafka was created by LinkedIn • Kafka is a publish-subcribe system built as a distributed commit log • Processes 500+ TB/ day (~500 billion messages) @ LinkedIn Kafka @ LinkedIn
  • 13. • Monitoring • InGraphs • Traditional Messaging (Pub-Sub) • Analytics • Who Viewed my Profile • Experiment reports • Executive reports • Building block for (log) distributibuted applications • Pinot • Espresso LinkedIn’s uses of Kafka
  • 14. Use Case: Kafka to Hadoop (Analytics) • LinkedIn tracks data to better understand how members use our products • Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center • Some of these events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation
  • 15. Couchbase @ LinkedIn • About 25 separate services with one or more clusters in multiple data centers • Up to 100 servers in a cluster • Single and Multi-tenant clusters
  • 16. Use Case: Jobs Cluster • Read scaling, Couchbase ~80k QPS, 24 server cluster(s) • Hadoop to pre-build data by partition • Couchbase 99 percentile latencies
  • 17. Hadoop to Couchbase • Our primary use-case for Hadoop  Couchbase is for building (warming) / recovering Couchbase buckets • LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

Editor's Notes

  1. Note: Remove the logos from the animation and speed up build. Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
  2. Users and consumers of information increasingly demand an always on, low latency access to their data. As well as providing a framework for businesses to understand what’s happening in real time while addressing Polyglot Persistence in managing data. The conceptual framework Lambda Architecture evolved out of Twitter and coined by Nathan Marz for a generic data processing architecture. In a way the architecture is an extended event sourced system but aims to accommodate streaming data at large scale. 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views.
  3. Hadoop is engineered for storage and analysis. It can store petabytes of data, and if can be deployed to thousands of servers. It started with map / reduce. It added Hive. Today, we see efforts like Impala and Drill along with Hortonworks Stinger Initiative and Tez. Some of the Hadoop distributions are bundling Storm and / or Spark. The analytical capabilities of Hadoop are continuing to evolve and improve. However, it’s not well suited to operational workloads. It’s not intended to serve as a backend for enterprise application, mobile or web. It’s not intended to provide interactive data access.
  4. Note: Remove the logos from the animation and speed up build. Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
  5. The data generated by users is published to Apache Kafka. Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop. Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
  6. The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.