This document discusses LinkedIn's use of Kafka, Hadoop, Storm, and Couchbase in their big data pipeline. It provides an overview of each technology and how LinkedIn uses them together. Specifically, it describes how LinkedIn uses Kafka to stream data to Hadoop for analytics and report generation. It also discusses how LinkedIn uses Hadoop to pre-build and warm Couchbase buckets for improved performance. The presentation includes a use case of streaming member profile and activity data through Kafka to both Hadoop and Couchbase clusters.
2. • Define Problem Domain
Justin Michaels | Solution Architect, Couchbase
• Use case at LinkedIn
Michael Kehoe | Site Reliability Engineer, Linkedin
• Supporting Technology Overview and Demo
Matt Ingenthron | Senior Director, Couchbase
• Q&A
Agenda
2
5. Lambda Architecture
5
Interactive and
Real Time Applications
1
2
3
4
5
DATA
BATCH
SPEED
SERVE
QUER
YHADOOP
COUCHBASE
STORM
COUCHBASEBroker
Cluster
Spout for
Topic
Kafka
Producers
Ordered
Subscriptions
6. • Hadoop … an open-source framework written for distributed storage
and distributed processing of very large data sets on commodity
hardware
• Kafka … append only write-ahead log that records messages to a
persistent store and allows subscribers to read and apply these
changes to their own stores in an appropriate time-frame
• Storm … distributed framework that uses custom created "spouts"
and "bolts" to define information sources and manipulations for
processing of streaming data
• Couchbase … an open source, distributed NoSQL document-
oriented database that is optimized for interactive applications with
an integrated data cache and incremental map reduce facility
6
11. • Site Reliability Engineer (SRE) at LinkedIn
• SRE for Profile & Higher-Education
• Member of LinkedIn’s CBVT
• B.E. (Electrical Engineering) from
the University of Queensland,
Australia
Michael Kehoe
12. • Kafka was created by LinkedIn
• Kafka is a publish-subcribe system built as a distributed commit log
• Processes 500+ TB/ day (~500 billion messages) @ LinkedIn
Kafka @ LinkedIn
13. • Monitoring
• InGraphs
• Traditional Messaging (Pub-Sub)
• Analytics
• Who Viewed my Profile
• Experiment reports
• Executive reports
• Building block for (log) distributibuted applications
• Pinot
• Espresso
LinkedIn’s uses of Kafka
14. Use Case: Kafka to Hadoop (Analytics)
• LinkedIn tracks data to better understand how members use our
products
• Information such as which page got viewed and which content got
clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto our
Hadoop grid for analysis and daily report generation
15. Couchbase @ LinkedIn
• About 25 separate services with one or more clusters in multiple data
centers
• Up to 100 servers in a cluster
• Single and Multi-tenant clusters
16. Use Case: Jobs Cluster
• Read scaling, Couchbase ~80k QPS, 24 server cluster(s)
• Hadoop to pre-build data by partition
• Couchbase 99 percentile latencies
17. Hadoop to Couchbase
• Our primary use-case for Hadoop Couchbase is for building
(warming) / recovering Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL
processes, cache invalidation procedures etc
Editor's Notes
Note: Remove the logos from the animation and speed up build.
Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
Users and consumers of information increasingly demand an always on, low latency access to their data. As well as providing a framework for businesses to understand what’s happening in real time while addressing Polyglot Persistence in managing data. The conceptual framework Lambda Architecture evolved out of Twitter and coined by Nathan Marz for a generic data processing architecture. In a way the architecture is an extended event sourced system but aims to accommodate streaming data at large scale.
1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
5. Any incoming query can be answered by merging results from batch views and real-time views.
Hadoop is engineered for storage and analysis.
It can store petabytes of data, and if can be deployed to thousands of servers. It started with map / reduce. It added Hive. Today, we see efforts like Impala and Drill along with Hortonworks Stinger Initiative and Tez. Some of the Hadoop distributions are bundling Storm and / or Spark. The analytical capabilities of Hadoop are continuing to evolve and improve. However, it’s not well suited to operational workloads. It’s not intended to serve as a backend for enterprise application, mobile or web. It’s not intended to provide interactive data access.
Note: Remove the logos from the animation and speed up build.
Distributed users communities relying on interactive applications require systems to be distributed. As a result data is created in a variety of forms and places … the Polyglot Persistence … as the complexity of problems to be solved in creases applications demand a variety of development environments for tackling different problems. These complex, real-time applications combine different problems. Reliably storing, providing access to, and analyzing this data landscape leads to the Polyglot Persistence of data.
The data generated by users is published to Apache Kafka.
Next, it’s pulled into Apache Storm for real time analysis and processing as well as into Hadoop.
Finally, Storm writes the data to Couchbase Server for real-time access by LivePerson agents while the data in Hadoop is eventually accessed via HP Vertica and MicroStrategy for offline business intelligence and analysis.
The data is first collected by tracking and collection service. Next, Storm pulls the data in for filtering, enrichment, and statistical analysis. The raw data is written to one Couchbase Server cluster while the processed data is written to a separate Couchbase Server cluster. The processed data is access by a front end for visualization and analysis. In addition, the raw data is copied from Couchbase Server to Hadoop. It’s combine with additional data and the whole is moved into HBase for ad hoc analysis. PayPal was able to handle both the volume and the velocity of data as well as meet both operation and analytical requirements. They relied on data capture, stream processing, NoSQL and Hadoop to do so.