SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kafka for Oracle DBAs
What is Kafka
Why should you care
How to learn Kafka
2© Cloudera, Inc. All rights reserved.
• Oracle DBA
• Turned Oracle Consultant
• Turned Hadoop Solutions Architect
• Turned Developer
Committer on Apache Sqoop
Contributor to Apache Kafka and
Apache Flume
About me
3© Cloudera, Inc. All rights reserved.
Apache Kafka is a
publish-subscribe messaging
rethought as a
distributed commit log.
An Optical Illusion
4© Cloudera, Inc. All rights reserved.
• Redo log as an abstraction
• How redo logs are useful
• Pub-sub message queues
• How message queues are useful
• What exactly is Kafka
• How do people use Kafka
• Where can you learn more
We’ll talk about:
5© Cloudera, Inc. All rights reserved.
Redo Log:
The most crucial structure for
recovery operations …
store all changes made to the
database as they occur.
6© Cloudera, Inc. All rights reserved.
Important Point
The redo log is the only reliable source of
information about current state of the database.
7© Cloudera, Inc. All rights reserved.
Redo Log is used for
• Recover consistent state of a database
• Replicate the database (Dataguard, Streams, GoldenGate…)
• Update materialized logs (well, it’s a log anyway)
If you look far enough into archive logs – you can reconstruct the entire database
8© Cloudera, Inc. All rights reserved.
What if…
You built an entire data storage system
that is just a transaction log?
9© Cloudera, Inc. All rights reserved.
Kafka can log
• Transactions from any database
• Clicks from websites
• Application logs (ERROR, WARN, INFO…)
• Metrics– cpu, memory, io
• Audit events
• And any system can read those logs: Hadoop, alerts, dashboards, databases.
10© Cloudera, Inc. All rights reserved.
Only one thing is missing
Q: How do you query a redo log?
A: Not very efficiently
Sometimes we just need the events – no need to query.
Other times, we need to load the results into a database.
While messages are in transit – we can do all kinds of transformations.
11© Cloudera, Inc. All rights reserved.
12© Cloudera, Inc. All rights reserved.
Publish-Subscribe
Message Queue
13© Cloudera, Inc. All rights reserved.
Raise your hand if this sounds familiar
“My next project was to get a working Hadoop setup…
Having little experience in this area, we naturally budgeted
a few weeks for getting data in and out, and the rest of our
time for implementing fancy algorithms. “
--Jay Kreps, Kafka PMC
14© Cloudera, Inc. All rights reserved.14
Client Source
Data Pipelines Start like this.
15© Cloudera, Inc. All rights reserved.15
Client Source
Client
Client
Client
Then we reuse them
16© Cloudera, Inc. All rights reserved.16
Client Backend
Client
Client
Client
Then we add consumers to the
existing sources
Another
Backend
17© Cloudera, Inc. All rights reserved.17
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend
18© Cloudera, Inc. All rights reserved.18
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend
19© Cloudera, Inc. All rights reserved.
Queues decouple systems: Both statically and in time
20© Cloudera, Inc. All rights reserved.
This is where we are trying to get
20
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop Security Systems
Real-time
monitoring
Data Warehouse
Kafka
Producers
Brokers
Consumers
Kafka decouples Data Pipelines
21© Cloudera, Inc. All rights reserved.
Important notes:
• Producers and Consumers don’t need to know about each other
• Performance issues on Consumers don’t impact Producers
• Consumers are protected from herds of Producers
• Lots of flexibility in handling load
• Messages are available for anyone –
lots of new use cases, monitoring, audit, troubleshooting
http://www.slideshare.net/gwenshap/queues-pools-caches
22© Cloudera, Inc. All rights reserved.
So… What is Kafka?
23© Cloudera, Inc. All rights reserved.
Kafka provides a fast, distributed, highly scalable,
highly available, publish-subscribe messaging system.
In turn this solves part of a much harder problem:
Communication and integration between
components of large software systems
Click to enter confidentiality
24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
•Messages are organized into topics
•Producers push messages
•Consumers pull messages
•Kafka runs in a cluster. Nodes are called
brokers
The Basics
25© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Topics, Partitions and Logs
26© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Each partition is a log
27© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
28© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
29© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
30© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Consumers
31© Cloudera, Inc. All rights reserved.
Why is Kafka better than other MQ?
• Can keep data forever
• Scales very well – high throughputs, low latency, lots of storage
• Scales to any number of consumers
32© Cloudera, Inc. All rights reserved.
How do people use Kafka?
• As a message bus
• As a buffer for replication systems (Like AdvancedQueue in Streams)
• As reliable feed for event processing
• As a buffer for event processing
• Decouple apps from database (both OLTP and DWH)
33© Cloudera, Inc. All rights reserved.
Need More Kafka?
• https://kafka.apache.org/documentation.html
• My video tutorial: http://shop.oreilly.com/product/0636920038603.do
• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-
tutorial/
• Try with Cloudera Manager:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-
kafka/latest/topics/kafka_install.html
34© Cloudera, Inc. All rights reserved.
One more thing...
35© Cloudera, Inc. All rights reserved.
Schema is a MUST HAVE for
data integration
Click to enter confidentiality
36© Cloudera, Inc. All rights reserved.
Kafka only stores Bytes – So where’s the schema?
• People go around asking each other:
“So, what does the 5th field of the messages in topic Blah contain?”
• There’s utility code for reading/writing messages that everyone reuses
• Schema embedded in the message
• A centralized repository for schemas
• Each message has Schema ID
• Each topic has Schema ID
Click to enter confidentiality
37© Cloudera, Inc. All rights reserved.
I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into Bytes or JSON
• Embed schema in files / records… or not
• Support for our favorite languages… Except Go.
• Schema Evolution
• Add and remove fields without breaking anything
Click to enter confidentiality
38© Cloudera, Inc. All rights reserved.
Replicating from Oracle to Kafka?
Don’t lose the schema!
39© Cloudera, Inc. All rights reserved.
Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas allow adding readers and writers easily
• Schemas allow modifying readers and writers independently
• Schemas can evolve as the system grows
• Allows validating data soon after its written
• No need to throw away data that doesn’t fit!
Click to enter confidentiality
40© Cloudera, Inc. All rights reserved.
Click to enter confidentiality
41© Cloudera, Inc. All rights reserved.
Thank you
@gwenshap
gshapira@cloudera.com

More Related Content

Kafka for DBAs

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kafka for Oracle DBAs What is Kafka Why should you care How to learn Kafka
  • 2. 2© Cloudera, Inc. All rights reserved. • Oracle DBA • Turned Oracle Consultant • Turned Hadoop Solutions Architect • Turned Developer Committer on Apache Sqoop Contributor to Apache Kafka and Apache Flume About me
  • 3. 3© Cloudera, Inc. All rights reserved. Apache Kafka is a publish-subscribe messaging rethought as a distributed commit log. An Optical Illusion
  • 4. 4© Cloudera, Inc. All rights reserved. • Redo log as an abstraction • How redo logs are useful • Pub-sub message queues • How message queues are useful • What exactly is Kafka • How do people use Kafka • Where can you learn more We’ll talk about:
  • 5. 5© Cloudera, Inc. All rights reserved. Redo Log: The most crucial structure for recovery operations … store all changes made to the database as they occur.
  • 6. 6© Cloudera, Inc. All rights reserved. Important Point The redo log is the only reliable source of information about current state of the database.
  • 7. 7© Cloudera, Inc. All rights reserved. Redo Log is used for • Recover consistent state of a database • Replicate the database (Dataguard, Streams, GoldenGate…) • Update materialized logs (well, it’s a log anyway) If you look far enough into archive logs – you can reconstruct the entire database
  • 8. 8© Cloudera, Inc. All rights reserved. What if… You built an entire data storage system that is just a transaction log?
  • 9. 9© Cloudera, Inc. All rights reserved. Kafka can log • Transactions from any database • Clicks from websites • Application logs (ERROR, WARN, INFO…) • Metrics– cpu, memory, io • Audit events • And any system can read those logs: Hadoop, alerts, dashboards, databases.
  • 10. 10© Cloudera, Inc. All rights reserved. Only one thing is missing Q: How do you query a redo log? A: Not very efficiently Sometimes we just need the events – no need to query. Other times, we need to load the results into a database. While messages are in transit – we can do all kinds of transformations.
  • 11. 11© Cloudera, Inc. All rights reserved.
  • 12. 12© Cloudera, Inc. All rights reserved. Publish-Subscribe Message Queue
  • 13. 13© Cloudera, Inc. All rights reserved. Raise your hand if this sounds familiar “My next project was to get a working Hadoop setup… Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “ --Jay Kreps, Kafka PMC
  • 14. 14© Cloudera, Inc. All rights reserved.14 Client Source Data Pipelines Start like this.
  • 15. 15© Cloudera, Inc. All rights reserved.15 Client Source Client Client Client Then we reuse them
  • 16. 16© Cloudera, Inc. All rights reserved.16 Client Backend Client Client Client Then we add consumers to the existing sources Another Backend
  • 17. 17© Cloudera, Inc. All rights reserved.17 Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
  • 18. 18© Cloudera, Inc. All rights reserved.18 Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
  • 19. 19© Cloudera, Inc. All rights reserved. Queues decouple systems: Both statically and in time
  • 20. 20© Cloudera, Inc. All rights reserved. This is where we are trying to get 20 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka decouples Data Pipelines
  • 21. 21© Cloudera, Inc. All rights reserved. Important notes: • Producers and Consumers don’t need to know about each other • Performance issues on Consumers don’t impact Producers • Consumers are protected from herds of Producers • Lots of flexibility in handling load • Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting http://www.slideshare.net/gwenshap/queues-pools-caches
  • 22. 22© Cloudera, Inc. All rights reserved. So… What is Kafka?
  • 23. 23© Cloudera, Inc. All rights reserved. Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. In turn this solves part of a much harder problem: Communication and integration between components of large software systems Click to enter confidentiality
  • 24. 24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights •Messages are organized into topics •Producers push messages •Consumers pull messages •Kafka runs in a cluster. Nodes are called brokers The Basics
  • 25. 25© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Topics, Partitions and Logs
  • 26. 26© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Each partition is a log
  • 27. 27© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  • 28. 28© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 29. 29© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 30. 30© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Consumers
  • 31. 31© Cloudera, Inc. All rights reserved. Why is Kafka better than other MQ? • Can keep data forever • Scales very well – high throughputs, low latency, lots of storage • Scales to any number of consumers
  • 32. 32© Cloudera, Inc. All rights reserved. How do people use Kafka? • As a message bus • As a buffer for replication systems (Like AdvancedQueue in Streams) • As reliable feed for event processing • As a buffer for event processing • Decouple apps from database (both OLTP and DWH)
  • 33. 33© Cloudera, Inc. All rights reserved. Need More Kafka? • https://kafka.apache.org/documentation.html • My video tutorial: http://shop.oreilly.com/product/0636920038603.do • http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and- tutorial/ • Try with Cloudera Manager: http://www.cloudera.com/content/cloudera/en/documentation/cloudera- kafka/latest/topics/kafka_install.html
  • 34. 34© Cloudera, Inc. All rights reserved. One more thing...
  • 35. 35© Cloudera, Inc. All rights reserved. Schema is a MUST HAVE for data integration Click to enter confidentiality
  • 36. 36© Cloudera, Inc. All rights reserved. Kafka only stores Bytes – So where’s the schema? • People go around asking each other: “So, what does the 5th field of the messages in topic Blah contain?” • There’s utility code for reading/writing messages that everyone reuses • Schema embedded in the message • A centralized repository for schemas • Each message has Schema ID • Each topic has Schema ID Click to enter confidentiality
  • 37. 37© Cloudera, Inc. All rights reserved. I Avro • Define Schema • Generate code for objects • Serialize / Deserialize into Bytes or JSON • Embed schema in files / records… or not • Support for our favorite languages… Except Go. • Schema Evolution • Add and remove fields without breaking anything Click to enter confidentiality
  • 38. 38© Cloudera, Inc. All rights reserved. Replicating from Oracle to Kafka? Don’t lose the schema!
  • 39. 39© Cloudera, Inc. All rights reserved. Schemas are Agile • Leave out MySQL and your favorite DBA for a second • Schemas allow adding readers and writers easily • Schemas allow modifying readers and writers independently • Schemas can evolve as the system grows • Allows validating data soon after its written • No need to throw away data that doesn’t fit! Click to enter confidentiality
  • 40. 40© Cloudera, Inc. All rights reserved. Click to enter confidentiality
  • 41. 41© Cloudera, Inc. All rights reserved. Thank you @gwenshap gshapira@cloudera.com

Editor's Notes

  1. Then we end up adding clients to use that source.
  2. But as we start to deploy our applications we realizet hat clients need data from a number of sources. So we add them as needed.
  3. But over time, particularly if we are segmenting services by function, we have stuff all over the place, and the dependencies are a nightmare. This makes for a fragile system.
  4. Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
  5. Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
  6. Kafka retains all messages for fixed amount of time. Not waiting for acks from consumers. The only metadata retained per consumer is the position in the log – the offset So adding many consumers is cheap On the other hand, consumers have more responsibility and are more challenging to implement correctly And “batching” consumers is not a problem
  7. 3 partitions, each replicated 3 times.
  8. The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  9. The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  10. can read from one or more partition leader. You can’t have two consumers in same group reading the same partition. Leaders obviously do more work – but they are balanced between nodes We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.
  11. Sorry, but “Schema on Read” is kind of B.S. We admit that there is a schema, but we want to “ingest fast”, so we shift the burden to the readers. But the data is written once and read many many times by many different people. They each need to figure this out on their own? This makes no sense. Also, how are you going to validate the data without a schema?
  12. https://github.com/schema-repo/schema-repo There’s no data dictionary for Kafka