Kafka for DBAs

1© Cloudera, Inc. All rights reserved.
Apache Kafka for Oracle DBAs
What is Kafka
Why should you care
How to learn Kafka

• Oracle DBA
• Turned Oracle Consultant
• Turned Hadoop Solutions Architect
• Turned Developer
Committer on Apache Sqoop
Contributor to Apache Kafka and
Apache Flume
About me

Apache Kafka is a
publish-subscribe messaging
rethought as a
distributed commit log.
An Optical Illusion

• Redo log as an abstraction
• How redo logs are useful
• Pub-sub message queues
• How message queues are useful
• What exactly is Kafka
• How do people use Kafka
• Where can you learn more
We’ll talk about:

Redo Log:
The most crucial structure for
recovery operations …
store all changes made to the
database as they occur.

Important Point
The redo log is the only reliable source of
information about current state of the database.

Redo Log is used for
• Recover consistent state of a database
• Replicate the database (Dataguard, Streams, GoldenGate…)
• Update materialized logs (well, it’s a log anyway)
If you look far enough into archive logs – you can reconstruct the entire database

What if…
You built an entire data storage system
that is just a transaction log?

Kafka can log
• Transactions from any database
• Clicks from websites
• Application logs (ERROR, WARN, INFO…)
• Metrics– cpu, memory, io
• Audit events
• And any system can read those logs: Hadoop, alerts, dashboards, databases.

Only one thing is missing
Q: How do you query a redo log?
A: Not very efficiently
Sometimes we just need the events – no need to query.
Other times, we need to load the results into a database.
While messages are in transit – we can do all kinds of transformations.

Publish-Subscribe
Message Queue

Raise your hand if this sounds familiar
“My next project was to get a working Hadoop setup…
Having little experience in this area, we naturally budgeted
a few weeks for getting data in and out, and the rest of our
time for implementing fancy algorithms. “
--Jay Kreps, Kafka PMC

14© Cloudera, Inc. All rights reserved.14
Client Source
Data Pipelines Start like this.

Client Source
Client
Client
Client
Then we reuse them

Client Backend
Client
Client
Client
Then we add consumers to the
existing sources
Another
Backend

Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend

Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend

Queues decouple systems: Both statically and in time

This is where we are trying to get
20
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop Security Systems
Real-time
monitoring
Data Warehouse
Kafka
Producers
Brokers
Consumers
Kafka decouples Data Pipelines

Important notes:
• Producers and Consumers don’t need to know about each other
• Performance issues on Consumers don’t impact Producers
• Consumers are protected from herds of Producers
• Lots of flexibility in handling load
• Messages are available for anyone –
lots of new use cases, monitoring, audit, troubleshooting
http://www.slideshare.net/gwenshap/queues-pools-caches

So… What is Kafka?

Kafka provides a fast, distributed, highly scalable,
highly available, publish-subscribe messaging system.
In turn this solves part of a much harder problem:
Communication and integration between
components of large software systems
Click to enter confidentiality

24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
•Messages are organized into topics
•Producers push messages
•Consumers pull messages
•Kafka runs in a cluster. Nodes are called
brokers
The Basics

Topics, Partitions and Logs

Each partition is a log

Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2

Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client

Consumers

Why is Kafka better than other MQ?
• Can keep data forever
• Scales very well – high throughputs, low latency, lots of storage
• Scales to any number of consumers

How do people use Kafka?
• As a message bus
• As a buffer for replication systems (Like AdvancedQueue in Streams)
• As reliable feed for event processing
• As a buffer for event processing
• Decouple apps from database (both OLTP and DWH)

Need More Kafka?
• https://kafka.apache.org/documentation.html
• My video tutorial: http://shop.oreilly.com/product/0636920038603.do
• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-
tutorial/
• Try with Cloudera Manager:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-
kafka/latest/topics/kafka_install.html

One more thing...

Schema is a MUST HAVE for
data integration

Kafka only stores Bytes – So where’s the schema?
• People go around asking each other:
“So, what does the 5th field of the messages in topic Blah contain?”
• There’s utility code for reading/writing messages that everyone reuses
• Schema embedded in the message
• A centralized repository for schemas
• Each message has Schema ID
• Each topic has Schema ID

I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into Bytes or JSON
• Embed schema in files / records… or not
• Support for our favorite languages… Except Go.
• Schema Evolution
• Add and remove fields without breaking anything

Replicating from Oracle to Kafka?
Don’t lose the schema!

Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas allow adding readers and writers easily
• Schemas allow modifying readers and writers independently
• Schemas can evolve as the system grows
• Allows validating data soon after its written
• No need to throw away data that doesn’t fit!

Thank you
@gwenshap
gshapira@cloudera.com

Kafka for DBAs

More Related Content

Kafka for DBAs

Editor's Notes