This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.
But as we start to deploy our applications we realizet hat clients need data from a number of sources. So we add them as needed.
But over time, particularly if we are segmenting services by function, we have stuff all over the place, and the dependencies are a nightmare. This makes for a fragile system.
Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
Kafka retains all messages for fixed amount of time.
Not waiting for acks from consumers.
The only metadata retained per consumer is the position in the log – the offset
So adding many consumers is cheap
On the other hand, consumers have more responsibility and are more challenging to implement correctly
And “batching” consumers is not a problem
3 partitions, each replicated 3 times.
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
can read from one or more partition leader. You can’t have two consumers in same group reading the same partition.
Leaders obviously do more work – but they are balanced between nodes
We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.
Sorry, but “Schema on Read” is kind of B.S.
We admit that there is a schema, but we want to “ingest fast”, so we shift the burden to the readers.
But the data is written once and read many many times by many different people. They each need to figure this out on their own? This makes no sense.
Also, how are you going to validate the data without a schema?
https://github.com/schema-repo/schema-repoThere’s no data dictionary for Kafka