Kappa Architecture on Apache Kafka and Querona: datamass.io

Kappa Architecture
for Event Processing
Piotr Czarnas
Querona CEO

The anatomy of event processing

What is an event
A user performed an
action in the application
A customer just ordered a
product
An event is a something that just happened
and requires a quick reaction
Information (data)
received from an external
partner
A frequent customer
ordered another product

The need for Event Streaming
ReactionQualification
High
frequency
events
• A lot of events happen
• Some of them are valuable
• Some require our reaction
• We have little time to act

How events are processed
Action
Gather events Store & forward ReactProcess
Valuable events require a reaction

Complex events
• The same event happened again
• An event connected with external
data is a different event
External
data
Complex events are high level events based on multiple data points
→ Complex events have a real business value

Complex events identification
Reaction
Simple events:
A customer
logged in
A customer
dropped a
shopping basket
Convert to a
complex event
A complex event may be identified and added to the event stream
External data

Source of events
Actions performed by users in applications
Messages from a corporate event bus (EAI)
Complex events identified by correlation of
multiple events
Row changes in databases (CDC)

Analytical advancement
Analytical advancement ladder
Businessvalue
Descriptive
analytics
Diagnostics
analytics
Predictive
analytics
Prescriptive
analytics
What has
happened?
Why did it happen?
What will happen?
What can we do to
make it happen?

Event processing value proposition
Predictive analytics
Prescriptive analytics
Learn what we can get from events
Identify and act on events
Event processing requires two processes: learning and acting

Event consumers
Data scientists & data analysts
identify valuable events
Events are consumed for learning and for performing actions
Reaction to events
Reaction to new events in the future
Events need re-reading many times

Limitations of the Lamba Architecture
• The batch layer and the speed layer require double processing
• Changes to the processing logic must be reimplemented in both
processing pipelines
• The whole view of all data is possible only by a virtual query that is
an union of the batch and the speed layer
But do we need a speed layer that is up-to-date every time?

Lamba Architecture for log monitoring
Lambda Architecture is good for log monitoring, not for business events

Lamba Architecture for CDC data synchronization
Lambda Architecture is good for keeping a copy of rows
from an OLTP database
Insert
Delete
UpdateDB
Key/store database
(Hbase/Cassandra,etc.)

Kappa Architecture
Only one processing logic!

Kappa architecture data lag
As long as the reaction time to the event is longer then
processing time, we can work with the data lag
Output table N
Output table N+1
15 min batch human reaction
lags

Kappa Architecture benefits
• Kafka is the only source
• Only one processing logic
• Multiple types of analyses possible
• New results available in a new table
Predictive analytics
Prescriptive analytics
Actionable analytics (learning + reacting) much easier

What is Apache Kafka
Consumer 1
Consumer 2
Apache Kafka is a high throughput publish-subscribe event bus
Event publishers
System 1
System 2
Event consumers
Kafka topic

Apache Kafka partitioning
Kafka rules:
• Topics are partitioned
• Partitions are as append-only files
• Partitions distributed across nodes
• Write speed: 1 mln events / sec /
partition
• Read speed: 2 mln events / sec /
partition
Kafka topic

Apache Kafka consumer groups
Consumer 1
Consumer 2
A consumer group
All consumers in a group share a group.idOffset

Apache Kafka offset storage for a group.id
But in Kappa Architecture we do not care about offset,
we read everything again
• Event streaming consumer must
keep the last read offset for each
partition
• Offset storage is specified by
offset.storage.[topic]
• Offset stores: Zookeeper, Kafka,
custom

Waiting for new events on Apache Kafka
The consumer can still
read from partition 0
The customer has
reached the end of all
partitions and is waiting
A customer that has reached the end of an assigned partition is
waiting for new events for the duration of the „pull” timeout period
Partition 0
Partition 1
Partition 2
Partition 3

Reading events without waiting at the end of a partition
KafkaConsumer<~> consumer =...
ConsumerRecords<~> records = consumer.pool(10000);
We must stop listening to a partition when we reach the last event or
the reader will wait or consume events forever
Partition 1

Reading events the easy way (1/3): setup
Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", "localhost:9092");
consumerProps.put("group.id", "consumer group here");
consumerProps.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
A random group.id must be used

Reading events the easy way (2/3): partition offset seek
KafkaConsumer<String, String> consumer =
new KafkaConsumer<String, String>(consumerProps);
List<PartitionInfo> partitionInfos =
consumer.partitionsFor("topic name here");
List<TopicPartition> topicPartitions =
partitionInfos.stream()
.map(pi -> new TopicPartition(pi.topic(), pi.partition()))
.collect(Collectors.toList());
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
But we can also find offsets by a timestamp and „rewind” to it

Reading events the easy way (3/3): reading loop
Map<TopicPartition, Long> endOffsets = consumer.endOffsets(topicPartitions);
int remainingPartitionsCount = endOffsets.size();
while(remainingPartitionsCount > 0) {
ConsumerRecords<String, String> consumerRecords = consumer.poll(10000);
for (ConsumerRecord<String, String> record : consumerRecords) {
TopicPartition recordPartition = new TopicPartition(record.topic(), record.partition());
long endOffset = endOffsets.get(recordPartition);
if (record.offset() == endOffset - 1) {
remainingPartitionsCount--;
consumer.pause(Arrays.asList(recordPartition));
}
if (record.offset() < endOffset)
processRecord(record);
}
if (consumerRecords.isEmpty())
break;
}

Bounded event reading on Apache Spark
1. Create a custom RDD or Dataframe that
reads from Apache Kafka
2. Register your RDD in the context
3. Just run SQL on the DataFrame

Bounded Spark RDD (1/6): RDD declaration
public static class KafkaTopicRDD extends org.apache.spark.rdd.RDD<String> {
private static final ClassTag<String> STRING_TAG =
ClassManifestFactory$.MODULE$.fromClass(String.class);
private static final long serialVersionUID = 1L;
private String kafkaServer;
private String groupId;
private String topic;
private long timeout;
public KafkaTopicRDD(SparkContext sc, String kafkaServer, String groupId, String topic,
long timeout) {
super(sc, new ArrayBuffer<Dependency<?>>(), STRING_TAG);
this.kafkaServer = kafkaServer;
this.groupId = groupId;
this.topic = topic;
this.timeout = timeout;
}

Bounded Spark RDD (2/6): RDD’s compute
@Override
public Iterator<String> compute(Partition arg0, TaskContext arg1) {
KafkaTopicPartition p = (KafkaTopicPartition)arg0;
KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer();
TopicPartition partition = new TopicPartition(topic, p.partition);
kafkaConsumer.assign(Arrays.asList(partition));
kafkaConsumer.seek(partition, p.startOffset);
return new KafkaTopicIterator(kafkaConsumer, p.endOffset, this.timeout);
}
private KafkaConsumer<String, String> createKafkaConsumer() {
Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", this.kafkaServer);
consumerProps.put("group.id", this.groupId);
consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(consumerProps);
return consumer;
}
Each Kafka’s partition is processed as a separate task

Bounded Spark RDD (3/6): Kafka → Spark partition
public static class KafkaTopicPartition implements Partition {
private static final long serialVersionUID = 1L;
private int partition;
private long startOffset;
private long endOffset;
public KafkaTopicPartition(int partition, long startOffset, long endOffset) {
this.partition = partition;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
@Override
public int index() { return partition; }
@Override
public boolean equals(Object obj) { return ... }
@Override
public int hashCode() { return index(); }
}

Bounded Spark RDD (4/6): partition enumeration
@Override
public Partition[] getPartitions() {
KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer();
List<PartitionInfo> partitionInfos = kafkaConsumer.partitionsFor(this.topic);
List<TopicPartition> topicPartitions = partitionInfos.stream()
.map(pi -> new TopicPartition(this.topic, pi.partition())).collect(Collectors.toList());
Map<TopicPartition, Long> beginOffsets = kafkaConsumer.beginningOffsets(topicPartitions);
Map<TopicPartition, Long> endOffsets = kafkaConsumer.endOffsets(topicPartitions);
Partition[] partitions = new Partition[partitionInfos.size()];
for(int i = 0; i < partitionInfos.size(); i++) {
PartitionInfo partitionInfo = partitionInfos.get(i);
TopicPartition topicPartition = topicPartitions.get(i);
partitions[i] = new KafkaTopicPartition(
partitionInfo.partition(),
beginOffsets.get(topicPartition),
endOffsets.get(topicPartition));
}
return partitions;
}

Bounded Spark RDD (5/6): events iterator
public static class KafkaTopicIterator extends AbstractIterator<String> {
private KafkaConsumer<String, String> kafkaConsumer;
private long endOffset, timeout;
private ConsumerRecords<String, String> recordsBatch;
private java.util.Iterator<ConsumerRecord<String, String>> recordIterator;
private ConsumerRecord<String, String> currentRecord;
private boolean lastRecordReached;
public KafkaTopicIterator(KafkaConsumer<String, String> kafkaConsumer, long endOffset, long timeout) {
this.kafkaConsumer = kafkaConsumer; this.endOffset = endOffset; this.timeout = timeout;
}
@Override
public String next() {
if (currentRecord == null)
hasNext();
String value = currentRecord.value();
currentRecord = null;
return value;
}

Bounded Spark RDD (6/6): iterator’s hasNext
@Override
public boolean hasNext() {
if (currentRecord != null) return true;
if (lastRecordReached) return false;
if (recordsBatch == null) {
recordsBatch = this.kafkaConsumer.poll(this.timeout);
recordIterator = recordsBatch.iterator();
}
if (!recordIterator.hasNext()) return false;
currentRecord = recordIterator.next();
if (currentRecord.offset() >= endOffset) {
currentRecord = null;
return false;
}
if (currentRecord.offset() >= endOffset - 1)
lastRecordReached = true;
return true;
}

The anatomy of a Kafka event
Key Value
• Records in Kafka have a key and a value
• Both key and the value are binary and serialized
by a serializer of choice
• JSON, String or AVRO serializers are usefull

Apache Kafka log compaction
The default (delete) log cleanup
policy removes old entries
„compact” cleanup policy keeps the
newest version of a record for each key
1 week
log.cleanup.policy=delete
Compact cleanup policy is required for Kappa Architecture
1 2 53 4 2 3 6 2 7
log.cleanup.policy=compact
1 2 53 4 2 3 6 2 7

Your existing data sources
ETL-free virtual database,
Apache Spark powered
User-friendly front-ends
keep it as it is to reduce risk
set up to facilitate &
accelerate BI
use what users know for years
Querona – Data Virtualization engine
Complete Logical Data Warehouse: ETL-free, self-service, Big Data ready, utilizing Apache Spark.

Data sources
CRM
ERP
OLTP
Client tools
Connects all data sources (~100)
Simple data loading (3 clicks)
Joins data from many sources (instant)
Real-time data access
Enables GDPR/RODO compliance
QUERONA – Logical Data Warehouse

Why Querona
Data Virtualization (DV) is not a new idea but since 2016 Garther has
considered DV as a key trend in Data Warehousing and Data Analytics
• Self-service → more people can use data
• SQL Server wire compatibility → compatible with any client tool
• Apache Spark bundled → „Big Data Ready” in 5 minutes
• Competitive licensing model → DV available for all companies

Which table has
first names?
Find any data source

What do we
have here?
Data preview in one place
Maybe we can
correlate that
with events?

The data source
not capable of
real-time access?
Caching – just a few clicks
Let’s cache it on
Apache Spark or
in the cloud

More information
about an event
are in a CRM?
Joining data
Let’s build a 360°
customer profile
as an SQL view!

Augmented events
Original events (Kafka)
Augmented events visible as SQL Server compatible views
V_EVENTS_CUSTOMER_INFO
V_EVENTS_PRODUCT_INFO
V_EVENT_SALES
CRM
Product database
ERP
V_EVENT_CAMPAIGN_GOALS
Marketing platform

External data sources for event augmentation
Social mediaSaaS
Business
partner’s
database
Partners Public data

Kappa Architecture full data lifecycle rules
• Treat Apache Kafka as a persistent event source
• Get ready for both event analysis (learn) and reacting to
events (act)
• Identify all additional data that may augment events
• Make sure that you can reprocess events at any time
• Expose complex events for consumption (dashboards,
activities created in CRM, etc.)

Piotr Czarnas
CEO
Querona Ltd.
piotr.czarnas@querona.com
+48 536 133 114
www.querona.com

Kappa Architecture on Apache Kafka and Querona: datamass.io

Related slideshows

More Related Content

Kappa Architecture on Apache Kafka and Querona: datamass.io