SlideShare a Scribd company logo
Kappa Architecture
for Event Processing
Piotr Czarnas
Querona CEO
The anatomy of event processing
What is an event
A user performed an
action in the application
A customer just ordered a
product
An event is a something that just happened
and requires a quick reaction
Information (data)
received from an external
partner
A frequent customer
ordered another product
The need for Event Streaming
ReactionQualification
High
frequency
events
• A lot of events happen
• Some of them are valuable
• Some require our reaction
• We have little time to act
How events are processed
Action
Gather events Store & forward ReactProcess
Valuable events require a reaction
Complex events
• The same event happened again
• An event connected with external
data is a different event
External
data
Complex events are high level events based on multiple data points
→ Complex events have a real business value
Complex events identification
Reaction
Simple events:
A customer
logged in
A customer
dropped a
shopping basket
Convert to a
complex event
A complex event may be identified and added to the event stream
External data
Source of events
Actions performed by users in applications
Messages from a corporate event bus (EAI)
Complex events identified by correlation of
multiple events
Row changes in databases (CDC)
Analytical advancement
Analytical advancement ladder
Businessvalue
Descriptive
analytics
Diagnostics
analytics
Predictive
analytics
Prescriptive
analytics
What has
happened?
Why did it happen?
What will happen?
What can we do to
make it happen?
Event processing value proposition
Predictive analytics
Prescriptive analytics
Learn what we can get from events
Identify and act on events
Event processing requires two processes: learning and acting
Event consumers
Data scientists & data analysts
identify valuable events
Events are consumed for learning and for performing actions
Reaction to events
Reaction to new events in the future
Events need re-reading many times
Kappa architecture
Classic Lamba Architecture
Limitations of the Lamba Architecture
• The batch layer and the speed layer require double processing
• Changes to the processing logic must be reimplemented in both
processing pipelines
• The whole view of all data is possible only by a virtual query that is
an union of the batch and the speed layer
But do we need a speed layer that is up-to-date every time?
Lamba Architecture for log monitoring
Lambda Architecture is good for log monitoring, not for business events
Lamba Architecture for CDC data synchronization
Lambda Architecture is good for keeping a copy of rows
from an OLTP database
Insert
Delete
UpdateDB
Key/store database
(Hbase/Cassandra,etc.)
Kappa Architecture
Only one processing logic!
Kappa architecture data lag
As long as the reaction time to the event is longer then
processing time, we can work with the data lag
Output table N
Output table N+1
15 min batch human reaction
lags
Kappa Architecture benefits
• Kafka is the only source
• Only one processing logic
• Multiple types of analyses possible
• New results available in a new table
Predictive analytics
Prescriptive analytics
Actionable analytics (learning + reacting) much easier
Event storage
What is Apache Kafka
Consumer 1
Consumer 2
Apache Kafka is a high throughput publish-subscribe event bus
Event publishers
System 1
System 2
Event consumers
Kafka topic
Apache Kafka partitioning
Kafka rules:
• Topics are partitioned
• Partitions are as append-only files
• Partitions distributed across nodes
• Write speed: 1 mln events / sec /
partition
• Read speed: 2 mln events / sec /
partition
Kafka topic
Apache Kafka consumer groups
Consumer 1
Consumer 2
A consumer group
All consumers in a group share a group.idOffset
Apache Kafka offset storage for a group.id
But in Kappa Architecture we do not care about offset,
we read everything again
• Event streaming consumer must
keep the last read offset for each
partition
• Offset storage is specified by
offset.storage.[topic]
• Offset stores: Zookeeper, Kafka,
custom
Waiting for new events on Apache Kafka
The consumer can still
read from partition 0
The customer has
reached the end of all
partitions and is waiting
A customer that has reached the end of an assigned partition is
waiting for new events for the duration of the „pull” timeout period
Partition 0
Partition 1
Partition 2
Partition 3
Reading events without waiting at the end of a partition
KafkaConsumer<~> consumer =...
ConsumerRecords<~> records = consumer.pool(10000);
We must stop listening to a partition when we reach the last event or
the reader will wait or consume events forever
Partition 1
Reading events the easy way (1/3): setup
Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", "localhost:9092");
consumerProps.put("group.id", "consumer group here");
consumerProps.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
A random group.id must be used
Reading events the easy way (2/3): partition offset seek
KafkaConsumer<String, String> consumer =
new KafkaConsumer<String, String>(consumerProps);
List<PartitionInfo> partitionInfos =
consumer.partitionsFor("topic name here");
List<TopicPartition> topicPartitions =
partitionInfos.stream()
.map(pi -> new TopicPartition(pi.topic(), pi.partition()))
.collect(Collectors.toList());
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
But we can also find offsets by a timestamp and „rewind” to it
Reading events the easy way (3/3): reading loop
Map<TopicPartition, Long> endOffsets = consumer.endOffsets(topicPartitions);
int remainingPartitionsCount = endOffsets.size();
while(remainingPartitionsCount > 0) {
ConsumerRecords<String, String> consumerRecords = consumer.poll(10000);
for (ConsumerRecord<String, String> record : consumerRecords) {
TopicPartition recordPartition = new TopicPartition(record.topic(), record.partition());
long endOffset = endOffsets.get(recordPartition);
if (record.offset() == endOffset - 1) {
remainingPartitionsCount--;
consumer.pause(Arrays.asList(recordPartition));
}
if (record.offset() < endOffset)
processRecord(record);
}
if (consumerRecords.isEmpty())
break;
}
Bounded event reading on Apache Spark
1. Create a custom RDD or Dataframe that
reads from Apache Kafka
2. Register your RDD in the context
3. Just run SQL on the DataFrame
Bounded Spark RDD (1/6): RDD declaration
public static class KafkaTopicRDD extends org.apache.spark.rdd.RDD<String> {
private static final ClassTag<String> STRING_TAG =
ClassManifestFactory$.MODULE$.fromClass(String.class);
private static final long serialVersionUID = 1L;
private String kafkaServer;
private String groupId;
private String topic;
private long timeout;
public KafkaTopicRDD(SparkContext sc, String kafkaServer, String groupId, String topic,
long timeout) {
super(sc, new ArrayBuffer<Dependency<?>>(), STRING_TAG);
this.kafkaServer = kafkaServer;
this.groupId = groupId;
this.topic = topic;
this.timeout = timeout;
}
Bounded Spark RDD (2/6): RDD’s compute
@Override
public Iterator<String> compute(Partition arg0, TaskContext arg1) {
KafkaTopicPartition p = (KafkaTopicPartition)arg0;
KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer();
TopicPartition partition = new TopicPartition(topic, p.partition);
kafkaConsumer.assign(Arrays.asList(partition));
kafkaConsumer.seek(partition, p.startOffset);
return new KafkaTopicIterator(kafkaConsumer, p.endOffset, this.timeout);
}
private KafkaConsumer<String, String> createKafkaConsumer() {
Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", this.kafkaServer);
consumerProps.put("group.id", this.groupId);
consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(consumerProps);
return consumer;
}
Each Kafka’s partition is processed as a separate task
Bounded Spark RDD (3/6): Kafka → Spark partition
public static class KafkaTopicPartition implements Partition {
private static final long serialVersionUID = 1L;
private int partition;
private long startOffset;
private long endOffset;
public KafkaTopicPartition(int partition, long startOffset, long endOffset) {
this.partition = partition;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
@Override
public int index() { return partition; }
@Override
public boolean equals(Object obj) { return ... }
@Override
public int hashCode() { return index(); }
}
Bounded Spark RDD (4/6): partition enumeration
@Override
public Partition[] getPartitions() {
KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer();
List<PartitionInfo> partitionInfos = kafkaConsumer.partitionsFor(this.topic);
List<TopicPartition> topicPartitions = partitionInfos.stream()
.map(pi -> new TopicPartition(this.topic, pi.partition())).collect(Collectors.toList());
Map<TopicPartition, Long> beginOffsets = kafkaConsumer.beginningOffsets(topicPartitions);
Map<TopicPartition, Long> endOffsets = kafkaConsumer.endOffsets(topicPartitions);
Partition[] partitions = new Partition[partitionInfos.size()];
for(int i = 0; i < partitionInfos.size(); i++) {
PartitionInfo partitionInfo = partitionInfos.get(i);
TopicPartition topicPartition = topicPartitions.get(i);
partitions[i] = new KafkaTopicPartition(
partitionInfo.partition(),
beginOffsets.get(topicPartition),
endOffsets.get(topicPartition));
}
return partitions;
}
Bounded Spark RDD (5/6): events iterator
public static class KafkaTopicIterator extends AbstractIterator<String> {
private KafkaConsumer<String, String> kafkaConsumer;
private long endOffset, timeout;
private ConsumerRecords<String, String> recordsBatch;
private java.util.Iterator<ConsumerRecord<String, String>> recordIterator;
private ConsumerRecord<String, String> currentRecord;
private boolean lastRecordReached;
public KafkaTopicIterator(KafkaConsumer<String, String> kafkaConsumer, long endOffset, long timeout) {
this.kafkaConsumer = kafkaConsumer; this.endOffset = endOffset; this.timeout = timeout;
}
@Override
public String next() {
if (currentRecord == null)
hasNext();
String value = currentRecord.value();
currentRecord = null;
return value;
}
Bounded Spark RDD (6/6): iterator’s hasNext
@Override
public boolean hasNext() {
if (currentRecord != null) return true;
if (lastRecordReached) return false;
if (recordsBatch == null) {
recordsBatch = this.kafkaConsumer.poll(this.timeout);
recordIterator = recordsBatch.iterator();
}
if (!recordIterator.hasNext()) return false;
currentRecord = recordIterator.next();
if (currentRecord.offset() >= endOffset) {
currentRecord = null;
return false;
}
if (currentRecord.offset() >= endOffset - 1)
lastRecordReached = true;
return true;
}
The anatomy of a Kafka event
Key Value
• Records in Kafka have a key and a value
• Both key and the value are binary and serialized
by a serializer of choice
• JSON, String or AVRO serializers are usefull
Apache Kafka log compaction
The default (delete) log cleanup
policy removes old entries
„compact” cleanup policy keeps the
newest version of a record for each key
1 week
log.cleanup.policy=delete
Compact cleanup policy is required for Kappa Architecture
1 2 53 4 2 3 6 2 7
log.cleanup.policy=compact
1 2 53 4 2 3 6 2 7
Querona
Your existing data sources
ETL-free virtual database,
Apache Spark powered
User-friendly front-ends
keep it as it is to reduce risk
set up to facilitate &
accelerate BI
use what users know for years
Querona – Data Virtualization engine
Complete Logical Data Warehouse: ETL-free, self-service, Big Data ready, utilizing Apache Spark.
Data sources
CRM
ERP
OLTP
Client tools
Connects all data sources (~100)
Simple data loading (3 clicks)
Joins data from many sources (instant)
Real-time data access
Enables GDPR/RODO compliance
QUERONA – Logical Data Warehouse
Why Querona
Data Virtualization (DV) is not a new idea but since 2016 Garther has
considered DV as a key trend in Data Warehousing and Data Analytics
• Self-service → more people can use data
• SQL Server wire compatibility → compatible with any client tool
• Apache Spark bundled → „Big Data Ready” in 5 minutes
• Competitive licensing model → DV available for all companies
Which table has
first names?
Find any data source
What do we
have here?
Data preview in one place
Maybe we can
correlate that
with events?
The data source
not capable of
real-time access?
Caching – just a few clicks
Let’s cache it on
Apache Spark or
in the cloud
More information
about an event
are in a CRM?
Joining data
Let’s build a 360°
customer profile
as an SQL view!
Augmented events
Original events (Kafka)
Augmented events visible as SQL Server compatible views
V_EVENTS_CUSTOMER_INFO
V_EVENTS_PRODUCT_INFO
V_EVENT_SALES
CRM
Product database
ERP
V_EVENT_CAMPAIGN_GOALS
Marketing platform
External data sources for event augmentation
Social mediaSaaS
Business
partner’s
database
Partners Public data
Kappa Architecture full data lifecycle rules
• Treat Apache Kafka as a persistent event source
• Get ready for both event analysis (learn) and reacting to
events (act)
• Identify all additional data that may augment events
• Make sure that you can reprocess events at any time
• Expose complex events for consumption (dashboards,
activities created in CRM, etc.)
Piotr Czarnas
CEO
Querona Ltd.
piotr.czarnas@querona.com
+48 536 133 114
www.querona.com

More Related Content

Kappa Architecture on Apache Kafka and Querona: datamass.io

  • 1. Kappa Architecture for Event Processing Piotr Czarnas Querona CEO
  • 2. The anatomy of event processing
  • 3. What is an event A user performed an action in the application A customer just ordered a product An event is a something that just happened and requires a quick reaction Information (data) received from an external partner A frequent customer ordered another product
  • 4. The need for Event Streaming ReactionQualification High frequency events • A lot of events happen • Some of them are valuable • Some require our reaction • We have little time to act
  • 5. How events are processed Action Gather events Store & forward ReactProcess Valuable events require a reaction
  • 6. Complex events • The same event happened again • An event connected with external data is a different event External data Complex events are high level events based on multiple data points → Complex events have a real business value
  • 7. Complex events identification Reaction Simple events: A customer logged in A customer dropped a shopping basket Convert to a complex event A complex event may be identified and added to the event stream External data
  • 8. Source of events Actions performed by users in applications Messages from a corporate event bus (EAI) Complex events identified by correlation of multiple events Row changes in databases (CDC)
  • 9. Analytical advancement Analytical advancement ladder Businessvalue Descriptive analytics Diagnostics analytics Predictive analytics Prescriptive analytics What has happened? Why did it happen? What will happen? What can we do to make it happen?
  • 10. Event processing value proposition Predictive analytics Prescriptive analytics Learn what we can get from events Identify and act on events Event processing requires two processes: learning and acting
  • 11. Event consumers Data scientists & data analysts identify valuable events Events are consumed for learning and for performing actions Reaction to events Reaction to new events in the future Events need re-reading many times
  • 14. Limitations of the Lamba Architecture • The batch layer and the speed layer require double processing • Changes to the processing logic must be reimplemented in both processing pipelines • The whole view of all data is possible only by a virtual query that is an union of the batch and the speed layer But do we need a speed layer that is up-to-date every time?
  • 15. Lamba Architecture for log monitoring Lambda Architecture is good for log monitoring, not for business events
  • 16. Lamba Architecture for CDC data synchronization Lambda Architecture is good for keeping a copy of rows from an OLTP database Insert Delete UpdateDB Key/store database (Hbase/Cassandra,etc.)
  • 17. Kappa Architecture Only one processing logic!
  • 18. Kappa architecture data lag As long as the reaction time to the event is longer then processing time, we can work with the data lag Output table N Output table N+1 15 min batch human reaction lags
  • 19. Kappa Architecture benefits • Kafka is the only source • Only one processing logic • Multiple types of analyses possible • New results available in a new table Predictive analytics Prescriptive analytics Actionable analytics (learning + reacting) much easier
  • 21. What is Apache Kafka Consumer 1 Consumer 2 Apache Kafka is a high throughput publish-subscribe event bus Event publishers System 1 System 2 Event consumers Kafka topic
  • 22. Apache Kafka partitioning Kafka rules: • Topics are partitioned • Partitions are as append-only files • Partitions distributed across nodes • Write speed: 1 mln events / sec / partition • Read speed: 2 mln events / sec / partition Kafka topic
  • 23. Apache Kafka consumer groups Consumer 1 Consumer 2 A consumer group All consumers in a group share a group.idOffset
  • 24. Apache Kafka offset storage for a group.id But in Kappa Architecture we do not care about offset, we read everything again • Event streaming consumer must keep the last read offset for each partition • Offset storage is specified by offset.storage.[topic] • Offset stores: Zookeeper, Kafka, custom
  • 25. Waiting for new events on Apache Kafka The consumer can still read from partition 0 The customer has reached the end of all partitions and is waiting A customer that has reached the end of an assigned partition is waiting for new events for the duration of the „pull” timeout period Partition 0 Partition 1 Partition 2 Partition 3
  • 26. Reading events without waiting at the end of a partition KafkaConsumer<~> consumer =... ConsumerRecords<~> records = consumer.pool(10000); We must stop listening to a partition when we reach the last event or the reader will wait or consume events forever Partition 1
  • 27. Reading events the easy way (1/3): setup Properties consumerProps = new Properties(); consumerProps.put("bootstrap.servers", "localhost:9092"); consumerProps.put("group.id", "consumer group here"); consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); A random group.id must be used
  • 28. Reading events the easy way (2/3): partition offset seek KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(consumerProps); List<PartitionInfo> partitionInfos = consumer.partitionsFor("topic name here"); List<TopicPartition> topicPartitions = partitionInfos.stream() .map(pi -> new TopicPartition(pi.topic(), pi.partition())) .collect(Collectors.toList()); consumer.assign(topicPartitions); consumer.seekToBeginning(topicPartitions); But we can also find offsets by a timestamp and „rewind” to it
  • 29. Reading events the easy way (3/3): reading loop Map<TopicPartition, Long> endOffsets = consumer.endOffsets(topicPartitions); int remainingPartitionsCount = endOffsets.size(); while(remainingPartitionsCount > 0) { ConsumerRecords<String, String> consumerRecords = consumer.poll(10000); for (ConsumerRecord<String, String> record : consumerRecords) { TopicPartition recordPartition = new TopicPartition(record.topic(), record.partition()); long endOffset = endOffsets.get(recordPartition); if (record.offset() == endOffset - 1) { remainingPartitionsCount--; consumer.pause(Arrays.asList(recordPartition)); } if (record.offset() < endOffset) processRecord(record); } if (consumerRecords.isEmpty()) break; }
  • 30. Bounded event reading on Apache Spark 1. Create a custom RDD or Dataframe that reads from Apache Kafka 2. Register your RDD in the context 3. Just run SQL on the DataFrame
  • 31. Bounded Spark RDD (1/6): RDD declaration public static class KafkaTopicRDD extends org.apache.spark.rdd.RDD<String> { private static final ClassTag<String> STRING_TAG = ClassManifestFactory$.MODULE$.fromClass(String.class); private static final long serialVersionUID = 1L; private String kafkaServer; private String groupId; private String topic; private long timeout; public KafkaTopicRDD(SparkContext sc, String kafkaServer, String groupId, String topic, long timeout) { super(sc, new ArrayBuffer<Dependency<?>>(), STRING_TAG); this.kafkaServer = kafkaServer; this.groupId = groupId; this.topic = topic; this.timeout = timeout; }
  • 32. Bounded Spark RDD (2/6): RDD’s compute @Override public Iterator<String> compute(Partition arg0, TaskContext arg1) { KafkaTopicPartition p = (KafkaTopicPartition)arg0; KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer(); TopicPartition partition = new TopicPartition(topic, p.partition); kafkaConsumer.assign(Arrays.asList(partition)); kafkaConsumer.seek(partition, p.startOffset); return new KafkaTopicIterator(kafkaConsumer, p.endOffset, this.timeout); } private KafkaConsumer<String, String> createKafkaConsumer() { Properties consumerProps = new Properties(); consumerProps.put("bootstrap.servers", this.kafkaServer); consumerProps.put("group.id", this.groupId); consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(consumerProps); return consumer; } Each Kafka’s partition is processed as a separate task
  • 33. Bounded Spark RDD (3/6): Kafka → Spark partition public static class KafkaTopicPartition implements Partition { private static final long serialVersionUID = 1L; private int partition; private long startOffset; private long endOffset; public KafkaTopicPartition(int partition, long startOffset, long endOffset) { this.partition = partition; this.startOffset = startOffset; this.endOffset = endOffset; } @Override public int index() { return partition; } @Override public boolean equals(Object obj) { return ... } @Override public int hashCode() { return index(); } }
  • 34. Bounded Spark RDD (4/6): partition enumeration @Override public Partition[] getPartitions() { KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer(); List<PartitionInfo> partitionInfos = kafkaConsumer.partitionsFor(this.topic); List<TopicPartition> topicPartitions = partitionInfos.stream() .map(pi -> new TopicPartition(this.topic, pi.partition())).collect(Collectors.toList()); Map<TopicPartition, Long> beginOffsets = kafkaConsumer.beginningOffsets(topicPartitions); Map<TopicPartition, Long> endOffsets = kafkaConsumer.endOffsets(topicPartitions); Partition[] partitions = new Partition[partitionInfos.size()]; for(int i = 0; i < partitionInfos.size(); i++) { PartitionInfo partitionInfo = partitionInfos.get(i); TopicPartition topicPartition = topicPartitions.get(i); partitions[i] = new KafkaTopicPartition( partitionInfo.partition(), beginOffsets.get(topicPartition), endOffsets.get(topicPartition)); } return partitions; }
  • 35. Bounded Spark RDD (5/6): events iterator public static class KafkaTopicIterator extends AbstractIterator<String> { private KafkaConsumer<String, String> kafkaConsumer; private long endOffset, timeout; private ConsumerRecords<String, String> recordsBatch; private java.util.Iterator<ConsumerRecord<String, String>> recordIterator; private ConsumerRecord<String, String> currentRecord; private boolean lastRecordReached; public KafkaTopicIterator(KafkaConsumer<String, String> kafkaConsumer, long endOffset, long timeout) { this.kafkaConsumer = kafkaConsumer; this.endOffset = endOffset; this.timeout = timeout; } @Override public String next() { if (currentRecord == null) hasNext(); String value = currentRecord.value(); currentRecord = null; return value; }
  • 36. Bounded Spark RDD (6/6): iterator’s hasNext @Override public boolean hasNext() { if (currentRecord != null) return true; if (lastRecordReached) return false; if (recordsBatch == null) { recordsBatch = this.kafkaConsumer.poll(this.timeout); recordIterator = recordsBatch.iterator(); } if (!recordIterator.hasNext()) return false; currentRecord = recordIterator.next(); if (currentRecord.offset() >= endOffset) { currentRecord = null; return false; } if (currentRecord.offset() >= endOffset - 1) lastRecordReached = true; return true; }
  • 37. The anatomy of a Kafka event Key Value • Records in Kafka have a key and a value • Both key and the value are binary and serialized by a serializer of choice • JSON, String or AVRO serializers are usefull
  • 38. Apache Kafka log compaction The default (delete) log cleanup policy removes old entries „compact” cleanup policy keeps the newest version of a record for each key 1 week log.cleanup.policy=delete Compact cleanup policy is required for Kappa Architecture 1 2 53 4 2 3 6 2 7 log.cleanup.policy=compact 1 2 53 4 2 3 6 2 7
  • 40. Your existing data sources ETL-free virtual database, Apache Spark powered User-friendly front-ends keep it as it is to reduce risk set up to facilitate & accelerate BI use what users know for years Querona – Data Virtualization engine Complete Logical Data Warehouse: ETL-free, self-service, Big Data ready, utilizing Apache Spark.
  • 41. Data sources CRM ERP OLTP Client tools Connects all data sources (~100) Simple data loading (3 clicks) Joins data from many sources (instant) Real-time data access Enables GDPR/RODO compliance QUERONA – Logical Data Warehouse
  • 42. Why Querona Data Virtualization (DV) is not a new idea but since 2016 Garther has considered DV as a key trend in Data Warehousing and Data Analytics • Self-service → more people can use data • SQL Server wire compatibility → compatible with any client tool • Apache Spark bundled → „Big Data Ready” in 5 minutes • Competitive licensing model → DV available for all companies
  • 43. Which table has first names? Find any data source
  • 44. What do we have here? Data preview in one place Maybe we can correlate that with events?
  • 45. The data source not capable of real-time access? Caching – just a few clicks Let’s cache it on Apache Spark or in the cloud
  • 46. More information about an event are in a CRM? Joining data Let’s build a 360° customer profile as an SQL view!
  • 47. Augmented events Original events (Kafka) Augmented events visible as SQL Server compatible views V_EVENTS_CUSTOMER_INFO V_EVENTS_PRODUCT_INFO V_EVENT_SALES CRM Product database ERP V_EVENT_CAMPAIGN_GOALS Marketing platform
  • 48. External data sources for event augmentation Social mediaSaaS Business partner’s database Partners Public data
  • 49. Kappa Architecture full data lifecycle rules • Treat Apache Kafka as a persistent event source • Get ready for both event analysis (learn) and reacting to events (act) • Identify all additional data that may augment events • Make sure that you can reprocess events at any time • Expose complex events for consumption (dashboards, activities created in CRM, etc.)