SlideShare a Scribd company logo
1
Single Message Transforms
Are Not the Transformations
You’re Looking For
Ewen Cheslack-Postava
Engineer, Apache Kafka Committer
2
The Challenge: Streaming Data Pipelines
3
Simplifying Streaming Data Pipelines with Apache Kafka
4
Kafka Connect
5
Streaming ETL
6
Single Message Transformations for Kafka Connect
Modify events before storing in
Kafka:
• Mask sensitive information
• Add identifiers
• Tag events
• Store lineage
• Remove unnecessary columns
Modify events going out of
Kafka:
• Route high priority events to
faster data stores
• Direct events to different
Elasticsearch indexes
• Cast data types to match
destination
• Remove unnecessary columns
7
Single Message Transformations for Kafka Connect
8
Built-in Transformations
• InsertField – Add a field using either static data or record metadata
• ReplaceField – Filter or rename fields
• MaskField – Replace field with valid null value for the type (0, empty string, etc)
• ValueToKey – Set the key to one of the value’s fields
• HoistField – Wrap the entire event as a single field inside a Struct or a Map
• ExtractField – Extract a specific field from Struct and Map and include only this field in results
• SetSchemaMetadata – modify the schema name or version
• TimestampRouter – Modify the topic of a record based on original topic and timestamp. Useful
when using a sink that needs to write to different tables or indexes based on timestamps
• RegexpRouter – modify the topic of a record based on original topic, replacement string and a
regular expression
9
Configuring Single Message Transformations
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-test
transforms=MakeMap,InsertSource
transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value
transforms.MakeMap.field=line
transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource.static.field=data_source
transforms.InsertSource.static.value=test-file-source
10
Why ”Single” Message Transformations?
11
Why ”Single” Message Transformations?
12
Single Message Transformation Use Cases
• Data masking: Mask sensitive information while sending it to Kafka.
• e.g.: Capture data from a relational database to Kafka, but the data includes PCI / PII information and your
Kafka cluster is not certified yet. SMT allows
• Event routing: Modify an event destination based on the contents of the event. (applies to events
that need to get written to different database tables)
• e.g.: write events from Kafka to Elasticsearch, but each event needs to go to a different index - based on
information in the event itself.
• Event enhancement: Add additional fields to events while replicating.
• e.g.: Capture events from multiple data sources to Kafka, and want to include information about the
source of the data in the event.
• Partitioning: Set the key for the event based on event information before it gets written to Kafka.
• e.g.: reading records from a database table, partition the records in Kafka based on customer ID)
• Timestamp conversion: Time-based data conversion standardization when integrating different
systems
• e.g.: There are many different ways to represent time. Often, Kafka events are read from logs, which use
something like "[2017-01-31 05:21:00,298]" but the key-value store events are being written into prefer
dates as "milliseconds since 1970"
13
14
Unix Pipelines
15
Streaming Pipelines
16
Levels of Abstraction
17
Programming With Configuration
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-test
transforms=MakeMap,InsertSource
transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value
transforms.MakeMap.field=line
transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource.static.field=data_source
transforms.InsertSource.static.value=test-file-source
18
Programming With Configuration
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-test
transforms=MakeMap,InsertSource, InsertKey, ExtractStoreId
transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value
transforms.MakeMap.field=line
transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource.static.field=data_source
transforms.InsertSource.static.value=test-file-source
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=storeId
transforms.ExtractStoreId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractStoreId.field=storeId
19
Programming With Configuration
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test.txt
topic=connect-test
transforms=MakeMap,InsertSource, InsertKey, ExtractStoreId, MessageTypeRouter
transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value
transforms.MakeMap.field=line
transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource.static.field=data_source
transforms.InsertSource.static.value=test-file-source
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=storeId
transforms.ExtractStoreId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractStoreId.field=storeId
transforms.MessageTypeRouter.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.MessageTypeRouter.regex=(foo|bar|baz)-.*
transforms.MessageTypeRouter.replacement=$1-logs
20
The Right Tool For The Job
21
The Right Tool For The Job
KStream<Integer, Integer> input = builder.stream(“numbers-topic”);
KStream<Integer, Integer> doubled = input.mapValues(v -> v * 2);
KTable<Integer, Integer> sumOfOdds = input
.filter((k,v) -> v % 2 != 0)
.selectKey((k, v) -> 1)
.reduceByKey((v1, v2) -> v1 + v2, ”sum-of-odds");
22
Order of Operations
name=my-sink
topics=foo-logs-jetty, foo-logs-app, bar-logs-jetty, bar-logs-app
topic.index.map=foo-logs-jetty:foo-logs,
foo-logs-app:foo-logs,
bar-logs-jetty:bar-logs,
bar-logs-app:bar-logs
transforms=Router
transforms.Router.type=org.apache.kafka.connect.transforms.TimestampRouter
transforms.Router.topic.format=${topic}-${timestamp}
transforms.Router.timestamp.format=yyyyMMddHH
23
Order of Operations
24
Schemas
25
Implementing a Transformation
/**
* Single message transformation for Kafka Connect record types.
*
* Connectors can be configured with transformations to make lightweight
* message-at-a-time modifications.
*/
public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {
/**
* Apply transformation to the {@code record} and return another record object.
*
* The implementation must be thread-safe.
*/
R apply(R record);
/** Configuration specification for this transformation. **/
ConfigDef config();
/** Signal that this transformation instance will no longer will be used. **/
@Override
void close();
}
26
It seems easy, but…
27
When should I use each tool?
Kafka Connect & Single Message Transforms
• Simple, message at a time
• Transformation can be performed inline
• Transformation does not interact with
external systems
• Keep it simple
Kafka Streams
• Complex transformations including
• Aggregations
• Windowing
• Joins
• Transformed data stored back in Kafka,
enabling reuse
• Write, deploy, and monitor a Java
application
28
Conclusion
Single Message Transforms in Kafka Connect
• Lightweight transformation of individual messages
• Configuration-only data pipelines
• Pluggable, with lots of built-in transformations. Stick to the built-in transformations.
29
Thanks!
http://confluent.io/download

More Related Content

Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations You're Looking For

  • 1. 1 Single Message Transforms Are Not the Transformations You’re Looking For Ewen Cheslack-Postava Engineer, Apache Kafka Committer
  • 2. 2 The Challenge: Streaming Data Pipelines
  • 3. 3 Simplifying Streaming Data Pipelines with Apache Kafka
  • 6. 6 Single Message Transformations for Kafka Connect Modify events before storing in Kafka: • Mask sensitive information • Add identifiers • Tag events • Store lineage • Remove unnecessary columns Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different Elasticsearch indexes • Cast data types to match destination • Remove unnecessary columns
  • 8. 8 Built-in Transformations • InsertField – Add a field using either static data or record metadata • ReplaceField – Filter or rename fields • MaskField – Replace field with valid null value for the type (0, empty string, etc) • ValueToKey – Set the key to one of the value’s fields • HoistField – Wrap the entire event as a single field inside a Struct or a Map • ExtractField – Extract a specific field from Struct and Map and include only this field in results • SetSchemaMetadata – modify the schema name or version • TimestampRouter – Modify the topic of a record based on original topic and timestamp. Useful when using a sink that needs to write to different tables or indexes based on timestamps • RegexpRouter – modify the topic of a record based on original topic, replacement string and a regular expression
  • 9. 9 Configuring Single Message Transformations name=local-file-source connector.class=FileStreamSource tasks.max=1 file=test.txt topic=connect-test transforms=MakeMap,InsertSource transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value transforms.MakeMap.field=line transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value transforms.InsertSource.static.field=data_source transforms.InsertSource.static.value=test-file-source
  • 10. 10 Why ”Single” Message Transformations?
  • 11. 11 Why ”Single” Message Transformations?
  • 12. 12 Single Message Transformation Use Cases • Data masking: Mask sensitive information while sending it to Kafka. • e.g.: Capture data from a relational database to Kafka, but the data includes PCI / PII information and your Kafka cluster is not certified yet. SMT allows • Event routing: Modify an event destination based on the contents of the event. (applies to events that need to get written to different database tables) • e.g.: write events from Kafka to Elasticsearch, but each event needs to go to a different index - based on information in the event itself. • Event enhancement: Add additional fields to events while replicating. • e.g.: Capture events from multiple data sources to Kafka, and want to include information about the source of the data in the event. • Partitioning: Set the key for the event based on event information before it gets written to Kafka. • e.g.: reading records from a database table, partition the records in Kafka based on customer ID) • Timestamp conversion: Time-based data conversion standardization when integrating different systems • e.g.: There are many different ways to represent time. Often, Kafka events are read from logs, which use something like "[2017-01-31 05:21:00,298]" but the key-value store events are being written into prefer dates as "milliseconds since 1970"
  • 13. 13
  • 18. 18 Programming With Configuration name=local-file-source connector.class=FileStreamSource tasks.max=1 file=test.txt topic=connect-test transforms=MakeMap,InsertSource, InsertKey, ExtractStoreId transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value transforms.MakeMap.field=line transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value transforms.InsertSource.static.field=data_source transforms.InsertSource.static.value=test-file-source transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey transforms.InsertKey.fields=storeId transforms.ExtractStoreId.type=org.apache.kafka.connect.transforms.ExtractField$Key transforms.ExtractStoreId.field=storeId
  • 19. 19 Programming With Configuration name=local-file-source connector.class=FileStreamSource tasks.max=1 file=test.txt topic=connect-test transforms=MakeMap,InsertSource, InsertKey, ExtractStoreId, MessageTypeRouter transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value transforms.MakeMap.field=line transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value transforms.InsertSource.static.field=data_source transforms.InsertSource.static.value=test-file-source transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey transforms.InsertKey.fields=storeId transforms.ExtractStoreId.type=org.apache.kafka.connect.transforms.ExtractField$Key transforms.ExtractStoreId.field=storeId transforms.MessageTypeRouter.type=org.apache.kafka.connect.transforms.RegexRouter transforms.MessageTypeRouter.regex=(foo|bar|baz)-.* transforms.MessageTypeRouter.replacement=$1-logs
  • 20. 20 The Right Tool For The Job
  • 21. 21 The Right Tool For The Job KStream<Integer, Integer> input = builder.stream(“numbers-topic”); KStream<Integer, Integer> doubled = input.mapValues(v -> v * 2); KTable<Integer, Integer> sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .reduceByKey((v1, v2) -> v1 + v2, ”sum-of-odds");
  • 22. 22 Order of Operations name=my-sink topics=foo-logs-jetty, foo-logs-app, bar-logs-jetty, bar-logs-app topic.index.map=foo-logs-jetty:foo-logs, foo-logs-app:foo-logs, bar-logs-jetty:bar-logs, bar-logs-app:bar-logs transforms=Router transforms.Router.type=org.apache.kafka.connect.transforms.TimestampRouter transforms.Router.topic.format=${topic}-${timestamp} transforms.Router.timestamp.format=yyyyMMddHH
  • 25. 25 Implementing a Transformation /** * Single message transformation for Kafka Connect record types. * * Connectors can be configured with transformations to make lightweight * message-at-a-time modifications. */ public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable { /** * Apply transformation to the {@code record} and return another record object. * * The implementation must be thread-safe. */ R apply(R record); /** Configuration specification for this transformation. **/ ConfigDef config(); /** Signal that this transformation instance will no longer will be used. **/ @Override void close(); }
  • 27. 27 When should I use each tool? Kafka Connect & Single Message Transforms • Simple, message at a time • Transformation can be performed inline • Transformation does not interact with external systems • Keep it simple Kafka Streams • Complex transformations including • Aggregations • Windowing • Joins • Transformed data stored back in Kafka, enabling reuse • Write, deploy, and monitor a Java application
  • 28. 28 Conclusion Single Message Transforms in Kafka Connect • Lightweight transformation of individual messages • Configuration-only data pipelines • Pluggable, with lots of built-in transformations. Stick to the built-in transformations.