SlideShare a Scribd company logo
Apache Samza 1.0 - What's New, What's Next
Apache Samza
• Top-level Apache project since 2014
• In use at LinkedIn, Slack, Metamarkets, Intuit,
TripAdvisor, VMWare, Optimizely, Redfin, etc.
• Powers thousands of active jobs in LinkedIn’s
production
Stream Processing Architecture at LinkedIn
Kafka
Near Real Time Processing
(Apache Samza)
Processing
Espresso
Oracle
MySql
Ambry
Services Tier
Ingestion
Venice
Results
Pinot
Couchb
ase
Changes
Brooklin
HDFS
Samza Scale At LinkedIn
3K+Jobs
900B+
Msgs Processed/Day
3K+Machines
99.99Availability
What's New
● Faster Onboarding
○ Make it fast and simple to learn Samza and create new applications.
● Powerful APIs
○ Provide the right level of expressibility for every use case.
● Ease of Development
○ Offer the right abstractions and tools to get things done quickly.
● Better Operability
○ Make it effortless and cost effective to run applications at any scale.
Faster Onboarding
Revamped Website and Documentation
Samza Course on YouTube
https://bit.ly/2TCS9x7
YouTube LIEngineering
Channel. Stream
Processing Tutorials
Playlist.
Simpler Job Creation
● More samples in hello-samza
○ Samza SQL
○ EventHubs Consumer
○ Integration Tests
○ Running with YARN and Standalone
https://github.com/apache/
samza-hello-samza
Powerful APIs
Example Application
Count number of ‘Page Views’ for each member in a 5 minute window
11
Page View
Page View Per
Member
Repartition
by member id
Window Map SendTo
Intermediate Stream
Low Level API
Job 1: Repartitioner Job
public class PageViewRepartitioner implements StreamTask {
private final SystemStream outputStream = new SystemStream("kafka", "pvMemberId");
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage();
String key = pageViewEvent.getMemberId();
OutgoingMessageEnvelope outMessage =
new OutgoingMessageEnvelope(outputStream, pageViewEvent, key, pageViewEvent);
collector.send(outMessage);
}
}
Low Level API
Job 2: Page view counter job
public class PageViewCounter implements StreamTask {
private final SystemStream outputStream = new SystemStream("kafka", "pageviewCount");
private final HashMap<String, Integer> counter = new HashMap<>();
private Instant lastTriggerTime = Instant.now();
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage();
String memberId = pageViewEvent.getMemberId();
counter.put(memberId, counter.getOrDefault(memberId, 0) + 1);
if (Duration.between(lastTriggerTime, Instant.now()).toMinutes() > 5) {
counter.forEach((key, value) -> collector.send(new OutgoingMessageEnvelope(outputStream, key, value)));
counter.clear();
lastTriggerTime = Instant.now();
}
}
}
High Level API
● Complex Processing Pipelines
● Easy Repartitioning
● Stream-Stream and Stream-Table Joins
● Processing Time Windows and Joins
High Level API
public class PageViewCountApplication implements StreamApplication {
@Override
public void describe(StreamApplicationDescriptor appDescriptor) {
...
appDescriptor.getInputStream(pageViews)
.partitionBy(m -> m.memberId, serde)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(PageViewCount::new)
.sendTo(appDescriptor.getOutputStream(pageViewCounts));
}
}
Apache Beam
● Event Time Processing Support
● Multi-language APIs (Python)*
● Sliding Windows & Multi-Way Joins
* coming soon
Apache Beam
public class PageViewCount {
public static void main(String[] args) {
...
pipeline
.apply(KafkaIO.<PageViewEvent>read()
.withTopic("PageView")
.withTimestampFn(kv -> new Instant(kv.getValue().header.time))
.withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000))
.apply(Values.create())
.apply(MapElements
.via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1)))
.apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5))))
.apply(Count.perKey())
.apply(MapElements.via(newCounter()))
.apply(KafkaIO.<Counter>write().withTopic("PageViewCount")
pipeline.run();
}
}
Samza SQL
● Declarative Streaming SQL API
● Create, Validate and Deploy in minutes using SQL Shell
● Managed Service at LinkedIn
● Capabilities: Filters, Projections, , Flatten, UDFs, Stream-Table Joins
Samza SQL
INSERT INTO kafka.tracking.PageViewCount
SELECT memberId, count(*) FROM kafka.tracking.PageView
GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
Samza APIs
● Complex Processing Pipelines
● Easy Repartitioning
● Complex Windows and Joins
● Event and Arrival Time Processing
● Multi-Language APIs (Java, Python, SQL)
Low Level (StreamTask)
High Level (StreamApplication)
Samza SQL
Apache Beam
(event time based windowed processing)
Java
Python
Samza APIs
Easier Development
Table API
● Evolution of the KVStore API
● Local and Remote K-V data sources
● Composition through hybrid tables
● Simplifies Stream-Table joins
● Remote Tables: Async I/O, Caching, Rate-limiting, and Retry
Stream Table Joins
Page Views Enriched Page Views
SendToJoin
Enrich ‘Page Views’ with Profile Info
Member
Database
RemoteTable ● Remote Table Features
○ Rate Limits to avoid DDoS
○ Async I/O
○ Caching / Retries
Table API
@Override
public void describe(StreamApplicationDescriptor appDesc) {
...
TableDescriptor<Integer, Profile> tableDesc =
new RocksDbTableDescriptor("profiles", serde);
Table<KV<Integer, Profile>> profilesTable = appDesc.getTable(tableDesc);
appDesc.getInputStream(profiles).sendTo(profilesTable);
appDesc.getInputStream(pageViews)
.map(m -> m.memberId)
.join(profilesTable, new MyJoinFunc())
.sendTo(decoratedProfiles);
}
Configuration Descriptors
Specify system, stream and table properties in code instead of configuration.
● Test your application against in-memory data.
● No need to set up Kafka / Yarn / Zookeeper locally.
● Works for both Low Level and High Level API applications.
Test Framework
Test Framework
@Test
public void testApplication() throws Exception {
// Generate Mock Data
List<PageView> pageViews = generateMockInput(...);
List<DecoratedPageView> expectedOutput = generateMockOutput(...);
// Get In Memory System and Stream Descriptors
InMemorySystemDescriptor inMemorySystem = new InMemorySystemDescriptor("test");
InMemoryInputDescriptor<PageView> pvDescriptor = inMemorySystem.getInputDescriptor(“page-views”);
InMemoryOutputDescriptor<DecoratedPageView> dpvDescriptor = inMemorySystem.getOutputDescriptor(“decorated-page-views”)
// Configure the TestRunner
TestRunner.of(new MyApplication())
.addInputStream(pvDescriptor, pageViews) // Associate data with the descriptor
.addOutputStream(dpvDescriptor, 10)
.run(Duration.ofMillis(1000));
// Add assertions on the output
StreamAssert.containsInOrder(expectedOutput, decoratedPageViewDesc, Duration.ofMillis(1000));
}
Offline Experimentation and Grandfathering
Application logic: Count number of ‘Page Views’ for each member in a 5 minute
window and send the counts to ‘Page View Per Member’
29
Page View
in stream
Page View per Member
out stream
Repartition
by member id
Window Map SendTo
HDFS
PageView: hdfs://mydbsnapshot/PageViewFiles/
PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
Better Operability
Samza as a Service (YARN)
• Low Cost: Applications are run
over-subscribed and can use 2 to 4x
more CPU than what is requested
• Supports Host Affinity for stateful jobs
and also clean up of state stores
• Job Management – Samza Dashboard,
Metrics/Alerting dashboards, ELK for
log management
• Multitenant and Fully-Managed:
Applications request
containers/resources and the service
manages allocation and resource
isolation
• Failure Handling: YARN has built in
retries
Samza as a Library (Standalone)
• Handle Process Failures via External
Monitoring Service
• Coordination via Zookeeper
• Enables canary support
• Host Affinity for stateful jobs
• Build event processing logic as part of
a larger application
• Full control on how app is hosted and
the entire life cycle management.
• Applications typically are hosted in
VMs/Containers.
Dedicated Clusters
● Dedicated machines for guaranteed capacity
● Isolation from noisy neighbors (hot machines)
● For large jobs with their own SRE teams
Heterogenous Clusters
● Clusters with spinning disks instead of SSDs
● Lower C2S for stateless jobs
Samza Diagnostics
● Error analysis for applications
○ Top N Errors
○ Latest N Errors
○ Exception Navigation
○ Application / Container Incarnations
Coming Soon
Faster Onboarding
● Bounded And Predictable Memory Usage
○ Avoid manual memory tuning during initial deploys
● More documentation, examples, and how-tos in hello-samza
Powerful APIs
● High Level API Async I/O support
● Python API via Apache Beam
● Samza SQL
○ Windowing (Aggregations)
○ Stream-Stream Joins
○ Nested data support
Sample Python Code
A Sample Pipeline
KafkaRead
KafkaWrite
p = Pipeline(options=pipeline_options)
(p
| 'read' >> ReadFromKafka(cluster="tracking",
topic="PageViewEvent", config=config)
| 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1))
| "windowing" >> beam.WindowInto(window.FixedWindows(60))
| "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn())
| 'write' >> WriteToKafka(cluster = "queuing",
topic = "PageViewCount", config = config)
p.run().waitUntilFinish()
Map
Window
Count
Easier Development
● Table API
○ Couchbase Table
○ Batching for Remote Tables
Better Operability
● Self-Serve Checkpoints
○ Set System / Stream / Partition Level Checkpoints
○ Set Time Based Checkpoints (e.g. "5 minutes ago") for all of the above
● State Restore Performance Improvements
○ Up to 60% faster restore times!
● Standby Containers With State Replication
● Host Affinity for Standalone
○ Support for stateful apps in ZK Standalone
● Queryable Local State
○ Read RocksDB store contents for debugging
Thank You!
samza.apache.org
dev@samza.apache.org
Apache Samza
0.7 July 2014
0.8 Dec 2014
0.9 Apr 2015
0.10 Dec 2015
0.11 Oct 2016
0.12 Feb 2017
0.13 June 2017
0.14 Jan 2018
1.0 Dec 2018
Context APIs
● Clear distinction b/w framework and application created objects.
● Clear distinction between Container and Task scoped objects.
● Ability to provide application context factories through the
ApplicationDescriptor.
Side Inputs
● Bounded (compacted) streams with periodic updates
● Bootstrap semantics (first consume "fully", then in continuous mode)
● Ideal for periodic data pushes from Hadoop
○ E.g., ML features generated offline.

More Related Content

Apache Samza 1.0 - What's New, What's Next

  • 2. Apache Samza • Top-level Apache project since 2014 • In use at LinkedIn, Slack, Metamarkets, Intuit, TripAdvisor, VMWare, Optimizely, Redfin, etc. • Powers thousands of active jobs in LinkedIn’s production
  • 3. Stream Processing Architecture at LinkedIn Kafka Near Real Time Processing (Apache Samza) Processing Espresso Oracle MySql Ambry Services Tier Ingestion Venice Results Pinot Couchb ase Changes Brooklin HDFS
  • 4. Samza Scale At LinkedIn 3K+Jobs 900B+ Msgs Processed/Day 3K+Machines 99.99Availability
  • 5. What's New ● Faster Onboarding ○ Make it fast and simple to learn Samza and create new applications. ● Powerful APIs ○ Provide the right level of expressibility for every use case. ● Ease of Development ○ Offer the right abstractions and tools to get things done quickly. ● Better Operability ○ Make it effortless and cost effective to run applications at any scale.
  • 7. Revamped Website and Documentation
  • 8. Samza Course on YouTube https://bit.ly/2TCS9x7 YouTube LIEngineering Channel. Stream Processing Tutorials Playlist.
  • 9. Simpler Job Creation ● More samples in hello-samza ○ Samza SQL ○ EventHubs Consumer ○ Integration Tests ○ Running with YARN and Standalone https://github.com/apache/ samza-hello-samza
  • 11. Example Application Count number of ‘Page Views’ for each member in a 5 minute window 11 Page View Page View Per Member Repartition by member id Window Map SendTo Intermediate Stream
  • 12. Low Level API Job 1: Repartitioner Job public class PageViewRepartitioner implements StreamTask { private final SystemStream outputStream = new SystemStream("kafka", "pvMemberId"); @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage(); String key = pageViewEvent.getMemberId(); OutgoingMessageEnvelope outMessage = new OutgoingMessageEnvelope(outputStream, pageViewEvent, key, pageViewEvent); collector.send(outMessage); } }
  • 13. Low Level API Job 2: Page view counter job public class PageViewCounter implements StreamTask { private final SystemStream outputStream = new SystemStream("kafka", "pageviewCount"); private final HashMap<String, Integer> counter = new HashMap<>(); private Instant lastTriggerTime = Instant.now(); @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage(); String memberId = pageViewEvent.getMemberId(); counter.put(memberId, counter.getOrDefault(memberId, 0) + 1); if (Duration.between(lastTriggerTime, Instant.now()).toMinutes() > 5) { counter.forEach((key, value) -> collector.send(new OutgoingMessageEnvelope(outputStream, key, value))); counter.clear(); lastTriggerTime = Instant.now(); } } }
  • 14. High Level API ● Complex Processing Pipelines ● Easy Repartitioning ● Stream-Stream and Stream-Table Joins ● Processing Time Windows and Joins
  • 15. High Level API public class PageViewCountApplication implements StreamApplication { @Override public void describe(StreamApplicationDescriptor appDescriptor) { ... appDescriptor.getInputStream(pageViews) .partitionBy(m -> m.memberId, serde) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(PageViewCount::new) .sendTo(appDescriptor.getOutputStream(pageViewCounts)); } }
  • 16. Apache Beam ● Event Time Processing Support ● Multi-language APIs (Python)* ● Sliding Windows & Multi-Way Joins * coming soon
  • 17. Apache Beam public class PageViewCount { public static void main(String[] args) { ... pipeline .apply(KafkaIO.<PageViewEvent>read() .withTopic("PageView") .withTimestampFn(kv -> new Instant(kv.getValue().header.time)) .withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000)) .apply(Values.create()) .apply(MapElements .via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1))) .apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5)))) .apply(Count.perKey()) .apply(MapElements.via(newCounter())) .apply(KafkaIO.<Counter>write().withTopic("PageViewCount") pipeline.run(); } }
  • 18. Samza SQL ● Declarative Streaming SQL API ● Create, Validate and Deploy in minutes using SQL Shell ● Managed Service at LinkedIn ● Capabilities: Filters, Projections, , Flatten, UDFs, Stream-Table Joins
  • 19. Samza SQL INSERT INTO kafka.tracking.PageViewCount SELECT memberId, count(*) FROM kafka.tracking.PageView GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
  • 20. Samza APIs ● Complex Processing Pipelines ● Easy Repartitioning ● Complex Windows and Joins ● Event and Arrival Time Processing ● Multi-Language APIs (Java, Python, SQL)
  • 21. Low Level (StreamTask) High Level (StreamApplication) Samza SQL Apache Beam (event time based windowed processing) Java Python Samza APIs
  • 23. Table API ● Evolution of the KVStore API ● Local and Remote K-V data sources ● Composition through hybrid tables ● Simplifies Stream-Table joins ● Remote Tables: Async I/O, Caching, Rate-limiting, and Retry
  • 24. Stream Table Joins Page Views Enriched Page Views SendToJoin Enrich ‘Page Views’ with Profile Info Member Database RemoteTable ● Remote Table Features ○ Rate Limits to avoid DDoS ○ Async I/O ○ Caching / Retries
  • 25. Table API @Override public void describe(StreamApplicationDescriptor appDesc) { ... TableDescriptor<Integer, Profile> tableDesc = new RocksDbTableDescriptor("profiles", serde); Table<KV<Integer, Profile>> profilesTable = appDesc.getTable(tableDesc); appDesc.getInputStream(profiles).sendTo(profilesTable); appDesc.getInputStream(pageViews) .map(m -> m.memberId) .join(profilesTable, new MyJoinFunc()) .sendTo(decoratedProfiles); }
  • 26. Configuration Descriptors Specify system, stream and table properties in code instead of configuration.
  • 27. ● Test your application against in-memory data. ● No need to set up Kafka / Yarn / Zookeeper locally. ● Works for both Low Level and High Level API applications. Test Framework
  • 28. Test Framework @Test public void testApplication() throws Exception { // Generate Mock Data List<PageView> pageViews = generateMockInput(...); List<DecoratedPageView> expectedOutput = generateMockOutput(...); // Get In Memory System and Stream Descriptors InMemorySystemDescriptor inMemorySystem = new InMemorySystemDescriptor("test"); InMemoryInputDescriptor<PageView> pvDescriptor = inMemorySystem.getInputDescriptor(“page-views”); InMemoryOutputDescriptor<DecoratedPageView> dpvDescriptor = inMemorySystem.getOutputDescriptor(“decorated-page-views”) // Configure the TestRunner TestRunner.of(new MyApplication()) .addInputStream(pvDescriptor, pageViews) // Associate data with the descriptor .addOutputStream(dpvDescriptor, 10) .run(Duration.ofMillis(1000)); // Add assertions on the output StreamAssert.containsInOrder(expectedOutput, decoratedPageViewDesc, Duration.ofMillis(1000)); }
  • 29. Offline Experimentation and Grandfathering Application logic: Count number of ‘Page Views’ for each member in a 5 minute window and send the counts to ‘Page View Per Member’ 29 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo HDFS PageView: hdfs://mydbsnapshot/PageViewFiles/ PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
  • 31. Samza as a Service (YARN) • Low Cost: Applications are run over-subscribed and can use 2 to 4x more CPU than what is requested • Supports Host Affinity for stateful jobs and also clean up of state stores • Job Management – Samza Dashboard, Metrics/Alerting dashboards, ELK for log management • Multitenant and Fully-Managed: Applications request containers/resources and the service manages allocation and resource isolation • Failure Handling: YARN has built in retries
  • 32. Samza as a Library (Standalone) • Handle Process Failures via External Monitoring Service • Coordination via Zookeeper • Enables canary support • Host Affinity for stateful jobs • Build event processing logic as part of a larger application • Full control on how app is hosted and the entire life cycle management. • Applications typically are hosted in VMs/Containers.
  • 33. Dedicated Clusters ● Dedicated machines for guaranteed capacity ● Isolation from noisy neighbors (hot machines) ● For large jobs with their own SRE teams
  • 34. Heterogenous Clusters ● Clusters with spinning disks instead of SSDs ● Lower C2S for stateless jobs
  • 35. Samza Diagnostics ● Error analysis for applications ○ Top N Errors ○ Latest N Errors ○ Exception Navigation ○ Application / Container Incarnations
  • 37. Faster Onboarding ● Bounded And Predictable Memory Usage ○ Avoid manual memory tuning during initial deploys ● More documentation, examples, and how-tos in hello-samza
  • 38. Powerful APIs ● High Level API Async I/O support ● Python API via Apache Beam ● Samza SQL ○ Windowing (Aggregations) ○ Stream-Stream Joins ○ Nested data support
  • 39. Sample Python Code A Sample Pipeline KafkaRead KafkaWrite p = Pipeline(options=pipeline_options) (p | 'read' >> ReadFromKafka(cluster="tracking", topic="PageViewEvent", config=config) | 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1)) | "windowing" >> beam.WindowInto(window.FixedWindows(60)) | "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn()) | 'write' >> WriteToKafka(cluster = "queuing", topic = "PageViewCount", config = config) p.run().waitUntilFinish() Map Window Count
  • 40. Easier Development ● Table API ○ Couchbase Table ○ Batching for Remote Tables
  • 41. Better Operability ● Self-Serve Checkpoints ○ Set System / Stream / Partition Level Checkpoints ○ Set Time Based Checkpoints (e.g. "5 minutes ago") for all of the above ● State Restore Performance Improvements ○ Up to 60% faster restore times! ● Standby Containers With State Replication ● Host Affinity for Standalone ○ Support for stateful apps in ZK Standalone ● Queryable Local State ○ Read RocksDB store contents for debugging
  • 43. Apache Samza 0.7 July 2014 0.8 Dec 2014 0.9 Apr 2015 0.10 Dec 2015 0.11 Oct 2016 0.12 Feb 2017 0.13 June 2017 0.14 Jan 2018 1.0 Dec 2018
  • 44. Context APIs ● Clear distinction b/w framework and application created objects. ● Clear distinction between Container and Task scoped objects. ● Ability to provide application context factories through the ApplicationDescriptor.
  • 45. Side Inputs ● Bounded (compacted) streams with periodic updates ● Bootstrap semantics (first consume "fully", then in continuous mode) ● Ideal for periodic data pushes from Hadoop ○ E.g., ML features generated offline.