Apache Samza 1.0 - What's New, What's Next
- 2. Apache Samza
• Top-level Apache project since 2014
• In use at LinkedIn, Slack, Metamarkets, Intuit,
TripAdvisor, VMWare, Optimizely, Redfin, etc.
• Powers thousands of active jobs in LinkedIn’s
production
- 3. Stream Processing Architecture at LinkedIn
Kafka
Near Real Time Processing
(Apache Samza)
Processing
Espresso
Oracle
MySql
Ambry
Services Tier
Ingestion
Venice
Results
Pinot
Couchb
ase
Changes
Brooklin
HDFS
- 4. Samza Scale At LinkedIn
3K+Jobs
900B+
Msgs Processed/Day
3K+Machines
99.99Availability
- 5. What's New
● Faster Onboarding
○ Make it fast and simple to learn Samza and create new applications.
● Powerful APIs
○ Provide the right level of expressibility for every use case.
● Ease of Development
○ Offer the right abstractions and tools to get things done quickly.
● Better Operability
○ Make it effortless and cost effective to run applications at any scale.
- 8. Samza Course on YouTube
https://bit.ly/2TCS9x7
YouTube LIEngineering
Channel. Stream
Processing Tutorials
Playlist.
- 9. Simpler Job Creation
● More samples in hello-samza
○ Samza SQL
○ EventHubs Consumer
○ Integration Tests
○ Running with YARN and Standalone
https://github.com/apache/
samza-hello-samza
- 11. Example Application
Count number of ‘Page Views’ for each member in a 5 minute window
11
Page View
Page View Per
Member
Repartition
by member id
Window Map SendTo
Intermediate Stream
- 12. Low Level API
Job 1: Repartitioner Job
public class PageViewRepartitioner implements StreamTask {
private final SystemStream outputStream = new SystemStream("kafka", "pvMemberId");
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage();
String key = pageViewEvent.getMemberId();
OutgoingMessageEnvelope outMessage =
new OutgoingMessageEnvelope(outputStream, pageViewEvent, key, pageViewEvent);
collector.send(outMessage);
}
}
- 13. Low Level API
Job 2: Page view counter job
public class PageViewCounter implements StreamTask {
private final SystemStream outputStream = new SystemStream("kafka", "pageviewCount");
private final HashMap<String, Integer> counter = new HashMap<>();
private Instant lastTriggerTime = Instant.now();
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) {
PageViewEvent pageViewEvent = (PageViewEvent) envelope.getMessage();
String memberId = pageViewEvent.getMemberId();
counter.put(memberId, counter.getOrDefault(memberId, 0) + 1);
if (Duration.between(lastTriggerTime, Instant.now()).toMinutes() > 5) {
counter.forEach((key, value) -> collector.send(new OutgoingMessageEnvelope(outputStream, key, value)));
counter.clear();
lastTriggerTime = Instant.now();
}
}
}
- 14. High Level API
● Complex Processing Pipelines
● Easy Repartitioning
● Stream-Stream and Stream-Table Joins
● Processing Time Windows and Joins
- 15. High Level API
public class PageViewCountApplication implements StreamApplication {
@Override
public void describe(StreamApplicationDescriptor appDescriptor) {
...
appDescriptor.getInputStream(pageViews)
.partitionBy(m -> m.memberId, serde)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(PageViewCount::new)
.sendTo(appDescriptor.getOutputStream(pageViewCounts));
}
}
- 16. Apache Beam
● Event Time Processing Support
● Multi-language APIs (Python)*
● Sliding Windows & Multi-Way Joins
* coming soon
- 17. Apache Beam
public class PageViewCount {
public static void main(String[] args) {
...
pipeline
.apply(KafkaIO.<PageViewEvent>read()
.withTopic("PageView")
.withTimestampFn(kv -> new Instant(kv.getValue().header.time))
.withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000))
.apply(Values.create())
.apply(MapElements
.via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1)))
.apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5))))
.apply(Count.perKey())
.apply(MapElements.via(newCounter()))
.apply(KafkaIO.<Counter>write().withTopic("PageViewCount")
pipeline.run();
}
}
- 18. Samza SQL
● Declarative Streaming SQL API
● Create, Validate and Deploy in minutes using SQL Shell
● Managed Service at LinkedIn
● Capabilities: Filters, Projections, , Flatten, UDFs, Stream-Table Joins
- 19. Samza SQL
INSERT INTO kafka.tracking.PageViewCount
SELECT memberId, count(*) FROM kafka.tracking.PageView
GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
- 20. Samza APIs
● Complex Processing Pipelines
● Easy Repartitioning
● Complex Windows and Joins
● Event and Arrival Time Processing
● Multi-Language APIs (Java, Python, SQL)
- 21. Low Level (StreamTask)
High Level (StreamApplication)
Samza SQL
Apache Beam
(event time based windowed processing)
Java
Python
Samza APIs
- 23. Table API
● Evolution of the KVStore API
● Local and Remote K-V data sources
● Composition through hybrid tables
● Simplifies Stream-Table joins
● Remote Tables: Async I/O, Caching, Rate-limiting, and Retry
- 24. Stream Table Joins
Page Views Enriched Page Views
SendToJoin
Enrich ‘Page Views’ with Profile Info
Member
Database
RemoteTable ● Remote Table Features
○ Rate Limits to avoid DDoS
○ Async I/O
○ Caching / Retries
- 25. Table API
@Override
public void describe(StreamApplicationDescriptor appDesc) {
...
TableDescriptor<Integer, Profile> tableDesc =
new RocksDbTableDescriptor("profiles", serde);
Table<KV<Integer, Profile>> profilesTable = appDesc.getTable(tableDesc);
appDesc.getInputStream(profiles).sendTo(profilesTable);
appDesc.getInputStream(pageViews)
.map(m -> m.memberId)
.join(profilesTable, new MyJoinFunc())
.sendTo(decoratedProfiles);
}
- 27. ● Test your application against in-memory data.
● No need to set up Kafka / Yarn / Zookeeper locally.
● Works for both Low Level and High Level API applications.
Test Framework
- 28. Test Framework
@Test
public void testApplication() throws Exception {
// Generate Mock Data
List<PageView> pageViews = generateMockInput(...);
List<DecoratedPageView> expectedOutput = generateMockOutput(...);
// Get In Memory System and Stream Descriptors
InMemorySystemDescriptor inMemorySystem = new InMemorySystemDescriptor("test");
InMemoryInputDescriptor<PageView> pvDescriptor = inMemorySystem.getInputDescriptor(“page-views”);
InMemoryOutputDescriptor<DecoratedPageView> dpvDescriptor = inMemorySystem.getOutputDescriptor(“decorated-page-views”)
// Configure the TestRunner
TestRunner.of(new MyApplication())
.addInputStream(pvDescriptor, pageViews) // Associate data with the descriptor
.addOutputStream(dpvDescriptor, 10)
.run(Duration.ofMillis(1000));
// Add assertions on the output
StreamAssert.containsInOrder(expectedOutput, decoratedPageViewDesc, Duration.ofMillis(1000));
}
- 29. Offline Experimentation and Grandfathering
Application logic: Count number of ‘Page Views’ for each member in a 5 minute
window and send the counts to ‘Page View Per Member’
29
Page View
in stream
Page View per Member
out stream
Repartition
by member id
Window Map SendTo
HDFS
PageView: hdfs://mydbsnapshot/PageViewFiles/
PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
- 31. Samza as a Service (YARN)
• Low Cost: Applications are run
over-subscribed and can use 2 to 4x
more CPU than what is requested
• Supports Host Affinity for stateful jobs
and also clean up of state stores
• Job Management – Samza Dashboard,
Metrics/Alerting dashboards, ELK for
log management
• Multitenant and Fully-Managed:
Applications request
containers/resources and the service
manages allocation and resource
isolation
• Failure Handling: YARN has built in
retries
- 32. Samza as a Library (Standalone)
• Handle Process Failures via External
Monitoring Service
• Coordination via Zookeeper
• Enables canary support
• Host Affinity for stateful jobs
• Build event processing logic as part of
a larger application
• Full control on how app is hosted and
the entire life cycle management.
• Applications typically are hosted in
VMs/Containers.
- 33. Dedicated Clusters
● Dedicated machines for guaranteed capacity
● Isolation from noisy neighbors (hot machines)
● For large jobs with their own SRE teams
- 35. Samza Diagnostics
● Error analysis for applications
○ Top N Errors
○ Latest N Errors
○ Exception Navigation
○ Application / Container Incarnations
- 37. Faster Onboarding
● Bounded And Predictable Memory Usage
○ Avoid manual memory tuning during initial deploys
● More documentation, examples, and how-tos in hello-samza
- 38. Powerful APIs
● High Level API Async I/O support
● Python API via Apache Beam
● Samza SQL
○ Windowing (Aggregations)
○ Stream-Stream Joins
○ Nested data support
- 39. Sample Python Code
A Sample Pipeline
KafkaRead
KafkaWrite
p = Pipeline(options=pipeline_options)
(p
| 'read' >> ReadFromKafka(cluster="tracking",
topic="PageViewEvent", config=config)
| 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1))
| "windowing" >> beam.WindowInto(window.FixedWindows(60))
| "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn())
| 'write' >> WriteToKafka(cluster = "queuing",
topic = "PageViewCount", config = config)
p.run().waitUntilFinish()
Map
Window
Count
- 41. Better Operability
● Self-Serve Checkpoints
○ Set System / Stream / Partition Level Checkpoints
○ Set Time Based Checkpoints (e.g. "5 minutes ago") for all of the above
● State Restore Performance Improvements
○ Up to 60% faster restore times!
● Standby Containers With State Replication
● Host Affinity for Standalone
○ Support for stateful apps in ZK Standalone
● Queryable Local State
○ Read RocksDB store contents for debugging
- 43. Apache Samza
0.7 July 2014
0.8 Dec 2014
0.9 Apr 2015
0.10 Dec 2015
0.11 Oct 2016
0.12 Feb 2017
0.13 June 2017
0.14 Jan 2018
1.0 Dec 2018
- 44. Context APIs
● Clear distinction b/w framework and application created objects.
● Clear distinction between Container and Task scoped objects.
● Ability to provide application context factories through the
ApplicationDescriptor.
- 45. Side Inputs
● Bounded (compacted) streams with periodic updates
● Bootstrap semantics (first consume "fully", then in continuous mode)
● Ideal for periodic data pushes from Hadoop
○ E.g., ML features generated offline.