SlideShare a Scribd company logo
How Braze uses the MongoDB Aggregation
Pipeline for Lean, Mean, and Efficient
Scaling
Presenting Today
Zach McCormick
Senior Software Engineer
Braze
@zachmccormick
Quick Intro to Braze
Braze empowers you to humanize your brand –
customer relationships at scale.
Tensof
BillionsofMessages
Sent
Monthly
Global
Customer
Presence
Morethan
1Billion
MAU
ON SIX CONTINENTS
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling
How Does It All Work?
•Push, email, in-app messaging, and more for
our customers
•Integration via an SDK and REST API
•Real-time audience segmentation
•Batch, event-driven, and transactional API
messaging
What does this look like at scale?
•Nearly 11 billion user profiles
•Our customers’ end users
•Over 8 billion Sidekiq jobs per day
•Segmentation, messaging, analytics, data
processing
•Over 6 billion API calls per day
•User activity, transactional messaging
•Over 350k MongoDB IOPS across clusters
•Powered by over 1,200 MongoDB shards, 65
different MongoDB clusters
9
TOC
Frequency Capping
What is it? How does it work at Braze?
The Original Design
How did it originally work? What were the issues?
Redesign using the Aggregation Pipeline
What does the new solution look like? Why is it
better?
Looking at the Results
Did it really improve performance? What’s next?
Today
Jobs with Frequency Capping Enabled
Frequency Capping
Why Use Frequency Capping?
Let’s look at our dashboard…
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling
Where does it fit in the process?
Message
Any piece of content sent to a user, often
highly customized by combining Liquid logic
with user profiles.
Channel
A conduit for messages, such as push, email,
in-app message, news feed card, etc.
Campaign
A single-step message send including
segmentation, messages for various
channels, delivery time options, etc.
Canvas
A multi-step “journey” or “workflow”
including entry conditions, segmentation at
various branches, multiple messages, delays,
etc.
Message Sending Pipeline at Braze
• Lots of steps
• Business logic at every step
• High level of parallelization
Audience
Segmentation
Variant Selection
FREQUENCY
CAPPING
Volume Limiting
Subscription
Preferences
Channel Selection
Send Time
Optimization
Enqueueing
Audience
Segmentation
Variant Selection
FREQUENCY
CAPPING
Volume Limiting
Subscription
Preferences
Channel Selection
Send Time
Optimization
Enqueueing
Render Payloads Send Messages Write Analytics
Audience
Segmentation
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling
The Original Design
User Collection Example
{
_id: 123,
first_name: "Zach",
last_name: "McCormick",
email: "zach.mccormick@braze.com",
custom: {
twitter_handle: "zachmccormick"
favorite_food: "Greek",
loves_coffee: true
},
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date('2019-06-01T12:00:03Z'),
last_opened_email: Date('2019-06-01T12:03:19Z')
}
}
}
Frequency Capping Algorithm
MongoDB Sidekiq Worker
Eligible Users
MongoDB Query
On “Users”
Frequency Capping Algorithm
MongoDB Sidekiq Worker
Data Transfer
(for each user)
Eligible Users
(in batches)
Remove Ineligible
Campaigns
MongoDB Query
On “Users”
Frequency Capping Algorithm
MongoDB Sidekiq Worker
Data Transfer
Count Campaigns
and Check Rule
For Each Rule
(for each user)
Eligible Users
(in batches)
Remove Ineligible
Campaigns
MongoDB Query
On “Users”
Frequency Capping Algorithm
MongoDB Sidekiq Worker
Data Transfer
Count Campaigns
and Check Rule
For Each Rule
(for each user)
Eligible Users
(in batches)
Non-Frequency
Capped Users
What are some potential problems ?
Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
Frequency Capping in a flame graph of the Sidekiq job
(mostly spent waiting on queries!)
Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
• What about the same campaign sent twice?
• “Last received” timestamps alone aren’t
enough data
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date('2019-06-
01T12:00:03Z'),
last_opened_email: Date('2019-06-
01T12:03:19Z')
}
}
Maybe we can make it smarter?
Micro-optimizations
• What if we limit what parts of the user
profile document we bring back?
• We have aggregate stats, so we know
when certain campaigns were sent
Optimization Attempt #1
Micro-optimizations
• What if we limit what parts of the user
profile document we bring back?
• We have aggregate stats, so we know
when certain campaigns were sent
• However…
• What if the frequency capping window is
fairly large?
• What if the customer has hundreds of
millions of users?
The solution was good,
but it was not enough…
Redesign using the
Aggregation Pipeline
Aggregations at Braze Today
• Documents representing daily aggregate
statistics per-campaign
• Messages sent, messages opened,
conversion events, etc.
• Aggregation Pipeline queries for graphs,
charts, and reports
What are the goals?
Redesign Goals
• Less network IO
• Expensive!
• Less RAM usage
• For huge campaigns, occasional OOM
errors
OOMs in server logs
Redesign Goals
• Less network IO
• Expensive!
• Less RAM usage
• For huge campaigns, occasional OOM
errors
• Much faster execution
• Micro-optimizations are only going to
go so far
Can we still use only User documents?
User Collection Example
{
_id: 123,
first_name: "Zach",
last_name: "McCormick",
email: "zach.mccormick@braze.com",
custom: {
twitter_handle: "zachmccormick"
favorite_food: "Greek",
loves_coffee: true
},
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date('2019-06-01T12:00:03Z'),
last_opened_email: Date('2019-06-01T12:03:19Z')
}
}
}
Campaign Summaries use a hash, not an array
What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases
What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases
• What if this new collection holds arrays of received campaigns
• We can use $slice to keep the arrays reasonably sized
• We can use the same IDs as User profiles to shard efficiently
What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases
• What if this new collection holds arrays of received campaigns
• We can use $slice to keep the arrays reasonably sized
• We can use the same IDs as User profiles to shard efficiently
• What would that look like?
UserCampaignInteractionData Collection Example
{
_id: 123,
emails_received: [
{
date: Date(‘2019-06-01T12:00:03Z’),
campaign: “CampaignB”,
dispatch_id: “identifier-for-send”
},
{
date: Date(‘2019-05-29T13:37:00Z’),
campaign: “CampaignA”,
dispatch_id: “identifier-for-send”
},
…
],
…
}
UserCampaignInteractionData Collection Example
{
_id: 123,
emails_received: […],
android_push_received: […],
ios_push_received: […],
webhooks_received: […],
sms_received: […],
…
}
UserCampaignInteractionData Collection Example
{
_id: 123,
emails_received: [
{
date: Date(‘2019-06-01T12:00:03Z’),
campaign: “CampaignB”,
dispatch_id: “identifier-for-send”
},
{
date: Date(‘2019-05-29T13:37:00Z’),
campaign: “CampaignA”,
dispatch_id: “identifier-for-send”
},
…
],
…
}
{
_id: 123,
first_name: "Zach",
last_name: "McCormick",
email: "zach.mccormick@braze.com",
custom: {
twitter_handle: "zachmccormick"
favorite_food: "Greek",
loves_coffee: true
},
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date(
'2019-06-01T12:00:03Z'),
last_opened_email: Date(
'2019-06-01T12:03:19Z')
}
}}
NEW Frequency Capping Algorithm
1. Match stage
2. First projection using $filter
1. Only look at the relevant time window
2. Don’t include the current dispatch (for
multi-channel sends)
3. Exclude campaigns that don’t count
toward frequency capping
Resulting document:
{
“Zach”: {
“email_86400”: [
{
“dispatch_id”: …,
“date”: …,
“campaign”: …
},
…
],
}
}
NEW Frequency Capping Algorithm
1. Match stage
2. First projection using $filter
1. Only look at the relevant time window
2. Don’t include the current dispatch (for
multi-channel sends)
3. Exclude campaigns that don’t count
toward frequency capping
3. Second projection
1. Only bring back dispatch IDs
Resulting document:
{
“Zach”: {
“email_86400”: [
“campaign-a-dispatch-id”,
“campaign-b-dispatch-id”,
],
}
}
UserCampaignInteractionData Query Example
first_projection[��email_86400"] = {
:$filter => {
:input => ”email_received",
:cond => {
:$and => [
# first make sure the tuple we care about is within rule's time window
{:$gte => [
"$$this.date", Date.new(2019,6,9,12,0,0)
]},
# next make sure we don't include transactional messages
{:$not => :$in => [
"$$this.campaign", ["Txn Message One", "Txn Message Two",]
]}
]
}
}
UserCampaignInteractionData Query Example
second_projection[”email_86400"] = "$emails_received.dispatch_id"
Looking at the Results
Frequency Capping – Network Bandwidth
MongoDB
Frequency Capping Version 1
Sidekiq
Transferring full
user profiles
MongoDB
Frequency Capping Version 2
Sidekiq
Transferring only
Dispatch IDsVS.
Frequency Capping v1 vs. v2 Max Duration
Frequency Capping v1 vs. v2 Median Duration
How did this get deployed?
Deployment Strategies
• All functionality behind a feature flipper
• Easy to turn on/off by customer
Deployment Strategies
• All functionality behind a feature flipper
• Easy to turn on/off by customer
• Lots of excess code
• Feature flipper logic is simple – use class X or class Y
Deployment Strategies
• All functionality behind a feature flipper
• Easy to turn on/off by customer
• Lots of excess code
• Feature flipper logic is simple – use class X or class Y
• Feature flipped on slowly
• Hourly and daily check-ins on Datadog
• Minimize impact if something goes wrong
What’s Next?
Frequency Capping by Tag
first_projection[”email_marketing_86400"] = {
:$filter => {
:input => ”email_received",
:cond => {
:$and => [ …,
# only include campaigns tagged “marketing”
{:$in => [
"$$this.campaign", [”July 4 Promo", ”Memorial Day Sale", …]
]}
]
}
}
What else?
• Set the foundation for future expectations
• Customers are always going to want to send messages
• Faster and faster
• With more detailed segmentation
• With more complex inclusion/exclusion rules
Thank you! We are hiring!
braze.com/careers
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling

More Related Content

MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling

  • 1. How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling
  • 2. Presenting Today Zach McCormick Senior Software Engineer Braze @zachmccormick
  • 4. Braze empowers you to humanize your brand – customer relationships at scale. Tensof BillionsofMessages Sent Monthly Global Customer Presence Morethan 1Billion MAU ON SIX CONTINENTS
  • 7. How Does It All Work? •Push, email, in-app messaging, and more for our customers •Integration via an SDK and REST API •Real-time audience segmentation •Batch, event-driven, and transactional API messaging
  • 8. What does this look like at scale? •Nearly 11 billion user profiles •Our customers’ end users •Over 8 billion Sidekiq jobs per day •Segmentation, messaging, analytics, data processing •Over 6 billion API calls per day •User activity, transactional messaging •Over 350k MongoDB IOPS across clusters •Powered by over 1,200 MongoDB shards, 65 different MongoDB clusters
  • 9. 9 TOC Frequency Capping What is it? How does it work at Braze? The Original Design How did it originally work? What were the issues? Redesign using the Aggregation Pipeline What does the new solution look like? Why is it better? Looking at the Results Did it really improve performance? What’s next? Today
  • 10. Jobs with Frequency Capping Enabled
  • 12. Why Use Frequency Capping?
  • 13. Let’s look at our dashboard…
  • 15. Where does it fit in the process?
  • 16. Message Any piece of content sent to a user, often highly customized by combining Liquid logic with user profiles.
  • 17. Channel A conduit for messages, such as push, email, in-app message, news feed card, etc.
  • 18. Campaign A single-step message send including segmentation, messages for various channels, delivery time options, etc.
  • 19. Canvas A multi-step “journey” or “workflow” including entry conditions, segmentation at various branches, multiple messages, delays, etc.
  • 20. Message Sending Pipeline at Braze • Lots of steps • Business logic at every step • High level of parallelization
  • 22. Variant Selection FREQUENCY CAPPING Volume Limiting Subscription Preferences Channel Selection Send Time Optimization Enqueueing Audience Segmentation
  • 23. Variant Selection FREQUENCY CAPPING Volume Limiting Subscription Preferences Channel Selection Send Time Optimization Enqueueing Render Payloads Send Messages Write Analytics Audience Segmentation
  • 27. User Collection Example { _id: 123, first_name: "Zach", last_name: "McCormick", email: "zach.mccormick@braze.com", custom: { twitter_handle: "zachmccormick" favorite_food: "Greek", loves_coffee: true }, campaign_summaries: { "Coffee Addict Promo": { last_received: Date('2019-06-01T12:00:03Z'), last_opened_email: Date('2019-06-01T12:03:19Z') } } }
  • 28. Frequency Capping Algorithm MongoDB Sidekiq Worker Eligible Users
  • 29. MongoDB Query On “Users” Frequency Capping Algorithm MongoDB Sidekiq Worker Data Transfer (for each user) Eligible Users (in batches)
  • 30. Remove Ineligible Campaigns MongoDB Query On “Users” Frequency Capping Algorithm MongoDB Sidekiq Worker Data Transfer Count Campaigns and Check Rule For Each Rule (for each user) Eligible Users (in batches)
  • 31. Remove Ineligible Campaigns MongoDB Query On “Users” Frequency Capping Algorithm MongoDB Sidekiq Worker Data Transfer Count Campaigns and Check Rule For Each Rule (for each user) Eligible Users (in batches) Non-Frequency Capped Users
  • 32. What are some potential problems ?
  • 33. Frequency Capping Problems • User profiles can be HUGE • 16 MB max doc size + batch processing • Network IO & RAM usage • Not particularly fast… Frequency Capping in a flame graph of the Sidekiq job (mostly spent waiting on queries!)
  • 34. Frequency Capping Problems • User profiles can be HUGE • 16 MB max doc size + batch processing • Network IO & RAM usage • Not particularly fast…
  • 35. Frequency Capping Problems • User profiles can be HUGE • 16 MB max doc size + batch processing • Network IO & RAM usage • Not particularly fast… • What about the same campaign sent twice? • “Last received” timestamps alone aren’t enough data campaign_summaries: { "Coffee Addict Promo": { last_received: Date('2019-06- 01T12:00:03Z'), last_opened_email: Date('2019-06- 01T12:03:19Z') } }
  • 36. Maybe we can make it smarter?
  • 37. Micro-optimizations • What if we limit what parts of the user profile document we bring back? • We have aggregate stats, so we know when certain campaigns were sent Optimization Attempt #1
  • 38. Micro-optimizations • What if we limit what parts of the user profile document we bring back? • We have aggregate stats, so we know when certain campaigns were sent • However… • What if the frequency capping window is fairly large? • What if the customer has hundreds of millions of users?
  • 39. The solution was good, but it was not enough…
  • 41. Aggregations at Braze Today • Documents representing daily aggregate statistics per-campaign • Messages sent, messages opened, conversion events, etc. • Aggregation Pipeline queries for graphs, charts, and reports
  • 42. What are the goals?
  • 43. Redesign Goals • Less network IO • Expensive! • Less RAM usage • For huge campaigns, occasional OOM errors OOMs in server logs
  • 44. Redesign Goals • Less network IO • Expensive! • Less RAM usage • For huge campaigns, occasional OOM errors • Much faster execution • Micro-optimizations are only going to go so far
  • 45. Can we still use only User documents?
  • 46. User Collection Example { _id: 123, first_name: "Zach", last_name: "McCormick", email: "zach.mccormick@braze.com", custom: { twitter_handle: "zachmccormick" favorite_food: "Greek", loves_coffee: true }, campaign_summaries: { "Coffee Addict Promo": { last_received: Date('2019-06-01T12:00:03Z'), last_opened_email: Date('2019-06-01T12:03:19Z') } } } Campaign Summaries use a hash, not an array
  • 47. What about a new supplementary document? • We don’t want to store more data on User profiles – already too big in some cases
  • 48. What about a new supplementary document? • We don’t want to store more data on User profiles – already too big in some cases • What if this new collection holds arrays of received campaigns • We can use $slice to keep the arrays reasonably sized • We can use the same IDs as User profiles to shard efficiently
  • 49. What about a new supplementary document? • We don’t want to store more data on User profiles – already too big in some cases • What if this new collection holds arrays of received campaigns • We can use $slice to keep the arrays reasonably sized • We can use the same IDs as User profiles to shard efficiently • What would that look like?
  • 50. UserCampaignInteractionData Collection Example { _id: 123, emails_received: [ { date: Date(‘2019-06-01T12:00:03Z’), campaign: “CampaignB”, dispatch_id: “identifier-for-send” }, { date: Date(‘2019-05-29T13:37:00Z’), campaign: “CampaignA”, dispatch_id: “identifier-for-send” }, … ], … }
  • 51. UserCampaignInteractionData Collection Example { _id: 123, emails_received: […], android_push_received: […], ios_push_received: […], webhooks_received: […], sms_received: […], … }
  • 52. UserCampaignInteractionData Collection Example { _id: 123, emails_received: [ { date: Date(‘2019-06-01T12:00:03Z’), campaign: “CampaignB”, dispatch_id: “identifier-for-send” }, { date: Date(‘2019-05-29T13:37:00Z’), campaign: “CampaignA”, dispatch_id: “identifier-for-send” }, … ], … } { _id: 123, first_name: "Zach", last_name: "McCormick", email: "zach.mccormick@braze.com", custom: { twitter_handle: "zachmccormick" favorite_food: "Greek", loves_coffee: true }, campaign_summaries: { "Coffee Addict Promo": { last_received: Date( '2019-06-01T12:00:03Z'), last_opened_email: Date( '2019-06-01T12:03:19Z') } }}
  • 53. NEW Frequency Capping Algorithm 1. Match stage 2. First projection using $filter 1. Only look at the relevant time window 2. Don’t include the current dispatch (for multi-channel sends) 3. Exclude campaigns that don’t count toward frequency capping Resulting document: { “Zach”: { “email_86400”: [ { “dispatch_id”: …, “date”: …, “campaign”: … }, … ], } }
  • 54. NEW Frequency Capping Algorithm 1. Match stage 2. First projection using $filter 1. Only look at the relevant time window 2. Don’t include the current dispatch (for multi-channel sends) 3. Exclude campaigns that don’t count toward frequency capping 3. Second projection 1. Only bring back dispatch IDs Resulting document: { “Zach”: { “email_86400”: [ “campaign-a-dispatch-id”, “campaign-b-dispatch-id”, ], } }
  • 55. UserCampaignInteractionData Query Example first_projection[”email_86400"] = { :$filter => { :input => ”email_received", :cond => { :$and => [ # first make sure the tuple we care about is within rule's time window {:$gte => [ "$$this.date", Date.new(2019,6,9,12,0,0) ]}, # next make sure we don't include transactional messages {:$not => :$in => [ "$$this.campaign", ["Txn Message One", "Txn Message Two",] ]} ] } }
  • 57. Looking at the Results
  • 58. Frequency Capping – Network Bandwidth MongoDB Frequency Capping Version 1 Sidekiq Transferring full user profiles MongoDB Frequency Capping Version 2 Sidekiq Transferring only Dispatch IDsVS.
  • 59. Frequency Capping v1 vs. v2 Max Duration
  • 60. Frequency Capping v1 vs. v2 Median Duration
  • 61. How did this get deployed?
  • 62. Deployment Strategies • All functionality behind a feature flipper • Easy to turn on/off by customer
  • 63. Deployment Strategies • All functionality behind a feature flipper • Easy to turn on/off by customer • Lots of excess code • Feature flipper logic is simple – use class X or class Y
  • 64. Deployment Strategies • All functionality behind a feature flipper • Easy to turn on/off by customer • Lots of excess code • Feature flipper logic is simple – use class X or class Y • Feature flipped on slowly • Hourly and daily check-ins on Datadog • Minimize impact if something goes wrong
  • 66. Frequency Capping by Tag first_projection[”email_marketing_86400"] = { :$filter => { :input => ”email_received", :cond => { :$and => [ …, # only include campaigns tagged “marketing” {:$in => [ "$$this.campaign", [”July 4 Promo", ”Memorial Day Sale", …] ]} ] } }
  • 67. What else? • Set the foundation for future expectations • Customers are always going to want to send messages • Faster and faster • With more detailed segmentation • With more complex inclusion/exclusion rules
  • 68. Thank you! We are hiring! braze.com/careers