MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling

How Braze uses the MongoDB Aggregation
Pipeline for Lean, Mean, and Efficient
Scaling

Presenting Today
Zach McCormick
Senior Software Engineer
Braze
@zachmccormick

Braze empowers you to humanize your brand –
customer relationships at scale.
Tensof
BillionsofMessages
Sent
Monthly
Global
Customer
Presence
Morethan
1Billion
MAU
ON SIX CONTINENTS

How Does It All Work?
•Push, email, in-app messaging, and more for
our customers
•Integration via an SDK and REST API
•Real-time audience segmentation
•Batch, event-driven, and transactional API
messaging

What does this look like at scale?
•Nearly 11 billion user profiles
•Our customers’ end users
•Over 8 billion Sidekiq jobs per day
•Segmentation, messaging, analytics, data
processing
•Over 6 billion API calls per day
•User activity, transactional messaging
•Over 350k MongoDB IOPS across clusters
•Powered by over 1,200 MongoDB shards, 65
different MongoDB clusters

9
TOC
Frequency Capping
What is it? How does it work at Braze?
The Original Design
How did it originally work? What were the issues?
Redesign using the Aggregation Pipeline
What does the new solution look like? Why is it
better?
Looking at the Results
Did it really improve performance? What’s next?
Today

Jobs with Frequency Capping Enabled

Let’s look at our dashboard…

Where does it fit in the process?

Message
Any piece of content sent to a user, often
highly customized by combining Liquid logic
with user profiles.

Channel
A conduit for messages, such as push, email,
in-app message, news feed card, etc.

Campaign
A single-step message send including
segmentation, messages for various
channels, delivery time options, etc.

Canvas
A multi-step “journey” or “workflow”
including entry conditions, segmentation at
various branches, multiple messages, delays,
etc.

Message Sending Pipeline at Braze
• Lots of steps
• Business logic at every step
• High level of parallelization

Variant Selection
FREQUENCY
CAPPING
Volume Limiting
Subscription
Preferences
Channel Selection
Send Time
Optimization
Enqueueing
Audience
Segmentation

Variant Selection
FREQUENCY
CAPPING
Volume Limiting
Subscription
Preferences
Channel Selection
Send Time
Optimization
Enqueueing
Render Payloads Send Messages Write Analytics
Audience
Segmentation

User Collection Example
{
_id: 123,
first_name: "Zach",
last_name: "McCormick",
email: "zach.mccormick@braze.com",
custom: {
twitter_handle: "zachmccormick"
favorite_food: "Greek",
loves_coffee: true
},
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date('2019-06-01T12:00:03Z'),
last_opened_email: Date('2019-06-01T12:03:19Z')
}
}
}

Frequency Capping Algorithm
MongoDB Sidekiq Worker
Eligible Users

MongoDB Query
On “Users”
Data Transfer
(for each user)
Eligible Users
(in batches)

Remove Ineligible
Campaigns
MongoDB Query
On “Users”
Data Transfer
Count Campaigns
and Check Rule
For Each Rule
(for each user)
Eligible Users
(in batches)

Remove Ineligible
Campaigns
MongoDB Query
On “Users”
Data Transfer
Count Campaigns
and Check Rule
For Each Rule
(for each user)
Eligible Users
(in batches)
Non-Frequency
Capped Users

What are some potential problems ?

Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
Frequency Capping in a flame graph of the Sidekiq job
(mostly spent waiting on queries!)

• What about the same campaign sent twice?
• “Last received” timestamps alone aren’t
enough data
last_received: Date('2019-06-
01T12:00:03Z'),
last_opened_email: Date('2019-06-
01T12:03:19Z')
}
}

Micro-optimizations
• What if we limit what parts of the user
profile document we bring back?
• We have aggregate stats, so we know
when certain campaigns were sent
Optimization Attempt #1

Micro-optimizations
• What if we limit what parts of the user
profile document we bring back?
• We have aggregate stats, so we know
when certain campaigns were sent
• However…
• What if the frequency capping window is
fairly large?
• What if the customer has hundreds of
millions of users?

The solution was good,
but it was not enough…

Redesign using the
Aggregation Pipeline

Aggregations at Braze Today
• Documents representing daily aggregate
statistics per-campaign
• Messages sent, messages opened,
conversion events, etc.
• Aggregation Pipeline queries for graphs,
charts, and reports

Redesign Goals
• Less network IO
• Expensive!
• Less RAM usage
• For huge campaigns, occasional OOM
errors
OOMs in server logs

Redesign Goals
• Less network IO
• Expensive!
• Less RAM usage
• For huge campaigns, occasional OOM
errors
• Much faster execution
• Micro-optimizations are only going to
go so far

Can we still use only User documents?

User Collection Example
{
_id: 123,
first_name: "Zach",
custom: {
loves_coffee: true
},
last_received: Date('2019-06-01T12:00:03Z'),
last_opened_email: Date('2019-06-01T12:03:19Z')
}
}
}
Campaign Summaries use a hash, not an array

What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases

• What if this new collection holds arrays of received campaigns
• We can use $slice to keep the arrays reasonably sized
• We can use the same IDs as User profiles to shard efficiently

• What if this new collection holds arrays of received campaigns
• We can use $slice to keep the arrays reasonably sized
• We can use the same IDs as User profiles to shard efficiently
• What would that look like?

UserCampaignInteractionData Collection Example
{
_id: 123,
emails_received: [
{
date: Date(‘2019-06-01T12:00:03Z’),
campaign: “CampaignB”,
dispatch_id: “identifier-for-send”
},
{
date: Date(‘2019-05-29T13:37:00Z’),
campaign: “CampaignA”,
},
…
],
…
}

{
_id: 123,
emails_received: […],
android_push_received: […],
ios_push_received: […],
webhooks_received: […],
sms_received: […],
…
}

{
_id: 123,
emails_received: [
{
date: Date(‘2019-06-01T12:00:03Z’),
campaign: “CampaignB”,
},
{
date: Date(‘2019-05-29T13:37:00Z’),
campaign: “CampaignA”,
},
…
],
…
}
{
_id: 123,
first_name: "Zach",
custom: {
loves_coffee: true
},
last_received: Date(
'2019-06-01T12:00:03Z'),
last_opened_email: Date(
'2019-06-01T12:03:19Z')
}
}}

NEW Frequency Capping Algorithm
1. Match stage
2. First projection using $filter
1. Only look at the relevant time window
2. Don’t include the current dispatch (for
multi-channel sends)
3. Exclude campaigns that don’t count
toward frequency capping
Resulting document:
{
“Zach”: {
“email_86400”: [
{
“dispatch_id”: …,
“date”: …,
“campaign”: …
},
…
],
}
}

NEW Frequency Capping Algorithm
1. Match stage
2. First projection using $filter
1. Only look at the relevant time window
2. Don’t include the current dispatch (for
multi-channel sends)
3. Exclude campaigns that don’t count
toward frequency capping
3. Second projection
1. Only bring back dispatch IDs
Resulting document:
{
“Zach”: {
“email_86400”: [
“campaign-a-dispatch-id”,
“campaign-b-dispatch-id”,
],
}
}

UserCampaignInteractionData Query Example
first_projection[��email_86400"] = {
:$filter => {
:input => ”email_received",
:cond => {
:$and => [
# first make sure the tuple we care about is within rule's time window
{:$gte => [
"$$this.date", Date.new(2019,6,9,12,0,0)
]},
# next make sure we don't include transactional messages
{:$not => :$in => [
"$$this.campaign", ["Txn Message One", "Txn Message Two",]
]}
]
}
}

UserCampaignInteractionData Query Example
second_projection[”email_86400"] = "$emails_received.dispatch_id"

Frequency Capping – Network Bandwidth
MongoDB
Frequency Capping Version 1
Sidekiq
Transferring full
user profiles
MongoDB
Frequency Capping Version 2
Sidekiq
Transferring only
Dispatch IDsVS.

Frequency Capping v1 vs. v2 Max Duration

Frequency Capping v1 vs. v2 Median Duration

Deployment Strategies
• All functionality behind a feature flipper
• Easy to turn on/off by customer

• Lots of excess code
• Feature flipper logic is simple – use class X or class Y

• Lots of excess code
• Feature flipper logic is simple – use class X or class Y
• Feature flipped on slowly
• Hourly and daily check-ins on Datadog
• Minimize impact if something goes wrong

Frequency Capping by Tag
first_projection[”email_marketing_86400"] = {
:$filter => {
:input => ”email_received",
:cond => {
:$and => [ …,
# only include campaigns tagged “marketing”
{:$in => [
"$$this.campaign", [”July 4 Promo", ”Memorial Day Sale", …]
]}
]
}
}

What else?
• Set the foundation for future expectations
• Customers are always going to want to send messages
• Faster and faster
• With more detailed segmentation
• With more complex inclusion/exclusion rules

Thank you! We are hiring!
braze.com/careers

MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling

Related slideshows

More Related Content

MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean, Mean, and Efficient Scaling