10 Guidelines for A/B Testing

Emily Robinson
@robinson_es
10 Guidelines for A/B
Testing

About Me
➔ R User (😱)
➔ Background in the social sciences
➔ Formerly at Etsy
➔ Data Scientist at DataCamp

My perspective
Millions of
visitors daily
Data
engineering
pipeline set-up

Generating numbers is easy;
generating numbers you
should trust is hard!
Source: Trustworthy online controlled experiments: five puzzling outcomes explained

This is Bowen
This is Bowen Bobo
He is our fictional PM for the day

Situation Problem
Bobo: Well, we’re hoping this test
will increase registrations, search
clicks, and course starts
The test increased registrations
by 5%, but decreased course
starts by 3%

1. Have one key metric per experiment
➔ Clarifies decision-making
➔ Can have additional “guardrail”
metrics that you don’t want to
negatively impact

Situation Problem
Bobo : I have 100 test ideas. How
long is each going to take to run?
And which ones should we choose?
Ideas are cheap; prioritizing
them is difficult

2. Use your key metric to do a power calculation
➔ 80% power = if there’s an effect of
this size, 80% chance you detect it
➔ 10,000 daily visitors, 10% conversion
rate, how many days to detect a 5%
increase?
➔ https://bookingcom.github.io/power
calculator/

Situation Problem
Bobo : I checked the experiment
today and we significantly
increased conversion rate! Quick,
stop the test!
Source: http://varianceexplained.org/r/bayesian-ab-testing/, David Robinson

3. Run your experiment the length you’ve planned on
➔ Stick to what you arrived to with
your power analysis
➔ Advanced: always Valid Inference
and sequential testing

Situation Problem
Bobo : I know the test didn’t work
overall, but when I look at Canadian
users on mobile we increased
conversion by 10%!
This is multiple hypothesis
testing and will increase your
false positive rate.

4. Don’t look for differences in every possible segment
➔ Pre-specify hypotheses
➔ Run separate tests
➔ Can use methods to adjust for
multiple hypothesis testing

Situation 5
Situation Problem
Bobo : The experiment was a big
success! The split was 50.5/49.5
instead of 50/50 as planned, but
that’s so small it doesn’t matter,
right?
If you have 200k people in your
experiment, a 50.5/49.5 has p <
.0001. You have bucketing skew
or sample-ratio mismatch.

5. Make sure your experiment is balanced
➔ Use a proportion test to check
your split
➔ If unbalanced, do not use the
results
➔ Bad news: difficult to debug.
Check segments

Situation Problem
Bobo : I read this article about how
much better multi-armed bandits is
better than traditional A/B tests.
Why don’t we use that?
Not full of understanding of
assumptions of those method

6. Don’t overcomplicate your methods
➔ Get the basics right
➔ Designing tests right > super
sophisticated methods

Situation Problem
Bobo: Well, nothing went up, but
nothing went down either, so let’s
just launch it!
May be a negative effect too
small to detect. Adds technical
upkeep.

7. Be careful of launching things because they “don’t hurt”
➔ Decide whether to “launch on
neutral” beforehand
➔ Non-inferiority testing

Situation Problem
Bobo: Hey, we just finished this
experiment. Can you analyze it for
us?
“To consult the statistician after an
experiment is finished is often merely to
ask [them] to conduct a post mortem
examination. [They] can perhaps say what
the experiment died of.”
- Robert Fisher

8. Have a data scientist/analyst involved in the whole process
➔ Helps decide whether it should
be an experiment at all
➔ Make sure you can measure
what you want
➔ Can surface problems along the
way

Situation Problem
Bobo: Hey, we accidentally added
everyone to the experiment. Can we
still use our dashboards to monitor it?
Non-impacted people add noise,
decreasing power

9. Only include people in your analysis who could have been affected
➔ Start tracking people after the
user sees the change
➔ Can be tricky – e.g. changing
threshold for free shipping offer
from $25 to $35

Situation Problem
Bobo: We spent 6 months
redesigning this page, made 50
changes to make it awesome, but the
A/B test shows it did worse. Why?
Time was wasted, and with many
changes hard or impossible to tell
what was the problem

10. Focus on smaller, incremental tests
➔ Work in small design-develop-
measure cycles
➔ Test assumptions

Recap
1. Have one key metric per experiment
2. Use your key metric to do a power
calculation
3. Run your experiment for the length you’ve
planned on
4. Don’t look for differences in every possible
segment
5. Make sure your experiment groups are
balanced
6. Don’t overcomplicate your methods
7. Be careful of launching things because
they don’t hurt
8. Have a data scientist/analyst involved in
the whole process
9. Only include people in your analysis who
could have been affected
10. Focus on smaller, incremental tests

Research papers
➔ Controlled experiments on the web: survey and practical guide (2008)
➔ Seven rules of thumb for web site experiments (2014)
➔ A dirty dozen: twelve common metric interpretation pitfalls in online
controlled experiments (2017)
➔ Democratizing online controlled experiments at Booking.com (2017)

Blog posts and presentations
➔ Design for Continuous Experimentation by Dan McKinley
➔ Scaling Airbnb’s Experimentation Platform by Jonathan Parks
➔ Please, please don’t A/B test that by Tal Raviv
➔ How Etsy handles peeking in A/B Testing by Callie McRee and
Kelly Shen

Thank you!
hookedondata.org
@robinson_es
bit.ly/guidelinesab

10 Guidelines for A/B Testing

Related slideshows

More Related Content

10 Guidelines for A/B Testing

Editor's Notes