SlideShare a Scribd company logo
Emily Robinson
@robinson_es
10 Guidelines for A/B
Testing
About Me
➔ R User (😱)
➔ Background in the social sciences
➔ Formerly at Etsy
➔ Data Scientist at DataCamp
What is A/B Testing?
A/B testing is everywhere
My perspective
Millions of
visitors daily
Data
engineering
pipeline set-up
Generating numbers is easy;
generating numbers you
should trust is hard!
Source: Trustworthy online controlled experiments: five puzzling outcomes explained
Guidelines
Disclaimer
This is Bowen
This is Bowen Bobo
He is our fictional PM for the day
Situation Problem
Bobo: Well, we’re hoping this test
will increase registrations, search
clicks, and course starts
The test increased registrations
by 5%, but decreased course
starts by 3%
1. Have one key metric per experiment
➔ Clarifies decision-making
➔ Can have additional “guardrail”
metrics that you don’t want to
negatively impact
Situation Problem
Bobo : I have 100 test ideas. How
long is each going to take to run?
And which ones should we choose?
Ideas are cheap; prioritizing
them is difficult
2. Use your key metric to do a power calculation
➔ 80% power = if there’s an effect of
this size, 80% chance you detect it
➔ 10,000 daily visitors, 10% conversion
rate, how many days to detect a 5%
increase?
➔ https://bookingcom.github.io/power
calculator/
Situation Problem
Bobo : I checked the experiment
today and we significantly
increased conversion rate! Quick,
stop the test!
Source: http://varianceexplained.org/r/bayesian-ab-testing/, David Robinson
3. Run your experiment the length you’ve planned on
➔ Stick to what you arrived to with
your power analysis
➔ Advanced: always Valid Inference
and sequential testing
Situation Problem
Bobo : I know the test didn’t work
overall, but when I look at Canadian
users on mobile we increased
conversion by 10%!
This is multiple hypothesis
testing and will increase your
false positive rate.
4. Don’t look for differences in every possible segment
➔ Pre-specify hypotheses
➔ Run separate tests
➔ Can use methods to adjust for
multiple hypothesis testing
Situation 5
Situation Problem
Bobo : The experiment was a big
success! The split was 50.5/49.5
instead of 50/50 as planned, but
that’s so small it doesn’t matter,
right?
If you have 200k people in your
experiment, a 50.5/49.5 has p <
.0001. You have bucketing skew
or sample-ratio mismatch.
5. Make sure your experiment is balanced
➔ Use a proportion test to check
your split
➔ If unbalanced, do not use the
results
➔ Bad news: difficult to debug.
Check segments
Situation Problem
Bobo : I read this article about how
much better multi-armed bandits is
better than traditional A/B tests.
Why don’t we use that?
Not full of understanding of
assumptions of those method
6. Don’t overcomplicate your methods
➔ Get the basics right
➔ Designing tests right > super
sophisticated methods
Situation Problem
Bobo: Well, nothing went up, but
nothing went down either, so let’s
just launch it!
May be a negative effect too
small to detect. Adds technical
upkeep.
7. Be careful of launching things because they “don’t hurt”
➔ Decide whether to “launch on
neutral” beforehand
➔ Non-inferiority testing
Situation Problem
Bobo: Hey, we just finished this
experiment. Can you analyze it for
us?
“To consult the statistician after an
experiment is finished is often merely to
ask [them] to conduct a post mortem
examination. [They] can perhaps say what
the experiment died of.”
- Robert Fisher
8. Have a data scientist/analyst involved in the whole process
➔ Helps decide whether it should
be an experiment at all
➔ Make sure you can measure
what you want
➔ Can surface problems along the
way
Situation Problem
Bobo: Hey, we accidentally added
everyone to the experiment. Can we
still use our dashboards to monitor it?
Non-impacted people add noise,
decreasing power
9. Only include people in your analysis who could have been affected
➔ Start tracking people after the
user sees the change
➔ Can be tricky – e.g. changing
threshold for free shipping offer
from $25 to $35
Situation Problem
Bobo: We spent 6 months
redesigning this page, made 50
changes to make it awesome, but the
A/B test shows it did worse. Why?
Time was wasted, and with many
changes hard or impossible to tell
what was the problem
10. Focus on smaller, incremental tests
➔ Work in small design-develop-
measure cycles
➔ Test assumptions
Conclusion
Recap
1. Have one key metric per experiment
2. Use your key metric to do a power
calculation
3. Run your experiment for the length you’ve
planned on
4. Don’t look for differences in every possible
segment
5. Make sure your experiment groups are
balanced
6. Don’t overcomplicate your methods
7. Be careful of launching things because
they don’t hurt
8. Have a data scientist/analyst involved in
the whole process
9. Only include people in your analysis who
could have been affected
10. Focus on smaller, incremental tests
Research papers
➔ Controlled experiments on the web: survey and practical guide (2008)
➔ Seven rules of thumb for web site experiments (2014)
➔ A dirty dozen: twelve common metric interpretation pitfalls in online
controlled experiments (2017)
➔ Democratizing online controlled experiments at Booking.com (2017)
Blog posts and presentations
➔ Design for Continuous Experimentation by Dan McKinley
➔ Scaling Airbnb’s Experimentation Platform by Jonathan Parks
➔ Please, please don’t A/B test that by Tal Raviv
➔ How Etsy handles peeking in A/B Testing by Callie McRee and
Kelly Shen
Thank you!
hookedondata.org
@robinson_es
bit.ly/guidelinesab

More Related Content

10 Guidelines for A/B Testing

  • 2. About Me ➔ R User (😱) ➔ Background in the social sciences ➔ Formerly at Etsy ➔ Data Scientist at DataCamp
  • 3. What is A/B Testing?
  • 4. A/B testing is everywhere
  • 5. My perspective Millions of visitors daily Data engineering pipeline set-up
  • 6. Generating numbers is easy; generating numbers you should trust is hard! Source: Trustworthy online controlled experiments: five puzzling outcomes explained
  • 9. This is Bowen This is Bowen Bobo He is our fictional PM for the day
  • 10. Situation Problem Bobo: Well, we’re hoping this test will increase registrations, search clicks, and course starts The test increased registrations by 5%, but decreased course starts by 3%
  • 11. 1. Have one key metric per experiment ➔ Clarifies decision-making ➔ Can have additional “guardrail” metrics that you don’t want to negatively impact
  • 12. Situation Problem Bobo : I have 100 test ideas. How long is each going to take to run? And which ones should we choose? Ideas are cheap; prioritizing them is difficult
  • 13. 2. Use your key metric to do a power calculation ➔ 80% power = if there’s an effect of this size, 80% chance you detect it ➔ 10,000 daily visitors, 10% conversion rate, how many days to detect a 5% increase? ➔ https://bookingcom.github.io/power calculator/
  • 14. Situation Problem Bobo : I checked the experiment today and we significantly increased conversion rate! Quick, stop the test! Source: http://varianceexplained.org/r/bayesian-ab-testing/, David Robinson
  • 15. 3. Run your experiment the length you’ve planned on ➔ Stick to what you arrived to with your power analysis ➔ Advanced: always Valid Inference and sequential testing
  • 16. Situation Problem Bobo : I know the test didn’t work overall, but when I look at Canadian users on mobile we increased conversion by 10%! This is multiple hypothesis testing and will increase your false positive rate.
  • 17. 4. Don’t look for differences in every possible segment ➔ Pre-specify hypotheses ➔ Run separate tests ➔ Can use methods to adjust for multiple hypothesis testing
  • 18. Situation 5 Situation Problem Bobo : The experiment was a big success! The split was 50.5/49.5 instead of 50/50 as planned, but that’s so small it doesn’t matter, right? If you have 200k people in your experiment, a 50.5/49.5 has p < .0001. You have bucketing skew or sample-ratio mismatch.
  • 19. 5. Make sure your experiment is balanced ➔ Use a proportion test to check your split ➔ If unbalanced, do not use the results ➔ Bad news: difficult to debug. Check segments
  • 20. Situation Problem Bobo : I read this article about how much better multi-armed bandits is better than traditional A/B tests. Why don’t we use that? Not full of understanding of assumptions of those method
  • 21. 6. Don’t overcomplicate your methods ➔ Get the basics right ➔ Designing tests right > super sophisticated methods
  • 22. Situation Problem Bobo: Well, nothing went up, but nothing went down either, so let’s just launch it! May be a negative effect too small to detect. Adds technical upkeep.
  • 23. 7. Be careful of launching things because they “don’t hurt” ➔ Decide whether to “launch on neutral” beforehand ➔ Non-inferiority testing
  • 24. Situation Problem Bobo: Hey, we just finished this experiment. Can you analyze it for us? “To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.” - Robert Fisher
  • 25. 8. Have a data scientist/analyst involved in the whole process ➔ Helps decide whether it should be an experiment at all ➔ Make sure you can measure what you want ➔ Can surface problems along the way
  • 26. Situation Problem Bobo: Hey, we accidentally added everyone to the experiment. Can we still use our dashboards to monitor it? Non-impacted people add noise, decreasing power
  • 27. 9. Only include people in your analysis who could have been affected ➔ Start tracking people after the user sees the change ➔ Can be tricky – e.g. changing threshold for free shipping offer from $25 to $35
  • 28. Situation Problem Bobo: We spent 6 months redesigning this page, made 50 changes to make it awesome, but the A/B test shows it did worse. Why? Time was wasted, and with many changes hard or impossible to tell what was the problem
  • 29. 10. Focus on smaller, incremental tests ➔ Work in small design-develop- measure cycles ➔ Test assumptions
  • 31. Recap 1. Have one key metric per experiment 2. Use your key metric to do a power calculation 3. Run your experiment for the length you’ve planned on 4. Don’t look for differences in every possible segment 5. Make sure your experiment groups are balanced 6. Don’t overcomplicate your methods 7. Be careful of launching things because they don’t hurt 8. Have a data scientist/analyst involved in the whole process 9. Only include people in your analysis who could have been affected 10. Focus on smaller, incremental tests
  • 32. Research papers ➔ Controlled experiments on the web: survey and practical guide (2008) ➔ Seven rules of thumb for web site experiments (2014) ➔ A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments (2017) ➔ Democratizing online controlled experiments at Booking.com (2017)
  • 33. Blog posts and presentations ➔ Design for Continuous Experimentation by Dan McKinley ➔ Scaling Airbnb’s Experimentation Platform by Jonathan Parks ➔ Please, please don’t A/B test that by Tal Raviv ➔ How Etsy handles peeking in A/B Testing by Callie McRee and Kelly Shen

Editor's Notes

  1. We’re going to talk about 10 situations you may encounter in you’re A/B testing journey. I want to make it clear this is not based on any particular person.
  2. And with that, here is Bobo. He’s our fictional PM for the day. Any resemblance of name and picture based to a product manager I’ve worked with previously is purely coicidencial
  3. I’d come in, I’d get my team