Digital Elite Camp 2016: Optimize for €€€ - Annemarie Klaassen & Tom van den Berg

•

2 likes•653 views

How we optimize at the biggest hotel chain in the Netherlands. 1. Free test software 2. Change of statistics 3. Change of test method

More Related Content

Digital Elite Camp 2016: Optimize for €€€ - Annemarie Klaassen & Tom van den Berg

1. SUBTITLE BELOW Optimize for €€€ HOW WE OPTIMIZE AT THE BIGGEST HOTEL CHAIN IN THE NETHERLANDS

2. @AM_Klaassen / @tomvdberg1 How we got here…

3. @AM_Klaassen / @tomvdberg1 The quest for female speakers…

4. @AM_Klaassen / @tomvdberg1 Female speakers… ?

5. @AM_Klaassen / @tomvdberg1 Our lovely colleagues

6. What we do…

7. @AM_Klaassen / @tomvdberg1 Conversion rate optimization Analytics Psychology

8. Lots of A/B-tests

9. @AM_Klaassen / @tomvdberg1 Our clients

10. @AM_Klaassen / @tomvdberg1 Adding direct value Learning user behavior

11. @AM_Klaassen / @tomvdberg1

12. @AM_Klaassen / @tomvdberg1

13. @AM_Klaassen / @tomvdberg1 Van der Valk  Unique website for each hotel  1 centralized team  Paid by % of their turnover

14. @AM_Klaassen / @tomvdberg1 Optimization of Valk  Increase revenue through CRO  Reduce costs per experiment

15. “We’re not in the business of science, we’re in the business of making money”

16. 3 Tactics

17. 1. Free test software

18. @AM_Klaassen / @tomvdberg1 Jorrin Quest GTM testing http://gtmtesting.com

19. @AM_Klaassen / @tomvdberg1 GTM testing  Easy setup  Free  Preview mode

20. 2. Change of statistics

21. @AM_Klaassen / @tomvdberg1 Focus on finding proof

22. @AM_Klaassen / @tomvdberg1 Example test: Order flow

23. @AM_Klaassen / @tomvdberg1 Example test: Set completion A B

24. @AM_Klaassen / @tomvdberg1 Test result

25. @AM_Klaassen / @tomvdberg1 • No effect on conversion  Don’t implement  Re-test with higher volumes Conclusion

26. @AM_Klaassen / @tomvdberg1

27. @AM_Klaassen / @tomvdberg1 What’s the alternative? Frequentist statistics Bayesian statistics

28. @AM_Klaassen / @tomvdberg1 Example test: Set completion A B

29. @AM_Klaassen / @tomvdberg1 Bayesian Test evaluation

30. @AM_Klaassen / @tomvdberg1 Focus on risk assessment 88,4% A test result is the probability that B outperforms A: ranging from 0% - 100%

31. @AM_Klaassen / @tomvdberg1 IMPLEMENT B PROBABILITY * EFFECT ON REVENU Expected risk 11,6% - € 69.791 Expected uplift 88,4% € 213.530 Contribution € 180.733 * Based on 6 months and an average order value of € 175 Make a risk assessment

32. @AM_Klaassen / @tomvdberg1 Implement B B

33. @AM_Klaassen / @tomvdberg1 abtestguide.com/bayesian/ Roy Schieving Annemarie Klaassen

34. @AM_Klaassen / @tomvdberg1 Conclusion Bayesian  Easier to understand and communicate  Better suits the business  Don’t throw away good ideas (indicatively significant)  Higher implementation rate and revenue € 0 € 1,000,000 € 2,000,000 € 3,000,000 € 4,000,000 € 5,000,000 € 6,000,000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Expected impact on revenue in 1 year € Frequentist € Bayesian

35. 3. Change of test method

36. @AM_Klaassen / @tomvdberg1 A/B-test

37. @AM_Klaassen / @tomvdberg1 Traffic allocation over time A B Traffic Time Profit or loss?

38. @AM_Klaassen / @tomvdberg1 What’s the alternative? A/B-test Bandits

39. If you have multiple options: which bandit will pay-out the most?

40. @AM_Klaassen / @tomvdberg1 Explore Exploit ? Explore / Exploit dilemma

41. @AM_Klaassen / @tomvdberg1 Bandit 100%

42. @AM_Klaassen / @tomvdberg1 Bandit 80% 20% 10% 90%

43. @AM_Klaassen / @tomvdberg1 Traffic allocation over time A B Traffic Time More Profit!

44. @AM_Klaassen / @tomvdberg1 Limited time frame Automation When use Bandits? Earn over learn

45. How?

46. @AM_Klaassen / @tomvdberg1 Example 1

47. @AM_Klaassen / @tomvdberg1 Example 1  Older target group  Comparing price on other sites  Behavior of others

48. @AM_Klaassen / @tomvdberg1 Example 1 Control No message Variation B Copy default Variaton C Variation D

49. 11,1% 10,6% 10,4% 10,5%

50. @AM_Klaassen / @tomvdberg1 Example 1 – Effect on revenue A B Traffic Time + € 6.637

51. @AM_Klaassen / @tomvdberg1 Example 2 Control New control Variation B Copy control Variation C Variation D

52. 9,2% 8,9% 7,8% 8,4%

53. @AM_Klaassen / @tomvdberg1 Example 2 – Effect on revenue A B Traffic Time + € 19.862

54. @AM_Klaassen / @tomvdberg1 Conclusion Bandits  Less regret during testing  No implementation costs  Optimizing per user segment

55. 1. Free test software 2. Bayesian statistics 3. Bandit testing

56. revenue Maximize

57. @AM_Klaassen / @tomvdberg1

58. “We’re not in the business of science, we’re in the business of making money”

59. TÄNAN VÄGA! Bayesian calculator: abtestguide.com/bayesian/ Slide deck: ondi.me/elitecamp2016 @AM_Klaassen annemarie@onlinedialogue.com nl.linkedin.com/in/amklaassen @tomvdberg1 tom@onlinedialogue.com nl.linkedin.com/in/vandenbergtom

Editor's Notes

Hi everybody! First of all thank you so much for having us over! We are really excited to be here. Before we start with our real presentation. I would like to the story how we actually came here on stage.
TOM Somewhere in February Peep posted this tweet, where he asked for female speakers.
TOM Where I replied to…
TOM So Peep asked for female speakers. How did I end up on stage? Annemarie decided to do a duo presentation. We are both working as a Web Analytics and Optimization Expert at Online Dialogue.
AN Online Dialogue is a conversion optimization agency in the Netherlands. Some of you may actually be familiar with our crazy bosses Ton (who’s a data geek) and Bart (the obviously crazy psychologist). They speak at several conferences all over the world about data and applied psychology. We work in multidisciplinary teams consisting of analyst, psychologists, ux designers, developers and project leads.
AN But, what is it that we do then on a day to day basis?
AN Well, we mainly focus on getting data insights (and that may be derived from webanalytics data, surveys, heat and scrollmaps, usability tests etcetera. Basically any data we can get our hands on) and combine those with consumer psychology and scientific research. This combination of data and psychology is then used to come up with several hypothesis to test in order to increase conversion.
AN As you might expect we run lots and lots of experiment.
AN We do this for a bunch of clients in the Netherlands and also for some pretty cool international clients. For most we do high velocity testing. Which means , we run multiple tests per week for them.
AN The purpose of all those A/B-tests is on the one hand to add direct value in the short term – you want to increase the revenue that is coming in to the website. That’s what clients sign up for. They want more money out of their site. And on the other hand in the long run to really learn from user behavior. You want to know what it is that triggers the visitors on your website to be able to use those insights to come up with even better optimization efforts in the future.
TOM Today we want to show you how we optimize the websites for the biggest hotel chain in the Netherlands. It’s called Van der Valk. Maybe you have heard about it, because they also have a big hotel close to Schiphol airport. We are the optimization partners for 5 year already. I will share the practical examples we have and Annemarie will share the theory of the different tactics we are using to optimize at Van der Valk.
TOM They have 39 hotels in the Netherlands (and 1 in Germany). All the websites have the same look and feel but are unique for every hotel. Every hotel has his own color.
TOM All the websites are maintained by 1 centralized team. They are paid by a percentage of the total turnover. We are their Analytics & Optimization partner.
TOM The manager of van der Valk is mainly interested in earning as much money as possible. His team is dependent upon the revenue they get from online. They get a certain percentage of the revenue. So our focus is on maximizing ROI and revenue. How do we do this? Don’t spend money on stuff you should not spend money on Maximize number of succesfull experiments.
TOM Because Van der is Valk is not in the business of science, we are in the business of making money.
TOM Today we want to share 3 tactics we are using to optimize for money at Van der Valk.
TOM The first tactic is free test software. We are only using the test software to inject the test code and equally split the variations. We are not using the drag and drop features or to analyse, we do this in the analytics program.
TOM The free software we use is GTM testing, which is developed by our colleague Jorrin. GTM testing is a simple and free solution to AB test in Google Tag Manager. All the test we run at Van der Valk are executed with GTM testing. Our colleague Jorrin wrote the script and is still adding new features.
TOM GTM testing is easy to set up and it is Free. For other tool, like VWO and Optimizely you will pay at least a few 100 euro’s. And if you have a high traffic website this will be more then 1000 euros a month. We also created a easy preview mode to switch between the control and the variation. Our clients can quickly check both variations and give us feedback.
AN The second tactic we use at Valk to grow their revenue is actually a funny one. Because we proposed a change in statistics we use to evaluate A/B-tests. We have been using Frequentists statistics with the so called z- or t-test to evaluate A/B-tests. This test evaluation method is most commonly used, but there is a big challenge with this type of evaluation.
AN The most important challenge is that with a t-test an A/B-test can only have 2 outcomes: you either have a winner of no winner. And the focus is on finding those real winners. You want to take as little risk as possible. This is not so surprising if you take into account that t-tests have been used in a lot of medical research as well. Of course you don’t want to bring a medicine to the market if you’re not 100% sure that it won’t make people worse of kill them. You don’t want to take any risk whatsoever. But businesses aren’t run this way. You need to take some risk in order to grow your business. Right? Now, Tom will show you an A/B-test we have run at van der Valk which highlights this exact challenge.
TOM In the data we saw that most visitors (around 50%) where exiting the website at this step in the checkout. Most visitors who where actually starting were also finishing it. So the optimization at this page was not in the form, but in how to get more visitors starting the form.
TOM We performed an A/B-test with the process indicator in the order flow. In the variation we added an extra step (which you already completed (select a room). This psychological tactic is called ‘set completion’. You make visitors feel they are already half-way their order process and will be more likely to finish their booking.
TOM Based on this test result you would conclude that it’s no winner, that it mustn’t be implemented and that the measured uplift in conversion rate wasn’t enough. So you will see this a loser and move on to another test idea. However, there seems to be a positive movement (the measured uplift is 4,55%), but it isn’t big enough to recognize as a significant winner. You probably only need a few more conversions.
TOM So the conclusion there is no significant effect on conversion, but there are significant more visitors on the 3rd step. This is the last step before customers finish there booking. Based on this numbers we would advise to not implement the variation. Or maybe retest later with higher volumes.
TOM But why re-test this. That might be a waste of time and money. Shouldn’t it just be implemented?
AN If Frequentists statistics confronts us with these kind of challenges, what’s the alternative then? Well as I said earlier, the most common approach to analysing A/B-tests is the t-test (which is based on frequentist statistics). But, over the last couple of years more and more software packages (like VWO and Google Optimize) are switching to Bayesian statistics. And that’s not without reason, because using Bayesian statistics makes more sense, since it better suits how businesses are run and I will show you why.
AN If you take a look at the test result Tom showed you earlier you see that there was a measured uplift in conversion rate of 4,5%, but this just wasn’t enough to declare it as a significant winner. You might have a hard time explaining this to your manager, because all he sees is the 4,5% measured uplift. You would have to explain him about significance, p-values and confidence levels why this is not a winner he needs to implement. But that’s hard!
If you use Bayesian statistics however, to evaluate your A/B-test, then there is no difficult statistical terminology involved. There’s no p-value, z-value, confidence levels et cetara. It just shows you the measured uplift and the probability that B is better than A. Easy right? Everyone can understand this. Based on the same numbers of the A/B-test we showed you earlier you have a 88,4% chance that B will actually be better than A. Probably every manager would like these odds.
AN In short: the biggest advantage of using a Bayesian A/B-test evaluation method is that it doesn’t have a binary outcome like the t-test does. A test result won’t tell you winner / no winner, but a percentage between 0 and 100% whether the variation performs better than the original. In this example 88,4%. The question that remains is: does the chance of an uplift outweigh the chance of loosing money when you would implement the variation?
AN What you can do with these numbers is make a proper risk assessment. When Valk decides to implement the new progress bar they have a 11.6% chance of a drop in revenue of almost 70.000 in 6 months time (and an average order value of 175) But on the other hand, they also have a 88,4% chance that the variation is actually better and brings in over 200.000 euro. If you multiply 11.6% times the drop in revenue plus 88.4% times the uplift in revenue and you have the total contribution of this test. This test is thus expected to generate 180.000 euro in 6 months time.
AN You could argue that every test with a contribution higher than 0 (that’s when the probability is higher than 50%) should be implemented. But then you don’t take into account the cost of testing As Tom mentioned earlier Van der Valk is paid by a percentage of their turnover. And running an A/b-test program (we don’t work for free) and implementing variations costs money. In order to compensate this they are satisfied when the expected contribution of a test is at least 150.000 euro in 6 months. The conclusion based on a Bayesian test evaluation is thus to implement the variation, because it will probably earn them more money! As opposed to with a frequentist test evaluation you wouldn’t and leave money on the table.
AN Recently we turned this Bayesian Excel calculator into a webtool as well. It’s for everyone free to use. If you visit this URL you can input your test and business case data and calculate! It will return the chance that B outperforms A and also the contribution of the test to the business.
AN In conclusion, if you change the evaluation method of A/B-testing to Bayesian you will have a couple of advantages: Test results are easier to understand, since you no longer have to communicate difficult statistical terminology The outcome of a test can be seen as a business case question and is therefor better suitable for commercial businesses You no longer throw away good test ideas that are nearly significant, because they will probably earn you more money Because of all this you will have a higher implementation rate and thereby more revenue
TOM The third tactic we use is a change of test method. We have been optimizing with just pure A/B-tests.
TOM In an A/B test you randomly assign equal numbers of users to variation A and variation B. And after this period you decide either to implement B or keep A. Either A or B then receives 100% of the traffic.
TOM During the test the traffic distribution doesn’t change. If a test runs for 2 weeks. The traffic distribution stays 50% / 50%. We call this pure exploration. During the test period you are exploring which variation is performing better. After the test the winning variation is determined and implemented (100% exploitation). If variation B is a winner, then you might regret showing 50% of your traffic an inferior option during the 2 weeks testing. And also the other way around. The question is whether this winner is a definite winner. Will it stay that way or is it dependent upon seasonality / the weather and so on?
AN If running A/B-tests may cause regret, what’s the alternative? I’ll explain what the difference is between an A/B-test and using so-called Bandits. I will also highlight in which cases using a bandit is preferred to running normal A/B-tests
AN Before I will explain how bandit testing works, I will first explain why it’s called bandit testing The name bandit testing comes from the casino. These slot machines are called one armed bandits, because they tend to rob you of your money. Each of those one armed bandits has a different payout schedule. A gambler will explore different slot machines to find out which slot machine pays out the most. So he will spend most his money in that slot machine. But he will still occasionally try the other slot machines as well to make sure he’s still spending his money right. So he exploits the one that pays out the most, but keeps exploring others.
AN The same principle can be applied to A/B-testing. You want to explore which variation performs better, but when you find out you want to exploit this variation as soon as possible. The distribution between exploring and exploiting may vary over time.
AN Now that you know what a bandit is, I will tell you how it works. When a new visitor enters the site, the visitor is assigned to either Explore or Exploit. When visitors are assigned to the Explore arm, visitors will be randomly assigned to variation A or B. This is the same as with a pure A/B-test. The bandit starts with 100% exploring, since you don’t have any knowledge about which variation performs better yet.
AN When you have collected a number of conversions you have information about the conversion rates of A and B. This historical information is used to determine the division between exploring and exploiting. Visitors who are assigned to Exploit will be served the best performing variation. If B converts better than A, then B will be shown to new visitors. The chosen percentage between exploring and exploiting is dynamic and is determined by two factors: The difference in conversion rate between A and B: the bigger the difference the higher the exploitation rate And the higher the confidence the higher the exploitation rate. This means that in the beginning of the bandit you’re not so confident in the results yet, but by the end of the test you are. In the end you might end up with 90% exploiting and 10% exploring.
AN On a time line this might look something like this. You start with a 50/50 split, but over time – if B has a higher conversion rate – the split becomes like 20/80. This way you will limit the level of regret of showing the less performing variation to a percentage of your traffic. It also means that by the end of the test your overall conversion rate (and thus revenue) is higher than with a normal A/B-test, because more visitors have seen the better variation.
AN In which cases is it smart to choose a bandit over an A/B-test? First of all, when you have a limited time frame to optimize: For example during promotions. You don’t want to run an A/B-test in this short time frame, because when you have the result, it’s worthless. You want to know which variation performs best during the test. Secondly, when the time frame is endless and you are doing continuous optimization and are frequently adding and removing versions to be tested. This can be very useful when you face high seasonality in sales and behaviour. A winning variation may perform well in the last-minute season, but perform badly during other seasons. A continuous bandit can solve this problem. And lastly, when you just simply don’t care much about the learnings, but want to optimize for money. You want the regret during testing to be minimized and average conversion rates to be optimized. This is what van der Valk prefers.
TOM How did we optimize with bandits at the Van der Valk website?
TOM At the homepage were 50% enters the websites visitors select their data. After this they receive an overview of all the available rooms + the prices.
TOM We did several rounds of bandits at the most important page at the Van der Valk website. This is the page were visitors choose their room based on the date they selected. Based on the impact we need to make and the extra revenue we can earn this is the most important page at the Van der Valk website. And in earlier A/B test we saw that specific message with a clear message can work here. Therefore we choose this page to start with bandits.
TOM This are the first 2 messages whe choose to run in the bandit. “You will immediately receive a confirmation” is chosen because the target group at the website is older and more uncertain. To give them more certainty we tell them they will receive a confirmation immediately. In a survey we did we found that 50% of the visitors is comparing prices with other websites when they are on the Van der Valk website. To show them when the last booking was made, they get the feeling they can’t wait too long “Last booking at this hotel was 3 minutes ago” Which sentences work??
TOM As Annemarie said, in a bandit the traffic allocation differs per day and per variation. The first 200 transaction we just run the bandit like an A/B test and keep the traffic split 30% per variation. Otherwise the traffic allocation in the bandit will change with every transaction. If the bandit starts to change the traffic allocation after 200 transactions, the conversion rate per variation is more stable. The control outside the bandit gets the same percentage all the time (10%).
TOM If we compare the result of the bandit to a normal A/B test, we earned €7K extra in this period. After a couple of days the conversion rate of variation D was higher compared to the other variations and this wasn’t changing anymore. Therefore we decided to start a new round and add new variations.
TOM In the second example we started a new bandit with all the traffic equal again. This round was at the same page as round 1. > Gratis annuleren: minder uncertaintly voor ouderen + wordt bij veel sites aangeboden (en wordt veel vergeleken)
TOM As Annemarie said, in a bandit the traffic allocation differs per day and per variation. The first 200 transaction we just run the bandit like an A/B test and keep the traffic split 33% per variation. Otherwise the traffic allocation in the bandit will change with every transaction. If the bandit starts to change the traffic allocation after 200 transactions, the conversion rate per variation is more stable. The control outside the bandit gets the same percentage all the time (10%).
TOM If we compare the result of the bandit to a normal A/B test, we earned €20K extra in this period. After a couple of days the conversion rate of variation D was higher compared to the other variations and this wasn’t changing anymore. Therefore we decided to start a new round and add new variations.
TOM
AN To summarize: We talked about 3 tactics: We use free test software to limit the cost of testing We use Bayesian statistics to evaluate A/B-tests And we run Bandit tests
AN All of this to maximize the revenue of van der Valk
AN To make our client, Van der Valk happy!!
AN Because again, we are in the business of making money and not in the business of science.
AN Thank you for your attention!

Digital Elite Camp 2016: Optimize for €€€ - Annemarie Klaassen & Tom van den Berg

Related slideshows

More Related Content

Digital Elite Camp 2016: Optimize for €€€ - Annemarie Klaassen & Tom van den Berg

Editor's Notes