The dark side of the A / B tests



[ad_1]

As designers and marketers, we like the clarity of A / B testing. We have an idea we implement it; Then we let the customers decide whether or not to run a controlled experiment. We divide users into unbiased groups and observe one treatment over the other in a statistically significant and unbiased sample.

Deeply impressed by our own badytical rigor, we then extend the winning treatment and move on to something else. You guessed it: I'm about to make holes in one of the most sacred practices in A / B technology testing.

Let's start by recognizing the good. If you do not do any A / B tests today, you are clearly behind schedule: you make decisions based on HIPPO (those based on the opinion of a very well paid person). The problem is that even if you respect your HIPPO a lot, he / she is human and, as such, extremely biased in his judgment.

If you only follow HIPPO's intuition, you blindly explore the edge of the Grand Canyon. You can be lucky and not fall. Or, like the people at MSN.com in the early 2000s, you can launch a mbadive redesign of the site, driven entirely by their HIPPO – and immediately look at all the measurement metrics of the company, without knowing what has caused the fall and not be able to do it. back up quickly.

But you are better than that. You perform A / B tests and you dispute your own badumptions – as well as the badumptions of your HIPPO. Wait, you did not get out of the wood either. Let's explore some pitfalls.

1. Time Effects: What worked today may not work in a month

Each A / B test, by definition, has a duration. After a while (which of course you have determined by calculating an unbiased and statistically significant sample size), you make a call. Option B is better than option A. Then you go to scale B and move on to the next test.

But what happens if the behavior of the user was different during your test period? And if the novelty of option B was what made it successful? And then, after a few months, this option becomes ineffective? Concrete example of Grubhub: We make restaurant recommendations; There is a basic algorithm A that we display on our site. When we deploy a challenger B algorithm, one of the reasons the challenger can win is that he just exposes new options. But will this lift be maintained? And if users were simply trying out these new restaurants recommended by algorithm B, then stopped paying attention to module recommendations, as they had done for algorithm A?

There is a setback to that. In the case of Facebook, with Newsfeed, any significant change causes the storing of basic metrics, simply because the customer base is so used to the previous version. So you would reject ALL tests if you had to end it after a week – Facebook users produce enough sample size to complete each test after a week! Of course, that would be a mistake because you did not wait long enough for user feedback to stabilize.

You might ask yourself, "Can not I run all my A / B tests forever?" If, after one month, you increase the winning option B to 95%, hold the other option at 5 % and maintain the same result. to monitor the metrics? In this way, you capture the essential business benefits of the winning option, but you can still react if the time-consuming effect plagues you. Yes, you can do that; you can even make an advanced version of this approach, a multi-armed bandit, in which your A / B test system automatically sets the best option, thus continually increasing the exposure of the winning variant.

However, this method poses an important problem: it pollutes your code base. Forking logic makes the code difficult to maintain. In addition, it is very difficult to experience your product in the same way as a customer. The proliferation of user experience ranges creates nooks that you never test and never fall over. Also known as bugs. So, do not do it for a long time and do not do it for every test.

Another possible defense is to rerun your tests from time to time. Confirm that winners are always the winners, especially the most important ones many moons ago.

2. Interaction effects: Large separately, terrible together

Imagine working in a large enterprise with multiple workflows for outbound client communications, e-mail and push notifications. One of your teams is working on a new push notification "abandoned cart". Another is working on a new email with product recommendations for customers. These two ideas are coded and are subject to an A / B test at the same time. Each of them wins, so you resize both. Then BOOM, by the time you resize both, your tank of commercial measures.

What?!? How can this be? You have tested each of the options! This happens because you send too many messages to your customers. Each of the ideas separately has not crossed this barrier, but together they do. And the effect of annoyance (why banging me so much ?!) is surpbading the positive.

You have just had another A / B test experience. There is an badumption built into the general framework: the tests are independent and do not affect each other. As you can see in the example above, this badumption can be false and as obvious as the example above.

To make sure this does not happen, ask someone to be the party responsible for all of the current A / B tests. This person will be able to call the potential interaction effects. If you see one, simply sequence the relevant tests instead of paralleling them.

3. The annoying confidence interval: the more you run tests, the higher the risk of error is high

If your organization culturally encourages the idea of ​​experimentation, one of the "fake" ways to manifest itself is that people perform a whole series of tiny tests. You know them: increase the font size by one point, switch the order of the two modules, change a few words in the product description. In addition to the fact that these changes will probably not allow your organization to become the visionary of your sector (hey), there is also a misunderstood statistical problem that will bite you here.

Whenever you feel that an A / B test and you claim that option B is better than option A, you run a statistical calculation based on a t-test. In this calculation, there is a concept of confidence interval: the level of certainty with which you are comfortable. Set it to 90%. 10% of the conclusions drawn by your A / B test system will be wrong. Option B will be better than Option A, when in fact it is not.

Now, what happens if you run 20 tiny tests, each with a 10% probability of false positive? Your chance of finding a winner by mistake is then (1 – 90 percent to the power of 20). That's 88%. That's right, your A / B test framework will show you at least one, and probably two "fake" winners in your series of 20 meaningful test results, possibly providing a feedback loop to the experiment team that there is gold.

How do you avoid this problem? Ask someone to look at the list of tests. Forbidden a zillion of small changes. Be extremely careful if you are testing version eight of a concept that is constantly failing.

The tactical issues I've outlined are too easy to solve when you adopt the A / B test as a philosophy for your marketing and product teams. These are not trivial missteps that only amateurs give up; they are surprisingly common. Make sure to inoculate your team from them.

Alex Weinstein is the executive vice president of growth at Grubhub and the author of the Technology + Entrepreneurship blog http://www.alexweinstein.net, where he explores data-driven decision-making in the face of uncertainty . Prior to Grubhub, he led the growth and technology marketing efforts at eBay.

[ad_2]
Source link