A/B Testing – keeping it valid

by | Mar 16, 2016

10 min read

You’ve setup a solid A/B test framework. You have a test you’d like to run, you’ve calculated the threshold you need to hit, and you’ve built the new version of your page. You hit the button to start the test.
Two weeks later, you can see that the results look great. Conversions are up by 25%! That’ll put a nice kick-up on the bottom line for your next reporting period. You apply the results of your test to your site proper, kick back, and expect the money to roll in. But it doesn’t. In fact, it drops by 10%, leaving you scrambling for answers. What went wrong?

You can’t fix by analysis what you bungled by design.Light, Singer and Willett

Experimental design is a slippery thing – indeed, even professional scientists frequently stumble into statistical traps. Without careful separation of variables, your results can be confounded by a variable dependency. For example, in the above test, perhaps you performed your test across the school holidays, resulting in an un-accounted for demographic profile. Or perhaps you ran the A/B test on your most popular product page, and then rolled the change out across all of your product pages assuming they’d all benefit. These are very much classic mistakes.
So how can we make sure to avoid them?

 

Defining the hypothesis

The first step to any experiment is to define a specific, testable hypothesis that, if proven, would advantage your business. First, you’ll need your general reasoning; this is, necessarily, an ‘inductive’ process – that is, applying knowledge that you already have about your business, and making an extrapolation. For example, your reasoning might be that ‘some users are not purchasing our product because the photos are too small to make out details’.

In our lust for measurement, we frequently measure that which we can rather than that which we wish to measure… and forget that there is a difference.George Udny Yule

Next, you need a specific hypothesis to test your reasoning. Perhaps you might increase your photo size by 30% so that people can see the products more clearly. A ha, this is the hypothesis, right? “Increasing the photo size by 30% will increase conversion”.
Well, not so fast. In increasing the photo size by 30%, you likely needed to move other items around in the layout. Perhaps the price is now below the photos where it was beside it so that there is more space. So the hypothesis needs to incorporate that too: “Increasing the photo size by 30% and moving the price below the photos will increase conversion”. Now you’ve got something that’s not just testable, but is accurately reflecting the test. It also incorporates all of the changes into the hypothesis – after all, if you did see an improvement, but in actual fact people were buying more because of moving the price to a different location, then you’d have ended up with an incorrect conclusion. If you’re not happy with your final hypothesis, then you may need to look at alternative methods for the experiment to isolate what you’d like to test. If you’re happy as is, the process can continue.
Defining the hypothesis is the cheapest point at which to remove ambiguity and exclude unnecessary variables. It only gets more complicated and costly as the process continues, so try to get your hypothesis as clean as you possibly can. If you need to bind two variables together for the sake of accuracy (such as photo size and price position), you can separate those two variables out in a future experiment if you want to get more specific.

 

Separating the variables

Key to consider when designing your experiment is that you must single out the variable that you’re looking to test. The more factors there are in the test, the more tenuous the link between results and what you’re attempting to test (your hypothesis). To take our ‘photo and price’ hypothesis, we’d also need to be careful that we don’t change the layout of any other part of the page (for example, consider that items might be pushed beneath the fold).
More challenging are the less obvious variables – particularly time-based variables. Demographic shifts, active marketing campaigns, and many other factors can change the profile of the users visiting your site from the mean, which can render your results considerably less valuable. If at all possible, attempt to statistically correct for any external impacts, by (for example) resampling to obtain a more indicative data set – this is known as obtaining a representative sample.
If the change you are making will affect both mobile and desktop, consider making them two separate experiments for the sake of clarity; the same change can play out quite differently for the two platforms, and separating them into two experiments can make it much clearer as to what is going on.

 

Defining a win

Before the experiment begins, it’s important to specify what conditions mean that the experiment indicates that the variation improved whichever metric you’re testing against. Technically, this is called ‘disproving the null hypothesis’ – that is, proving that your changes to the alternate version made a significant difference from the control version. This requires getting a sufficiently sized sample (Optimizely has a decent calculator).
Every experiment may be said to exist only to give the facts  a chance of disproving the null hypothesis. –R. A. Fisher
When conducting an experiment, it can also be highly useful to gather additional data, which can act as secondary goals. If your experimental version doesn’t improve conversion rates significantly, but does decrease bounce rate or increase time on page, then you’ve got some additional data to work with when preparing your next experiment.

 

Running the experiment

While running the experiment, I honestly find it best to not watch the results come in, beyond checking that it’s working as intended. Early results can and very often will be overturned as the sample size gets bigger and more representative of the userbase. Attempting to draw early conclusions can lead to prematurely ending the experiment, which can lead to incorrect results.

 

Interpreting the result

Remember, if you didn’t obtain significance, your results are worthless. Do not end the experiment until you have significance.
Other than that, make sure to delve into the results beyond the main figure. Look at demographics, behaviours, and other secondary stats – they can be extremely useful in setting up future experiments.
Through careful experimental design, the most common and insidious traps can be avoided so that confidence can be placed in the results. A few extra hours spent on experimental design can prevent days of analysis later and maximise conversions earlier; it’s always the cheapest part of the A/B testing process to get the best results.

Kirsten Tanner
Categories

Recommended for you

Get Our Newsletter

Sign up for our newsletter and receive monthly updates on what we’ve been up to, digital marketing news and more.

Your personal information will not be shared, and we don’t like mail spam or pushy salesmen either!