The Two Safeguards to Stopping an AB Test

Last modified January 14, 2019 By (Follow on ) 1 Comment

I feel like I can spend only spend a few paragraphs and a graph motivating why waiting for statistical significance is important in AB testing.

Here is a test we ran for an ecommerce client:

AB testing statistical significance

The goal we are measuring here is successful checkouts (visitors reaching the receipt page). The orange variation is beating the blue by 30.4% (2.96% vs. 2.27%), and the test shows 95% “chance to beat” or “statistical significance”!

This client has an 8 figure ecommerce business. Let’s say it’s $30 million a year in revenue (close enough…me hiding who they are and their revenue should give you confidence that if you work with us, I won’t go announcing your financials all over the internet).

So a 30% lift in orders is worth $9,000,000 in extra revenue a year!

(By the way, this is the power of AB testing).

This lure of extra revenue is enticing. Very much so. So much so that it can lead otherwise well meaning people to make a grave mistake: stopping a test too early.

In this case, you might be tempted to think: well the test has run for reached 95% significance, which everyone says is the cutoff. It had 3000 visitors to each variation. This is convincing. Let’s stop the test and run the orange variation at 100%!

But here’s a secret: both variations were the same.

This was, in CRO parlance, and A/A test. We ran it not to “test significance” which Craig Sullivan, in that preceding link argues is a waste of time (and I agree), not to give me an opportunity to look smart by writing this article, but just to check if revenue goals were measuring properly (they were, thank you for asking).

So how in the world are you supposed to know that this test is not worth stopping? Or as CRO people say, how do you know when you’re supposed to “call” the test?

You actually need two safeguards to make sure you don’t get duped by random fluctuations of data, i.e. statistical noise.

Safeguard 1: Statistical Significance

Safeguard 2: Sample Size

In this case we saw it satisfied the first safeguard, statistical significance. But that’s why the 2nd safeguard, sample size, exists.

We’ll discuss what both mean, in English, not math — well maybe a little math, but mostly English — in future articles (you can join our newsletter here). But for now, take the above example as a warning to not just stop a test the moment it reaches 95% significance.

Auxiliary Safeguards

Finally, in addition to the two safeguards above, there are also a few details you should pay attention to when deciding whether to stop a test:

First, are you tracking multiple goals through your purchase or signup funnel? 

You should be paying attention to supporting goals. Do the supporting goals show a consistent result? (They don’t have to, and they may not, but it’s good to know). In the A/A test example above, most actually did, but some did not, for example, here is the goal that tracks clicks on the size dropdown on the PDP, the results show largely no difference at all (note 45%, 55%, even chance of beating each other):

clicks on size

Tracking multiple goals will help give you a more holistic picture of what’s happening. I’m not saying all goals have to show the same result for a test result to be valid. The world is a complex place. But if just one or two goals are showing a big difference but the others aren’t, you should ask why that could be.

Second, pay attention to purchase cycles

Again, I have to word this in vague terms like “pay attention to”. If you’re thing “OMG just tell me what I should do” your frustration is noted.

“Purchase cycles” for most online businesses, means calendar weeks. Conversion rates and audience types vary by the day, so if you start a test on Tuesday and end it on Friday, you’re risking seeing different results than if you ran it for a week. Like anything it’s good to get multiple weeks in so you know the week wasn’t an anomaly.

Second, pay attention to seasonality. For a swimsuit business, things that work in July and things that work in December may be different things. (They may not, but they may be.) Again, when you run test after test after test, you learn to talk in slightly vague terms because you’ve seen so many different results.

1 Comment

  1. Graeme Sutherland
    April 20, 2017

    Growth Rock is the real deal Devesh! – but you know that… 🙂 Great post. Again.


Leave a Reply