Frequentist Stopping Rules

Stopping rules are what make most frequentist (a.k.a. classical) statistical tests valid. Unsurprisingly, they tell you when to stop an experiment, and equally unsurprising, they are rules that must be followed in order to get statistically valid experiments from statistical tests that depend on them. However, despite stopping rules being critical to the interpretation of experimental outcomes, there are many misconceptions about what they are and how to use them. These misconceptions can lead to mistaken interpretations, and worse, bad decisions.

In this article I’ll explain (1) what they are, (2) how they work, (3) different types in null hypothesis significance testing (NHST), and, finally, (4) which ones we recommend for use in experiments at Credit Karma.

Introduction to stopping rules

We all know that having enough data is critical to determining if one sample is likely to have been sampled from a different population than another. But how do we know how much data is enough? The answer depends on the type of statistical test you plan to run. In most cases you’ll use a sample size calculator for that specific type of test. Here’s an example of some inputs for a z-test sample size calculator:

µ1 (expected mean for the first sample)

µ2 (expected mean for the second sample)

α (desired p-value threshold, e.g. .05)

β (desired power threshold, e.g. .80)

Putting in the relevant values for your experiment will provide you with the sample size that you’re likely to need to meet the desired α (alpha) and β (beta)[1]. In order to calculate the α and β with a z test you have to stop as soon as you reached that sample size (for reasons that will become clear later). Thus the sample size is the stopping rule; it is the rule for stopping, that when followed, results in the relevant statistical tests providing accurate outputs.

Sample size is just one type of stopping rule, though. Others include time, minimum sample size, or even minimum sample size combined with minimum time[2]. One final note for this section, sometimes you will hear people refer to the output of a sample size calculator as the minimum sample size. This is a misconception that I address in the conclusion of this post.

First, the basis of stopping rules: probability

In order to understand how stopping rules work let’s get a refresher on some basic probability theory (feel free to skip this section if you eat probability for breakfast). We’ll use the example of a fair coin.

Imagine that you have a fair quarter, i.e. a quarter that is likely to come up heads 50% of the time and tails 50% of the time. What is the probability that after 4 flips 2 of those flips are heads? We can illustrate how probability works by looking at the outcome space.

  1. One flip
    • Either 0 heads or 1 head
    • {H, T}
    • See Figure 1
  2. Two flips
    • Either 0 heads,1 head, or 2 heads
    • {HH, HT, TH, TT}
    • See Figure 2
  3. Three flips
    • Either 0 heads,1 head, 2 heads, or 3 heads
    • {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
    • See Figure 3
  4. Four flips
    • Either 0 heads,1 head, 2 heads, 3 heads, or 4 heads
    • {HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, HTTH, HTTT, THHH, THHT, THTH, THTT, TTHH, TTHT, TTTH, TTTT}
    • See Figure 4
Figure 1 Figure 2 Figure 3 Figure 4


So the possible outcomes are 0, 1, 2, 3, or 4 heads in 4 flips (this is just what is referred to as the binomial coefficient or n-choose-k[3]). Importantly, there are different combinations of some of the outcomes. For example, an outcome of 1 head could be HTTT, THTT, TTHT or TTTH. We can calculate the likelihood of these outcomes quite easily using a binomial probability distribution[4], but it will be useful to think about it in terms of the probability trees we constructed earlier.

Consider the probability tree in the table above for 4 flips. From this it’s easy to see that the probability of getting exactly two heads in four flips is 6/16 (or 37.5%). That is, out of the 16 possible outcomes for 4 flips, 6 of them have exactly 2 heads. Now that we understand how to calculate those probabilities, let’s see how that applies to NHST where we ask what the probability of seeing a value is or a value more extreme. In NHST testing it’s not enough just to know the probability of seeing a value, instead you need to know the probability of that value or any value more extreme than that value where extreme means further from the null hypothesis.

Now suppose we have a (alternative) hypothesis that says that the coin is not fair, instead it’s biased toward tails. Furthermore, let’s say we get 3 tails in our 4 flips. We can look at the probability tree and ask what is the chance that we would see at least 3 tails in 4 flips if the null hypothesis, i.e. the hypothesis that the coin is fair, were true. We add the “at least” qualifier because NHST asks what the chances are that we would see a result as extreme as what we saw or more extreme. The probability tree we drew for a fair coin shows us that 3 tails or more in 4 flips is expected 5 out of 16 times. That is, if it were indeed a fair coin we would expect 3 or more tails 31.25% of the times that we flipped it 4 times. That is much greater than a standard α threshold of 0.05 and so we would reject the hypothesis that the coin is biased toward tails (the small sample size makes the probability tree easier to draw but for the purposes of illustration we’ll need to ignore the fact that the sample size is too small to make these inferences).

For completeness I’ve also included a plot of a cumulative distribution function (CDF) showing the same. For those more math or stats inclined you may prefer to think of this in terms of the CDF instead of a probability tree.

plot of cdf for 4 flips.png

You’ll notice that we have not explicitly stated a stopping rule in the example above. We asked what the probability out of four flips is, but that’s not strong enough. In order to be able to calculate a p-value, we need to specify how we obtained the sample of four flips. Was it from a larger sample of eight? Did we just flip the coin for 30 seconds and then stopped and happened to have four flips? Not specifying the stopping rule up front is a frequent mistake in many analyses, which can cause our p-value to come apart from the likelihood of a value or more extreme having come from the null hypothesis. We’ll see how in the next few sections[5].

Types of stopping rules

The most common stopping rule is an sample size stopping rule. That is, the experimenter says, “I will stop once my sample size has reached 10,000.” And the number 10,000 often comes from a sample-size calculator like we saw in §1 that takes as inputs the mean of the control, maybe its standard deviation, and likely a couple of other parameters.

However, you could also use a stopping rule that is time based. For example, an experimenter could say “I will stop collecting data once one minute of coin-flipping has occurred.” Let’s consider what that would look like.

Recall Table 1, we can use the last column to represent the possible outcomes when we decide to stop collecting data after we have reached a sample size of 4. That is, the possible outcomes if we use an sample size stopping rule where N=4. Now assume that we can flip a coin about once every 15 seconds, but not exactly once every 15 seconds, and instead of an sample size stopping rule we decide to stop after one minute. That is we decide to use a time based stopping rule[6]. This changes the outcome space considerably. Rather than just the last column from Table 1, the possible outcomes include the possibility that we might only get 3 flips in one minute instead of 4, or perhaps even 5 or 6.

So, given that we can flip a coin about once every 15 seconds let’s just say that we’re most likely to flip the coin 4 times in a minute, but it’s possible that we only flip it 3 times or maybe get an extra flip in for 5 total. And for the example we’ll just stipulate that 3 and 5 flips are each 25% likely in one minute, and 4 flips is 50% likely. Now the possible outcomes are:

  1. Three flips
    • Either 0 heads,1 head, 2 heads, or 3 heads
    • {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
    • 25% likely to see 3 flips in one minute
  2. Four flips
    • Either 0 heads,1 head, 2 heads, 3 heads, or 4 heads
    • {HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, HTTH, HTTT, THHH, THHT, THTH, THTT, TTHH, TTHT, TTTH, TTTT}
    • 50% likely to see 4 flips in one minute
  3. Five flips
    • Either 0 heads,1 head, 2 heads, 3 heads, 4 heads, or 5 heads
    • {HHHHH, HHHHT, HHHTH, HHHTT, HHTHH, HHTHT, HHTTH, HHTTT, HTHHH, HTHHT, HTHTH, HTHTT, HTTHH, HTTHT, HTTTH, HTTTT, THHHH, THHHT, THHTH, THHTT, THTHH, THTHT, THTTH, THTTT, TTHHH, TTHHT, TTHTH, TTHTT, TTTHH, TTTHT, TTTTH, TTTTT}
    • 25% likely to see 5 flips in one minute

3 Flips 4 Flips 5 Flips

Given that the potential outcomes are completely different when we stop by time, our probabilities also change. Notice that even if we flip 4 times, all of the flips from 3 and 5, properly weighted, are still part of the probability calculation since they were potential outcomes. Consider this probability tree where I only show the probabilities of outcomes of 3, 4, and 5 flips.

Recall that we stipulated that the likelihoods of flips in one minute are P(three) = .25, P(four) = .5, and P(five) = .25. That is, the probability that I only finish three flips in one minute is 25%. Again, let’s say that we want to know what the likelihood of seeing 75% tails or more extreme is (previously we expressed this as 3 tails in 4 flips, but since we aren’t confident we’ll flip just four time we’ll express it as a percentage).

There are 56 possible outcomes (8 in 3 flips, 16 in 4 flips, and 32 in 5 flips). We know from our previous example that the likelihood that we would see 75% or more extreme in 4 flips is 6/16 or 37.5%. Looking at the most recent probability tree, we see that in 3 flips it is ⅛ or 12.5% and in 5 flips it is 6/32 or 18.75%. Remember that we’re weighting 4 flips as twice as likely; so we get (1+6+6+6)/(8+16+16+32) = 26.39%. This stopping rule gives us a very different result from the sample size stopping rule!

Here I’ll stress why it is important to note that this stopping rule produced a different p-value than the sample size stopping rule. Imagine that you ran an experiment where you decided that you would run the experiment for one week and then stop. Then you calculated the p-value as if you had used an sample size stopping rule instead of a time-based stopping rule. As you can see, you would have calculated the incorrect p-value. The p-value for time-based stopping rules is calculated differently than sample size stopping rules to account for the outcome space and ignoring that will result in an incorrect statistic.

Now is a good time to point out another “rule” experimenters sometimes follow. This is definitely something we’ve erroneously done here at Credit Karma and certainly other organizations have done the same. That’s when we’ve said “I will stop when I have collected at least 4 flips and at least one minute has gone by.” However, even though this is the stopping rule we use, we don’t calculate the p-value based on that stopping rule. Instead, we calculate the p-value as if we had used a sample size stopping rule. And again, the p-value and the probability of observing an outcome or something more extreme come apart.

Concretely, imagine we discard the 3 flips from the most recent example above because they do not satisfy the first conjunct of the antecedent (I have collected at least 4 flips), then our probabilities become (6+6+6)/(16+16+32) = 28.13%. Still a far cry from the 37.5% we would calculate if we used a simple, sample size stopping rule.

Conclusion: Recommended stopping rule

So, which stopping rule should we use at Credit Karma? Our experimentation group has provided a few tools (such as the web-based ad hoc bootstrapping tool) that allow our analysts and other experimenters to take a Bayesian approach to data analysis. We believe the best stopping rule is going to be a Bayesian stopping rule that we’ll cover in another, upcoming post.

However, if your team is using a frequentist approach to data analysis such as NHST, then our recommendation is that you use sample size as your stopping rule. There are certainly some challenges with using the approach that it’s helpful to be aware of. First, if you want to capture seasonality in your metrics and, say, seasonality for you metric is weekly, then it’s unlikely you’ll be able to identify the sample size upfront that exactly coincides with one week of data.

What can you do in this case? You could use a time-based stopping rule, but calculating it is a nontrivial task and it’s unlikely that your statistical tools are taking a time-based stopping rule into account. Instead, we strongly recommend that you move to a Bayesian approach to analysis[7].

At this point you might be thinking, “Well, sample sizes are minimums, right? Easy, then I’ll just keep collecting data until I hit my seasonality timespan as long as it’s above my minimum.” Unfortunately, that doesn’t work. Going into why is beyond the scope of this article (but a great topic for another article, see footnote 8 for a teaser[8]).

Great! So now you understand stopping rules, you’re using them as outlined here, and your p-values now accurately represent the likelihood of seeing an outcome as extreme or more extreme given the null hypothesis, right? Not so much. There are quite a few ways p-values can go wrong. For a couple of examples see the American Statistical Association statement on p-values, any number of Andrew Gelman’s posts and papers on p-values, or for a more sympathetic, but still critical view, alpha wars. Frequentist statistics is plagued by problems with getting p-values to map to likelihoods of observing a value as extreme or more including researcher degrees of freedom, data cutting (or what academics call supplementary analysis), underpowered results, and many others. However, by choosing the most appropriate stopping rule and using it as it is intended you do get slightly closer to results that reflect the actual state of affairs.

Bonus: Peeking

A very common critique of experiment methodology is that experimenters sometimes peek at results and decide whether to continue the experiment or not. It’s easy to see why this is problematic if we revisit the probability tree from our previous example of 4 coin flips.

Suppose I peek after two flips and I tell myself, if there are two heads then no way am I continuing. I’ll only continue if there’s at least one tails. What changes in the probability of possible outcomes:

Now what is the probability that we would see 3 tails or more extreme in 4 flips? 4/12 or 33%. Whereas normally the probability that we see at least 3 tails in 4 flips is 31.25%, because we peeked and decided not to continue when there were already 2 heads, we’ve increased the actual p-value. However, if we used a standard p-value function (like those found in r or Excel) it still spits out 31.25%. What’s critical to note here is that the chance of as extreme an outcome or more extreme and the calculated p-value come apart! Whereas calculating the p-value using a statistical test used to give you the chance of seeing as extreme a result or more extreme, it now doesn’t reflect that because it is being calculated on outcomes that are no longer possible given the actions of the experimenter[9].

Notes

  1. Recall that α is the probability of making a Type I error (https://xkcd.com/882/), viz. finding a difference when there is none. 1 - β on the other hand is the likelihood of making a Type II error, viz. not finding a difference when one exists.
  2. There are still more stopping rules in Bayesian statistics such as highest density interval (HDI) stopping rules.
  3. For a deeper dive see http://mathworld.wolfram.com/BinomialCoefficient.h...
  4. For example to calculate the probability of 3 heads in 4 flips in r simply run 1-pbinom(2, size=4, prob=0.5), which should be read as: one minus, i.e. the inverse of, the probability of seeing two or fewer heads out of four flips where each flip has a 0.5 probability of being heads.
  5. You might think there is an implicit assumption in our analysis that we are using an N-based (i.e. sample-size based) stopping rule (a rule that says to stop after N samples have been collected), but if we don’t make it explicit it could call our p-value into question.
  6. This is not an infrequent approach in the industry. An experimenter might say that they’re going to run an experiment for a week at X% and then analyze the result at the end of the week.
  7. You might wonder how this solves the problem. For a good overview see optional stopping in data collection.
  8. You might be skeptical, but just consider the following:
> qbinom(.3, 100000, .5)/100000[1] 
0.49917
> qbinom(.3, 100001, .5)/100001[1] 
0.499175
> qbinom(.3, 100002, .5)/100002[1] 
0.49917
> qbinom(.3, 100003, .5)/100003[1] 
0.499175
# Or a more extreme example:
> qbinom(.3, 2, .5)/2[1] 
0.5
> qbinom(.3, 3, .5)/3[1] 
0.3333333
> qbinom(.3, 4, .5)/4[1] 
0.25
> qbinom(.3, 5, .5)/5[1] 
0.4
> qbinom(.3, 6, .5)/6[1] 
0.3333333

9. For more on this see https://plato.stanford.edu/entries/statistics/#Exc...






About the Author

Robert is a staff software engineer on the experimentation and analytics platform who is passionate about solving large scale engineering and statistics problems.