Measure, compare, optimize: data-driven decisions with A/B testing

2023-06-26by Mira Céline Klein

Which layout of an ad leads to more clicks? Would a different color or position of the order button lead to a higher conversion rate? Does a special offer really attract more customers - and which of two wordings would be better?

For a long time, you had to rely on gut feeling to answer these questions. A/B testing is an effective marketing tool that helps us answer these questions based on data, not just gut feeling.

What is A/B testing?

A/B testing is a process of testing two or more variants - for example, of a website or an advertisement - against each other to find out which one works best. These variants are usually referred to as "A" and "B". The criterion for the comparison could be anything: how many people order a product, subscribe to a newsletter or click on an advertisement.

To do this, website visitors are randomly assigned to one of two or more groups, between which the target metric (e.g. click-through rate, conversion rate...) can then be compared. This randomization ensures that the groups do not differ systematically in all other relevant dimensions. This means that if your target metric takes on a significantly higher value in one group, you can be pretty sure that this is due to your treatment and not due to any other variable.

Compared to other methods, performing an A/B test does not require extensive statistical knowledge. Nevertheless, some things need to be taken into account.

The process of an A/B test

In principle, an A/B test can be evaluated with any statistical environment. Today, however, special tools are often used that are integrated directly into the content management system (CMS) or the analytics platform. This simplifies the setup and evaluation of tests and also enables specialist departments to carry out and coordinate A/B tests.

The statistics behind an A/B test

In principle, evaluating an A/B test using classic statistical methods is not complicated. For example, there are tests that allow you to test two proportions from two independent groups for equality. If you want to compare two means, the t-test for independent samples might be the right choice. Here, care must still be taken, to ensure that the result is not distorted by outliers. Further below we explain why one can make mistakes here despite the relatively simple evaluation. For this, however, two important statistical concepts must first be explained: type I and type II error as well as the p-value.

Type I and type II error

There are two possible errors in a statistical decision (see also table 1): A type I error means that we observe a significant result even though there is no real difference between our groups. A type II error means that we observe a non-significant result even though there is in fact a difference. Type I error can be controlled and set in advance to a fixed number, such as 5%, often referred to as the α or significance level. Type II error, on the other hand, cannot be controlled directly. It decreases with sample size and the magnitude of the actual effect. For example, if one of the designs performs a lot better than the other, it is more likely that the difference will actually be detected by the test than if there is only a small difference with respect to the target variable. Therefore, the required sample size can be calculated in advance if you know α and the minimum size of the effect you want to detect (sample size planning or power analysis). If you know the average traffic on the website, you can then get a rough idea of the time you need to wait until the test is completed. Setting the rule for the end of the test in advance is often referred to as "fixed-horizon testing".

Table 1: Overview over possible errors and correct decisions in statistical tests
		Effect really exists
		No	Yes
Statistical test is significant	No	True negative	Type II error (false negative)
Statistical test is significant	Yes	Type I error (false positive)	True positive

p-value

Statistical tests generally report the p-value, which reflects the probability that the observed result (or an even more extreme one) occurs purely by chance, assuming there is no effect. When the p-value is less than α, the result is said to be "significant".

What should be considered in an A/B test?

Although A/B testing is basically simple, there are a few pitfalls to watch out for.

Early ending

When running an A/B test, you may not always want to wait until the end, but rather take a look from time to time to see how the test is performing. Now what if you suddenly realize that your p-value has already dropped below your significance level? Doesn't that mean that the winner has already been determined and you can stop the test? While this conclusion is very tempting, it can also be very wrong. The p-value varies greatly during the experiment, and even if the p-value is considerably greater than α at the end of the fixed horizon, it may fall below α at some point during the experiment. This is why looking at the p-value multiple times is a bit like cheating, because it makes the actual probability of a type I error much larger than the α you chose in advance. This is called "α-inflation." In the best case, you only end up changing the color or position of a button, although this has no effect. In the worst case, your company offers a special deal that incurs costs but doesn't actually generate any profit. The more often you check your p-value during data collection, the greater the risk of drawing the wrong conclusions. In short, as tempting as it may seem, don't abandon your A/B test prematurely just because you observe a significant result. In fact, it can be mathematically proven that if you extend your time horizon to infinity, you are guaranteed to get a significant p-value at some point.

The following figure shows a simulation with 500 observations and a true click-through rate (CTR) of 6% in both groups, i.e. there is no actual difference. The line shows what you would observe if you looked at the p-value after each new observation. In the end (after 1000 observations), the p-value does the right thing: it is greater than 0.05, which is good because there is actually no difference between the groups. But you can see that the p-value still fell below the 5% threshold several times. If you had stopped the test prematurely, you would have drawn a wrong conclusion. Of course, this is only one possible result of a random simulation, but it illustrates the problem. If you keep looking at the p-value and stop the experiment as soon as p < 0.05, the type I error increases and becomes larger than 5%.

Example for the course of the p-value during a test

Test group size & test period

In addition, you should determine in advance how large the test groups must be and how long the test should run.

In addition, the test period must be chosen carefully. For example, it should be ensured that the test does not run only at certain times or days of the week. Practitioners also recommend that a test should not run for too long in order to be able to react quickly to results and not interfere with other marketing campaigns.

Example in R

The following code shows you how to test the difference between two rates in R, e.g., click-through rates or conversion rates. You can apply the code to your own data by replacing the URL to the example data with your file path. To test the difference between two proportions, you can use the function prop.test which is equivalent to Pearson’s chi-squared test. For small samples you should use Fisher’s exact test instead. prop.test returns a p-value and a confidence interval for the difference between the two rates. The interpretation of a 95% confidence interval is as follows: When conducting such an analysis many times, then 95% of the displayed confidence intervals would contain the true difference.

library(readr)

# Specify file path:
dataPath <-
  "https://www.inwt-statistics.de/data/exampleDataABtest.csv"

# Read data
data <- read_csv(file = dataPath) 

# Inspect structure of the data
str(data) 
## $ group      : chr [1:1000] "A" "A" "A" "A" ...
## $ date       : Date[1:1000], format: "2023-06-05" "2023-06-04" "2023-06-02" "2023-06-05" ...
## $ clickedTrue: num [1:1000] 0 0 0 0 0 0 0 0 0 1 ...

# Change type of group to factor 
data$group <- as.factor(data$group) 

# Change type of click through variable to factor
data$clickedTrue <- as.factor(data$clickedTrue) 
levels(data$clickedTrue) <- c("0", "1")

# Compute frequencies and conduct test for proportions 
# (Frequency table with successes in the first column)
freqTable <- table(data$group, data$clickedTrue)[, c(2,1)] 

# print frequency table
freqTable 
##    
##       1   0
##   A  20 480
##   B  40 460

# Conduct significance test
prop.test(freqTable, conf.level = .95)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  freqTable
## X-squared = 6.4007, df = 1, p-value = 0.01141
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.071334055 -0.008665945
## sample estimates:
## prop 1 prop 2 
##   0.04   0.08

Enhancements and improvements in A/B testing

As technology and methodologies advance, there are always new approaches to further improve A/B testing. Two such approaches are Bayesian A/B testing and the use of multi-armed bandits.

Bayesian approaches

Bayesian A/B testing uses the method of Bayesian statistics to determine not only whether a difference exists, but also how large that difference is likely to be and how confident we can be in that statement. It provides a probability distribution model that gives continuous insight into the expected performance of each variant, based on the data collected so far.

An advantage of the Bayesian A/B test is that, unlike the p-value, it provides an intuitive and easy-to-interpret output, e.g., "There is a 95% probability that variant B is better than variant A." This facilitates communication of test results and decision making.

Another advantage is that Bayesian A/B tests are more flexible than the classical frequentist tests. They allow you to stop the test at any time once sufficient data has been collected. Bayesian A/B testing is a powerful method, but it also requires a higher level of statistical understanding and care in its application to avoid misinterpretation and misuse.

Multi Armed Bandits

Multi-Armed Bandits (MAB) are a concept from the field of machine learning and artificial intelligence that is often used to optimize A/B tests. The name "Multi-Armed Bandit" is a reference to one-armed bandits (slot machines), where the "arm" of the bandit represents an option or slot that can be pulled. MAB algorithms are known to efficiently solve the dilemma between exploration (trying new options and gathering information) and exploitation (using the information gathered so far to choose the best possible option). They do this by seeking a compromise between testing the arms with the best results to date (exploitation) and testing the arms about which little is known (exploration). You can read more about Multi Armed Bandits in our blog article Multi-Armed Bandits as an Alternative to A/B Testing.

Links:

Online-tool for calculating the sample size for A/B-tests

Further parts of the article series:

2019-09-26 by Marina Wyss