A/B Testing

A/B Testing is a controlled experimental framework used to compare two or more variations of a product, webpage, advertisement, or feature to determine which version performs better based on a predefined metric (e.g., click-through rate, conversion rate, or revenue).

It works by randomly dividing a target audience into two groups:

Group A (Control): Exposed to the existing version or baseline.
Group B (Variant): Exposed to the new version being tested.

The performance of the two versions is analyzed to determine if the changes introduced in the variant lead to statistically significant improvements.

Key Components of A/B Testing

Metric: A quantifiable outcome, such as CTR (Click-Through Rate), revenue, sign-ups, etc.
Randomization: Ensures that the two groups are comparable and that the results are not biased by external factors.
Controlled Environment: Only one element or variable is changed between the two groups to isolate the effect of that change.
Statistical Analysis: Used to determine if the observed differences in performance are statistically significant. (e.g., confidence intervals, p-values)

Purpose

Optimization: Improve user experience and business outcomes by identifying the most effective design or feature.
Data-Driven Decisions: Make informed decisions based on empirical evidence rather than intuition or assumptions.
Risk Mitigation: Test changes on a smaller scale before rolling them out to the entire user base, reducing the risk of negative impacts.

Examples of A/B Testing

Advertising: Testing different ad creatives to see which one generates more clicks or conversions.
Web Design: Comparing different layouts, color schemes, or call-to-action buttons to see which design leads to higher user engagement.
Email Campaigns: Testing different subject lines or email content to see which version results in higher open rates or click-through rates.
Product Features: Testing new features or changes to existing features to see how they impact user behavior and satisfaction.

A/B Testing is widely used in various industries, including marketing, product development, and UI/UX design, due to its simplicity and effectiveness.

Steps to Conduct A/B Testing

Define the Objective
- Clearly define the goal of the test and the key metric to be measured.
- Questions to consider:
  - What are you trying to improve?
  - What is the primary metric for success?
  - What is the current baseline performance?
  - Why are you conducting this test?
- Example:
  - Objective: Increase the click-through rate (CTR) on a call-to-action button.
  - Metric: CTR (Click-Through Rate)
Identify the Variants
- Device the element to be tested between the control (A) and the variant (B).
- Common Variants:
  - Website elements: headlines, images, buttons, layouts.
  - Ad creatives: text, visuals, calls to action.
  - Email content: subject lines, body text, images.
Segment the Audience
- Randomly divide the target audience into two groups to ensure comparability.
  - Group A (Control): Exposed to the existing version.
  - Group B (Variant): Exposed to the new version.
- Considerations:
  - Randomization: Ensure that the assignment to groups is random to avoid selection bias.
  - Equal Representation: Split traffic evenly between the two groups to ensure that each group is comparable.
- Example:
  - If you have 10,000 visitors, randomly assign 5,000 to Group A and 5,000 to Group B.
Determine Sample Size

n = \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2 \cdot 2 \cdot \bar{CTR} \cdot (1 - \bar{CTR})}{(\Delta CTR)^2}

Calculate the minimum sample size required for each group to detect a meaningful effect.
Considerations:
- Baseline Performance: The current performance of the metric being tested. (e.g., current CTR)
- Minimum Detectable Effect (MDE): The smallest effect size that you want to be able to detect. (e.g. $\Delta CTR = CTR_{B} - CTR_{base}$ )
- Confidence Level: Typically set at 95% to reduce the risk of Type I errors. (e.g., $\alpha = 0.05$ )
- Statistical Power: Aim for a power of 80% or higher to reduce the risk of Type II errors. (e.g., $\beta = 0.20$ )

Run the Experiment
Analyze the Results
- Calculate the metrics for each group.
- Perform Hypothesis Testing.

Definition

Types of Errors

Type 1 Error: False Positive (detecting an effect when there is none)
- $\alpha$ (alpha) level, commonly set at 0.05
- Confidence Level: 1 - $\alpha$ (e.g., 95% confidence level)
Type 2 Error: False Negative (failing to detect an effect when there is one)
- $\beta$ (beta) level, commonly set at 0.20
- Statistical Power: Probability of correctly rejecting the null hypothesis when it is false (1 - $\beta$ )
p-value: Probability of observing the data, or something more extreme, if the null hypothesis is true.

Definition

Hypothesis Testing

Null Hypothesis (H0): There is no difference between the control and variant.
Alternative Hypothesis (H1): There is a difference between the control and variant.

Evaluating the Hypotheses:

Use a two-proportion z-test or t-test to assess statistical significance.
Compute P-Value:
- If p-value < alpha (e.g., 0.05), reject the null hypothesis and conclude that there is a significant difference.
- If p-value >= alpha, fail to reject the null hypothesis.

One-Sided vs Two-Sided Tests

One-Sided Test: Tests for an effect in a specific direction (e.g., variant is better than control).
Two-Sided Test: Tests for an effect in both directions (e.g., variant is different from control, either better or worse).

Advantages of A/B Testing

Simplicity
Actionable Insights
Isolated Variables
Broad Applicability
Risk Mitigation

Disadvantages of A/B Testing

Allocated traffic statistically, typically 50/50
Wastes resources by continuing to show the worse performing variant to a significant portion of users.

Note

Alternative Approaches:

Multi-Armed Bandit Testing: Dynamically allocates more traffic to better-performing variants.
Sequential Testing: Continuously monitors results and allows for early stopping of the test when significant results are observed.

Code Example

1
import numpy as np
2
import pandas as pd
3
from scipy.stats import norm
4
from statsmodels.stats.proportion import proportions_ztest
5

6
np.random.seed(42)
7

8
# base line performance
9
ctr_base = 0.05
10

11
# minimum detectable effect (MDE)
12
mde = 0.01
13
ctr_b = ctr_base + mde
14

15
# define type I and type II error rates
16
alpha = 0.05
17
beta = 0.2
18

19
# test z-scores
20
z_alpha_one_sided = norm.ppf(1 - alpha)
21
z_alpha_two_sided = norm.ppf(1 - alpha / 2)
22
z_beta = norm.ppf(1 - beta)
23

24
# pooled ctr and delta ctr
25
ctr_bar = (ctr_base + ctr_b) / 2
26
delta_ctr = mde
27

28
# require samples size
29
n_one_sided = int(((z_alpha_one_sided + z_beta) ** 2 * 2 * ctr_bar * (1 - ctr_bar)) / (delta_ctr ** 2))
30
n_two_sided = int(((z_alpha_two_sided + z_beta) ** 2 * 2 * ctr_bar * (1 - ctr_bar)) / (delta_ctr ** 2))
31

32
# generate data
33
clicks_A_one_sided = np.random.binomial(n=n_one_sided, p=ctr_base)
34
clicks_B_one_sided = np.random.binomial(n=n_one_sided, p=ctr_b)
35
clicks_A_two_sided = np.random.binomial(n=n_two_sided, p=ctr_base)
36
clicks_B_two_sided = np.random.binomial(n=n_two_sided, p=ctr_b)
37

38
data_one_sided = pd.DataFrame({
39
    'Ads': ['A', 'B'],
40
    'Impressions': [n_one_sided, n_one_sided],
41
    'Clicks': [clicks_A_one_sided, clicks_B_one_sided]
42
})
43

44
data_two_sided = pd.DataFrame({
45
    'Ads': ['A', 'B'],
46
    'Impressions': [n_two_sided, n_two_sided],
47
    'Clicks': [clicks_A_two_sided, clicks_B_two_sided]
48
})
49

50
# do hypothesis testing
51
z_stat_one_sided, p_value_one_sided = proportions_ztest(data_one_sided['Clicks'], data_one_sided['Impressions'], alternative='smaller')
52
z_stat_two_sided, p_value_two_sided = proportions_ztest(data_two_sided['Clicks'], data_two_sided['Impressions'], alternative='two-sided')
53

54
# print results
55
print("One-sided test results:")
56
print(data_one_sided)
57
print(f"Z-statistic: {z_stat_one_sided}, P-value: {p_value_one_sided}")
58
if p_value_one_sided < 0.05:
59
    print("Reject the null hypothesis in one-sided test.")
60
else:
61
    print("Fail to reject the null hypothesis in one-sided test.")
62

63
print("\n---\n")
64

65
print("Two-sided test results:")
66
print(data_two_sided)
67
print(f"Z-statistic: {z_stat_two_sided}, P-value: {p_value_two_sided}")
68
if p_value_two_sided < 0.05:
69
    print("Reject the null hypothesis in two-sided test.")
70
else:
71
    print("Fail to reject the null hypothesis in two-sided test.")

Output

1
One-sided test results:
2
  Ads  Impressions  Clicks
3
0   A         6426     308
4
1   B         6426     388
5
Z-statistic: -3.117994752318808, P-value: 0.0009104302281918874
6
Reject the null hypothesis in one-sided test.
7

8
---
9

10
Two-sided test results:
11
  Ads  Impressions  Clicks
12
0   A         8158     377
13
1   B         8158     496
14
Z-statistic: -4.139814221510001, P-value: 3.4758719657070286e-05
15
Reject the null hypothesis in two-sided test.