Logo
Overview

A/B Testing

October 5, 2025
6 min read

A/B Testing is a controlled experimental framework used to compare two or more variations of a product, webpage, advertisement, or feature to determine which version performs better based on a predefined metric (e.g., click-through rate, conversion rate, or revenue).

It works by randomly dividing a target audience into two groups:

  • Group A (Control): Exposed to the existing version or baseline.
  • Group B (Variant): Exposed to the new version being tested.

The performance of the two versions is analyzed to determine if the changes introduced in the variant lead to statistically significant improvements.

Key Components of A/B Testing

  1. Metric: A quantifiable outcome, such as CTR (Click-Through Rate), revenue, sign-ups, etc.
  2. Randomization: Ensures that the two groups are comparable and that the results are not biased by external factors.
  3. Controlled Environment: Only one element or variable is changed between the two groups to isolate the effect of that change.
  4. Statistical Analysis: Used to determine if the observed differences in performance are statistically significant. (e.g., confidence intervals, p-values)

Purpose

  • Optimization: Improve user experience and business outcomes by identifying the most effective design or feature.
  • Data-Driven Decisions: Make informed decisions based on empirical evidence rather than intuition or assumptions.
  • Risk Mitigation: Test changes on a smaller scale before rolling them out to the entire user base, reducing the risk of negative impacts.

Examples of A/B Testing

  1. Advertising: Testing different ad creatives to see which one generates more clicks or conversions.
  2. Web Design: Comparing different layouts, color schemes, or call-to-action buttons to see which design leads to higher user engagement.
  3. Email Campaigns: Testing different subject lines or email content to see which version results in higher open rates or click-through rates.
  4. Product Features: Testing new features or changes to existing features to see how they impact user behavior and satisfaction.

A/B Testing is widely used in various industries, including marketing, product development, and UI/UX design, due to its simplicity and effectiveness.

Steps to Conduct A/B Testing

  1. Define the Objective
    • Clearly define the goal of the test and the key metric to be measured.
    • Questions to consider:
      • What are you trying to improve?
      • What is the primary metric for success?
      • What is the current baseline performance?
      • Why are you conducting this test?
    • Example:
      • Objective: Increase the click-through rate (CTR) on a call-to-action button.
      • Metric: CTR (Click-Through Rate)
  2. Identify the Variants
    • Device the element to be tested between the control (A) and the variant (B).
    • Common Variants:
      • Website elements: headlines, images, buttons, layouts.
      • Ad creatives: text, visuals, calls to action.
      • Email content: subject lines, body text, images.
  3. Segment the Audience
    • Randomly divide the target audience into two groups to ensure comparability.
      • Group A (Control): Exposed to the existing version.
      • Group B (Variant): Exposed to the new version.
    • Considerations:
      • Randomization: Ensure that the assignment to groups is random to avoid selection bias.
      • Equal Representation: Split traffic evenly between the two groups to ensure that each group is comparable.
    • Example:
      • If you have 10,000 visitors, randomly assign 5,000 to Group A and 5,000 to Group B.
  4. Determine Sample Size
n=(Z1α/2+Z1β)22CTRˉ(1CTRˉ)(ΔCTR)2n = \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2 \cdot 2 \cdot \bar{CTR} \cdot (1 - \bar{CTR})}{(\Delta CTR)^2}
  • Calculate the minimum sample size required for each group to detect a meaningful effect.
  • Considerations:
    • Baseline Performance: The current performance of the metric being tested. (e.g., current CTR)
    • Minimum Detectable Effect (MDE): The smallest effect size that you want to be able to detect. (e.g. ΔCTR=CTRBCTRbase\Delta CTR = CTR_{B} - CTR_{base})
    • Confidence Level: Typically set at 95% to reduce the risk of Type I errors. (e.g., α=0.05\alpha = 0.05)
    • Statistical Power: Aim for a power of 80% or higher to reduce the risk of Type II errors. (e.g., β=0.20\beta = 0.20)
  1. Run the Experiment
  2. Analyze the Results
    • Calculate the metrics for each group.
    • Perform Hypothesis Testing.
Definition

Types of Errors

  • Type 1 Error: False Positive (detecting an effect when there is none)
    • α\alpha (alpha) level, commonly set at 0.05
    • Confidence Level: 1 - α\alpha (e.g., 95% confidence level)
  • Type 2 Error: False Negative (failing to detect an effect when there is one)
    • β\beta (beta) level, commonly set at 0.20
    • Statistical Power: Probability of correctly rejecting the null hypothesis when it is false (1 - β\beta)
  • p-value: Probability of observing the data, or something more extreme, if the null hypothesis is true.
Definition

Hypothesis Testing

  • Null Hypothesis (H0): There is no difference between the control and variant.
  • Alternative Hypothesis (H1): There is a difference between the control and variant.

Evaluating the Hypotheses:

  • Use a two-proportion z-test or t-test to assess statistical significance.
  • Compute P-Value:
    • If p-value < alpha (e.g., 0.05), reject the null hypothesis and conclude that there is a significant difference.
    • If p-value >= alpha, fail to reject the null hypothesis.

One-Sided vs Two-Sided Tests

  • One-Sided Test: Tests for an effect in a specific direction (e.g., variant is better than control).
  • Two-Sided Test: Tests for an effect in both directions (e.g., variant is different from control, either better or worse).

Advantages of A/B Testing

  • Simplicity
  • Actionable Insights
  • Isolated Variables
  • Broad Applicability
  • Risk Mitigation

Disadvantages of A/B Testing

  • Allocated traffic statistically, typically 50/50
  • Wastes resources by continuing to show the worse performing variant to a significant portion of users.
Note

Alternative Approaches:

  • Multi-Armed Bandit Testing: Dynamically allocates more traffic to better-performing variants.
  • Sequential Testing: Continuously monitors results and allows for early stopping of the test when significant results are observed.

Code Example

import numpy as np
import pandas as pd
from scipy.stats import norm
from statsmodels.stats.proportion import proportions_ztest
np.random.seed(42)
# base line performance
ctr_base = 0.05
# minimum detectable effect (MDE)
mde = 0.01
ctr_b = ctr_base + mde
# define type I and type II error rates
alpha = 0.05
beta = 0.2
# test z-scores
z_alpha_one_sided = norm.ppf(1 - alpha)
z_alpha_two_sided = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(1 - beta)
# pooled ctr and delta ctr
ctr_bar = (ctr_base + ctr_b) / 2
delta_ctr = mde
# require samples size
n_one_sided = int(((z_alpha_one_sided + z_beta) ** 2 * 2 * ctr_bar * (1 - ctr_bar)) / (delta_ctr ** 2))
n_two_sided = int(((z_alpha_two_sided + z_beta) ** 2 * 2 * ctr_bar * (1 - ctr_bar)) / (delta_ctr ** 2))
# generate data
clicks_A_one_sided = np.random.binomial(n=n_one_sided, p=ctr_base)
clicks_B_one_sided = np.random.binomial(n=n_one_sided, p=ctr_b)
clicks_A_two_sided = np.random.binomial(n=n_two_sided, p=ctr_base)
clicks_B_two_sided = np.random.binomial(n=n_two_sided, p=ctr_b)
data_one_sided = pd.DataFrame({
'Ads': ['A', 'B'],
'Impressions': [n_one_sided, n_one_sided],
'Clicks': [clicks_A_one_sided, clicks_B_one_sided]
})
data_two_sided = pd.DataFrame({
'Ads': ['A', 'B'],
'Impressions': [n_two_sided, n_two_sided],
'Clicks': [clicks_A_two_sided, clicks_B_two_sided]
})
# do hypothesis testing
z_stat_one_sided, p_value_one_sided = proportions_ztest(data_one_sided['Clicks'], data_one_sided['Impressions'], alternative='smaller')
z_stat_two_sided, p_value_two_sided = proportions_ztest(data_two_sided['Clicks'], data_two_sided['Impressions'], alternative='two-sided')
# print results
print("One-sided test results:")
print(data_one_sided)
print(f"Z-statistic: {z_stat_one_sided}, P-value: {p_value_one_sided}")
if p_value_one_sided < 0.05:
print("Reject the null hypothesis in one-sided test.")
else:
print("Fail to reject the null hypothesis in one-sided test.")
print("\n---\n")
print("Two-sided test results:")
print(data_two_sided)
print(f"Z-statistic: {z_stat_two_sided}, P-value: {p_value_two_sided}")
if p_value_two_sided < 0.05:
print("Reject the null hypothesis in two-sided test.")
else:
print("Fail to reject the null hypothesis in two-sided test.")

Output

One-sided test results:
Ads Impressions Clicks
0 A 6426 308
1 B 6426 388
Z-statistic: -3.117994752318808, P-value: 0.0009104302281918874
Reject the null hypothesis in one-sided test.
---
Two-sided test results:
Ads Impressions Clicks
0 A 8158 377
1 B 8158 496
Z-statistic: -4.139814221510001, P-value: 3.4758719657070286e-05
Reject the null hypothesis in two-sided test.