AB Testing Fundamentals
I’ve been running A/B tests for a while now, and I keep coming back to these fundamental concepts. Writing them down helps me think through them more clearly - maybe they’ll be useful for you too.
What A/B Testing Actually Is #
A/B testing is basically a controlled experiment to figure out which version of something works better. The key word here is “controlled” - you’re not just comparing random things, you’re systematically testing one change at a time.
The core characteristics that make it work:
- Two variants: Control (what you have now) and treatment (what you’re testing)
- Random assignment: Users get randomly bucketed into groups
- Measurement: You track specific metrics to see what happens
That random assignment piece is crucial. Without it, you’re just looking at correlations, and we all know correlation doesn’t equal causation.
Why Random Assignment Matters #
Random assignment does three important things:
- Creates representative groups: When done right, your test groups should look like mini versions of your broader user base
- Minimizes bias: Both known and unknown variables get evenly distributed between groups
- Establishes causality: By controlling for everything except your one change, you can actually say that change caused the difference you observed
Control vs Treatment #
This is straightforward but worth being explicit about:
- Control: The current state, your baseline, what users see today
- Treatment: The new thing you’re testing, hopefully an improvement
When A/B Testing Makes Sense (And When It Doesn’t) #
A/B testing is great for:
Conversion Rate Optimization
- Testing checkout flows, signup processes, email campaigns
- Optimizing calls-to-action, form designs, landing pages
Feature Releases
- Rolling out new functionality gradually
- Validating UI/UX changes before full launch
Product Impact Measurement
- Understanding how changes affect key business metrics
- Testing marketing campaigns and pricing strategies
But it’s not always the right tool:
When you don’t have enough traffic - If you can’t reach statistical significance in a reasonable timeframe, don’t bother
Ethical concerns - Some tests could harm users or feel manipulative
No clear hypothesis - If you can’t articulate what you expect to happen and why, you’re not ready to test
Resource constraints - Sometimes the cost of setting up and running the test outweighs the potential benefit
Setting Up Tests Properly #
The Process I Follow #
- Define the goal clearly - What are you trying to improve and why?
- Form a testable hypothesis - “Based on X observation, if we change Y, then Z will happen because of reason R”
- Choose your metrics - Primary metric (the thing you care most about), secondary metrics (other things worth tracking), guardrail metrics (things that shouldn’t get worse)
- Calculate sample size - How many users do you need to detect a meaningful difference?
- Run the test - Resist the urge to peek at results early
- Analyze and act - Look at the data, make a decision, document what you learned
Picking Good Metrics #
I think about metrics in three categories:
Primary/North Star Metrics: The big picture stuff that directly reflects business value. Revenue per visitor, conversion rate, user retention.
Granular Metrics: More specific user behaviors that are easier to move and give you insight into what’s happening. Click-through rate, time on page, signup rate.
Guardrail Metrics: Things you don’t want to break while optimizing other things. Site speed, error rates, user satisfaction scores.
A good metric is:
- Stable: Doesn’t fluctuate wildly for no reason
- Sensitive: Actually responds when you make meaningful changes
- Measurable: You can track it accurately and consistently
- Non-gameable: Hard to manipulate without creating real value
Understanding Statistical Concepts #
The Math You Need to Know #
Statistical Significance (α): Usually set at 5%. This is your false positive rate - how often you’ll think there’s a difference when there isn’t one.
Statistical Power (1-β): Usually set at 80%. This is your ability to detect a real difference when it exists.
Minimum Detectable Effect (MDE): The smallest change you care about detecting. Smaller effects require bigger sample sizes.
The relationship between these determines your sample size. You can’t optimize all three - if you want to detect smaller effects with higher confidence, you need more data.
Effect Size and Cohen’s Measures #
Effect size tells you not just whether there’s a difference, but how big that difference is. Cohen’s measures standardize this:
For proportions (like conversion rates): Cohen’s h
- Formula: h = 2 × (arcsin(√P₂) - arcsin(√P₁))
For means (like revenue per user): Cohen’s d
- Formula: d = (μ₂ - μ₁)/σ
The interpretation is the same for both:
- 0.2 = Small effect
- 0.5 = Medium effect
- 0.8 = Large effect
Sample Size Calculation #
Once you have your effect size, you can calculate sample size: n = 2(Zα + Zβ)² / effect_size²
Example: Testing a checkout flow change
- Baseline conversion: 10%
- Target conversion: 11% (10% relative improvement)
- Cohen’s h ≈ 0.033 (small effect)
- Sample size needed: ~14,728 per variant
- With 5,000 daily visitors: ~6 day test
Common Pitfalls and How to Avoid Them #
Multiple Comparisons Problem #
When you test multiple metrics simultaneously, your false positive rate inflates. If you test 4 metrics at 5% significance each, your overall false positive rate jumps to about 19%.
Solutions:
- Bonferroni correction: Divide your significance level by number of tests
- Hierarchy approach: Have one primary metric, treat others as secondary
- Accept the tradeoff: Sometimes a slightly higher false positive rate is acceptable
Sample Ratio Mismatch (SRM) #
This happens when your user split deviates from what you planned (e.g., 48/52 instead of 50/50). It usually indicates a problem with your randomization.
Check for SRM using a chi-square test. If you find it, investigate:
- Randomization algorithm bugs
- Delayed start times for variants
- Data logging issues
- Bot filtering differences
Peeking at Results #
Every time you check results before reaching your planned sample size, you increase your false positive rate. What feels like 5% significance becomes ~14% if you peek 5 times.
Solutions:
- Set a test duration and stick to it
- Use sequential analysis methods if you must monitor
- Only stop early for dramatic effects or serious problems
External Validity Issues #
Simpson’s Paradox: Results can flip when you segment your data. Always check if your overall results hold within important user segments.
Novelty Effects: Users might try new things just because they’re new. Run tests long enough for behavior to stabilize.
Change Aversion: The opposite problem - users might avoid new features initially even if they’re better.
Advanced Considerations #
Ratio Metrics and the Delta Method #
When your metric is a ratio (like revenue per user), standard t-tests don’t work properly because the numerator and denominator can vary independently.
The delta method accounts for this by considering the covariance between numerator and denominator. For a ratio R = X/Y:
Var(R) ≈ (1/μy²)[σx² + R²σy² - 2Rσxy]
This matters because two scenarios with the same ratio can have very different confidence intervals depending on the underlying variance.
Handling Multiple Metrics #
In practice, you usually care about more than one metric. Here’s how I approach it:
- Establish hierarchy: One primary metric, several secondary ones, guardrails
- Account for correlations: Related metrics will move together
- Set decision rules upfront: What combination of results will make you ship?
- Use statistical corrections when appropriate: But don’t be overly conservative
Practical Implementation #
Infrastructure You’ll Need #
User Assignment:
- Reliable randomization algorithm
- Consistent user identification (cookies, user IDs)
- Proper traffic allocation controls
Data Collection:
- Event logging for all relevant metrics
- Data quality monitoring
- Real-time dashboards for monitoring
Analysis Tools:
- Statistical testing capabilities
- Confidence interval calculations
- Segmentation and drilling-down features
Quality Assurance #
A/A Tests: Run identical experiences to two groups to validate your setup. If you see significant differences, you have a problem.
Balance Checks: Verify that your randomization worked by checking that user characteristics are balanced between groups.
Data Quality Monitoring: Watch for unusual patterns, missing data, or technical issues that could bias results.
Making Decisions #
At the end of the day, A/B testing is about making better decisions. Here’s my framework:
- Statistical significance: Is the difference real or likely due to chance?
- Practical significance: Is the difference big enough to matter?
- Business context: Does this align with your strategy and priorities?
- Risk assessment: What are the downsides if you’re wrong?
- Implementation cost: Is the benefit worth the effort to build and maintain?
Key Takeaways #
A/B testing is a powerful tool, but it’s not magic. It requires:
- Clear thinking about what you’re trying to learn
- Proper statistical setup and execution
- Careful interpretation of results
- Good judgment about when and how to act
The math is important, but the thinking is more important. Start with a clear hypothesis, design a clean test, and be honest about what the data tells you.
Most importantly: A/B testing is about learning, not just optimizing. Even “failed” tests teach you something valuable about your users and your product.