Running an A/B test without enough data isn't just a waste of time โ€” it's actively misleading. A result that looks like a 20% improvement after 80 visitors is almost certainly noise. Act on it, and you may ship a change that hurts your conversions while believing you've improved them. Understanding sample size is the difference between testing that builds a real advantage and testing that creates false confidence.

The most common A/B testing mistake: stopping too early

Most teams stop their tests too early. It's human nature โ€” you launch a test, check the dashboard the next morning, and one variant is already ahead by 15%. Excitement kicks in. You call the winner and move on.

This is called the peeking problem, and it's one of the most well-documented errors in experimentation. The core issue: early in a test, random fluctuations in conversion rates look like real differences. With small sample sizes, coin flips can look like loaded coins. If you check your results repeatedly and stop whenever one variant looks ahead, you dramatically inflate the chance of a false positive โ€” sometimes to 30-40%, even if your nominal significance threshold is 95%.

The fix is to decide your required sample size before you start, and not call a winner until you've reached it.

What is statistical significance?

Statistical significance is the probability that your result wasn't caused by random chance. The industry standard is 95% confidence (written as p < 0.05), which means: if there were truly no difference between your two variants, you'd see a result this extreme only 5% of the time by chance alone.

In other words, at 95% confidence, you're accepting a 1-in-20 chance of calling a winner when there isn't one. That's a false positive, also called a Type I error. For most business decisions, 95% is an acceptable threshold. For high-stakes changes โ€” pricing, checkout flow, legal copy โ€” you might want 99%.

Statistical significance doesn't tell you the effect is large or important. It tells you the effect is probably real. A 0.1% conversion improvement can be statistically significant with enough traffic โ€” but it may not be worth building.

The three variables that determine sample size

You don't need to understand the full statistics to use sample size correctly โ€” you just need to know the three inputs:

1. Baseline conversion rate

Your current conversion rate before any changes. If 4 out of every 100 visitors buy something, your baseline is 4%. Lower baselines require larger samples to detect changes reliably.

2. Minimum Detectable Effect (MDE)

The smallest improvement you actually care about. If your conversion rate is 4% and you'd only act on an improvement of 10% relative (to 4.4%), that's your MDE. The smaller the improvement you want to detect, the more traffic you need. This is the most important variable most people get wrong โ€” they set their MDE too low, requiring enormous samples, or too high, missing smaller real improvements.

A practical approach: ask yourself "what's the smallest lift that would make this change worth keeping permanently?" That's your MDE.

3. Confidence level

Usually 95%. Higher confidence (99%) requires more traffic. Lower (90%) requires less but accepts more false positives.

A useful rule of thumb

For a 5% baseline conversion rate and a 10% relative MDE (detecting a lift from 5.0% to 5.5%), you need roughly 15,000 visitors per variant โ€” 30,000 total. At 1,000 daily visitors split evenly, that's a 30-day test. Smaller MDEs or lower baselines push this higher fast. A 20% relative MDE at the same baseline drops the requirement to roughly 4,000 per variant.

Always use a sample size calculator to get the exact number for your situation rather than guessing.

Practical guidelines for running clean tests

When to call it

Call a winner when both conditions are met:

  1. You've reached your pre-defined sample size.
  2. You've run for at least your minimum time window (typically 2 weeks).

If you've hit both conditions and the result still isn't statistically significant, that's information too. It means either the difference is real but smaller than your MDE, or there's no meaningful difference. Either way, that's a valid, actionable conclusion โ€” you just don't have a winner, and you should move on to testing something else.

What p-value actually means

p-value is one of the most misunderstood concepts in statistics, but it's simpler than it sounds. A p-value of 0.04 means: if there were truly no difference between variant A and variant B, we'd see a result this extreme (or more) only 4% of the time by chance.

What p-value does not mean:

Treat statistical significance as a signal to act, not as proof of a specific magnitude of improvement. Always look at the confidence interval alongside the p-value to understand the realistic range of the true effect.

Free A/B Test Significance Calculator

Enter your visitor counts and conversions for each variant, and get an instant significance result with p-value and confidence level โ€” no spreadsheets needed.

Check your test results โ†’