Start for free now

Did your A/B test actually win?

Enter visitors and conversions for each variant. Get the p-value, the relative uplift, and a plain-English call on whether to ship variant B or keep running the test.

Loading

Fill the fields to see your test result.

+0%

Relative uplift of B over A B converts at 0.00% vs A at 0.00%

Variant A rate 0.00% 0 of 0 visitors

Variant B rate 0.00% 0 of 0 visitors

P-value — Two-tailed z-test

Sample needed — per variant at 80% power

Health check

The four numbers, briefly

What each metric actually means.

A/B testing is the difference between "I think this is better" and "I know this is better." Four metrics decide whether you have enough evidence to ship a change confidently or keep collecting data.

P-value Significance

p < 0.05 means significant at 95%

The probability that your observed difference is pure chance. p = 0.03 means 3% chance the result is noise. Standard threshold for shipping is p < 0.05. High-stakes decisions use p < 0.01.

Relative uplift Effect size

= (B rate − A rate) ÷ A rate

How much better (or worse) B is than A, in percent. 5% to 6% is +20% relative. Marketers report relative uplift because it scales the impact: a 20% lift on a 5% baseline is meaningful, a 20% lift on a 0.5% baseline barely moves the needle in absolute terms.

Absolute uplift Effect size

= B rate − A rate (in percentage points)

The raw difference between the two rates. 5% vs 6% is +1 pp absolute. Statisticians prefer absolute; product folks prefer relative. They are two views of the same effect; the math (and the required sample size) is driven by the absolute difference.

Required sample size Power

Lehr: ≈ 16 × p̄(1−p̄) ÷ δ²

Visitors per variant needed at 95% confidence and 80% power. δ is the absolute difference you want to detect. Smaller effects need way bigger samples: detecting a +10% lift takes ~4x more traffic than a +20% lift, ~16x more than a +40% lift.

Confidence level Threshold

95% = max 5% false positive rate

The probability that a "significant" result is not a fluke. 95% is the standard. 99% is stricter (for high-cost decisions). 90% is looser (for low-stakes iteration). Higher confidence requires more data.

Statistical power Threshold

80% = max 20% miss rate

The probability of detecting a real effect when it exists. 80% is standard. With less power, you miss real winners; with more, you need bigger samples. The sample-size formula bakes in 80% power; if you want 90% power, multiply by ~1.3x.

Three rules every A/B tester should know

2 weeks

minimum test duration. Covers weekday vs weekend behavior. Shorter tests miss temporal effects and inflate false positives.

No peeking

Pre-commit to a sample size, then look ONCE at the end. Peeking and stopping the moment p < 0.05 can inflate false positives above 30%.

+20%

typical minimum detectable effect for a meaningful test. Smaller effects require enormous samples; usually not worth the wait unless the change is cheap.

Sources: ConversionXL Statistics Guide, Evan Miller's "How Not to Run an A/B Test," Optimizely / VWO documentation.

Common questions

Honest answers.

What does statistically significant mean?

Statistically significant means the difference you observed between variant A and variant B is unlikely to be the result of random chance. The standard threshold is 95% confidence, which corresponds to a p-value below 0.05. Below that, you can reasonably claim variant B is genuinely different (better or worse) than variant A. Above 0.05, you do not have enough evidence yet and should keep running the test.

What is the p-value?

The p-value is the probability of observing your result (or a more extreme one) if there were actually no real difference between A and B. A p-value of 0.03 means there is a 3% chance the difference you see is pure noise. The accepted threshold for declaring a winner is p < 0.05 (95% confidence). For high-stakes decisions, use p < 0.01 (99%).

What statistical test does this tool use?

A two-proportion z-test, the standard test for comparing two conversion rates. It assumes each visitor's conversion is independent and that you have enough sample size (at least ~30 conversions per variant) for the normal approximation to be reasonable. For very small samples or rare events, exact tests like Fisher's exact test are technically more accurate, but the z-test is what nearly every commercial A/B testing platform uses in practice.

How big a sample size do I need?

Depends on your baseline conversion rate and the minimum uplift you care about detecting. A rough rule (Lehr's approximation): sample size per variant equals 16 times the average rate times (1 minus the average rate) divided by the squared absolute difference you want to detect. So at a 5% baseline rate trying to detect a 20% relative lift (5% to 6%, absolute difference 0.01), you need around 16 × 0.055 × 0.945 / 0.0001 ≈ 8,300 visitors per variant. The tool calculates this for you based on your observed numbers.

How long should I run an A/B test?

Long enough to hit the required sample size AND at least one full business cycle (typically 1-2 weeks). Two weeks accounts for weekday vs weekend behavior, payday cycles, and other temporal patterns that might bias a shorter test. Stopping a test early (peeking) because it looks significant inflates false positives; commit to a duration up front.

What if variant B is worse than variant A?

The same math applies; you are now looking for evidence that B is significantly worse. If the p-value is below 0.05 and B has a lower rate, you can confidently say variant B underperforms. Keep variant A (your control) live. If the p-value is above 0.05, the test is inconclusive: B might be neutral or slightly worse, but you cannot say for sure without more data.

Can I stop the test as soon as it hits significance?

No. This is the most common A/B testing mistake. P-values fluctuate during a test; checking repeatedly and stopping the first time you see p < 0.05 inflates false positives dramatically (the actual error rate can climb above 30%). Pick the sample size up front based on your minimum detectable effect, run the test until you hit it, then look once. Bayesian methods and sequential testing exist to handle peeking, but the simplest rule is: decide duration in advance, then resist the urge to look until it ends.

Is my data kept private?

Yes. Everything runs in your browser. Nothing you enter is sent to systeme.io or any other server, stored, or logged. You can verify in DevTools by watching the network tab while you use the calculator.

Ship winning variants on systeme.io

Run your funnels on systeme.io.

Build landing pages, sales funnels, online courses, email automations, and affiliate programs on one platform. Test variants, ship the winners. Free plan, 2,000 contacts.