Traffic & conversion · Guide

A/B testing

A/B testing replaces "I think this version is better" with "the data shows it is." Done well, it is the most reliable way to improve results. Done badly, it produces confident, wrong conclusions. Here is how to do it right, statistics and all.

12 min read Updated June 2026

What A/B testing is

A/B testing is showing two versions of something to two randomly split groups at the same time, then measuring which one performs better on a goal you chose in advance.

Version A is the control, your current version, and version B is the variant with one change. You split live traffic between them, let the results come in, and the version that wins on your chosen metric becomes the new standard. Stripped to its essence, an A/B test is a small randomized controlled experiment, the same logic science uses, applied to a headline or a checkout page. As Harvard Business Review puts it, it is one of the simplest forms of a randomized experiment, and it has been around for roughly a century.

Its real value is what it replaces: opinion. Without testing, decisions default to whoever argues hardest or whoever is most senior, the highest-paid person's opinion. A/B testing settles those arguments with evidence from your actual audience. It is the experimental engine inside the broader discipline of conversion rate optimization: CRO is the whole practice of improving conversions, and A/B testing is how you prove that any given idea actually works before you commit to it.

How it works

The mechanics are simple, and each one exists for a reason. Get any of them wrong and the result stops meaning anything.

A control and a variantVersion A is what you have now; version B changes one thing. Everything else stays identical.
One variableChange a single element so any difference can be attributed to that change, not a dozen at once.
Random, simultaneous splitVisitors are assigned at random and both versions run at once, so seasonality and day-of-week do not skew it.
One primary metricPick the single goal that defines success, like conversion rate, before the test starts.

Underneath all four is one idea: control everything except the thing you are testing, so the result is attributable. If version B changes the headline, the image, and the button all at once and it wins, you have learned that something worked but not what, which is almost useless for the next decision. And it all rests on a hypothesis: a stated prediction like "changing the headline to lead with the outcome will lift signups, because visitors care about the result more than the features." A test without a hypothesis is just a guess you happened to measure.

The statistics, honestly

This is where most A/B testing goes wrong, so it is worth getting right. You do not need to be a statistician, but you do need three ideas: significance, sample size, and why you must not stop early.

26%
false-positive rate when you peek and stop early, versus the 5% you think you have.Evan Miller
~1 in 7
A/B tests that produce a clear winner; most come back flat or inconclusive.VWO, via NN/g
1–2 wks
minimum to run a test, to cover full business cycles including weekends.Industry consensus

Statistical significance

Significance measures how likely your result is to be real rather than random luck. The standard is 95% confidence, and here is the part nearly every blog gets wrong: 95% confidence does not mean there is a 95% chance your variant is better. It means that if there were truly no difference between the two versions, you would rarely, about 5% of the time, see a gap as large as the one you observed by pure chance. So significance tells you a result is unlikely to be a fluke. It does not tell you that you are 95% certain B wins. Hold that distinction and you are already ahead of most testers.

Sample size

You cannot judge significance without enough data, and "enough" is something you calculate before you start, not eyeball as you go. The required sample size depends on your current conversion rate and the smallest improvement you care about detecting, called the minimum detectable effect. The catch is that detecting a small change, say a 2% lift, needs a far larger sample than detecting a big one, which is exactly why low-traffic sites should test bold changes rather than tiny tweaks. Free sample-size calculators, like Evan Miller's or Optimizely's, do the math for you; the discipline is using one before you launch.

Why you must not stop early

This is the single most important rule, and the most broken. The significance calculation assumes you fixed the sample size in advance. If you watch a running test and stop it the moment it crosses the significance line, you break that assumption and flood your results with false positives. Evan Miller's well-known analysis showed that checking after every visitor and stopping at the usual threshold produces a real false-positive rate around 26%, more than five times the 5% you believe you have. The fix is simple discipline: set your sample size and duration before you begin, and wait until the test is genuinely finished before you trust the numbers. Run for at least one to two full business cycles, so weekday and weekend behavior are both represented.

Most tests do not win

One honest truth keeps testers sane: most tests do not produce a winner. By widely-cited figures, only around one in seven tests wins clearly, and large experimentation teams at companies like Booking.com say the large majority of their tests fail to deliver the hoped-for result. The rest come back flat or inconclusive. This is not a problem with testing, it is the nature of it. A well-designed test teaches you something whatever the outcome, by killing a bad idea cheaply or confirming the current version holds. Treat A/B testing as a way to learn steadily, and be deeply skeptical of any case study promising a huge, effortless lift.

What to test

Test the things big enough to matter. The classic beginner trap is obsessing over a button color, which rarely moves the needle or teaches you much. Aim higher: the elements that genuinely change how people decide.

Headline / value propositionThe biggest lever. What the offer is and why it matters, in the first line people read.
Call to actionThe button copy and placement. The wording matters far more than the color.
Page layout and heroWhat leads, what order sections appear in, what the visitor sees first.
Form lengthHow many fields you ask for. Fewer usually lifts completion, but test it.
MediaImage versus video versus none in the hero, which can shift conversion either way.
Offer and pricing framingHow the offer and price are presented, like monthly versus annual or a payment plan.

The principle is to favor bold changes over trivial ones. Small tweaks quickly hit a ceiling, where each new test wins a fraction of a percent and eventually nothing, while a genuinely different approach can produce a real, detectable effect and teach you something about your audience. Bold changes also reach significance with a smaller sample, which matters if your traffic is modest. To decide what to test first, score your ideas by their potential impact and how easy they are to run, using a simple framework like PIE (potential, importance, ease) or ICE (impact, confidence, ease), and start with the high-impact, reasonable-effort ones.

A/B vs multivariate

A/B testing is the most common method, but it has cousins. The difference that matters in practice is how much traffic each one needs.

MethodWhat it isTraffic needed
A/B testTwo versions of a page, usually differing by one element. Also called split testing.Lowest
Split-URL testVariants live on different URLs. Used when the variant is a separate page or a full redesign.Similar to A/B
Multivariate (MVT)Several elements changed at once, measuring how the combinations interact.Much higher

The warning to take from that table is about multivariate testing. Because it splits your traffic across every combination of changes, it needs far more visitors than a simple A/B test to reach significance, often many times more. Unless you have a high-traffic site, an A/B test is almost always the smarter choice. Start with A/B, and graduate to multivariate only once you have the volume to support it.

How to run a test in 7 steps

Here is the full process, in order. The discipline is mostly in steps three and four, deciding the rules before you start, and in step six, sticking to them.

  1. Study your data to find the problem

    Use your analytics, heatmaps, and qualitative research to find where something underperforms or where visitors drop off, so you test a real opportunity rather than a random idea.

  2. Form a hypothesis

    Write a clear prediction: changing this element to that will improve this metric, because of this reason. The "because" forces you to think, and gives you something to learn from whatever the result.

  3. Define the metric and minimum effect

    Decide the single primary metric in advance, and the smallest improvement worth detecting. Choosing these up front stops you from fishing through the data afterward for any result that looks good.

  4. Calculate sample size and duration in advance

    Use a sample-size calculator with your baseline rate and minimum detectable effect to find how many visitors you need, and plan to run for at least one to two full business cycles. Write the numbers down.

  5. Build the control and the variant

    Keep the control exactly as it is, and create a variant that differs only by the element you are testing. Set up the tool to split traffic randomly and show both versions at the same time.

  6. Run it to completion without peeking

    Wait until you reach the planned sample size and duration. Do not stop the test the moment it appears to hit significance, because early peeking is what turns a 5% error rate into something closer to 26%.

  7. Analyze and act

    If you have a statistically significant winner, ship it and bank the gain. If the result is flat or inconclusive, record what you learned and start a new hypothesis. Either outcome moves you forward.

When not to A/B test

A/B testing is not the right tool for every situation, and pretending otherwise wastes time. Its hard requirement is volume: to reach a statistically significant result you often need thousands of visitors and a solid number of conversions per version. On a page that gets a few hundred visits a month, a test can run for months and never conclude, which is worse than not testing at all, because you make decisions on noise while believing they are data.

If you do not have the traffic, you have two good options. First, test only bold, radical changes rather than small tweaks, because a larger effect can reach significance with a smaller sample. Second, lean on qualitative methods instead: user testing, session recordings, and surveys tell you why people behave as they do, which raw A/B numbers never reveal. In fact, even high-traffic teams pair the two, because A/B testing tells you what changed but not why. Use testing where you have the volume to do it properly, and use research and judgment everywhere else.

Common mistakes to avoid

Most bad A/B testing comes down to a short list of errors. Avoid these and you are doing better than most.

Stopping early. The number-one mistake. Calling a test the moment it looks significant inflates false positives to around 26%. Run to the planned end.

A sample that is too small. An underpowered test is just noise. Calculate the sample you need before launching, and do not read into a handful of conversions.

Changing several things at once. If the variant differs in many ways and wins, you cannot tell which change did it. Keep an A/B test to one variable.

Testing trivial elements. Button colors and tiny tweaks rarely produce a detectable effect or a useful lesson. Test the headline, the offer, the layout.

Misreading significance. Treating 95% confidence as a 95% chance the variant wins, or ignoring significance altogether. Know what the number means.

Ignoring outside factors. A sale, a holiday, or a traffic-source change during the test can fake a result. Account for what else was happening.

Run tests in systeme.io

Test your pages and emails, no extra tools

systeme.io has A/B testing built into the page builder and the email tool, so you can run a real split test on the things that matter, your landing pages and your subject lines, and see the results, without bolting on a separate testing platform. Build and test on the free plan.

A/B test landing pagesSplit traffic between two versions of a page and compare conversions.
Test email subject linesTry two subject lines and judge them on clicks, not just opens.
Built-in statsSee how each version performs without exporting to another tool.
Everything in one placePages, emails, and funnels together, so you test the whole journey.
Start testing free

A/B testing is one tool inside the wider practice of conversion rate optimization. To make sure you have the traffic to test with, see how to drive traffic to your website.

Frequently asked questions

A/B testing, also called split testing, is showing two versions of something to two randomly split groups of people at the same time, then measuring which version performs better on a chosen goal. Version A is the control, your current version, and version B is the variant with one change. By splitting a live audience randomly and comparing the results, you replace opinion and guesswork with evidence. It is essentially a small randomized controlled experiment for your marketing, and it is the method that lets you find out what actually works instead of assuming.

You take your current version as the control and create a variant that differs by just one element, such as a headline or a call to action. You then split your live traffic randomly between the two, show both at the same time, and measure a single primary metric you chose in advance, like the conversion rate. Changing only one element is what lets you attribute any difference to that change rather than to chance or outside factors. Running both simultaneously to randomly assigned visitors removes the bias you would get from comparing one week against another.

It is a measure of how likely your result is to be real rather than random chance, and it is widely misunderstood. The industry standard is 95% confidence, but 95% confidence does not mean there is a 95% chance your variant is better. It means that if there were truly no difference between the two versions, you would rarely, about 5% of the time, see a gap as large as the one you observed by pure chance. So significance tells you the result is unlikely to be a fluke, not that you are 95% certain B wins. Getting this distinction right is what separates careful testers from the rest.

Decide the length before you start, based on a pre-calculated sample size, and run for at least one to two full business cycles, which usually means a minimum of one to two weeks. Running full weeks matters because behavior differs by day, so a test that ends on a Tuesday may miss the weekend pattern entirely. Crucially, run for the planned duration even if the test appears to hit significance early. Stopping the moment it looks like a winner is one of the biggest mistakes in testing, because early readings are unreliable and change as more data arrives.

Because the significance calculation assumes you fixed the sample size in advance, and peeking at the results and stopping the moment they look significant breaks that assumption and floods your results with false positives. Evan Miller's well-known analysis showed that checking after every visitor and stopping at the usual threshold produces a false-positive rate around 26%, more than five times the 5% you think you are getting. The fix is simple: decide your sample size and duration before you begin, then wait until the test is actually over before believing the numbers. Patience is part of the method.

Enough that a real difference can reach statistical significance, which often means thousands of visitors and a healthy number of conversions per version. Exactly how many depends on your current conversion rate and how small a change you want to detect: spotting a small lift needs far more traffic than spotting a big one. This makes A/B testing a poor fit for very low-traffic pages. If you do not have the volume, test bold, radical changes rather than tiny tweaks, since larger effects need smaller samples, or use qualitative methods like user testing and session recordings to learn instead.

Start with high-impact elements, not trivial ones. Testing your headline or value proposition, your main call to action, your offer, or your page layout can move the needle meaningfully, whereas testing a button color rarely teaches you much or produces a detectable effect. A useful rule is to test bold changes over tiny tweaks, because small tweaks quickly hit a ceiling where the gains shrink to nothing, while bigger changes both reach significance faster and teach you more. Prioritize ideas by their potential impact and how easy they are to run, and test the high-impact, reasonable-effort ones first.

No, and that is normal. By widely-cited industry figures, only around one in seven tests produces a clear winner, and large experimentation teams at companies like Booking.com report that the large majority of their tests do not produce the hoped-for result. Most tests come back flat or inconclusive. This is not failure: a well-designed test teaches you something whatever the outcome, by ruling out an idea or confirming the current version is fine. Treat testing as a way to learn steadily rather than a machine for guaranteed lifts, and be skeptical of case studies promising huge, easy wins.

Stop guessing, start testing

Run a real A/B test on your pages and emails, and let the data decide. Start on the free plan, with no card.

Start for free now