A/B Testing Analytics Setup: How to Run Experiments You Can Actually Trust

A/B Testing Analytics Setup: How to Run Experiments You Can Actually Trust

Most A/B Tests Are Set Up Wrong A/B testing is one of the most powerful tools available to product and marketing teams. It is also…

AB Testing Analytics

Most A/B Tests Are Set Up Wrong

A/B testing is one of the most powerful tools available to product and marketing teams. It is also one of the most commonly misused. Teams run tests without a clear hypothesis, call results early when they see movement in the direction they hoped for, treat statistical significance as a finish line rather than a filter, and draw conclusions from experiments that were never properly instrumented to begin with.

The result is a testing culture that feels data-driven but is actually built on fragile foundations  where decisions are made based on experiment results that would not hold up to basic scrutiny.

Getting A/B testing right is not primarily a tool problem. It is an analytics setup problem. Before you run a single test, you need the right instrumentation, the right statistical framework, and the right process for deciding when a result is real versus noise. This guide covers all three.

Step 1: Define Your Hypothesis Before You Touch the Setup

The most common reason A/B tests fail to produce actionable results is not bad tracking, it is a bad hypothesis. An experiment without a clear, pre-specified hypothesis is not a test. It is a data fishing exercise.

A properly formed A/B test hypothesis has three components:

  • A specific change: what exactly are you testing? Not ‘we are testing the checkout page’ but ‘we are testing replacing the three-step checkout with a single-page layout.’
  • A predicted direction: which metric do you expect to move, and in which direction? ‘expect this change to increase checkout completion rate by reducing friction at the payment step.’
  • A primary metric: one metric that will determine whether the experiment succeeded or failed. Not five metrics, not a scorecard  one. Secondary metrics can inform interpretation but should not determine the outcome.

Write the hypothesis down before the experiment launches. Share it with everyone who will see the results. If you find yourself changing the primary metric after the fact because the original one did not move, you are p-hacking — and your result is no longer valid.

Step 2: Instrument the Experiment Correctly

Once you have a hypothesis, you need to make sure the experiment is tracked properly. This means confirming that your analytics platform will capture assignment data, exposure events, and outcome events in a way that connects them cleanly.

The three events every A/B test needs

  • Assignment event: fired when a user is assigned to a variant. This should fire the first time a user is exposed to the experiment, with properties capturing the experiment ID and the variant they were assigned to. This is your denominator — the population who entered the experiment.
  • Exposure event: fired when a user actually sees the variant being tested. In some experiments, assignment and exposure are the same moment. In others — particularly feature flag-based experiments — users may be assigned to a variant but never actually see the changed experience. Tracking exposure separately prevents this from inflating your denominator with users who were never actually in the experiment.
  • Conversion event: the primary metric event that determines whether the variant performed better than control. This should already exist in your analytics implementation if your tracking plan is solid. If you have to create a new event just for the experiment, treat that as a signal that your base tracking has gaps.

Example: tracking an experiment in Mixpanel or GA4

When a user enters a checkout experiment, you would fire:

experiment_assigned({ experiment_id: “checkout_v2”, variant: “single_page” })

When they see the new checkout layout:

experiment_viewed({ experiment_id: “checkout_v2”, variant: “single_page” })

When they complete checkout:

purchase({ transaction_id: “…”, value: 149.00, currency: “USD” })

Your analysis then measures purchase rate among users who fired experiment_viewed, broken down by variant.

Step 3: Choose the Right Experimentation Platform

The tool you use to run experiments affects how you assign users to variants, how you prevent cross-contamination between variants, and what statistical methods are available for analysis. The right platform depends on your stack and your experimentation volume.

PlatformBest forStatistical approachIntegrations
StatsigProduct teams, high velocitySequential / CUPEDMixpanel, GA4, Segment
OptimizelyWeb and feature experimentationFrequentist + BayesianGA4, Segment, CDPs
LaunchDarklyFeature flag-based experimentsRequires external statsMixpanel, Datadog
VWOWebsite / CRO teamsFrequentistGA4, Segment
GA4 A/BSimple web experiments, freeBayesianGoogle Ads, Looker
Amplitude ExperimentProduct analytics teamsSequential testingAmplitude, Segment

For most SaaS product teams, Statsig is currently the strongest option; it handles both feature flagging and experimentation, integrates cleanly with Segment and Mixpanel, and applies CUPED variance reduction which meaningfully shortens the time needed to reach significance. For simpler web optimisation use cases, GA4’s built-in A/B testing tool or VWO are both reasonable starting points.

Step 4: Calculate Your Required Sample Size Before You Launch

One of the most common A/B testing mistakes is launching experiments without calculating how many users are needed to detect a meaningful effect. Teams run tests for a week, see a number they like, and call the result significant  without knowing whether they had anything close to enough data.

Sample size depends on three inputs:

  • Baseline conversion rate: what is your current conversion rate for the primary metric? If your checkout completion rate is 40%, that is your baseline.
  • Minimum detectable effect (MDE): what is the smallest improvement that would be worth implementing? If a 2% absolute improvement in checkout completion rate is commercially meaningful, set your MDE at 2 percentage points.
  • Statistical power and significance threshold: standard settings are 80% power and a 5% significance level (p < 0.05). These mean you have an 80% chance of detecting a real effect of your MDE size, with a 5% chance of a false positive.

With a 40% baseline and a 2pp MDE at standard settings, you need roughly 4,000 users per variant — 8,000 total — before you can draw reliable conclusions. If your experiment page gets 500 visitors per week, you need to run for at least 16 weeks before touching the results.

Use a sample size calculator before every experiment. Statsig, Optimizely, and Evan Miller’s online calculator are all reliable options. If your traffic cannot support detecting your MDE at an acceptable power level, either increase the MDE (accept only larger improvements) or wait until you have the traffic.

Step 5: Understand Statistical Significance  and Its Limits

Statistical significance (p < 0.05) tells you that the observed difference between variants is unlikely to be due to random chance alone. It does not tell you that the effect is large. It does not tell you that the effect will persist. It does not mean you should definitely ship the winning variant.

What p < 0.05 actually means

A p-value below 0.05 means that if there were no real difference between control and variant, you would see a result this extreme or more extreme less than 5% of the time by chance alone. That is a useful filter. It is not proof of a real effect.

Practical significance matters as much as statistical significance

A 0.1% improvement in conversion rate might reach statistical significance with a large enough sample  but it may not be worth the engineering cost to ship. Always pair statistical significance with a practical significance check: is the effect size large enough to matter for your business? If your minimum detectable effect was set correctly at the start, this question is already answered.

Watch out for these common errors

  • Peeking: checking results before the pre-specified sample size is reached and stopping early when you see significance. This dramatically inflates false positive rates. Use sequential testing methods (available in Statsig and Amplitude Experiment) if you need to monitor results while an experiment is running.
  • Running too many variants: every additional variant increases the chance of a false positive. If you are testing four variants against a control, adjust your significance threshold accordingly (Bonferroni correction is a simple starting point).
  • Multiple metrics problem: if you are testing statistical significance across ten secondary metrics, you will likely find a significant result on at least one of them purely by chance. Define your primary metric before launch and be skeptical of significance that appears only in secondary metrics.

Step 6: Analyse Segment-Level Results  But Carefully

A test that shows no overall winner can still contain useful signal at the segment level. An experiment that produced a neutral aggregate result for all users might show a strong positive effect for mobile users, or for users on the Pro plan, or for users in their first week.

Segment-level analysis is valuable  but it carries a higher risk of false positives than aggregate analysis because you are running multiple tests simultaneously across subgroups. Treat segment findings as hypotheses to test in a follow-up experiment, not as conclusions to act on directly.

The right workflow: run the aggregate analysis first and determine whether the primary metric moved. Then run segment analysis to generate hypotheses for what is driving the effect  or for where the next experiment should focus.

Step 7: Build a Systematic Experiment Backlog

Individual experiments produce individual results. A systematic experimentation programme produces compounding knowledge about what works for your users, your product, and your business.

The teams that get the most value from A/B testing are not the ones running the most experiments, they are the ones running the most informed experiments. That means keeping a shared experiment log that captures:

  • The hypothesis behind each test
  • The results  including null results, which are as informative as wins
  • What the result implies about user behaviour
  • What follow-up experiments the result suggests

A null result that is well-documented is not a failure. It is evidence that the change you tested does not matter to users in the way you expected  and that is genuinely useful information for directing the next test.

When to Bring in Experimentation Expertise

Setting up basic A/B tests on a website is manageable with off-the-shelf tools. Building an experimentation programme that reliably surfaces true effects  with proper instrumentation, correct sample sizing, and statistical rigour across a product with multiple surfaces and a complex user journey  is a different challenge.

Common signs that your experimentation setup needs attention: results that seem significant but do not replicate in follow-up tests, inconsistent numbers between your experimentation platform and your analytics tool, no documented hypothesis or sample size calculation for tests, and a tendency to call winning results faster than the data supports.

At Kaliper, we set up and validate experimentation infrastructure for product and growth teams from Statsig and Optimizely implementations through to the event tracking and statistical framework that makes experiment results reliable. If your testing programme is not producing insights you can act on confidently, the setup is worth examining.

Final Thoughts

A/B testing done right is one of the most valuable tools in a product or growth team’s toolkit. Done poorly, it creates false confidence  teams making decisions based on results that would not survive scrutiny, while missing real effects because the experiment was not set up to detect them.

The foundation is not the tool. It is the hypothesis, the instrumentation, the sample size, and the discipline to wait for a real result before drawing conclusions. Get those right, and your experimentation programme will compound over time  each test building on the knowledge of the last.


Want an experimentation setup that produces results you can trust? Kaliper helps product and growth teams design, instrument, and validate A/B testing programmes from platform selection through to statistical framework.