For each test, Shoplift determines the validity of your test data by utilizing a Bayesian statistical analysis model to determine if your test results are valid at a 95% confidence level and statistical power of 80%.
How Shoplift calculates statistical significance
Using the data collected during testing, Shoplift's Bayesian regression and inference model calculates the outcome values simulated from a posterior predictive distribution using a Markov chain Monte Carlo algorithm.
The posterior distribution is just the technical term for the range of outcomes that could result from a given test. As a test collects more data, the posterior distribution grows narrower, meaning that the possible range of outcomes shrinks and Shoplift's certainty in the results grows.
How much data is required?
Bayesian inference operates without a minimum sample size requirement. Even when your conversion rates exhibit consistent yet distinct patterns with limited traffic, you can derive actionable insights. It's possible to uncover valuable experiment opportunities in less-trafficked sections of your website, and our methodology adeptly addresses such situations. While this might deviate from your familiarity with larger sample size demands, it stands as a distinct advantage of our approach.
That being said, Shoplift does impose both time- and sample-based limitations before determining if test results can be certified as valid. For all tests, the sample threshold is a minimum of 30 orders on each variant. For conversion rate tests, the time threshold is 3 days, and for average order value and revenue per session tests, the time threshold is 7 days.
We impose these time and sample restrictions to ensure that data collected early on is not acted upon - even if the behavior is consistent - because of the "newness" effect for variants: wherein shoppers will often act upon a new experience more often because it's new. This generally tapers off after a few days.
Understanding the statistical significance of your test
As a test collects more data, the possible range of outcomes shrinks and Shoplift's certainty in the results grows. There are three key pieces of information displayed in test reports that provide different information relating to the validity of your test results:
Progress is a semantic indicator that measures the reliability with which test results can be certified as real. Any test is only a small sample of data in the overall lifetime of your store, so progress is used to tell you if your test results are a representative sample or if they are an anomaly.
Probability to win is the estimated chance that one test experience outperforms another. The percentage relates to the certainty of your test overall. As data is collected and your test approaches statistical significance, the probability to win for each variant becomes more credible.
Estimated time to significance
Estimated time to significance reflects the approximate remaining time necessary to reach a statistically significant result. This estimate varies based on the volume and velocity of data collected, and is updated dynamically as your test progresses.
If your test is showing an improvement, but your test results are still in progress, you should continue to run the test and collect more data. This is because the improvement you are seeing is not yet statistically significant, and could potentially be due to random chance.
As data is collected for each test, Shoplift provides the raw results for the key validity metrics explained above. Your test report will also display a summary of how your test data is progressing in the form of various status banners displayed at the top of your test report page.
If your test is in the "Gathering data" stage, this means that the sample size of your test is not yet large enough to provide an indicator of credibility in your test results, and more data is needed.
This is the default status upon launching a test, and you should keep your test running until enough data has been gathered to provide subsequent status updates.
Trending (positive or negative)
As data is collected and confidence in your results improves, a "Trending" status will be displayed, which indicates that the changes you are testing are trending either positively or negatively in relation to your original.
While a "Trending" test provides an indication of the initial degree of success of your test, it is just an indication, and more data is needed to certify that your results are valid.
When your test has gathered enough data to indicate that significance will be achieved shortly, if the trend continues, a "Nearing significance" status will be displayed. This indicates that there is a strong probability that your test results are valid.
Shoplift does not advise making decisions on tests that have reached this status, because there may still be continued variability in test results.
When your test has gathered enough data to provide a confidence level of 95% or greater, a "Significant" status will be displayed, and your test results are certified as credible. This means that your test results have achieved 80% statistical power.
When a test enters this stage, you can confidently proceed with making decisions based on your test results.
If your test has a total duration of 14 days or more, and the estimated time to significance remains another 14-60 days out, your test will enter a "Long timeline expected" status. This indicates that while the changes you are testing may reach significance, it will require more data than you may be willing to wait for, depending on what you are testing.
If your test has a total duration of 14 days or more, and the estimated time to significance remains greater than 60 days out, your test will enter a "Significance unlikely" status. This indicates that either the changes you are testing are not large enough to result in a meaningful change in performance, or that your sample size remains too low to provide an indication of credibility in your results.
If a test reaches the "Significance unlikely" stage, Shoplift recommends ending the test and trying a new test idea. If you would like to continue to run the test and collect data, you can, and as soon as the sample size is large enough to provide one of the above statuses, your test will exit this status.
When you end a test, your test report will provide a summary status that explains the credibility of your test results, depending on the data collected and the degree of certainty in your results when the test was ended.
If your test achieved significance, a "Significant" status will be displayed that certifies your original or variant experience as the winner at a confidence level of 95%.
If your variant is the winning experience, you can confidently proceed with implementing that experience as the default experience going forward. For more information on implementing a test variant, see Implementing a winning test
Inconclusive (not enough data was gathered)
If your test was ended before reaching the "Significant" state, it will be marked as inconclusive because enough data was not gathered to determine a clear winner.
Inconclusive (determining a winner was unlikely):
If your test was ended when it was in the "Significance unlikely" state, it will be marked as inconclusive but the banner will indicate that significance for this test idea was unlikely to be achieved. This distinction from the more generic Inconclusive state can be helpful in determining the difference in which of your historical test ideas may have reached significance with enough data, and which were unlikely to achieve significance at all.
If you find that your tests often result in no winner being found, it means you may not be testing changes that are significant enough to impact performance in a meaningful way. If this is the case, we suggest testing more dramatic changes.