Understanding how to run product experiments is a science. I've learnt from the school of hard knocks running them, and hope to share what I've learnt so that you know the basics and don't have to start from scratch.
It is advisable to work with a data scientist (or even better, have one embedded in your team) if you're thinking of running one. They can help you with experiment design and project your impact based on historical probability.
Experimental results can be directional or statistically-significant.
Directional is when you want to validate a hypothesis, but don't need it to be mathematically-accurate and take into account false positives or negatives.
You may be seeing a direction of change, but it could be due to random chance, seasonality or skewed data based on how you run the experiment.
The most common example is changing the onboarding flow for evaluators which impacts business revenue. The bigger the experience change, the bigger the risk.
The possibility of a positive impact may also be higher with a bigger change, as small cosmetic changes may not move the needle. If possible, push for a more daring change to overcome the local maxima problem.
You may have a degree of confidence in your new design based on qualitative research, but need bigger volume to validate your hypothesis as the small research sample size may not be representative of your customer base.
Running an experiment like this requires a pre-determined sample size based on the degree of confidence and precision required for the results. For smaller lifts or higher degrees of accuracy, a higher volume is required.
If the volume going through the flow where the change happens is very low, it may take months to run an experiment which is too costly. You might be better off validating your hypothesis using other methods.
Now that we know what type of experiment we are running, let's look at the basic phases of an experiment (excluding development).
We believe that change X (experience change) will result in an increase / decrease / no difference in Y (success metric) because of Z (e.g. belief, supporting research).
If there are multiple hypotheses, then you need to reduce your α *.
You should choose a target population that is most likely to be positively-impacted by the change.
This target population should ideally:
Measure of success refers to the relative (or absolute) change in the success metric.
Make it obvious whether it is relative or absolute when explaining to stakeholders, as people usually think in absolute terms. You won't want to mislead them to think that your impact is higher than interpreted.
If there are multiple success metrics, you need to reduce your α *.
We assume a normal data distribution with the standard 95% confidence and 80% precision for our results.
In inferential statistics, the null hypothesis's default position is that there is no statistically-significant difference between two variables (variation and control experience).
We want to correctly accept it 95% of the time (5% chance of false positives) when there is no actual difference, and correctly reject it 80% of the time (20% chance of false negatives) when there is an actual difference.
Significance level α = 0.05 (5%). Significance level is the % of the time that the minimum effect size will be detected, assuming it doesn't exist.
p-value represents this, where α should usually be < 0.05 to be statistically-significant.
* If you got multiple hypotheses, multiple success metrics or a multi-variant test, then you need to reduce your α.
We usually want to be 95% confident to NOT DETECT the minimum effect size when it doesn't exist.
In this case, we don't want false positives.
Statistical power 1−β = 0.8 (80%). Statistical power is the % of the time that the minimum effect size will be detected, assuming it exists.
We usually want to be 80% likely to DETECT the minimum effect size when it exists.
In this case, we don't want false negatives.
Assuming that we have 2 cohorts in a traditional A/B test, we need to calculate the sample size for each cohort to detect the relative lift in the variation cohort when compared to control.
Resources that may help with estimating it are:
Assuming you have a baseline conversion rate of an expected weekly volume of users going through your funnel based on your target population, you need ~X weeks to enrol the sample size and ~Y weeks to get to maturity.
When you start the experiment, you have to monitor the volume of enrolment as it might differ from estimates.
It is advisable not to engage in peeking of results before the experiment reaches maturity, as it might mislead you to make the wrong conclusions.
However, you can do that if you're running a sequential sampling test which allows real-time decision making.
Statistically-significant experiments take time to set up, design and run. Many startup founders grow their companies successfully without doing them.
On the other hand, as a company scales to millions of users, experimental infrastructure is created to allow product teams to validate ideas quickly to benefit from the volume and more importantly, understand the impact of their rollouts on business metrics such as revenue and retention.
As we move into the age of machine learning and big data, experimentation frameworks may change with the assistance of computing intelligence. How that changes the way we run experiments is an exciting development that will be worth watching out for!
Leave me a message to comment about this blog post!