Experimental Design
A/B Testing
- A controlled experiment, usually in the context of a website
- You test the performance of some change to your website (the variant) and measure conversion relative to your unchanged site (the control).
Ideally choose what you are trying to influence:
- oder amount
- profit
- ad clicks
- order quantity
But attributing actions downstream from your change can be hard
- Especially if you;re running more than one experiment
VARIANCE IS THE ENEMY Common mistake:
- Run a test for some small period of time that results in a few purchases to analyze
- You take the mean order amount from a and B, and declare victory or defeat
- But there’s so much random variation in order amounts to begin with, tha your results was just based on chance
- You then fool yourself into thinking some change to your website, which could actually be harmful, has made tons of money
- Sometimes you need to also look at conversion metrics with less variance
- Order quantities vs order dollar amounts, for example
T-Tests and P-Values
So, how do we know if a result is likely to be “real” as opposed to just random variation?
T-Test and P-Values!
The T-Statistic
- A measure of the difference between the two sets expressed in units of standard error
- The size of the difference relative to the variance in the data
- A high T value means there’s probably a real difference between the two sets
- Assumes a normal distribution of behavior
- This is a good assumption of you’re measuring revenue as conversion
- Fisher’s exact test (clickthrough rates), E-test (for transactions per user), and chi-squared test (for product quantities purchased)
- miara różnicy między dwoma zestawami wyrażona w jednostkach błędu standardowego
- wielkość różnicy w stosunku do wariancji danych
- Wysoka wartość T oznacza, że prawdopodobnie istnieje prawdziwa różnica między dwoma zestawami
- Zakłada normalny rozkład zachowania
- To dobre założenie, że mierzysz przychody jako konwersje
- Dokładny test Fishera (współczynniki klikalności), E-test (dla transakcji na użytkownika) i test chi-kwadrat (dla ilości zakupionych produktów)
The P-Value
- Think of it as the probability of A and B satisfying the “null hypothesis” (that there is no real difference between the control and treatment’s behaviour)
- So, a low P-Value implies significance.
-
It is the probability of an observer lying at an extreme t-value assuming the null hypothesis.
- Potraktuj to jako prawdopodobieństwo spełnienia przez A i B „hipotezy zerowej”
- Tak więc niska wartość P oznacza znaczenie.
- Jest to prawdopodobieństwo, że obserwator leży na skrajnej wartości t przy założeniu hipotezy zerowej.
import numpy as np
from scipy import stats
A = np.random.normal(25.0, 5.0, 10000)
B = np.random.normal(26.0, 5.0, 10000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=-14.20191652286491, pvalue=1.4847065535663492e-45)
The t-statistic is a measure of the difference between the two sets expressed in units of standard error. Put differently, it’s the size of the difference relative to the variance in the data. A high t value means there’s probably a real difference between the two sets; you have “significance”. The P-value is a measure of the probability of an observation lying at extreme t-values; so a low p-value also implies “significance.” If you’re looking for a “statistically significant” result, you want to see a very low p-value and a high t-statistic (well, a high absolute value of the t-statistic more precisely). In the real world, statisticians seem to put more weight on the p-value result.
B = np.random.normal(25.0, 5.0, 10000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=-0.07828796143840591, pvalue=0.9375997767163184)
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 5.0, 100000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=0.5752102121870725, pvalue=0.5651497840962222)
Our p-value actually got a little lower, and the t-test a little larger, but still not enough to declare a real difference. So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn’t help, so if we were to keep running this A/B test for years, you’d never acheive the result you’re hoping for:
A = np.random.normal(25.0, 5.0, 1000000)
B = np.random.normal(25.0, 5.0, 1000000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=0.16312002647674892, pvalue=0.8704239486112085)
stats.ttest_ind(A, A)
Ttest_indResult(statistic=0.0, pvalue=1.0)
How do I know when I’m done with A/B test?
- You have achieved significance (positive or negative)
- You no longer observe meaningful trends in your p-value
- That is, you don’t see any indication that your experiment will “converge” on a result over time
- You reach some pre-established upper bound on time