Experimental Design

A/B Testing

  • A controlled experiment, usually in the context of a website
  • You test the performance of some change to your website (the variant) and measure conversion relative to your unchanged site (the control).

Ideally choose what you are trying to influence:

  • oder amount
  • profit
  • ad clicks
  • order quantity

But attributing actions downstream from your change can be hard

  • Especially if you;re running more than one experiment


  • Run a test for some small period of time that results in a few purchases to analyze
  • You take the mean order amount from a and B, and declare victory or defeat
  • But there’s so much random variation in order amounts to begin with, tha your results was just based on chance
  • You then fool yourself into thinking some change to your website, which could actually be harmful, has made tons of money
  • Sometimes you need to also look at conversion metrics with less variance
  • Order quantities vs order dollar amounts, for example

T-Tests and P-Values

So, how do we know if a result is likely to be “real” as opposed to just random variation?

T-Test and P-Values!

The T-Statistic

  • A measure of the difference between the two sets expressed in units of standard error
  • The size of the difference relative to the variance in the data
  • A high T value means there’s probably a real difference between the two sets
  • Assumes a normal distribution of behavior
    • This is a good assumption of you’re measuring revenue as conversion
    • Fisher’s exact test (clickthrough rates), E-test (for transactions per user), and chi-squared test (for product quantities purchased)
  • miara różnicy między dwoma zestawami wyrażona w jednostkach błędu standardowego
  • wielkość różnicy w stosunku do wariancji danych
  • Wysoka wartość T oznacza, że prawdopodobnie istnieje prawdziwa różnica między dwoma zestawami
  • Zakłada normalny rozkład zachowania
    • To dobre założenie, że mierzysz przychody jako konwersje
    • Dokładny test Fishera (współczynniki klikalności), E-test (dla transakcji na użytkownika) i test chi-kwadrat (dla ilości zakupionych produktów)

The P-Value

  • Think of it as the probability of A and B satisfying the “null hypothesis” (that there is no real difference between the control and treatment’s behaviour)
  • So, a low P-Value implies significance.
  • It is the probability of an observer lying at an extreme t-value assuming the null hypothesis.

  • Potraktuj to jako prawdopodobieństwo spełnienia przez A i B „hipotezy zerowej”
  • Tak więc niska wartość P oznacza znaczenie.
  • Jest to prawdopodobieństwo, że obserwator leży na skrajnej wartości t przy założeniu hipotezy zerowej.
import numpy as np
from scipy import stats

A = np.random.normal(25.0, 5.0, 10000)
B = np.random.normal(26.0, 5.0, 10000)
stats.ttest_ind(A, B)
Ttest_indResult(statistic=-14.20191652286491, pvalue=1.4847065535663492e-45)

The t-statistic is a measure of the difference between the two sets expressed in units of standard error. Put differently, it’s the size of the difference relative to the variance in the data. A high t value means there’s probably a real difference between the two sets; you have “significance”. The P-value is a measure of the probability of an observation lying at extreme t-values; so a low p-value also implies “significance.” If you’re looking for a “statistically significant” result, you want to see a very low p-value and a high t-statistic (well, a high absolute value of the t-statistic more precisely). In the real world, statisticians seem to put more weight on the p-value result.

B = np.random.normal(25.0, 5.0, 10000)

stats.ttest_ind(A, B)
Ttest_indResult(statistic=-0.07828796143840591, pvalue=0.9375997767163184)
A = np.random.normal(25.0, 5.0, 100000)
B = np.random.normal(25.0, 5.0, 100000)

stats.ttest_ind(A, B)
Ttest_indResult(statistic=0.5752102121870725, pvalue=0.5651497840962222)

Our p-value actually got a little lower, and the t-test a little larger, but still not enough to declare a real difference. So, you could have reached the right decision with just 10,000 samples instead of 100,000. Even a million samples doesn’t help, so if we were to keep running this A/B test for years, you’d never acheive the result you’re hoping for:

A = np.random.normal(25.0, 5.0, 1000000)
B = np.random.normal(25.0, 5.0, 1000000)

stats.ttest_ind(A, B)
Ttest_indResult(statistic=0.16312002647674892, pvalue=0.8704239486112085)
stats.ttest_ind(A, A)
Ttest_indResult(statistic=0.0, pvalue=1.0)

How do I know when I’m done with A/B test?

  • You have achieved significance (positive or negative)
  • You no longer observe meaningful trends in your p-value
    • That is, you don’t see any indication that your experiment will “converge” on a result over time
  • You reach some pre-established upper bound on time