1. Distributions

A distribution is just “the list of observed values”.

It is tempting to think of a distribution as a function or a plot, but these are ultimately representations of the distribution itself.

Basic concepts to be aware of regarding distributions:

Continuous vs discrete values
Mean, median, mode.
Variance (and standard deviation). Analog of MSE and RMSE.
Skewness

2. Central Limit Theorem

2.1. Background

We can sample from a distribution, which is to say we observe a subset of the possible observations.

The central limit theorem states three properties observed when repeatedly $N$ items from any distribution.

The samples are normally distributed
The estimated sample mean converges to the true population mean
The sample standard deviation of the estimated mean converges to $

A tangible example of this might be if we wanted to estimate the words per page in a book.

Group A takes the intelligent decision to count every word on every page.
Group B splits the work. They each count 20 pages and report the mean. But they don’t coordinate, so they each pick 20 random, possibly overlapping pages.

Group B’s mean will converge to the true Group A mean if they do this enough times.

In practice, we don’t actually sample a set number of observations thousands of times and plot the distribution. But the CLT result is useful because it tells us what we would have observed if we had done so. We can then use this for hypothesis testing. We have a sample value. We can say whether it is above or below the sample mean (which is a decent estimate of the true mean) and by how much (in terms of std devs).

2.2. Z-Score

This leads on to the concept of Z-scores. This is essentially normalising values on any scale. So the number is telling us “how many standard deviations am I away from the mean?”

These numbers are tabulated, so we know what the probability is of obtaining a Z-score at least as extreme as this.

3. Hypothesis Testing

3.1. Two Universe

We can think of hypothesis tests as proposing two “universes”:

\[ \begin{aligned} H_0 &: \text{This is a fair coin} \\ H_1 &: \text{This is not a fair coin} \end{aligned} \]

We conduct some experiments and calculate the probability of this outcome in universe 0. If this is sufficiently unlikely, we reject the null hypothesis, ie we don’t believe we live in universe 0 therefore we must live in universe 1.

3.2. P-values

As an example, if we flip the coin 6 times and get 6 heads in a row, the probability if this were a fair coin is 1%. In other words, the p-value is 0.01.

This makes us feel “uneasy” so we start to question whether we actually live in universe 0. If we lived in universe 1 and the coin was always heads, then P(6 heads) would be 100% so we wouldn’t get an uneasy feeling.

The confidence level is the critical p-value at which we reject the null hypothesis. In other words, at what point we feel “uneasy” enough that we change our world view.

3.3. Hypothesis testing teps

Procedure for testing if the sample mean is statistically significantly different from a population mean. This is a Z-test because we are testing based on the Z-score.

State the null hypothesis
State the alternative hypothesis
Calculate the sample standard deviation: $\sigma_{sample} = \frac{\sigma_{pop}}{\sqrt{N}}$
Calc Z-score
Look up p-value of our Z-score
Compare p-value to confidence level. Accept or reject the null hypothesis.

3.4. Rejection Region

An alternative approach to choosing whether to accept/reject the null hypothesis is by calculating the critical Z-value for the rejection.

We proceed through the first 4 steps as before. But rather than converting our Z-score to a p-value and then comparing to our critical p-value (the confidence level), we go the other way. We convert the confidence level p-value to a critical Z-score. Then we can say whether our observed Z-score was more extreme than the critical Z-score in order to accept/reject the null hypothesis.

Convert confidence level to critical Z-score
Compare Z vs Z_critical. Accept or reject the null hypothesis.

We can shade in the “rejection region” of the normal distribution and see if our sample value lies in it.

3.5. Hypothesis Testing Assumptions

Z-test assumptions:

Sample is selected at random.
Observations are independent.
The population’s standard deviation is known, OR the sample contains at least 30 samples. With an unknown population standard deviation, use a T-test, but above 30 samples the distribution basically converges to Normal anyway.

3.6. Proportion Testing

This is a very similar problem where we want to test whether the proportion within a population has changed.

For example, a 2016 study says that 58% of households have tablets (i.e. iPads). We survey 100 random households and find that 73 do. We want to know if only 58% do.

(Confusingly, p refers to population here, not probability)

H0: 58% or fewer households have tablets H1: More than 58% have tablets.

First we have to check that our sample size is big enough:

\[ \begin{aligned} np &\ge 10 \\ nq &\ge 10 \\ \text{where } q &= 1 - p \end{aligned} \]

We then conduct a hypothesis test as before, with: \[ \begin{aligned} \mu_{\text{population}} &= p_{\text{population}} \, (0.58) \\ \sigma_{\text{population}} &= \sqrt{p q} \end{aligned} \]

The proportions are Normally-distributed as before so we can conduct a Z-test.

3.7. Failing to Reject the Null Hypothesis

The two options when hypothesis testing are “reject the null hypothesis” or “fail to reject the null hypothesis”.

We don’t accept any specific hypothesis. If we have a critical Z-score, it doesn’t necessarily mean H1 is true, there could be some other hypothesis H2 that also explains the result. So we don’t “accept H1”.

Equally we don’t “accept H0”. If we don’t have the critical Z-score, it may just be that we need more samples, the data wasn’t randomly sampled, etc.

4. t-testing

4.1. The t-distribution

The t-distribution describes a sample rather than the distribution of the population.

The t-distribution has heavier tails than the Normal distribution. The fewer “degrees of freedom” the heavier the tails. In other words, the fewer observations we have, the more possible it is that the true values could be further away that what we’ve observed.

As the DOF (symbol $\nu$) tends to infinity, the distribution converges towards a Normal distribution. At about 30, it’s almost identical to a Normal.

Use a t-test when:

The population standard deviation is unknown AND
Sample size is small ($n \lt 30$)

\[ t = \frac{x - \mu}{s / \sqrt{n}} \]

$\nu$ is basically always $N-1$ (where N is number of samples)

4.2. t-tests

The t-test hypothesis testing process is almost identical to the z-test. The only difference is we calculate a t-statistic (rather than a z-score). The rest is the same: we formulate our hypotheses, calculate the score and compare to our critical value.

4.3. 1-tailed and 2-tailed tests

For 2-tailed tests, you split the confidence interval on either side, e.g. a 5% confidence level means the rejection region is the top 2.5% and bottom 2.5% of the distribution.

So a 2-tailed test is more strict, i.e. more difficult to reject the null hypothesis.

You can think of this as a 1-tailed test incorporates prior knowledge that the variable can only have increased (or decreased). But for a 2-tailed test, we have no prior knowledge so it could be either side.

4.4. Misuse of p-values

The p-value only tells us the probability of our binary accept/reject decision being incorrect.

It doesn’t tell us about the magnitude or uncertainty of the effect we are claiming.

It is best practice to report the effect sizes and confidence intervals as well as the p-values, to give a fuller picture.

For example, a shampoo increases hair volume at a 5% confidence level. How much does it increase hair volume?

4.5. Confidence Intervals

This gives us the range within which our true population parameter lies.

It is the sample mean ± the margin of error. The margin of error is the standard error $\frac{\sigma}{\sqrt{n}}$ multiplied by the Z-score.

The Z-score corresponds to the confidence level chosen, e.g. 1.96 for a 95% confidence level.

References

Statistics for Business Analytics and Data Science A-Z Udemy course