Introduction to hypothesis testing

[In this course we will be learning how to formulate figure-of-merit statistics that can help to answer research questions like “Is quantity A significantly greater/less than quantity B?”, or “Does quantity X appear to be significantly related to quantity Y?”.  As we are about to discuss, statistics that can be used to answer these types of questions do so using the underlying probability distribution to the statistic.  Every statistic used for hypothesis testing has an underlying probability distribution.]

Let’s begin our discussion of hypothesis testing by looking at a data point, X, which under the null hypothesis is drawn from the Normal distribution with mean 0 and std deviation 1 (ie; the standard Normal distribution).  Recall that the standard Normal distribution is symmetric about 0, with long tails.  The further we get from zero, the lower the probability.  Thus, if our observed X is close to zero, it is quite likely that it was randomly drawn from the Normal distribution.  If X is far from zero, however, say… X=+4.3, the probability is low to observe such a high value of X.  In fact, the probability of observing a value of X at least that high is the integral of the upper tail of the Normal distribution from X to infinity.  This is called a “one tailed” test of significance.  If, on the other hand, we wanted to assess the probability of observing a value of X at least that far away from zero, then we concern ourselves with the probability of observing |X| at least as large as our observed value.  This is the integral of the probability distribution from -infinity to -X, plus the integral from +X to infinity.  This is called a “two tailed” test of significance.

For

The p-value is the probability that we would observe our data, given our null hypothesis.  Alpha is the probability cut-off at which we say that the observed is improbable given the null hypothesis.  Usually a cut-off of alpha=0.05 is used in analyses.  When the p-value<alpha, we say that we have a “statistically significant” result.

The use of alpha=0.05 is somewhat controversial because it is arbitrary.  Plus, one out of 20 times, we reject the null hypothesis when it is actually true.  This means that many spurious “statistically significant” results can make it into the literature, especially if multiple tests of significance were done in the analysis, and the researchers did not correct their alpha for how many tests they did (for example, if we did 100 tests of significance in an analysis, even when the null hypothesis is actually true, on average we would find 5 of those tests yielded a “significant” result causing us to reject the null hypothesis).

Because of this problem, one psychology journal has actually banned the use of p-values in analyses published in their journals.

Type I error: Incorrectly rejecting the null hypothesis when it is actually true.  Can be controlled by decreasing alpha.  Also need to reduce alpha if doing multiple tests of significance.

Type II error: Incorrectly accepting the null hypothesis when it is actually false.  Larger sample sizes can reduce type II errors because they give better statistical power to distinguish between null and alternate hypotheses.

Example:

Null hypothesis (H0): “The person on trial is innocent.”

A type I error occurs when convicting an innocent person (a miscarriage of justice). “Beyond a reasonable doubt” is an attempt to make alpha in trials as small as possible to reduce the probability of rejecting this null when it is actually true.

A type II error occurs when letting a guilty person go free (an error of impunity).

A positive correct outcome occurs when convicting a guilty person. A negative correct outcome occurs when letting an innocent person go free.

Leave a Reply