Hypothesis testing of count data (flowchart)

This page describes how to determine if count data are statistically consistent with some value. Count data are data counted in bins of some time span, for instance the number of influenza cases per day, or the number of murders per year.

In the discussion here, we assume that the probability distribution underlying the stochasticity of the data in each bin is the Poisson distribution.  Recall that the Poisson distribution is a one-parameter distribution with parameter lambda; the mean of the distribution is lambda, and the standard deviation of the distribution is sqrt(lambda).

If the data are count data but counting how often something happens out of N trials (say, for instance, counting the number of students who pass an exam out of a class of N people), usually the appropriate distribution to use is the Binomial distribution, not the Poisson distribution. In this past module we discussed hypothesis testing with the Binomial distribution.

If the data are just counting how many events happened within certain time frames (bins), then the Poisson distribution is usually considered to underlie the stochasticity of the data within each bin.

With the Poisson distribution, recall that when lambda is large (say, greater than 5 to 10 or so… but really 5 is kind of pushing it), the shape of the Poisson distribution approaches that of the Normal distribution with mean lambda and standard deviation sqrt(lambda).The Poisson Distribution

 

Question: is the number of counts in each bin at least 5 to 10?

Yes, there are at least 5 to 10 counts in each bin

With at least 5 to 10 counts per bin, the data are approximately Normally distributed in each bin, and so we can use the Pearson chi-square test.

1) To test if the counts in each bin are all consistent with one and other, first calculate the mean number of counts per bin, bar(X).  Then calculate the chi-square statistic

Then use pchisq(Q,df), with df=(Nbins-1) to test the null hypothesis that the data in each bin are consistent with being drawn from Poisson probability distributions with the same mean. If the p-value is close to 1 (say, greater than 0.95) we reject the null hypothesis.  The degrees of freedom is (Nbins-1) because we used up a degree of freedom calculating bar(X).

2) To test if the counts in each bin are consistent with particular expected values Xexpect_i, calculate the chi-square statistic

The number of degrees of freedom of the test will be df=Nbins.  Use pchisq(Q,df) to test the null hypothesis that the count data are drawn from Poisson probability distributions with means equal to Xexpect_i.  If the p-value is close to one (say 0.95) then reject the null hypothesis.

No, there are not at least 5 to 10 counts in each bin

Beyond the scope of this course (likelihood methods are needed)

 

Visits: 4168

Leave a Reply