In this module we will discuss numerical methods that can be used to calculate the 95% CI on data that has been transformed by some function, if one knows the probability distribution underlying the stochasticity of the original data.
Propagation of uncertainties
When doing statistical analyses, given an assumed probability distribution that underlies the stochasticity in the data, you might want to apply a function to the data, then obtain the 95% confidence interval on that transformation.
As an example of this, let’s assume that we are doing an analysis of the annual per capita rate of public mass shootings (with four or more people killled) that involved high capacity firearms during the Federal Assault Weapons ban from September 13, 1994 to September 13, 2004, compared to the rate of mass shootings since the ban has lapsed, to the end of 2013.
During the ban, say we observed that there were 12 mass shootings involving high capacity firearms. The ban lasted 10 years, and the average population of the US was 279.2 million people during that time.
After the ban, we observed that there were 35 mass shootings, over a period of 13.3 years, and the average US population during that time was 310.9 million people.
Let’s call the number of mass shootings N, the population P, and the length of the time period T. The annual per capita rate of mass shootings is thus N/(P*T). We would like to estimate the 95% confidence interval on our estimated per capita rate of public mass shootings. To do this, we notice that the number of mass shootings can be assumed to be Poisson distributed, because it is count data. Thus, to estimate the 95% confidence interval on our transformed data, we can generate a large number of Poisson distributed random numbers with mean N, divide those random numbers by (P*T), and then use the R quantile() function to determine the 95% confidence interval. Like so:
N = 12 T = 10 P = 279.2 vN = rpois(1000000,N) vrate = vN/(P*T) q = quantile(vrate,probs=c(0.025,0.975)) cat("The average annual rate per million people is:",N/(P*T),"\n") cat("The 95% confidence interval is [",q,",",q,"]\n",sep="")
Now, how about if we wanted to determine the ratio, R, of the rate after the ban to the rate during the ban. This is N_2/(P_2*T_2) divided by N_1/(P_1*T_1). We also want to estimate the 95% confidence interval on this quantity to determine if the interval contains 1. If it does, then there is no significant difference between the two rates.
set.seed(4289588) N_1 = 12 T_1 = 10 P_1 = 279.2 N_2 = 35 T_2 = 13.3 P_2 = 310.9 vN_1 = rpois(1000000,N_1) vN_2 = rpois(1000000,N_2) vR = vN_2/(P_2*T_2) vR = vR/(vN_1/(P_1*T_1)) R = N_2/(P_2*T_2) R = R/(N_1/(P_1*T_1)) q = round(as.numeric(quantile(vR,probs=c(0.025,0.975))),2) cat("The ratio of the rates is:",round(R,2),"\n") cat("The 95% confidence interval is [",q,",",q,"]\n",sep="")
We see that the 95% CI does not include the number 1, thus we reject the null hypothesis that the two rates are statistically consistent.
If we wanted to get the p-value for the difference between the rates, we could have used population standardized Poisson regression, with a factor, vperiod, on the RHS of the regression equation indicating whether or not the period was during the ban, or after it:
vN = c(N_1,N_2) vP = c(P_1,P_2) vT = c(T_1,T_2) vperiod = c(1,2) myfit = glm(vN~offset(log(vP))+offset(log(vT))+factor(vperiod),family="poisson") print(summary(myfit)) pvalue = summary(myfit)$coef[2,4] cat("The p-value testing the null hypothesis the rates are statistically consistent is:",pvalue,"\n")
The “intercept” coefficient is the log of the rate when vperiod=1, and the sum of the first and second coefficients is the log of the rate when vperiod=2.
In a paper, we could have just stated the regression coefficient as our result, but it is more easily understood by the average reader if instead the results are expressed as a ratio of the rates, with the 95% confidence interval on that ratio. You can still use the p-value from the population standardized regression to assess the null hypothesis that the rates are statistically consistent.