AML 610 Fall seminar series: Module IV linear regression

In this module, students will become familiar with linear regression methods. Note that before proceeding with any regression analysis, it is important to first perform initial data exploration, both with visualization analysis with histograms, boxplots, and scatter plots, and numerical summaries of the variables like the mean, standard deviations, maxima and minima, and correlations between variables.  In this way, you can determine if there are any unexpected “quirks” or problems with the data (and, more often than not, there are).


In this module we will be discussing fitting linear models to data, which is to say that we hypothesize that some relationship exists between our data, y, and some variable x that we think is likely linearly related (so our hypothesized model is y=a*x+b, and we wish to estimate the model parameters a and b).

One way we could do this is to try lots of different pairs of values of a and b, and choose the pair that appears to give us the “best fit” to the data.  But in the field of statistical data analysis, assessing goodness of fit isn’t just done by looking at a plot of the data with the model overlaid and thinking “Yeah, that looks like the best fit so far…”.  Instead, there are quantities (“figure or merit” statistics) that you can calculate that give an indication of how well the model fits the data.  In this module, we will be discussing Least Square statistics for assessment of goodness of fit.  The LS statistic is the sum of the squared distances between our data, y, and the model (in this case a*x+b).least_square_demo

Can you see that a well fitting model will be the one that minimizes the sum of these squared distances?  Of the two models above, which looks like a better fit? (as we will see below, one of them has to be better).

As an aside; looking at this data, do you think there is evidence of “significant” linear trend?  Can you see that if there are fewer points that evidence of potential trend would be sketchier?  Can you see if that the data have a lot of stochasticity (up and down variation from point to point) evidence of potential trend would also be sketchier? We will be discussing below the mathematical rigour of linear model fitting using least squares, and how to assess the significance of model parameters, testing the null hypothesis that they are consistent with zero.

[nextpage title="Underlying assumptions of least squares regression"]

Before we move on to talk about the statistical details of linear models, an important aside that needs to be made is that Least Squares is just one figure of merit statistic.  Others exist (for instance weighted least squares, Pearson chi-square, or Poisson or Negative Binomial maximum likelihood). The different FoM statistics assume different underlying things about the data.  In the case of Least Squares, it assumes that all the data points have the same probability distribution underlying their stochasticity, and that probability distribution is Normal. This may or may not be a good assumption.  The problem is that if this underlying assumption is violated, the Least Squares method will yield biased estimators for the model parameters, and also the p-values testing the null hypothesis that the parameters are consistent with zero will also be wrong.

Whatever fitting method we use, it is important to understand all of the underlying assumptions of the method, and ensure that the data actually are consistent with those underlying assumptions. Keep this in mind whenever you fit any kind of model to data, using any method.

[nextpage title="Linear regression with Least Squares"]

Regression analyses are used to model the relationship between a variable Y (usually called the dependent variable, or response) and one or more explanatory (sometimes called independent) variables X1, X2, … Xp.  When p>1 regression is also known as multivariate regression. When p=1 the regression is also known as simple regression.

In regression analyses, the dependent variable is continuous, but the explanatory variables can be continuous, discrete, or categorical (like Male/Female, for instance).

Regression analyses can be done for a variety of different reasons, for instance prediction of future observations, or assessment of the potential relationship between the dependent and independent variables.

Linear regression models assume a linear relationship between Y and one or more explanatory variables, and the parameters are obtained from the best fit to the data by optimizing some FoM statistic like Least Squares.  Thus the model for the linear dependence of the data y on one explanatory variable, x, would look like something like this for the i^th data point


where epsilon is a term that describes the left-over stochasticity in the data, after taking into account the linear regression line (epsilon is also known as the residual), and x_i are the data corresponding to the independent, explanatory variable, and the y_i are the data for the dependent variable.  The alpha and the beta are the unknown parameters for which we will fit (and also epsilon, but we’ll get to that in a moment).  Note that the equation is linear in those parameters, thus this is why it is known as linear regression.

The explanatory variables themselves can appear as non-linear functions, as long as unknown parameters don’t appear in the function.  So, for a known frequency omega, and data points (t_i,y_i)


is also a linear model (and in fact, one that we will return to when we discuss harmonic linear regression later on in this AML 610 course to model periodic behaviour in the data).  However, an example of a non-linear model when alpha and beta are unknown is:


In this course we will be focussing on linear regression rather than non-linear regression. However, not that sometimes with a bit of thought, you can transform a non-linear model into a linear model (for instance, if we knew alpha from prior information, in practice we could subtract it from Y and then take the log of both sides and use a linear model regression method to determine gamma and log(beta)).

Now, recall that if we are using Least Squares to determine our model parameters, the underlying assumption of that method is that the epsilon_i are all equal (a property known as homoscedasticity), with probability distribution Normal(mu=0,sigma=epsilon).

Now, note that if we have N data points, the y data can be more compactly expressed as a vector of length N.  Similarly, the data for the independent variables can be expressed as a matrix. We can thus express the linear regression in matrix format as


where p is the number of explanatory variables.  X is often referred to as the “design matrix”.

Linear regression involves estimating values of beta that describe as much of the variation in the data as possible, by minimizing the sum of the squared residuals.  The residuals are


where hat(y)_i is the predicted value of y_i from the linear regression model.

The left over variance is the sum of the squared residuals (RSS or SSE),


(note that N=n in the above equation). In the Least Squares method this is equal to

eqn_sse_lsq_c Equation 2

To find the parameters that minimize the left-over variance involves simple calculus; we take the derivative of the LHS wrt the beta terms, then set it equal to zero.  Plugging through the algebra yields that the values of beta that minimizes equation 2 is


A measure of how much of the data variance is described by the model is known as R^2


where bar(y) is the mean of the y vector (aside; note that the denominator in that expression is proportional to the variance of y). When R^2 is close to 1, then almost all the original variance in the data is described by the model (and the residual variance is very small).

[nextpage title="Determination of the uncertainties on least squares linear regression parameter estimates"]

Because of the stochasticity in the data, there will be uncertainty in the estimates of beta.  If there is a lot of stochasticity in the data and/or if there are just a few data points, the estimates of beta will have large uncertainties.

The covariance matrix for the parameters of a linear least squares model is


Where sigma^2 is the variance on the fit residuals, and is


The underlying probability distribution for a parameter estimate under the null hypothesis that it is equal to zero is Students-t with N-2 degrees of freedom.  As N gets large, the Students-t distribution approaches the Normal distribution.

Questions: as N gets large (ie; lots of data), what happens to the uncertainties on the parameter estimates?

[nextpage title="Linear regression in R with the lm() method"]

In practice, you will probably never have to do the above matrix arithmetic to calculate the parameter estimates of a linear model and their uncertainties; this is because software like R, Matlab, SAS, SPSS, Stata, etc all have linear regression methods that do it for you.  In R, linear regression is done with the lm() method.

The lm() method outputs various fit diagnostics, including the parameter estimates, their uncertainties, and the p-value testing the null hypothesis that the parameter estimate is consistent with zero. It also outputs the R^2 of the model, and a statistic (called the F-statistic) that tests the null hypothesis that all model parameters except for the intercept are consistent with zero.  These various diagnostics are accessed with the summary() function.

As an example, let’s generate some simulated data with the underlying model y=10*x, and with underlying stochasticity epsilon~N(0,25).

Copy and paste the following into R:

n = 100
x = seq(0,10,length=n)
y = 10*x+rnorm(n,0,25)
mydata = data.frame(x,y)


b = lm(y~x,data=mydata)

The ~ in the above input to lm() means that whatever variable you put on the LHS is assumed to be described by the model on the RHS (which can involve one or more explanatory variables).  When you run the above code, you will get the following plot:


And the following output:

lm(formula = y ~ x, data = mydata)

    Min      1Q  Median      3Q     Max 
-55.718 -18.066   1.997  21.172  74.192 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.2908     5.3357   1.366    0.175    
x             8.4868     0.9218   9.206 6.39e-15 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.88 on 98 degrees of freedom
Multiple R-squared:  0.4638,	Adjusted R-squared:  0.4583 
F-statistic: 84.76 on 1 and 98 DF,  p-value: 6.391e-15

The parameter estimates are in the first column, the second column has their estimated uncertainty, and the last column is the p-value testing the null hypothesis that the parameter estimate is consistent with zero.  Which of the parameters would we expect to be consistent with zero?

Calculate the 95% confidence interval on the parameter estimates.  Do the true parameters lie within these confidence intervals?

The R^2 of the model is 0.46, meaning that the model explains 46% of the variance in the data (which isn’t too bad).  Ideally, we would like a  model that explains much of the variance in the data, with good predictive power for future events (ie; the model really describes the relationship between x and y), without having too many explanatory variables (more on that below).

To get the covariance matrix on model parameter estimates in R, we can use vcov(model_name).  Note that in the above code the model_name is b.  To get the parameter estimate correlation matrix in R, we would use cov2cor(vcov(model_name)). What is the correlation between the model parameter estimates?

If we wanted to fit a linear model without an intercept in R, we would use the expression lm(y~x-1).  Try this out.  What happens to the uncertainty on the slope?

If we wanted to fit a linear model with only an intercept (ie; y is constant) in R we would use the expression lm(y~1).  Try this out.  What happens to the R^2?  Why is there no F statistic in the summary?

Other things to try: change the n in the above code to n=5 and rerun.  What is the p-value testing the null hypothesis that the slope is equal to zero?  Now reset n to n=100, and generate the simulated data with y=10*x+rnorm(n,0,100).  What is the p-value now?

[nextpage title="Parsimonious model selection"]

You will notice at the bottom of the summary output for our original linear model that it not only gives the R^2, but also something called the adjusted R^2. This takes us to the topic of what happens if you put too many explanatory variables into the model; if, instead of just a linear model, we used a model with a polynomial with degree N-1, we would be able to exactly fit the data (thus R^2 would be exactly equal to 1).  However that polynomial, if extrapolated to predict the y at a new value of x, would almost certainly do a very poor job of properly estimating y for another independent, equivalent dataset.  This is because that polynomial is likely fitting to the random wiggles of the stochasticity in y in the original data set, rather than representing the true underlying relationship between x and y for that kind of data. The adjusted R^2 penalizes the R^2 statistic to correct for having too many explanatory variables in the model:

\bar R^2 = {1-(1-R^{2}){n-1 \over n-p-1}} = {R^{2}-(1-R^{2}){p \over n-p-1}}

From Wikipedia: “If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted R2 computed each time, the level at which adjusted R2 reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms.”

The principle behind the adjusted R2 statistic can be seen by rewriting the ordinary R2 as

R^{2} = {1-{\textit{VAR}_\text{err} \over \textit{VAR}_\text{tot}}}

where {\textit{VAR}_\text{err} = SS_\text{err}/n} and {\textit{VAR}_\text{tot} = SS_\text{tot}/n} are the sample variances of the estimated residuals and the dependent variable respectively, which can be seen as biased estimates of the population variances of the errors and of the dependent variable. These estimates are replaced by statistically unbiased versions: {\textit{VAR}_\text{err} = SS_\text{err}/(n-p-1)} and {\textit{VAR}_\text{tot} = SS_\text{tot}/(n-1)}.

There is also another statistic, called the Akaike Information Criterion (AIC), that can help determine if a model term improves the description of the data while also yielding a parsimonious model; in fact, model selection by minimizing the AIC is the preferred method over maximizing the adjusted R^2.  The optimal model minimizes the AIC.   There is also another statistic called the Bayesian Information Criterion that does much the same thing.

The R MASS library includes methods related to the AIC.  To access the AIC statistic of the linear model we calculated above, install the MASS library, and type


For the above simulated data, let’s see if a quadratic model gives us a better adjusted R^2 and/or AIC.  Try this:

n = 100
x = seq(0,10,length=n)
y = 10*x+rnorm(n,0,25)
mydata = data.frame(x,y)


Is the output what you expected?

It turns out that in R you can’t use powers of x on the RHS.  Instead you have to define a new variable, called say, x2, that is x2=x*x, and then put that new variable on the RHS if you want a quadratic term. Try this

ba = lm(y~x,data=mydata)
b = lm(y~x+x2,data=mydata)

Is the quadratic term significant? What is the adjusted R^2 and AIC and statistics compared to the model with just the linear term. Does something perhaps seem a little odd?

Now add in a cubic term.  What is the significance of the parameters, and what is the adjusted R^2 and AIC statistics compared to the model with just the linear term?

Add in a fourth order term and get the R^2 and AIC statistics and compare them to those of the other models we’ve tried.

We know that the underlying model is linear (because we generated it with y=10*x plus stochasticity).  And yet, just based on adjusted R^2 and the AIC statistics, a model with higher order polynomials appears to be preferred. What is going on here?

It turns out that even just looking at the adjusted R^2 and/or the significance of the model parameters does not necessarily give the best indication of the most parsimonious model. The problem is that the stochasticity in the data can make the data appear to be consistent with a different model than the true underlying model. A model that is tuned to the stochasticity in the data may have poor predictive power for predicting future events.

Determination of whether or not our fitted model has good predictive power for future events brings us to the topic of model validation.

[nextpage title="Model validation"]

Model validation ensures that a hypothesized model whose models are fit to a particular data set has good predictive power for a similar data set.  A good way to validate a model is a method called “bootstrapping” or “cross-validation”; randomly divide the data into half, fit the model on one half (the training sample), then evaluate how much of the variance of the other half (the testing sample) is described by the predictions from the fitted model.  A model with good predictive power will describe much of the variance in the testing sample.

The R script least_square_example.R gives an example of bootstrapping to compare the predictive performance of a linear and a quadratic model on the data set we generated above.

[nextpage title="More on parsimonious model selection"]

As discussed, we wish to obtain the most parsimonious model that describes as much of the variance in the data as possible.  If we have several explanatory variables that we believe the data might be related to, what we could do is try all different combinations of the variables, and pick the model with the smallest AIC (and check using model validation to ensure that model has good predictive power). However, if there are a lot of variables this procedure would rapidly become quite tedious. The MASS library in R conveniently has a function called stepAIC() that can do this for you.  The stepAIC function has an option called “direction”.  If direction=”backward”, then the algorithm starts with the model with all variables, and drops them one at a time, picking the model that has the lowest AIC (I believe it drops them in the reverse order as you write them in the model expression in the lm() function).  It stops when dropping more terms results in a model with a higher AIC.  If direction=”forward”, then it starts with the first variable, and adds the other variables one at a time until the model with the lowest AIC is reached.  If direction=”both”, then it essentially adds and drops parameters at each step to determine the model with the lowest AIC.  This last option is the preferable one, but can be computationally intensive if your model has a lot of potential explanatory variables.

As an example, let’s assume that we think daily average temperature and wind speed might affect the number of assults per day in Chicago (perhaps by making people more irritable when the weather is bad).  Let’s use this code to try out a model with these explanatory variables (you first need to download the files chicago_crime_summary_b.txt and chicago_weather_summary.txt to your working directory, if you have not done so already):

cdat = read.table("chicago_crime_summary_b.txt",header=T)
cdat$jul = julian(cdat$month,cdat$day,cdat$year)
names(cdat)[which(names(cdat)=="julian")] = "jul"
wdat = read.table("chicago_weather_summary.txt",header=T)
wdat$jul = julian(wdat$month,wdat$day,wdat$year)

# subset the samples such that they cover the same time period

# put temperature and wind info in the crime data frame
cdat$temperature[match(wdat$jul,cdat$jul)] = wdat$temperature
cdat$wind[match(wdat$jul,cdat$jul)] = wdat$wind

# regress the number of assaults (x4) on daily avg
# temperature and wind
mult.fig(1,main="Number daily assaults in Chicago, 2001 to 2012")
b = lm(x4~temperature+wind,data=cdat)
plot(cdat$jul,cdat$x4,xlab="Julian days",ylab="\043 assaults/day")

The code produces the following plot:chicago_weather_and_crime_a


Do you see a problem with this fit?

Try the following code:

mult.fig(1,main="Number of daily assaults in Chicago, 2001-2012")
cdat$jul2 = cdat$jul^2
bb = lm(x4~jul+jul2+temperature+wind,data=cdat)
plot(cdat$jul,cdat$x4,xlab="Julian days",ylab="\043 assaults/day")

It produces the following plot:chicago_weather_and_crime_b

What do you think of this fit, based on the figure?

From the fit output, does it look like wind is needed in the fit?  Try using newfit=stepAIC(bb,direction=”both”) to determine what the most parsimonious model based on these explanatory variables is.  What is the AIC of the original model, bb?  What is the AIC of the newfit model?  How much of the variance did the model in bb explain?  How much of the variance did the model in newfit explain?

As an aside, what kind of possible confounding variables do you think we probably should think about including in the regression?

Also, as an aside, do you think the data are consistent with the underlying assumptions of Least Squares regression? (hint: start by plotting the residuals vs time.  Also, there is a function in fall_course_libs.R that can help you figure out if the data are consistent with the underlying assumptions of the statistical model).

[nextpage title="Factor regression"]

In regression analysis we can take into account the effect of what are known as categorical or factor variables; that is to say variables that can take on a variety of nominal levels.  Examples of categorical variables are gender, day-of-week, month-of-year,  political party, state, etc. An example of factor regression is regressing income on number of years in school, taking into account gender:example_factor_regression

In this particular example, the slope is assumed to be the same for both genders, but the intercept is different.  Is it obvious that in this case including gender in the regression will dramatically reduce the residual variance of the model? (ie; increase the R^2)

This is an example of one way to include factor variables in a linear regression fit; assume that for each category the intercept of the regression line is different in the different categories.  As we will see in a minute, if we use the lm() function for factor regression, by default R picks one of the categories as the “baseline” category, and then quotes an intercept relative to this baseline for each of the other categories.

Let’s look at an example: let’s assume that the number of assaults in Chicago per day is related to  a continuous explanatory variable x that is not a factor (say, temperature), and another explanatory variable w that is categorical (say, weekday).  To understand this better, let’s first fit a regression model with just temperature (plus a quadratic trend term in time), then aggregate the residuals by weekday.  Is it clear that if there is in fact no dependence on weekday, the residuals within-weekday should be statistically consistent with each other?

The following R code performs the fit and aggregates the results within weekday:

cdat$jul2 = cdat$jul^2
b = lm(x4~jul+jul2+temperature,data=cdat)
g = aggregate(b$residuals,by=list(cdat$weekday),FUN="mean")
eg = aggregate(b$residuals,by=list(cdat$weekday),FUN="se")
glen = aggregate(b$residuals,by=list(cdat$weekday),FUN="length")

As an aside: here I used a quadratic term to take into account long-term trends.  Do you think higher order polynomials should be used?  How could we tell?  If we are trying to determine what explanatory variables like temperature may cause crime trends, what is the danger of adding in too many higher order terms in the trend polynomial?

In the file fall_course_libs.R I have included a function called factorplot() that takes a vector y and its uncertainties ey, and plots them along with with mean of y and its one standard deviation uncertainties

y = g[[2]]
ey = eg[[2]]
ylen = glen[[2]]

The code produces the following plot:factorplot_example

Note that the within-group sample standard deviation (red error bars) is a lot smaller than the pooled standard deviation (blue dotted lines) . Based on the plot, do you think the means of residuals within weekday appear to be statistically consistent?

Let’s test the p-value of that null hypothesis using the chi-square test:

v = (y-mean(y))/ey
dof = length(v)-1
p = 1-pchisq(sum(v^2),dof)
cat("The p-value testing the null-hypothesis that the factor levels are drawn from the same mean is",p,"\n")

Now let’s include the weekday factor level in the regression itself. First let’s fill the weekday name in cdat (recall that weekday=0 is Sunday and weekday=6 is Saturday).

vweekday = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
cdat$weekday_name = vweekday[cdat$weekday+1]

In order to include a factor in the linear regression with lm(), you simply add a term in the call function using factor(<variable name>).  R then picks one of the levels of that factor as the “baseline” level (the baseline also includes the overall intercept of the model), and it gives the intercept for the other levels relative to that baseline:

myfit = lm(x4~jul+jul2+temperature+factor(weekday_name),data=cdat)

Take a look at the output… notice that R picked “Friday” as the baseline. Why do you think that is?

To force R to put the factors in a particular order such that the first in the list is treated as the baseline, you would do the following:

vweekday = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
myfit = lm(x4~jul+jul2+temperature+factor(weekday_name,levels=vweekday),data=cdat)

Now, take a look at the intercept for Monday, and subtract it from Tuesday. Compare this to y[3]-y[2] from the y vector that we calculated above from the fit residuals within-weekday of the regression model with just temperature and the quadratic trend term. What do you notice?

Interpretation of the summary(myfit) output reveals that all of the factor levels are significant when compared to the Sunday level.

[nextpage title="Including interaction terms"]

In the above we examined adding in higher order polynomial terms of one explanatory variable into the model.  If you believe that there are interaction terms between the explanatory variables you can easily include them in a model.  Let’s start with a simple example of a drug treatment trial with men and women, and they measured some response vs dosage (for instance, perhaps cholesterol level if it is a cholesterol reducing drug).  It is not uncommon for men and women to metabolize drugs differently.

In the trial, 90 men and 90 women are recruited who had high cholesterol.  They were divided into groups of 10, and each group was given a different dosage of a drug, and the response in their cholesterol level evaluated after a period of time. The data are in the file drug_treatment.txt  Read in this file, and regress the response on the drug dosage without taking into account gender:

b = lm(cholesterol~dosage,data=ddat)
plot(ddat$dosage,ddat$cholesterol,xlab="Drug dosage (mg)",ylab="Cholesterol (mg/dL)")

This produces the following plot.drug_treatment_a

What do you think about the quality of the fit?  Are the fit residuals consistent with being Normally distributed? (plot them vs dosage, and also see the norm_overlay_and_qq() function in fall_course_libs.R)

Now let’s include gender as a factor:

bb = lm(cholesterol~dosage+factor(gender),data=ddat)
plot(ddat$dosage,ddat$cholesterol,xlab="Drug dosage (mg)",ylab="Cholesterol (mg/dL)")
for (igender in levels(ddat$gender)){

producing this plot:drug_treatment_b

What do you think about the quality of this fit?  What is the R^2 compared to the original fit?  Do the factor terms appear to be significant?  Interpret the results as shown in summary(bb). What is the AIC compared to the original fit?  Do the residuals satisfy the underlying assumptions of the Least Squares model?

You can also include interaction terms in a factor regression model if you think that the slope of the regression on the continuous explanatory variable depends on the factor level.  In R you do this by multiplying the factor by the continuous variable:

bc = lm(cholesterol~dosage*factor(gender),data=ddat)
plot(ddat$dosage,ddat$cholesterol,xlab="Drug dosage (mg)",ylab="Cholesterol (mg/dL)")
for (igender in levels(ddat$gender)){

This produces the plotdrug_treatment_c

and the fit summary information:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 250.442396 0.818672 305.913 <2e-16 ***
dosage -0.252715 0.006878 -36.741 <2e-16 ***
factor(gender)M 1.487949 1.157777 1.285 0.2
dosage:factor(gender)M 0.141957 0.009727 14.594 <2e-16 ***

The first term is the intercept for the first factor level (females). The second term is slope for the first factor level.  The third term is the difference in intercept for the males compared to the females.  The fourth term is the difference in slope for the males compared to the slope for the females.   What is the interpretation of the third term not being significant?

Examine the residuals. Are the consistent with the underlying assumptions of the Least Squares method?

Try doing separate regressions with each factor level:

ddat_female = subset(ddat,gender=="F")
bc_female = lm(cholesterol~dosage,data=ddat_female)
cat("Female regression:\n")

ddat_male = subset(ddat,gender=="M")
bc_male = lm(cholesterol~dosage,data=ddat_male)
cat("Male regression:\n")

with output:

Female regression:
(Intercept) 250.442396 0.808410 309.80 <2e-16 ***
dosage -0.252715 0.006792 -37.21 <2e-16 ***

Male regression:
(Intercept) 251.930344 0.828807 303.97 <2e-16 ***
dosage -0.110758 0.006963 -15.91 <2e-16 ***

Compare the differences, and uncertainties on the differences, of the slope and intercepts from the two fits the results of the combined fit with interaction term above. What do you find?

Use the stepAIC() function in the MASS library on the model bc above.  What do you find?

If we wanted to fix the intercept to be the same for the factor levels, but only include different slopes for each factor level, we would do the following:

bd = lm(cholesterol~dosage:factor(gender),data=ddat)
plot(ddat$dosage,ddat$cholesterol,xlab="Drug dosage (mg)",ylab="Cholesterol (mg/dL)")
for (igender in levels(ddat$gender)){

producing the following plot:drug_treatment_d

You can also include interaction terms between continuous explanatory variables.  For instance, in the below code, we fit simulated data generated with the model y = 50*x*w:

n = 100000
x = rnorm(n,10,0.1)
w = rnorm(n,10,0.25)

y = 50*x*w + rnorm(n,0,25)

myfit_a = lm(y~x+w)
mult.fig(1,main="True model y=50xw")
plot(x*w,myfit_a$residuals,xlab="x*w",ylab="Residuals from fit of y=x+w model")

myfit_b = lm(y~x*w)

Notice that in the first fit we just fit the additive model with no interactions, and it appears to describe a large amount of the variance in the data, even though it is not the correct model. Notice that, unlike with the factor regression for assualts as a function of weekday example given above, plotting the residuals of this additive model fit vs x*w doesn’t seem to reveal any dependence on x*w:interaction_example

There is no apparent dependence of the linear fit residuals on the interaction term because coefficients of the linear terms mostly compensated for the fact the fit was missing the x:w component. So, just plotting the residuals of an additive model vs cross terms will not usually give an indication as to whether cross-terms may be needed in the fit.

That being said, just like as for any other variable, putting cross-term variables willy nilly into a fit for no well-motivated reason is a bad idea, because sometimes, just due to stochasticity in the data, the cross-term will appear to be significant, and the AIC will make it appear that the term should be in the fit, when it in fact really isn’t related to variation in the dependent variable.  This affects the prediction accuracy of the model.

In the myfit_b fit we take into account the cross-terms. Are the regression parameter estimates consistent with what we expect? Is this more complex model preferred?


Leave a Reply