Introduction
To R

Linear Regression Models

Skull Data

The data file skull.dat contains measurements on a number of skulls recovered from Egyptian archaeological sites. You can save them and look at the first ten lines using the following command:

> system("head skull.dat")
MaxBreadth BaseHeight BaseLength NasalHeight Year
131     138     89      49      -4000
125     131     92      48      -4000
131     132     99      50      -4000
119     132     96      44      -4000
136     143     100     54      -4000
138     137     89      56      -4000
139     130     108     48      -4000
125     136     93      48      -4000
131     134     102     51      -4000

Reading the Data

This file can be read into a data frame - but since the first line has the column descriptions, we can add head=T to read.table() and R will use those as the names of the columns:

> skull <- read.table('skull.dat',head=T)
> dim(skull)
[1] 150   5
> skull[1:5,]        First five rows...
  MaxBreadth BaseHeight BaseLength NasalHeight  Year 
1        131        138         89          49 -4000
2        125        131         92          48 -4000
3        131        132         99          50 -4000
4        119        132         96          44 -4000
5        136        143        100          54 -4000

We want to investigate the relation between the BaseHeight of the skull, and the BaseLength.

Plotting the Data

We can plot a scatter plot of the two quantities:

> plot(skull$BaseH,skull$BaseL)

and we want to fit a model that relates the Base Length to the Base Height.

The `lm()` function

We want to fit a linear model - of the form y = a.x + b - to the data. The lm() does that.

> lm(skull$BaseL ~ skull$BaseH)       Can abbreviate names...
Call:
lm(formula = skull$BaseL ~ skull$BaseH)

Coefficients:
 (Intercept) skull$BaseH 
     58.3103      0.2878

Note how lm() is called with lm(y ~ x) - this we read as y depends on x. This is the way that many models are constructed - with the quantity you are measuring (the response) as the y value and the quantities you control (the predictors) as the x variables. Of course in this situation its just two sets of measurements... Anyway...

Storing a Fit

Just as you can do with most R functions, you can store a lm() value in an object. The returned value from lm() is another list, like t.test(), but with different elements:

> sfit <- lm(skull$BaseL~skull$BaseH)
> is.list(sfit)
[1] TRUE
> names(sfit)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

These are all important results from the fitted model. Perhaps the most important is the coefficients - you can use these to add the fitted model to the scatter plot:

> plot(skull$BaseH,skull$BaseL)
> sfit$coef
 (Intercept) skull$BaseH 
    58.31026   0.2878212
> abline(sfit$coeff)

The abline() function takes a vector of length two and draws a line with that slope and intercept.

Other components of the fit are useful for diagnostic purposes. For example you can look at the plot of the fitted values against the residuals:

> plot(sfit$fitted,sfit$resid)

this can show deviations from the data fitting a linear model well, if there is significant curvature in the plot.

You can also do a histogram of the residuals to look for outliers - if the data really comes from a linear model plus gaussian noise then the residuals should come from a normal distribution:

> hist(sfit$resid)

Object Class

Different objects in R can be assigned different classes. For example, we can find out which class has been assigned to the object storing the result of the lm() function by typing:

> class(sfit)
[1] "lm"

The fact that sfit is an object of the class lm implies that diagnostic plots as those seen above can be obtained simply by typing:

> plot(sfit)

This is possible because plot() is a generic function: the type of the plot produced is dependent on the type or class of the argument.

Contents