Contents
 
Introduction
To R

Data Frames

Oh no, not another data structure. We've got vectors, matrices, lists - what else do we need?

Data Frames are much like matrices - they have rows and columns, and hence two dimensions. But each column can store a different type of thing. The first column could be a number, the second a character variable, the third a factor. Data Frames are the best way to store data where each row corresponds to a unit, or person, and each column represents a measurement on the units.

Reading a Data Frame from a File

R has a function for reading data from a file straight into a data frame. It is very common to get data files with one line per record, with each element separated either by spaces or commas. Consider this file, music.dat :
CAD3004,Frank Black, Frank Black, 15, CD
Col4851,Weather Report, Sweetnighter, 6, CD
Rep2257,Neil Young, Decade I, 19, CD
Rep4335,Neil Young, Weld, 12, CD
Chp1432,Red Hot Chili Peppers, Mother's Milk, 13, Tape
EMI1233,Primus, The Brown Album, 12, Tape
Atl4500,Led Zeppelin, Led Zep 3, 11, CD
We use read.table() to read this function into an object. We give it the filename, and in this case we have to tell it that fields are separated by commas:
> music <- read.table("music.dat",sep=",",row.names=1,quote="")
> music
                           V2               V3 V4    V5
CAD3004           Frank Black      Frank Black 15    CD
Col4851        Weather Report     Sweetnighter  6    CD
Rep2257            Neil Young         Decade I 19    CD
Rep4335            Neil Young             Weld 12    CD
Chp1432 Red Hot Chili Peppers    Mother's Milk 13  Tape
EMI1233                Primus  The Brown Album 12  Tape
Atl4500          Led Zeppelin        Led Zep 3 11    CD
Notice how some columns are numbers, and some are text. You can't do that with an ordinary matrix.

Subscripts like a matrix...

The datafile has been read into something that looks a bit like a matrix. In fact you can use ordinary matrix subscripting to get rows and columns:
> music[,3]
[1] 15  6 19 12 13 12 11
> music[2,]
                    V2            V3 V4  V5 
Col4851 Weather Report  Sweetnighter  6  CD

Names like a list...

You can also use the column name labels much as you can with a list, using a $ sign:
> music$V5
[1]  CD    CD    CD    CD    Tape  Tape  CD
Levels:   CD  Tape
You can give the columns more sensible names by assigning the names() to a character vector:
> names(music) <- c('Artist','Title','Ntracks','Format')
> music$Title
[1]  Frank Black      Sweetnighter     Decade I         Weld           
[5]  Mother's Milk    The Brown Album  Led Zep 3      
Levels:   Decade I  Frank Black  Led Zep 3  Mother's Milk  
Sweetnighter  The Brown Album  Weld
Note that the first column, the catalogue number, isn't really part of the data frame, and so doesn't have a names() entry. You can get or set these values using row.names():
> row.names(music)
[1] "CAD3004" "Col4851" "Rep2257" "Rep4335" "Chp1432" "EMI1233" "Atl4500"

Adding Columns

You can add columns to a data frame using cbind() as with matrices. If you name the argument to cbind() then you'll set the names() element for the new column:
> music <- cbind(music,Rate=c(7,6,9,10,9,8,8))  Marks out of 10
> music
                       Artist            Title Ntracks Format Rate 
CAD3004           Frank Black      Frank Black      15     CD    7
Col4851        Weather Report     Sweetnighter       6     CD    6
Rep2257            Neil Young         Decade I      19     CD    9
Rep4335            Neil Young             Weld      12     CD   10
Chp1432 Red Hot Chili Peppers    Mother's Milk      13   Tape    9
EMI1233                Primus  The Brown Album      12   Tape    8
Atl4500          Led Zeppelin        Led Zep 3      11     CD    8

Characters/Factors?

Anything that read.table() finds that isn't all numbers gets turned into a factor. Sometimes you want things to be factors, sometimes you dont. For example, the Format field is a categorical field, and maybe Artist is too, but the Title probably shouldn't be. We can convert these in their place:
> is.factor(music$Title)
[1] TRUE
> music$Title <- as.character(music$Title)
> music$Artist <- as.character(music$Artist)
> music
                       Artist            Title Ntracks Format rate 
CAD3004 Frank Black            Frank Black          15     CD    7
Col4851 Weather Report         Sweetnighter          6     CD    6
Rep2257 Neil Young             Decade I             19     CD    9
Rep4335 Neil Young             Weld                 12     CD   10
Chp1432 Red Hot Chili Peppers  Mother's Milk        13   Tape    9
EMI1233 Primus                 The Brown Album      12   Tape    8
Atl4500 Led Zeppelin           Led Zep 3            11     CD    8

Contents