Tuesday, May 11, 2010

Summary Statistics in R

There are plenty of useful techniques for manipulating data frame objects in R. This post summarizes some of these useful techniques, and gives code to implement them. Before running the code in this post, you should read in the data set (which is available for download here). For tips on how to read Stata ".dta" files into an R data frame, see my previous post.
## Print the entire data frame to the screen by typing its name ##

caschool.df

## Look just as the variable names ##

names(caschool.df)

## Examine a particular variable using the $ extractor ##

caschool.df$testscr

## You can also think of the data frame as a matrix ##

caschool.df[,4] ## Extracts the 4th column
caschool.df[4,] ## Extracts the 4th observation
caschool.df[4,4:18]

## The last command extracts the 4th through 18th columns
## of the 4th observation.
##
## Note 4:18 is R shorthand for
## c(4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)

## Compute summary statistics on the entire data frame ##
## or just one variable ##

mean(caschool.df) ## Returns NA for categorical
round(mean(caschool.df),2) ## Rounding returns easier
## to read formatting
var(caschool.df) ## Returns var-cov matrix
diag(var(caschool.df)) ## Returns just the variances
mean(caschool.df$testscr) ## Returns mean of testscr
var(caschool.df$testscr) ## Returns scalar variance

## Compute the summary statistics on columns 6 through 18 ##

mean(caschool.df[,6:18])
var(caschool.df[,6:18]) ## Returns var-cov matrix

## If you don't like typing the data frame's name every ##
## time you want to explore the variable, you can use ##
## the attach() command.

attach(caschool.df)

## Now, the variables in the data frame can be accessed ##
## without extracting them with the $

mean(testscr)


##-------------------------------------------------- ##
## Creating/Storing new variables ##
## Just using arithmetic definition of new variables ##
##-------------------------------------------------- ##

math_read_avg = (read_scr+math_scr)/2

## -------------------------------------------##
## You can attach it to the data frame ##
## In fact, for regression, you want to do so ##
## -------------------------------------------##

caschool.df=cbind(caschool.df,math_read_avg)

## -------------------------------------------------##
## ... and its name will be what you called it when ##
## you defined the variable ##
## -------------------------------------------------##

names(caschool.df)
caschool.df$math_read_avg

## ------------------------------------------------- ##
## Too many datasets in R's active memory can cause
## problems math_scr might be a variable name in
## multiple data frames ... especially if you work
## with a lot of data sets
##
## In case of ambiguity, R will write over the old
## variable name with the new definition.
##
## To avoid this problem, when you are done with an
## attached data set, you should use detach()
## ------------------------------------------------ ##

detach(caschool.df)

## ------------------------------------------------ ##
## An added note: I like to leave my workspace image
## clutter-free. After working on some code, I save
## my code in a text file, but I do *not* save my
## workspace image.
##
## Following this practice can help you avoid
## referencing a variable name that you created months
## ago for a separate project
## ------------------------------------------------ ##

No comments:

Post a Comment