R is a versatile and reliable statistics package. What makes it particularly attractive to students, statisticians, and researchers is it's free. That’s right – you can get a leading statistical package for zero dinero. Still not smiling? Most fully licensed statistical packages cost $2000 or more; R costs $0.
Now that you’re smiling, let’s look at how to set up R.
R SET UP
Step 1. Goto the R website at http://cran.r-project.org/
Step 2. Select your operating system (Linux, Mac, Windows)
(Note: the rest of these instructions are for Windows users)
Step 3. Click on “base”.
Step 4. Select the latest version of R for download (top of page)
Step 5. Save the R .exe file to a folder on your computer.
Step 6. After the file loads, open the folder and double click the .exe file to start the installation. It is probably best to select the default download settings for now. The installation will create an R folder in the list of program files. If you download newer versions of R, this is where the program files will be stored.
After completing steps 1-6, look for the R shortcut icon on your desktop. Double click on it to launch R. You will see a window with the RGui (R graphic user interface). You are set to go and can enter all sorts of data and programming commands for running statistical functions (several of these are shown below). A great add on to consider is RStudio. RStudio is a graphic user interface for R. To get RStudio, go to rstudio.org, download the program, and follow the install instructions.
Programming syntax for statistical operations
Matrix Math Functions
a=c(1,4,7)
b=c(2,5,8)
c=c(3,6,9)
matrixA=cbind(a,b,c) # converts 3 vectors into a 3x3 matrix
matrixA=matrix(c(1,2,3,4,5,6,7,8,9),byrow=TRUE,ncol=3) # creates the same 3x3 matrix as above
t(matrixA) # transposes matrix A
matrixA%*%matrixB # multiplies matrixes A and B
solve(matrixA) # takes the inverse of a matrix
crossprod(matrixA) # shortcut to compute A*t(A)
Data Entry and Removal Techniques in datasets/dataframes and matrices
Enter continuous variable x x = c(v,a,l,u,e,s)
Enter categorical/factor variable z z = factor (c(l,e,v,e,l,s))
Combine vbls x, y, z into a matrix matrixname=cbind(x, y, z)
Combine vbls x, y, z into a dataframe dataframe= data.frame(x,y,z)
Convert dataframe into a matrix matrixname = as.matrix(dataframe)
Convert matrix into a dataframe dataframe = as.data.frame(matrixname)
Convert long data format into wide format dataframe = reshape(dataframe,idvar=c("var1","var2"),timevar="time",direction="wide")
Count number of TRUEs in a logical vector length(vectorname[vectorname==TRUE])
Dimensions of matrix or dataframe dim(matrix or dataframe)
Change variable name colnames(dataframe) [#] <- "newname" # where [#] is the column number for the vbl
Change variable names (1) colnames(dataframe) = c("vbl1_name", "vbl2_name", "etc")
Change variable names (2) names(dataframe) = c("vbl1_name", "vbl2_name", "etc")
Add a variable to a dataframe dataframe = cbind(dataframe,variable)
Add a new variable to an existing dataframe dataframe$newvariable=c(v,a,l,u,e,s)
Add a new variable by summing vbls x and y dataframe$newvariable=dataframe$x + dataframe$y
view variables in a dataset names(dataframe)
view active dataframe and variables ls()
view available datasets in R data()
view first 10 rows and first 5 columns Table(dataframe)[1:10, 1:5]
view first 5 rows or last 5 rows head(dataframe) or tail(dataframe)
Remove active dataframe and values rm(dataframe)
Remove variables in a dataframe dataframe = remove.vars(dataframe, "variable", info = TRUE) # requires "gdata" pkg
Remove cases with missing data from a variable dataframe = na.omit(dataframe$variable)
Remove cases with missing data from a dataframe dataframe = na.omit(dataframe)
Remove rows in dataframe dataframe = dataframe[-(3:5), ] # removes rows 3, 4, and 5
Get a dataframe from a package data("dataframe", package = "package name")
Load dataframe attach(dataframe)
Randomly select 100 values from normal dist. x=rnorm(100)
Randomly select 100 values from dist. x=rnorm(10, 100, 16) #with n=10, mean=100, sd=16
Randomly select N values from a dataframe sample(dataframe, N, replace = TRUE) # replace =TRUE or FALSE
Create a sequence of numbers from 1 to 50 x=1:50
Create a subset of dataframe datasubset = subset(dataframe, vblname == "level in factor vbl to select")
Check classification status of a dataframe class(dataframe)
Check classification status of a variable class(variable)
Check classification status of all vbls sapply(dataframe, class)
Check levels of a factor variable levels(dataframe$variable)
Check your working directory getwd()
Convert a numeric into a factor data$variable = as.factor(data$variable) # always use "$" even if dataframe is attached!
Convert a character into a numeric data$variable = as.numeric(data$frame)
Convert a variable into a factor with labels data$variable = factor(data$variable, levels=1:2, labels=c("male","female"))
Convert a variable into an ordered factor data$variable = factor(data$variable, ordered = TRUE)
Create grouping for continuous vbl split(data$continuous, data$factor)
Create grouping for continuous vbl group = data$continuous[data$factor=="factor level"]
Choose new referent category variable = relevel(variable, ref="value") #Check new referent with "table(variable)"
ignore missing data NAs ( . . . , na.rm = TRUE)
ignore NAs in specific variable ( . . . , na.omit(variable))
Open new window in Rstudio window()
select age > 65 from a dataframe datasubset = data[ which(data$gender=='F' & data$age > 65), ]
select subset from dataframe datasubset = subset(dataframe, vblname == "level in factor vbl to select")
change decimal places options(digits=4)
mean center centered.vbl = scale(variable, center = TRUE, scale=FALSE)
standardize variables std.variable = scale(variable, center = FALSE, scale = TRUE)
Importing Data Files from Excel
Save data file to your desktop or other folder (e.g., R working directory) as a text file with the ".csv" extension, then open with RStudio by clicking the "import dataset" tab in the top right pane and then locating your data file. RStudio should recognize the first row of variable names as just that, variable names. Run the attach(dataset) command. Save file in R working directory (see below).
Importing Data Files from SPSS
library(foreign)
dataframe <- read.spss("C:/Location/SPSS_file.sav", use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
Note: Sometimes factor labels are lost in the transfer process. R may reclassify your categorical variables as "numeric". Check variable types in R with "sapply(dataframe,class)". If this is a problem, try assigning labels to categorical variables in SPSS (e.g., 0=female, 1=male). If a nominal variable cannot be assigned labels then try changing the variables to "string" in SPSS's "Type" column in the data view window. If all else fails, convert variables to factors in R using "dataframe$variable=as.factor(dataframe$variable)"
#Saving a dataset in csv format (into R working directory)
To save a data set/frame in .csv format, use write.csv. Unless you’ve created row names via the rownames command, you typically want to set row.names = FALSE. You can keep dataframe and datafile names the same.
write.csv(dataframe, "datafile.csv", row.names=FALSE)
#Saving a dataset in SPSS format (into R working directory)
To save a data set/frame in .csv format, use write.csv. Unless you’ve created row names via the rownames command, you typically want to set row.names = FALSE. You can keep dataframe and datafile names the same.
write.spss(dataframe, "datafile.spss", row.names=FALSE)
#Loading a csv dataset (from R working directory)
To load a data set/frame in .csv format, use read.csv. If the first row are variable names, you may also include
"header=TRUE". You can keep the dataset name and filename the same.
dataframe=read.csv("datafile.csv", header=TRUE)
Exporting Data Files from R
When exporting dataset from RStudio to Excel or SPSS, try saving as a "csv" file (see above) and opening the file in excel or spss. A more quick and dirty approach is to click on the appropriate dataset name in the top right workspace window of RStudio (the dataset will appear in the top left window), then copy and paste the dataset into Excel and SPSS and you're good to go.
Basic Stats & Descriptives
summary(dataframe) # statistics for dataframe
describeBy(dataframe, grouping_variable) #summary statistics by grouping factor (requires "psych" package). For all variables.
by(variable, grouping variable, stat.desc) # summary statistics by grouping factor (requires "pastecs" package). For one variable.
mean (x)
median (x)
mode (x)
max (x)
min (x)
quantile (x)
IQR (x)
range (x) # gives lowest and highest score
sum (x) # gives sum of variable
sd (x) # unbiased standard deviation
var(x) # unbiased variance
summary (x) # summary stats
length (x) # sample size
cov (x, y) # covariance for x and y
scale(x) # to find z-scores
pnorm(scale(x)) # gives (Percentile Rank) areas to left of z-scores
pnorm(x, mean, sd) # gives PR for data points with a specified mean and sd
t = (scale(x))*10+50 # covert into t-scores
Check parametric assumptions
leveneTest(outcome vbl, grouping vbl) # Levene's test for homogeneity of variance (requires "car" package)
shapiro.test(dataframe$variable) # Shapiro's test for normality
Graphics using GrapheR package
A good "point and click" GUI for creating standard figures
library(GrapheR)
run.GrapheR()
Graphics (standard R)
plot (x) # creates scatter plot
plot(x,y) # creates scatter plot (outcome variable [y] listed second)
identify (x, y) # identify points on a plot (put cross hairs on point, click, then hit "esc" to see row number for that case
abline(lm(y~x)) # draws regression line (you must do plot(x,y) first and minimize its window)
scatterplot(y~x, smoother=loessLine) #draws plot, linear reg. line, and smooth reg. line (requires "car packages) or
may use smoother=gamLine or smoother=quantregLine smoothers instead
scatterplot(y~x | factor, smoother=loessLine) # draws plot, linear reg. line, and smooth reg. line for each level of a factor (requires "car")
plot(density(x)) # plots kernal density function
plot(scale(x),scale(y)) # puts both variables on the same scale
barplot(table(x)) # creates bar graph
boxplot(x) # creates boxplot
boxplot (x,y) # view both plots side by side
boxplot (y~x) # y is continuous and x is grouping factor
xyplot(y~x|z) # y and x are continuous, z is grouping factor (requires lattice package)
stem (x)
hist (x)
lines(density(variable.name)) # superimpose line on a histogram [do hist (x) first]
hist (x,10) # a histogram with 10 breaks
table (x)
Graphics with "lattice" package
plot x and y variables xyplot(y~x)
scatterplot for each level of a factor xyplot(y~x | factor) # viewing something by ( | ) factor is called conditioning
scatterplots grouped by 2 factors xyplot( y ~ x | factor1 + factor2, type = "o")
scatterplot matrix splom(dataframe) # where dataframe contains continuous variables
scatterplot matrix grouped by a factor splom(~dataframe | factor) # use "type = c()" commands shown below to add regression lines
plot x and y with regression line xyplot(y~x, data=dataframe, ("p","r"), xlim=c(,), ylim=c(,)) #adjust limits as needed
plot x and y with a smooth line xyplot(y~x, data=dataframe, ("g","p","smooth"), xlab="x") #adjust labels as needed
plot x and y grouped by a factor vbl xyplot(y~x | factor, data=dataframe, ("p","r"), xlim=c(,), ylim=c(,))
histogram histogram(~x)
histogram for each level of a factor histogram(~ variable | factor, data=dataframe)
density plot densityplot(~variable, data = dataframe)
density plot for each level of a factor densityplot(~ variable | factor, data=dataframe)
superimposed density plots densityplot(~ variable, groups=factor, plot.points=F, ref=T, auto.key=list(columns=3))
barplot barchart(factor ~ scale_vbl | factor1 + factor2)
normal QQ plot qqmath(~variable)
normal QQ plots by grouping variable qqmath(~variable | factor)
box and whisker plot bwplot(~variable)
box and whisker plot with factor bwplot(~variable | factor)
strip plot stripplot(~ variable)
strip plot by grouping variable stripplot(~ variable | factor)
bivariate 3D plot cloud(z~x*y)
Graphics with "ggplot2" package
generic expression mygraph = ggplot (dataframe, aes(vbl x, vbl y))
add geoms (layers) to generic expression mygraph + geom_??()
Geoms - insert geom aesthetics into "()" as needed [e.g., color = "name"; size = value; fill=color; alpha(color,value); weight=value; line, 2, 3, 4, 5, or 6; shape=0,1,2,3,4,5, or 6]:
bar graph [include just nominal vbl x in generic expression] geom_bar()
histogram [include just continuous vbl x in generic expression] geom_histogram()
boxplot [include just continuous vbl x in generic expression] geom_boxplot()
scatterplot geom_point()
plus connect dots geom_line()
plus smooth reg. line geom_smooth() #smooth regression line with 95% upper and lower bands
text box geom_text() # x=horizontal.location, y=vertical.location, label="name"
density plot geom_density()
error bars geom_errorbar() # ymin=lower.limit, ymax=upper limit of bars
horizontal and vertical lines geom_hline(), geom_vline() # yintercept = value, xintercept = value
Title + labs(title = "Title")
Save graphic ggsave("graphname.ext") # extensions: pdf, jpeg, tiff, png, bmp, svg, wmf
ANALYSES
Correlation
cor.test (x,y) # gives Pearson r and p-value
cor.test (x,y, method = "spearman") # Spearman ranks correlation (add exact = FALSE if there are tied ranks)
cor.test(x,y)^2 # coefficient of determination
cor.test(rank(x),rank(y)) # gives Spearman ranks correlation coefficient. FOR when data are not normal, data is ordinal, small N, or relationship is nonlinear.
cor(dataframe) # correlation matrix
# create a correlation matrix (requires deducer package. See "??ggcorplot" for other goodies)
corr.matrix = cor.matrix(dataframe)
ggcorplot(corr.matrix, data = dataframe)
Linear Regression
Choose new referent category dataframe$variable = relevel(dataframe$variable, ref="value") #Check with "table(variable)"
model.matrix ( ~ variable, data = dataframe) # create matrix of model variables
sjt.lm(mymodel) # Creates nice SPSS-ish charts of regression data. Requires "sjPlot" package.
plot(mymodel) # produces useful plots for evaluating regression diagnostics
mymodel=lm(y~x) # one predictor
mymodel=lm(y~x1+x2+x3) # multiple predictors
summary(mymodel)
and then...abline(lm(y~x1)) # draws regression line (you must run "plot(x,y)" first and minimize its window)
coefficients(mymodel) #model coefficients
confint(mymodel, level=0.95) #95% CIs for coefficients
fitted(mymodel) # predicted values
residuals(mymodel) # residuals
anova(mymodel) # anova table
vcov(mymodel) # covariance matrix for model parameters
influence(mymodel) # regression diagnostics
Additional Regression functions (Thanks to "R Tutorials" for these):
+ x include variable x
- x remove variable x from list of predictor vbls
., include all predictor vbls
x : y include the interaction between vbls x and y
x * y include variables x and y and the interaction between them
x / y nesting: include vbl y nested within vbl x
x | y conditioning: include x given y
(x + y + z)^3 include these variables and all interactions up to three way
poly(x,3) polynomial regression: orthogonal polynomials
Error(a/b) specify the error term
I(x*y) as is: include a new variable consisting of these variables multiplied
- 1 intercept: delete the intercept (regress through the origin)
Logistic Regression
mymodel <- glm(y~x1+x2+x3, family="binomial", data=dataframe)
summary(mymodel) # display results
and then...
confint(mymodel) # 95% CI for the coefficients
exp(coef(mymodel)) # exponentiated coefficients to get Odds Ratios
exp(confint(mymodel)) # 95% CI for exponentiated coefficients (Odds Ratios)
plot(mymodel) # useful visual regression diagnostics
Testing Proportions
Single Sample Proportion test
prop.test(n, N, p = null prop, conf.level=.95)
Exact Binomial Test for a Single Sample
binom.test(n, N, p = nullprop) # use exact binomial when sample size is small or always use it
Two Samples Proportion Test
prop.test(c(n1, n2), c(N1, N2))
Fisher's Exact Test for Two Samples
fisher.test(matrix(c(f1,f2,m1,m2),nrow=2)) # use fisher's exact when sample size is small. Where f1='yes' count and f2='no' count for one group, and m1='yes' count and m2='no' count for the other group. nrow=2 specifies that the first two values (f1 and f2) comprise two rows in the matrix. The last 2 values (m1 and m2) are automatically assigned to the columns, hence this example refers to a 2x2 matrix.
Testing Equality of Variance
F test for equality of variance
var.test(x1, x2) # use this when there are two continuous variables
Levene's test for homogeneity of variance
leveneTest(outcome vbl, grouping vbl) # Use this with a grouping variable and a continuous variable (requires "car" package)
T-Tests
Single-sample t-test
t.test(x, mu = a population mean value that you want to test the sample mean against)
Welch's paired samples t-test
t.test(y1, y2, paired=TRUE)
F test for equality of variance
var.test(x1, x2)
Between groups t-test - assuming no equality of variance (Welch's)
t.test(y~x)
Between groups t-test - assuming equality of variance
t.test(y~x, var.equal = TRUE)
Ideal format when there's more than two categories in the grouping variable
t.test(continuous_vbl [grouping_vbl == 1], continuous_vbl [grouping_vbl == 2])
# You may add the following commands to these t-tests: alt="two.sided", "less", or "greater"; mu = ?(change mean difference to something other than zero); var.equal = TRUE/FALSE (the default setting is FALSE which is Welch's Test, R assumes that the variances are not equal unless you specify that they are equal); conf.level = 0.95; na.action=na.exclude (if you want to leave out missing values).
Non-Parametric Tests
# Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
Single Sample test
wilcox.test(x, mu = population median value) # add "exact=FALSE" if there are tied ranks
Paired samples test (Wilcoxon Signed-Ranks)
wilcox.test(x1, x2, paired = TRUE) # add "exact=FALSE" if there are tied ranks
Independent, 2 samples test (MWU or Wilcoxon Rank Sum)
wilcox.test(y~x) # where x has two categories. Add "exact=FALSE" if there are tied ranks
Kruskal Wallis test
kruskal.test(y~x) # where variable x is a between subjects' factor with more than 2 catergories
Post Hoc test for Kruskal
kruskalmc(y~x) # requires pgirmess package
ANOVA
Independent Groups, One-Way ANOVA
anova(lm(y ~ x)) # where y is the continuous outcome and x is the between subjects grouping factor
anova(lm(y ~ x1 * x2)) # separate multiple between subjects factors with an asterix
or
print(summary(lm(y~x))) #produces F stat, p-value, and regression coefficients
or
myanova = aov(outcome ~ predictor1 * predictor2, data=mydata, na.action=na.exclude)
summary(myanova)
myWelchanova = oneway.test(outcome ~ predictor[s], data=mydata) # use when homogeneity assumption is not met
Chi-Square tests
Goodness of Fit
obs = c(n1, n2, n3, n4)
exp = c(prop1, prop2, prop3, prop4)
chisq.test (obs, p = exp)
Test for Independence (Crosstabs)
chisq.test(data.frame(vbl.1, vbl.2))
dataframe.name = stack(dataframe.name, df = stack(data.frame(x,y,z))
Odds Ratio & Relative Risk
dimnames(mymatrix)=list(smoker=c("y","n"), death=c("y","n"))
mymatrix=matrix(c(100,105,150,165),2,2)
library(epitools)
oddsratio(mymatrix, method="wald")
riskratio(mymatrix, rev="both") # reverses matrix
Agreement
Kappa Agreement # for categorical or ordinal data
library(irr)
rater1 = c(1,1,2,1,1,3,2,1,1)
rater2 = c(1,1,2,1,1,1,2,1,1)
rater3 = c(1,2,1,1,2,1,3,2,2)
ratings = data.frame(rater1,rater2,rater3)
kappa2(ratings[,c(1,2)]) # compare raters 1 & 2
kappa2(ratings[,c(1,3)]) # compare raters 1 & 3
kappa2(ratings[,c(2,3)]) # compare raters 2 & 3
Matrix Math Functions
a=c(1,4,7)
b=c(2,5,8)
c=c(3,6,9)
matrixA=cbind(a,b,c) # converts 3 vectors into a 3x3 matrix
matrixA=matrix(c(1,2,3,4,5,6,7,8,9),byrow=TRUE,ncol=3) # creates the same 3x3 matrix as above
t(matrixA) # transposes matrix A
matrixA%*%matrixB # multiplies matrixes A and B
solve(matrixA) # takes the inverse of a matrix
crossprod(matrixA) # shortcut to compute A*t(A)
Data Entry and Removal Techniques in datasets/dataframes and matrices
Enter continuous variable x x = c(v,a,l,u,e,s)
Enter categorical/factor variable z z = factor (c(l,e,v,e,l,s))
Combine vbls x, y, z into a matrix matrixname=cbind(x, y, z)
Combine vbls x, y, z into a dataframe dataframe= data.frame(x,y,z)
Convert dataframe into a matrix matrixname = as.matrix(dataframe)
Convert matrix into a dataframe dataframe = as.data.frame(matrixname)
Convert long data format into wide format dataframe = reshape(dataframe,idvar=c("var1","var2"),timevar="time",direction="wide")
Count number of TRUEs in a logical vector length(vectorname[vectorname==TRUE])
Dimensions of matrix or dataframe dim(matrix or dataframe)
Change variable name colnames(dataframe) [#] <- "newname" # where [#] is the column number for the vbl
Change variable names (1) colnames(dataframe) = c("vbl1_name", "vbl2_name", "etc")
Change variable names (2) names(dataframe) = c("vbl1_name", "vbl2_name", "etc")
Add a variable to a dataframe dataframe = cbind(dataframe,variable)
Add a new variable to an existing dataframe dataframe$newvariable=c(v,a,l,u,e,s)
Add a new variable by summing vbls x and y dataframe$newvariable=dataframe$x + dataframe$y
view variables in a dataset names(dataframe)
view active dataframe and variables ls()
view available datasets in R data()
view first 10 rows and first 5 columns Table(dataframe)[1:10, 1:5]
view first 5 rows or last 5 rows head(dataframe) or tail(dataframe)
Remove active dataframe and values rm(dataframe)
Remove variables in a dataframe dataframe = remove.vars(dataframe, "variable", info = TRUE) # requires "gdata" pkg
Remove cases with missing data from a variable dataframe = na.omit(dataframe$variable)
Remove cases with missing data from a dataframe dataframe = na.omit(dataframe)
Remove rows in dataframe dataframe = dataframe[-(3:5), ] # removes rows 3, 4, and 5
Get a dataframe from a package data("dataframe", package = "package name")
Load dataframe attach(dataframe)
Randomly select 100 values from normal dist. x=rnorm(100)
Randomly select 100 values from dist. x=rnorm(10, 100, 16) #with n=10, mean=100, sd=16
Randomly select N values from a dataframe sample(dataframe, N, replace = TRUE) # replace =TRUE or FALSE
Create a sequence of numbers from 1 to 50 x=1:50
Create a subset of dataframe datasubset = subset(dataframe, vblname == "level in factor vbl to select")
Check classification status of a dataframe class(dataframe)
Check classification status of a variable class(variable)
Check classification status of all vbls sapply(dataframe, class)
Check levels of a factor variable levels(dataframe$variable)
Check your working directory getwd()
Convert a numeric into a factor data$variable = as.factor(data$variable) # always use "$" even if dataframe is attached!
Convert a character into a numeric data$variable = as.numeric(data$frame)
Convert a variable into a factor with labels data$variable = factor(data$variable, levels=1:2, labels=c("male","female"))
Convert a variable into an ordered factor data$variable = factor(data$variable, ordered = TRUE)
Create grouping for continuous vbl split(data$continuous, data$factor)
Create grouping for continuous vbl group = data$continuous[data$factor=="factor level"]
Choose new referent category variable = relevel(variable, ref="value") #Check new referent with "table(variable)"
ignore missing data NAs ( . . . , na.rm = TRUE)
ignore NAs in specific variable ( . . . , na.omit(variable))
Open new window in Rstudio window()
select age > 65 from a dataframe datasubset = data[ which(data$gender=='F' & data$age > 65), ]
select subset from dataframe datasubset = subset(dataframe, vblname == "level in factor vbl to select")
change decimal places options(digits=4)
mean center centered.vbl = scale(variable, center = TRUE, scale=FALSE)
standardize variables std.variable = scale(variable, center = FALSE, scale = TRUE)
Importing Data Files from Excel
Save data file to your desktop or other folder (e.g., R working directory) as a text file with the ".csv" extension, then open with RStudio by clicking the "import dataset" tab in the top right pane and then locating your data file. RStudio should recognize the first row of variable names as just that, variable names. Run the attach(dataset) command. Save file in R working directory (see below).
Importing Data Files from SPSS
library(foreign)
dataframe <- read.spss("C:/Location/SPSS_file.sav", use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
Note: Sometimes factor labels are lost in the transfer process. R may reclassify your categorical variables as "numeric". Check variable types in R with "sapply(dataframe,class)". If this is a problem, try assigning labels to categorical variables in SPSS (e.g., 0=female, 1=male). If a nominal variable cannot be assigned labels then try changing the variables to "string" in SPSS's "Type" column in the data view window. If all else fails, convert variables to factors in R using "dataframe$variable=as.factor(dataframe$variable)"
#Saving a dataset in csv format (into R working directory)
To save a data set/frame in .csv format, use write.csv. Unless you’ve created row names via the rownames command, you typically want to set row.names = FALSE. You can keep dataframe and datafile names the same.
write.csv(dataframe, "datafile.csv", row.names=FALSE)
#Saving a dataset in SPSS format (into R working directory)
To save a data set/frame in .csv format, use write.csv. Unless you’ve created row names via the rownames command, you typically want to set row.names = FALSE. You can keep dataframe and datafile names the same.
write.spss(dataframe, "datafile.spss", row.names=FALSE)
#Loading a csv dataset (from R working directory)
To load a data set/frame in .csv format, use read.csv. If the first row are variable names, you may also include
"header=TRUE". You can keep the dataset name and filename the same.
dataframe=read.csv("datafile.csv", header=TRUE)
Exporting Data Files from R
When exporting dataset from RStudio to Excel or SPSS, try saving as a "csv" file (see above) and opening the file in excel or spss. A more quick and dirty approach is to click on the appropriate dataset name in the top right workspace window of RStudio (the dataset will appear in the top left window), then copy and paste the dataset into Excel and SPSS and you're good to go.
Basic Stats & Descriptives
summary(dataframe) # statistics for dataframe
describeBy(dataframe, grouping_variable) #summary statistics by grouping factor (requires "psych" package). For all variables.
by(variable, grouping variable, stat.desc) # summary statistics by grouping factor (requires "pastecs" package). For one variable.
mean (x)
median (x)
mode (x)
max (x)
min (x)
quantile (x)
IQR (x)
range (x) # gives lowest and highest score
sum (x) # gives sum of variable
sd (x) # unbiased standard deviation
var(x) # unbiased variance
summary (x) # summary stats
length (x) # sample size
cov (x, y) # covariance for x and y
scale(x) # to find z-scores
pnorm(scale(x)) # gives (Percentile Rank) areas to left of z-scores
pnorm(x, mean, sd) # gives PR for data points with a specified mean and sd
t = (scale(x))*10+50 # covert into t-scores
Check parametric assumptions
leveneTest(outcome vbl, grouping vbl) # Levene's test for homogeneity of variance (requires "car" package)
shapiro.test(dataframe$variable) # Shapiro's test for normality
Graphics using GrapheR package
A good "point and click" GUI for creating standard figures
library(GrapheR)
run.GrapheR()
Graphics (standard R)
plot (x) # creates scatter plot
plot(x,y) # creates scatter plot (outcome variable [y] listed second)
identify (x, y) # identify points on a plot (put cross hairs on point, click, then hit "esc" to see row number for that case
abline(lm(y~x)) # draws regression line (you must do plot(x,y) first and minimize its window)
scatterplot(y~x, smoother=loessLine) #draws plot, linear reg. line, and smooth reg. line (requires "car packages) or
may use smoother=gamLine or smoother=quantregLine smoothers instead
scatterplot(y~x | factor, smoother=loessLine) # draws plot, linear reg. line, and smooth reg. line for each level of a factor (requires "car")
plot(density(x)) # plots kernal density function
plot(scale(x),scale(y)) # puts both variables on the same scale
barplot(table(x)) # creates bar graph
boxplot(x) # creates boxplot
boxplot (x,y) # view both plots side by side
boxplot (y~x) # y is continuous and x is grouping factor
xyplot(y~x|z) # y and x are continuous, z is grouping factor (requires lattice package)
stem (x)
hist (x)
lines(density(variable.name)) # superimpose line on a histogram [do hist (x) first]
hist (x,10) # a histogram with 10 breaks
table (x)
Graphics with "lattice" package
plot x and y variables xyplot(y~x)
scatterplot for each level of a factor xyplot(y~x | factor) # viewing something by ( | ) factor is called conditioning
scatterplots grouped by 2 factors xyplot( y ~ x | factor1 + factor2, type = "o")
scatterplot matrix splom(dataframe) # where dataframe contains continuous variables
scatterplot matrix grouped by a factor splom(~dataframe | factor) # use "type = c()" commands shown below to add regression lines
plot x and y with regression line xyplot(y~x, data=dataframe, ("p","r"), xlim=c(,), ylim=c(,)) #adjust limits as needed
plot x and y with a smooth line xyplot(y~x, data=dataframe, ("g","p","smooth"), xlab="x") #adjust labels as needed
plot x and y grouped by a factor vbl xyplot(y~x | factor, data=dataframe, ("p","r"), xlim=c(,), ylim=c(,))
histogram histogram(~x)
histogram for each level of a factor histogram(~ variable | factor, data=dataframe)
density plot densityplot(~variable, data = dataframe)
density plot for each level of a factor densityplot(~ variable | factor, data=dataframe)
superimposed density plots densityplot(~ variable, groups=factor, plot.points=F, ref=T, auto.key=list(columns=3))
barplot barchart(factor ~ scale_vbl | factor1 + factor2)
normal QQ plot qqmath(~variable)
normal QQ plots by grouping variable qqmath(~variable | factor)
box and whisker plot bwplot(~variable)
box and whisker plot with factor bwplot(~variable | factor)
strip plot stripplot(~ variable)
strip plot by grouping variable stripplot(~ variable | factor)
bivariate 3D plot cloud(z~x*y)
Graphics with "ggplot2" package
generic expression mygraph = ggplot (dataframe, aes(vbl x, vbl y))
add geoms (layers) to generic expression mygraph + geom_??()
Geoms - insert geom aesthetics into "()" as needed [e.g., color = "name"; size = value; fill=color; alpha(color,value); weight=value; line, 2, 3, 4, 5, or 6; shape=0,1,2,3,4,5, or 6]:
bar graph [include just nominal vbl x in generic expression] geom_bar()
histogram [include just continuous vbl x in generic expression] geom_histogram()
boxplot [include just continuous vbl x in generic expression] geom_boxplot()
scatterplot geom_point()
plus connect dots geom_line()
plus smooth reg. line geom_smooth() #smooth regression line with 95% upper and lower bands
text box geom_text() # x=horizontal.location, y=vertical.location, label="name"
density plot geom_density()
error bars geom_errorbar() # ymin=lower.limit, ymax=upper limit of bars
horizontal and vertical lines geom_hline(), geom_vline() # yintercept = value, xintercept = value
Title + labs(title = "Title")
Save graphic ggsave("graphname.ext") # extensions: pdf, jpeg, tiff, png, bmp, svg, wmf
ANALYSES
Correlation
cor.test (x,y) # gives Pearson r and p-value
cor.test (x,y, method = "spearman") # Spearman ranks correlation (add exact = FALSE if there are tied ranks)
cor.test(x,y)^2 # coefficient of determination
cor.test(rank(x),rank(y)) # gives Spearman ranks correlation coefficient. FOR when data are not normal, data is ordinal, small N, or relationship is nonlinear.
cor(dataframe) # correlation matrix
# create a correlation matrix (requires deducer package. See "??ggcorplot" for other goodies)
corr.matrix = cor.matrix(dataframe)
ggcorplot(corr.matrix, data = dataframe)
Linear Regression
Choose new referent category dataframe$variable = relevel(dataframe$variable, ref="value") #Check with "table(variable)"
model.matrix ( ~ variable, data = dataframe) # create matrix of model variables
sjt.lm(mymodel) # Creates nice SPSS-ish charts of regression data. Requires "sjPlot" package.
plot(mymodel) # produces useful plots for evaluating regression diagnostics
mymodel=lm(y~x) # one predictor
mymodel=lm(y~x1+x2+x3) # multiple predictors
summary(mymodel)
and then...abline(lm(y~x1)) # draws regression line (you must run "plot(x,y)" first and minimize its window)
coefficients(mymodel) #model coefficients
confint(mymodel, level=0.95) #95% CIs for coefficients
fitted(mymodel) # predicted values
residuals(mymodel) # residuals
anova(mymodel) # anova table
vcov(mymodel) # covariance matrix for model parameters
influence(mymodel) # regression diagnostics
Additional Regression functions (Thanks to "R Tutorials" for these):
+ x include variable x
- x remove variable x from list of predictor vbls
., include all predictor vbls
x : y include the interaction between vbls x and y
x * y include variables x and y and the interaction between them
x / y nesting: include vbl y nested within vbl x
x | y conditioning: include x given y
(x + y + z)^3 include these variables and all interactions up to three way
poly(x,3) polynomial regression: orthogonal polynomials
Error(a/b) specify the error term
I(x*y) as is: include a new variable consisting of these variables multiplied
- 1 intercept: delete the intercept (regress through the origin)
Logistic Regression
mymodel <- glm(y~x1+x2+x3, family="binomial", data=dataframe)
summary(mymodel) # display results
and then...
confint(mymodel) # 95% CI for the coefficients
exp(coef(mymodel)) # exponentiated coefficients to get Odds Ratios
exp(confint(mymodel)) # 95% CI for exponentiated coefficients (Odds Ratios)
plot(mymodel) # useful visual regression diagnostics
Testing Proportions
Single Sample Proportion test
prop.test(n, N, p = null prop, conf.level=.95)
Exact Binomial Test for a Single Sample
binom.test(n, N, p = nullprop) # use exact binomial when sample size is small or always use it
Two Samples Proportion Test
prop.test(c(n1, n2), c(N1, N2))
Fisher's Exact Test for Two Samples
fisher.test(matrix(c(f1,f2,m1,m2),nrow=2)) # use fisher's exact when sample size is small. Where f1='yes' count and f2='no' count for one group, and m1='yes' count and m2='no' count for the other group. nrow=2 specifies that the first two values (f1 and f2) comprise two rows in the matrix. The last 2 values (m1 and m2) are automatically assigned to the columns, hence this example refers to a 2x2 matrix.
Testing Equality of Variance
F test for equality of variance
var.test(x1, x2) # use this when there are two continuous variables
Levene's test for homogeneity of variance
leveneTest(outcome vbl, grouping vbl) # Use this with a grouping variable and a continuous variable (requires "car" package)
T-Tests
Single-sample t-test
t.test(x, mu = a population mean value that you want to test the sample mean against)
Welch's paired samples t-test
t.test(y1, y2, paired=TRUE)
F test for equality of variance
var.test(x1, x2)
Between groups t-test - assuming no equality of variance (Welch's)
t.test(y~x)
Between groups t-test - assuming equality of variance
t.test(y~x, var.equal = TRUE)
Ideal format when there's more than two categories in the grouping variable
t.test(continuous_vbl [grouping_vbl == 1], continuous_vbl [grouping_vbl == 2])
# You may add the following commands to these t-tests: alt="two.sided", "less", or "greater"; mu = ?(change mean difference to something other than zero); var.equal = TRUE/FALSE (the default setting is FALSE which is Welch's Test, R assumes that the variances are not equal unless you specify that they are equal); conf.level = 0.95; na.action=na.exclude (if you want to leave out missing values).
Non-Parametric Tests
# Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
Single Sample test
wilcox.test(x, mu = population median value) # add "exact=FALSE" if there are tied ranks
Paired samples test (Wilcoxon Signed-Ranks)
wilcox.test(x1, x2, paired = TRUE) # add "exact=FALSE" if there are tied ranks
Independent, 2 samples test (MWU or Wilcoxon Rank Sum)
wilcox.test(y~x) # where x has two categories. Add "exact=FALSE" if there are tied ranks
Kruskal Wallis test
kruskal.test(y~x) # where variable x is a between subjects' factor with more than 2 catergories
Post Hoc test for Kruskal
kruskalmc(y~x) # requires pgirmess package
ANOVA
Independent Groups, One-Way ANOVA
anova(lm(y ~ x)) # where y is the continuous outcome and x is the between subjects grouping factor
anova(lm(y ~ x1 * x2)) # separate multiple between subjects factors with an asterix
or
print(summary(lm(y~x))) #produces F stat, p-value, and regression coefficients
or
myanova = aov(outcome ~ predictor1 * predictor2, data=mydata, na.action=na.exclude)
summary(myanova)
myWelchanova = oneway.test(outcome ~ predictor[s], data=mydata) # use when homogeneity assumption is not met
Chi-Square tests
Goodness of Fit
obs = c(n1, n2, n3, n4)
exp = c(prop1, prop2, prop3, prop4)
chisq.test (obs, p = exp)
Test for Independence (Crosstabs)
chisq.test(data.frame(vbl.1, vbl.2))
dataframe.name = stack(dataframe.name, df = stack(data.frame(x,y,z))
Odds Ratio & Relative Risk
dimnames(mymatrix)=list(smoker=c("y","n"), death=c("y","n"))
mymatrix=matrix(c(100,105,150,165),2,2)
library(epitools)
oddsratio(mymatrix, method="wald")
riskratio(mymatrix, rev="both") # reverses matrix
Agreement
Kappa Agreement # for categorical or ordinal data
library(irr)
rater1 = c(1,1,2,1,1,3,2,1,1)
rater2 = c(1,1,2,1,1,1,2,1,1)
rater3 = c(1,2,1,1,2,1,3,2,2)
ratings = data.frame(rater1,rater2,rater3)
kappa2(ratings[,c(1,2)]) # compare raters 1 & 2
kappa2(ratings[,c(1,3)]) # compare raters 1 & 3
kappa2(ratings[,c(2,3)]) # compare raters 2 & 3