R is a versatile and reliable statistics package. What makes it particularly attractive to students, statisticians, and researchers is it's free. That’s right – you can get a leading statistical package for zero dinero. Still not smiling? Most fully licensed statistical packages cost $2000 or more; R costs $0.
Now that you’re smiling, let’s look at how to set up R.
R SET UP
Step 1. Goto the R website at http://cran.r-project.org/
Step 2. Select your operating system (Linux, Mac, Windows)
(Note: the rest of these instructions are for Windows users)
Step 3. Click on “base”.
Step 4. Select the latest version of R for download (top of page)
Step 5. Save the R .exe file to a folder on your computer.
Step 6. After the file loads, open the folder and double click the .exe file to start the installation. It is probably best to select the default download settings for now. The installation will create an R folder in the list of program files. If you download newer versions of R, this is where the program files will be stored.
After completing steps 1-6, look for the R shortcut icon on your desktop. Double click on it to launch R. You will see a window with the RGui (R graphic user interface). You are set to go and can enter all sorts of data and programming commands for running statistical functions (several of these are shown below). A good add on to consider is RStudio. RStudio is a graphic user interface for R. To get RStudio, go to rstudio.org, download the program, and follow the install instructions.
Programming syntax for statistical operations
Matrix Math Functions
a=c(1,4,7)
b=c(2,5,8)
c=c(3,6,9)
matrixa=cbind(a,b,c) # these 4 lines create a 3x3 matrix
matrixa=matrix(c(1,2,3,4,5,6,7,8,9),byrow=TRUE,ncol=3) # creates the same 3x3 matrix as above
t(matrixa) # transposes matrix a
matrixab=matrixa%*%matrixb # multiplies matrixes a and b
solve(matrixa) # takes the inverse of a matrix
Data Entry Techniques
y = c( ) # enter continuous variable y values in () separated by commas.
x = factor(c( )) # enter factor variable x values in () separated by commas.
datasetname = dataframename (x,y,z) # combines vbls x, y, and z into one data set named datasetname.
cbind(x, y, z) # combines x, y, and z into one dataset.
cbind(dataframename,newvariable) #adds new variable to an existing data frame.
dataset = edit (data.frame( )) # opens spreadsheet for data entry (not available in RStudio).
attach(dataset) # program will recognize dataset.
x=rnorm(100) # randomly selects 100 values from normal distribution
x=rnorm(10,mean=100,sd=16) # randomly selects 10 values from a dist. with specified parameters
x=1:50 # creates a sequence of numbers from 1 to 50
attach(dataset) # Tells the program to recognize the dataset for analyses
Importing Data Files from Excel and SPSS
From Excel: When the data file orginates from Excel, save data file to your desktop or other folder (e.g., R working directory) as a text file with the ".txt" extension, then open with RStudio by clicking the "import dataset" tab in the top right pane and then locating your data file. RStudio should recognize the first row of variable names as just that, variable names. Run the attach(dataset) command. Save file in R working directory (see below).
From SPSS: When the data file orginates from SPSS, save the data file to your desktop or other folder (e.g., R working directory) as a tab delimited ".tab" file then open with RStudio by clicking the "import dataset" tab in the top right pane and then locating your data file. RStudio should recognize the variable names. Run the attach(dataset) command. Save file in R working directory (see below).
#Saving a dataset in csv format (into R working directory)
# The following shows how to save a dataset in csv format to your working directory. To save a data frame in CSV #format, use write.csv. Unless you’ve created row names via the rownames command, you typically want to set #row.names = FALSE.
write.csv(dataset, file = "datasetfilename", row.names=FALSE) # you can keep dataset and filename the same
#Loading a csv dataset (from R working directory)
# load dataset using import dataset command in top right pane of Rstudio, or use . . .
dataset = read.csv("datasetfilename") # you can keep the dataset name and filename the same
# check for appropriate dataset name using ls() command
Exporting Data Files from R
When exporting dataset from RStudio to Excel or SPSS, try saving as a "csv" file (see above) and opening the file in excel or spss. A more quick and dirty approach is to click on the appropriate dataset name in the top right workspace window of RStudio (the dataset will appear in the top left window), then copy and paste the dataset into Excel and SPSS and you're good to go.
Viewing and Editing Data
x = edit(x) # opens window to edit variable “x”
data.entry (x) # opens spreadsheet to edit variable “x”
dataset=edit(dataset) # opens data editor to edit dataset. Run attach(dataset) after edits (see fix command below)
x[1]=3 # changes first value in variable ‘x’ to the number 3.
data ( ) # lists available data sets
x # just type variable name to see a list of values in the variable “x”
x [] # put number in the parentheses to bring up datum for that position
ls ( ) # lists active variables
rm ( ) # removes/deletes variable and its data
rank ( ) # gives ranks for the data points
sort ( ) # ranks data from smallest to highest for specified variable
round (x, n) # round the elements of “x” to “n” decimal places
fix(dataset) # opens data editor. Edit variable names or datum then "attach(dataset)" to save the changes
dataset = transform(dataset, x=x/10, y=y*2, z=c(2,4,6,...))
# use the "transform" command to perform data transformations and add variables to a data set
#Save and extract a dataset in R format:
save(datasetname, file = "datasetfilename.rda")
load("datasetfilename.rda")
# Check for appropriate datasetname using ls() command
Basic Stats & Descriptives
mean (x)
median (x)
mode (x)
max (x)
min (x)
quantile (x)
IQR (x)
range (x) # gives lowest and highest score
sum (x) # gives sum of variable
sd (x) # unbiased standard deviation
var(x) # unbiased variance
summary (x) # summary stats
length (x) # sample size
cov (x, y) # covariance for x and y
scale(x) # to find z-scores
pnorm(scale(x)) # gives (Percentile Rank) areas to left of z-scores
pnorm(x, mean, sd) # gives PR for data points with a specified mean and sd
t = (scale(x))*10+50 # covert into t-scores
describe.by(dataset, grouping_variable_name) #summary statistics by group (requires psych package)
Graphics
plot (x) # creates scatter plot
plot(x,y) # creates scatter plot (outcome variable [y] listed second)
abline(lm(y~x)) # draws regression line (you must do plot(x,y) first and minimize its window)
plot(scale(x),scale(y)) # puts both variables on the same scale
barplot(x) # creates bar graph
boxplot(x) # creates boxplot
boxplot (x,y) # view both plots side by side
boxplot (y~x) # y is continuous and x is grouping factor
bxyplot(y~x|z) # y and x are continuous, z is grouping factor (requires lattice package)
stem (x)
hist (x)
lines(density(variable.name)) # superimpose line on a histogram [do hist (x) first]
hist (x,10) # a histogram with 10 breaks
table (x)
Regressions
Linear Regression
lm(y~x) # assumes that a data set with a "y" DV and "x" is already active
summary(lm(y~x)) # gives data for regression abline
abline(lm(y~x)) # draws regression line (you must do plot(x,y) first and minimize its window)
or try...
model=lm(y~x, data=datafile.name)
summary(model)
and then...
coefficients(model) #model coefficients
confint(model, level=0.95) #95% CIs for coefficients
fitted(model) # predicted values
residuals(model) # residuals
anova(model) # anova table
vcov(model) # covariance matrix for model parameters
influence(model) # regression diagnostics
Multiple Regression
summary(lm(y ~ x1 + x2 + x3)) #assumes data set is already active,
or try...
model=lm(y~x1+x2+x3, datafile.name)
summary(model)
and then...
coefficients(model) # model coefficients
confint(model, level=0.95) #95% CIs for coefficients
fitted(model) # predicted values
residuals(model) # residuals
anova(model) # anova table
vcov(model) # covariance matrix for model parameters
influence(model) # regression diagnostics
Logistic Regression (where Y is a binary factor and predictors are continuous variables)
summary(glm(y ~ x1 + x2 + x3, family=binomial)) # assumes data set is already active
or try...
model <- glm(y~x1+x2+x3,data=datasetname, family=binomial)
summary(model) # display results
and then...
confint(model) # 95% CI for the coefficients
exp(coef(model)) # exponentiated coefficients to get Odds Ratios
exp(confint(model)) # 95% CI for exponentiated coefficients (Odds Ratios)
Correlation
cor (x,y) # gives Pearson r
cor(x,y)^2 # coefficient of determination
cor(rank(x),rank(y)) # gives Spearman ranks correlation coefficient
Testing Proportions
Single Sample Proportion test
prop.test(n, N, p = null prop, conf.level=.95)
Exact Binomial Test for a Single Sample
binom.test(n, N, p = nullprop) # use exact binomial when sample size is small or always use it
Two Samples Proportion Test
prop.test(c(n1, n2), c(N1, N2))
Fisher's Exact Test for Two Samples
fisher.test(matrix(c(f1,f2,m1,m2),nrow=2)) # use fisher's exact when sample size is small. f1 represents, for example, the number of females falling into the 'yes' category and f2 represents the number of females falling in the 'no' category for a dichotomous variable. m1 represents the number of males falling in the 'yes' category and m2 represents the number of males falling into the 'no' category. nrow=2 specifies that the first two values (f1 and f2) comprise two rows in the matrix. The last 2 values (m1 and m2) are automatically assigned to the columns, hence this example refers to a 2x2 matrix.
Common Parametric Tests
Between groups t-test equal variances
t.test(y~x, var.equal=TRUE) # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
Between Groups, t-test, unequal variances (Welch’s Test)
t.test(y~x, var.equal=FALSE) # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
Within Groups t-test
t.test(y~x, paired = TRUE) # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
lm(y~x) # also running this will give the regression coefficients
Independent Groups, One-Way ANOVA
anova(lm(y ~ x)) # produces ANOVA table for continuous variable y and factor x
or
print(summary(lm(y~as.factor(x)))) #produces F stat, p-value, and regression coefficients
Factorial, Two-Way ANOVA (coming soon)
Non-Parametric Tests # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
wilcox.test(x, mu = 5) # Wilcoxon for comparing x data to a population value mu.
wilcox.test(x, y, paired = TRUE) # Wilcoxon-Signed-Ranks
wilcox.test(x,y) #Wilcoxon-Rank-Sum a.k.a Mann-Whitney-U
obs = c(n1, n2, n3, n4) # Chi-square test for goodness of fit (all 3 rows)
exp = c(prop1, prop2, prop3, prop4)
chisq.test (obs, p = exp)
chisq.test(data.frame(vbl.1, vbl.2)) # Chi-square test for independence
dataframe.name = stack(dataframe.name, df = stack(data.frame(x,y,z))
kruskal.test(values ~ ind, data = df)
Matrix Math Functions
a=c(1,4,7)
b=c(2,5,8)
c=c(3,6,9)
matrixa=cbind(a,b,c) # these 4 lines create a 3x3 matrix
matrixa=matrix(c(1,2,3,4,5,6,7,8,9),byrow=TRUE,ncol=3) # creates the same 3x3 matrix as above
t(matrixa) # transposes matrix a
matrixab=matrixa%*%matrixb # multiplies matrixes a and b
solve(matrixa) # takes the inverse of a matrix
Data Entry Techniques
y = c( ) # enter continuous variable y values in () separated by commas.
x = factor(c( )) # enter factor variable x values in () separated by commas.
datasetname = dataframename (x,y,z) # combines vbls x, y, and z into one data set named datasetname.
cbind(x, y, z) # combines x, y, and z into one dataset.
cbind(dataframename,newvariable) #adds new variable to an existing data frame.
dataset = edit (data.frame( )) # opens spreadsheet for data entry (not available in RStudio).
attach(dataset) # program will recognize dataset.
x=rnorm(100) # randomly selects 100 values from normal distribution
x=rnorm(10,mean=100,sd=16) # randomly selects 10 values from a dist. with specified parameters
x=1:50 # creates a sequence of numbers from 1 to 50
attach(dataset) # Tells the program to recognize the dataset for analyses
Importing Data Files from Excel and SPSS
From Excel: When the data file orginates from Excel, save data file to your desktop or other folder (e.g., R working directory) as a text file with the ".txt" extension, then open with RStudio by clicking the "import dataset" tab in the top right pane and then locating your data file. RStudio should recognize the first row of variable names as just that, variable names. Run the attach(dataset) command. Save file in R working directory (see below).
From SPSS: When the data file orginates from SPSS, save the data file to your desktop or other folder (e.g., R working directory) as a tab delimited ".tab" file then open with RStudio by clicking the "import dataset" tab in the top right pane and then locating your data file. RStudio should recognize the variable names. Run the attach(dataset) command. Save file in R working directory (see below).
#Saving a dataset in csv format (into R working directory)
# The following shows how to save a dataset in csv format to your working directory. To save a data frame in CSV #format, use write.csv. Unless you’ve created row names via the rownames command, you typically want to set #row.names = FALSE.
write.csv(dataset, file = "datasetfilename", row.names=FALSE) # you can keep dataset and filename the same
#Loading a csv dataset (from R working directory)
# load dataset using import dataset command in top right pane of Rstudio, or use . . .
dataset = read.csv("datasetfilename") # you can keep the dataset name and filename the same
# check for appropriate dataset name using ls() command
Exporting Data Files from R
When exporting dataset from RStudio to Excel or SPSS, try saving as a "csv" file (see above) and opening the file in excel or spss. A more quick and dirty approach is to click on the appropriate dataset name in the top right workspace window of RStudio (the dataset will appear in the top left window), then copy and paste the dataset into Excel and SPSS and you're good to go.
Viewing and Editing Data
x = edit(x) # opens window to edit variable “x”
data.entry (x) # opens spreadsheet to edit variable “x”
dataset=edit(dataset) # opens data editor to edit dataset. Run attach(dataset) after edits (see fix command below)
x[1]=3 # changes first value in variable ‘x’ to the number 3.
data ( ) # lists available data sets
x # just type variable name to see a list of values in the variable “x”
x [] # put number in the parentheses to bring up datum for that position
ls ( ) # lists active variables
rm ( ) # removes/deletes variable and its data
rank ( ) # gives ranks for the data points
sort ( ) # ranks data from smallest to highest for specified variable
round (x, n) # round the elements of “x” to “n” decimal places
fix(dataset) # opens data editor. Edit variable names or datum then "attach(dataset)" to save the changes
dataset = transform(dataset, x=x/10, y=y*2, z=c(2,4,6,...))
# use the "transform" command to perform data transformations and add variables to a data set
#Save and extract a dataset in R format:
save(datasetname, file = "datasetfilename.rda")
load("datasetfilename.rda")
# Check for appropriate datasetname using ls() command
Basic Stats & Descriptives
mean (x)
median (x)
mode (x)
max (x)
min (x)
quantile (x)
IQR (x)
range (x) # gives lowest and highest score
sum (x) # gives sum of variable
sd (x) # unbiased standard deviation
var(x) # unbiased variance
summary (x) # summary stats
length (x) # sample size
cov (x, y) # covariance for x and y
scale(x) # to find z-scores
pnorm(scale(x)) # gives (Percentile Rank) areas to left of z-scores
pnorm(x, mean, sd) # gives PR for data points with a specified mean and sd
t = (scale(x))*10+50 # covert into t-scores
describe.by(dataset, grouping_variable_name) #summary statistics by group (requires psych package)
Graphics
plot (x) # creates scatter plot
plot(x,y) # creates scatter plot (outcome variable [y] listed second)
abline(lm(y~x)) # draws regression line (you must do plot(x,y) first and minimize its window)
plot(scale(x),scale(y)) # puts both variables on the same scale
barplot(x) # creates bar graph
boxplot(x) # creates boxplot
boxplot (x,y) # view both plots side by side
boxplot (y~x) # y is continuous and x is grouping factor
bxyplot(y~x|z) # y and x are continuous, z is grouping factor (requires lattice package)
stem (x)
hist (x)
lines(density(variable.name)) # superimpose line on a histogram [do hist (x) first]
hist (x,10) # a histogram with 10 breaks
table (x)
Regressions
Linear Regression
lm(y~x) # assumes that a data set with a "y" DV and "x" is already active
summary(lm(y~x)) # gives data for regression abline
abline(lm(y~x)) # draws regression line (you must do plot(x,y) first and minimize its window)
or try...
model=lm(y~x, data=datafile.name)
summary(model)
and then...
coefficients(model) #model coefficients
confint(model, level=0.95) #95% CIs for coefficients
fitted(model) # predicted values
residuals(model) # residuals
anova(model) # anova table
vcov(model) # covariance matrix for model parameters
influence(model) # regression diagnostics
Multiple Regression
summary(lm(y ~ x1 + x2 + x3)) #assumes data set is already active,
or try...
model=lm(y~x1+x2+x3, datafile.name)
summary(model)
and then...
coefficients(model) # model coefficients
confint(model, level=0.95) #95% CIs for coefficients
fitted(model) # predicted values
residuals(model) # residuals
anova(model) # anova table
vcov(model) # covariance matrix for model parameters
influence(model) # regression diagnostics
Logistic Regression (where Y is a binary factor and predictors are continuous variables)
summary(glm(y ~ x1 + x2 + x3, family=binomial)) # assumes data set is already active
or try...
model <- glm(y~x1+x2+x3,data=datasetname, family=binomial)
summary(model) # display results
and then...
confint(model) # 95% CI for the coefficients
exp(coef(model)) # exponentiated coefficients to get Odds Ratios
exp(confint(model)) # 95% CI for exponentiated coefficients (Odds Ratios)
Correlation
cor (x,y) # gives Pearson r
cor(x,y)^2 # coefficient of determination
cor(rank(x),rank(y)) # gives Spearman ranks correlation coefficient
Testing Proportions
Single Sample Proportion test
prop.test(n, N, p = null prop, conf.level=.95)
Exact Binomial Test for a Single Sample
binom.test(n, N, p = nullprop) # use exact binomial when sample size is small or always use it
Two Samples Proportion Test
prop.test(c(n1, n2), c(N1, N2))
Fisher's Exact Test for Two Samples
fisher.test(matrix(c(f1,f2,m1,m2),nrow=2)) # use fisher's exact when sample size is small. f1 represents, for example, the number of females falling into the 'yes' category and f2 represents the number of females falling in the 'no' category for a dichotomous variable. m1 represents the number of males falling in the 'yes' category and m2 represents the number of males falling into the 'no' category. nrow=2 specifies that the first two values (f1 and f2) comprise two rows in the matrix. The last 2 values (m1 and m2) are automatically assigned to the columns, hence this example refers to a 2x2 matrix.
Common Parametric Tests
Between groups t-test equal variances
t.test(y~x, var.equal=TRUE) # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
Between Groups, t-test, unequal variances (Welch’s Test)
t.test(y~x, var.equal=FALSE) # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
Within Groups t-test
t.test(y~x, paired = TRUE) # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
lm(y~x) # also running this will give the regression coefficients
Independent Groups, One-Way ANOVA
anova(lm(y ~ x)) # produces ANOVA table for continuous variable y and factor x
or
print(summary(lm(y~as.factor(x)))) #produces F stat, p-value, and regression coefficients
Factorial, Two-Way ANOVA (coming soon)
Non-Parametric Tests # Add alt="two.sided", "less", or "greater" to specify alternative hypothesis
wilcox.test(x, mu = 5) # Wilcoxon for comparing x data to a population value mu.
wilcox.test(x, y, paired = TRUE) # Wilcoxon-Signed-Ranks
wilcox.test(x,y) #Wilcoxon-Rank-Sum a.k.a Mann-Whitney-U
obs = c(n1, n2, n3, n4) # Chi-square test for goodness of fit (all 3 rows)
exp = c(prop1, prop2, prop3, prop4)
chisq.test (obs, p = exp)
chisq.test(data.frame(vbl.1, vbl.2)) # Chi-square test for independence
dataframe.name = stack(dataframe.name, df = stack(data.frame(x,y,z))
kruskal.test(values ~ ind, data = df)