This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button (in R Studio) a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. Note, however, that to publish to certain file types, you will need to install additional software. You can embed an R code chunk like this:
#get current working directory
getwd()
#set working directory to the folder you want to use
#the working directory is where R looks to read files and where it writes files
setwd("C:/Users/") #MUST BE MODIFIED FOR YOUR COMPUTER
#you must change the line above to fit your preferred location
#in Windows, you can find a folder's full directory by
#right clicking the folder and selecting "Properties"
R is a programming language designed specifically for conducting statistics and generating graphics related to those statistics. It is a functional programming language, and such programming languages are well suited for data analysis. Although any statistical analysis can be run ‘by hand’ in R by typing in the formula, over 10,000 packages are now available that simplify many of these analyses.
First and foremost, then, R is the most powerful calculator you have ever ‘owned’. For example:
#addition and subtraction
2+2
3+5
#multiplication and division
(2+2)*(3+5)
(3+5)/(2+2)
#exponents
2^2
4^.5
We can also do matrix algebra, which is required by most statistical analyses. However, unless you take statistics in the Statistics Department, you’ll not likely need to know how to actually do matrix algebra. This website https://www.statmethods.net/advstats/matrix.html provides some common matrix manipulations, in case you need them.
Note that datasets are matrices, with dimensions nxp (n rows and p columns, where n is the number of observations and p is the number of variables). Indeed, it becomes useful to think of datasets this way when working with them in R, because you may want to select a subset of observations or variables for specific analyses. Additionally, when we refer to matrices and/or datasets in R, we refer to them by name, then row, then column, like this: data[row, column]
If you’re not sure how to do something in R, Google it (or use your preferred search engine, like DuckDuckGo which doesn’t track and sell your information)! Many tutorials have been written for R and its various packages, and many people have likely asked the question you’re asking on websites such as stats.stackexchange.com
If your code isn’t running, usually it means there’s a dreaded TYPO in your code! After working for many hours or late in the night, it can be very difficult to notice small errors in your code. Best practice recommendation is to write your code and test it piece by piece. Then, you’ll know at each step if everything is working, rather than writing and running a chunk of code, then finding out errors are buried somewhere. Easy advice–but difficult to follow (and still frustrating when you can’t find the typo, missing comma, or missing parenthesis)!
R Studio helps you with writing your R code, by providing suggestions to complete your statements, by highlighting different parts of code in various colors, and by identifying where code has errors.
As mentioned, R now has over 10,000 packages available to help you conduct statistics. These packages span a variety of functions, from helping you manipulate your data (e.g., dplyr) to helping you run modern analytic techniques like machine learning with k-fold cross-validation (e.g., caret). For now, we will demonstrate how to install and load one of the packages most widely used among psychologists (and in general)–Dr. William Revelle’s (Northeastern University) ‘psych’ package.
What is a package, you may ask? It is a collection of custom-built functions. Functions include a set of code to run a certain analysis or perform an operation, and you only have to specify a select few pieces of information. So, for every function that you use, someone has manually written code to peform the function and made it publicly available for your benefit. Functions, then, save you the work of having to manually generate the code to run an analysis.
#first we should load the packages we plan to use
#we'll be using several commands from Dr. William Revelle's 'psych' package
#install the package by deleting the hashtag from the line below
#install.packages("psych")
#load the package into the working environment
library(psych)
#note that the quotes ARE required to install, but not to load packages
#also, capitalization matter in R
So, we have specified our working directory to what we want and loaded in the package we plan to use. Now, we need to get some data loaded into the environment.
Although R comes with several embedded datasets, employing read.csv() is foundational. “.csv” files are known as “comma separated values” files and are a type of spreadsheet. You can save spreadsheets as .csv in both Google Sheets and Microsoft Excel.
Go to the email I sent yesterday to download the dataset we will be using (if you have not already), and save it to the working directory you specified with setwd() above. In the code chunk below, we will save our dataset into the object ‘d’.
#normally you want to give your data frames a meaningful name
#this becomes especially important if are working with multiple data files
#here we'll just use 'd' for convenience
d <- read.csv(file="states.csv", header=T, stringsAsFactors = F)
#header=True if variable names are in the first row. 'T' and 'F' are short for 'True' and 'False'
#if you have a manipulation or categorical variable listed as text labels, stringsAsFactors=True
#stringsAsFactors=T will make a categorical variable usable in formulae without transforming it to numerical
#stringsAsFactors=T orders factors alphabetically beginning at 1
#for instance, if your data has a factor for 'males' and 'females', females=1 and males=2
#we define objects in R by using either 'object_name <- value' or 'object_name = value'
#'<-' is preferred because '=' has two meanings in R
#so, in R, we use these objects to store data, lists, analytic results, etc.
#how many observations and variables do we have? Use dim(), a base R function, to find out
dim(d) #51 observations and 18 variables
#preview the first several rows of the data
head(d)
#we can explore specific data values by referencing the rows and/or columns of interest
#let's look at the 10th row's values
d[10,]
#let's look at the 10th column's values
d[,10]
#so, reference rows before the ',', and reference columns after ','
#in other words, object[rows,columns], or object[n,p]
#we can view multiple rows and a specific set of columns
d[3:6, 1:3]
#lets see Indiana's personality!
d[16,14:18]
#summary() is a base R function that provides basic descriptive statistics
summary(d)
#we can also view the dataset in R Studio using View()
View(d)
#can also view the dataset by clicking on its name in the 'Environment' interface
#the first variable name got read in wrong
#to correct it, we can use the colnames() function
colnames(d)[1] <- "State" #in () we specify dataset, and the column[s] we want to rename in []
#to rename multiple columns at once, we can create a list
#then we pass the list to the colnames() function
var_names <- c("State", "Children_poverty", "Quality_of_life")
colnames(d)[1:3] <- var_names #same as: colnames(d)[1:3] <- c("State", "Children_poverty", "Quality_of_life")
#we can also view (and refer to) columns using the variable names
d$State
#describe() is a psych package function that provides detailed descriptive statistics
describe(d)
## Warning in describe(d): NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
#all the error messages we see come from trying to describe the state labels
#the state labels are not numerical
#we can rerun without that column using a negative sign
describe(d[,-1])
#or by selecting the relevant columns
describe(d[,2:18])
So, above we demonstrated how we can select certain rows/observations and columns/variables (or exclude them). When working with data, it is often necessary to select a certain subset of data or to combine two datasets. We can select data using the commands demonstrated above, but to combine datasets, we use rbind() to bind rows (i.e., observations) together, and cbind() to bind columns (i.e., variables) together.
If you need to combine datasets using a reference column (like participant identifiers), this is beyond this tutorial, but the dplyr package has functions to do this.
#say we wanted to get rid of the Gallup Variables (i.e., GallupJobSatisfaction, Gall WorkStress, and GallFinWB)
#we identify which columsn they are using describe()
#then we can drop those columns with two different methods
#method 1
d_new1 <- d[,-8:-10]
#method 2
d_new2 <- cbind(d[,1:7], d[,11:18])
#NOTE: when we edit our data, best practice is to assign it to a new object
#then, we still have the original dataset in case it's needed
#test if the new data frames the same
identical(d_new1, d_new2)
#test cell by cell to see if they're the same
d_new1 == d_new2
#we can perform parallel operations to drop observations
#for instance, perhaps we want to exclude washington D.C. since it has no senators
d[1:10,1] #DC is in the 8th row
d_new3 <- d[-8,] #method 1
d_new4 <- rbind(d[1:7,], d[9:51,]) #method 2
identical(d_new3, d_new4) #check for equivalence
We can also use plots to summarize our data and inspect for assumptions, like normality. Here are some examples:
#histogram
hist(d[,3])
#hist(d$QualityOfLife) #another way of running the same code
#bivariate distribution plots
plot(d[,15:16]) #same as plot(d[,15], d[,16]) except labels are lost
plot(d[,2:8])
plot(d[,13:18])
#quantile-quantile, or QQ plots
qqnorm(d[,3])
#qqnorm(d$QualityOfLife) #another way of running the same code
#box and whisker plot
boxplot(d[,3])
#note that with this and other plots, we can change labels on the plot
#how to do so may change depending on the plot but generally is similar
boxplot(d[,3],
main="OECD Quality of Life in the United States", #sets main title
xlab=" ", #x axis label
ylab="Quality of Life", #y axis label
col="blue", #color of box
border="orange", #color of box borders and whiskers
horizontal=F, #if T, makes boxplot horizontal
notch=F, #if T, adds a notch to boxplot
ylim=c(50,80)) #sets y limits on graph
#scatterplot with regression line
plot(d$Neuroticism, d$AVGunemployment, main="Scatterplot Example",
ylab="Average State-level Unemployment 2013-2018 (%)", xlab="State-level Neuroticism", pch=19)
#pch changes the shape of the dots on the scatterplot
#add the regression line
abline(lm(d$AVGunemployment ~ d$Neuroticism))
#note that if you write this formula in reverse, it won't plot!
#abline() requires vertical axis to be outcome variable (which is listed first, as in y=bx+e)
#more on writing linear models later
#we can use the 'car' (companion to applied regression) package to make fancier scatterplots
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
#add scatterplot with regression line, Lowess line confidence intervals, and box & whisker plots
scatterplot(d$AVGunemployment, d$Neuroticism,
xlab="Average State-level Unemployment 2013-2018 (%)", ylab="State-level Neuroticism")
Personally, I consider the correlation matrix (with means and SDs) to be the most important piece of information provided in papers. Often, you can guess the paper’s results from the correlation matrix, even if they then go on to use Multiple Regression or Structural Equation Modeling to analyze the data.
#bivariate correlations
cor(d$AVGunemployment, d$Neuroticism)
#correlation matrix
cor(d[,-1])
#with p values using corr.test() from psych package
corr.test(d$AVGunemployment, d$Neuroticism)
#can also create correlation matrix
corr_matrix <- corr.test(d[,-1])
#a note here on stored objects--they can have multiple dimensions themselves
#this is similar to the variables in a data frame
#corr.test() generates r, n, and p variables--correlation matrix, N used for each correlation, and p-value of correlations
#access them via corr_matrix$r, corr_matrix$n, and corr_matrix$p
#using the apaTables package, we can output APA formatted correlation matrices!
#install.packages("apaTables")
library(apaTables)
apa.cor.table(d[,14:18], filename="StatePersCors.doc", table.number=1, show.conf.interval=T)
Frequently, we may want to mean center and/or standardize variables (giving them mean 0 and SD 1, respectively). Additionally, sometimes variables need transformed to make them fit the assumptions inherent in our statistical models (e.g., normal distribution).
We can see from the descriptives we ran above that state-level Agreeableness is the most skewed variable and also has the highest kurtosis of any variable in our dataset, so we may want to transform it so that it has a more ‘normal’ distribution.
#center and/or scale with scale(). In other words, scale() can mean center and standardize variables
#when scale=F, only mean centers variables
d_scale1 <- scale(d[,2:18], scale=F)
describe(d_scale1)
#when scale=T, also makes variance=1 (i.e., unity) for all variables
d_scale2 <- scale(d[,2:18], scale=T)
describe(d_scale2)
#note that state-level personality isn't affected as they were already standardized
#compare the descriptions of d_scale1 and d_scale 2 to describe(d)
#we could also center variables manually, one at a time
d_scale3 <- d
d_scale3$AVGunemployment <- d_scale3$AVGunemployment-mean(d_scale3$AVGunemployment)
#we can transform Agreeableness to make it more normally distributed
#visualize the skewness
hist(d$Agreeableness)
d_trans1 <- d
d_trans1$Agreeableness <- (d_trans1$Agreeableness+5)^2
#note that this is an unusual transformation because it is a standardized var
#visualize the transformation
hist(d_trans1$Agreeableness)
describe(d_trans1$Agreeableness) #see that skewness and kurtosis are much improved
describe(d$Agreeableness) #compare to original
R has useful functions for running almost any statistical test you can imagine (and many you cannot!). The Base R package has functions for t-tests and ANOVA/regression as the general ‘linear model’.
#we don't have any discrete variables for a t-test, but we can still do one
ttest1 <- t.test(d[1:25,3], d[26:50,3])
#another way of writing the command outside of R markdown
#ttest1 <- t.test(d$QualityOfLife[1:25], d$QualityOfLife[26:50])
ttest1 #we should not see a significant difference because...
#it's just Quality of Life for alphabetical first half of states vs. second half
#basic regression
#we've already seen one example above when we plotted the regression line on scatterplot
model_1 <- lm(AVGunemployment ~ Agreeableness, data=d)
#lm() command, with outcome/Y variable on left side of equation, predictors/X variables on right side
model_1
summary(model_1) #notice the much more detailed output
#re-run with the transformed Agreeableness variable
model_1t <- lm(AVGunemployment ~ Agreeableness, data=d_trans1)
summary(model_1t)
#multiple regression--just add more predictors with +
model_2 <- lm(AVGunemployment ~ Extraversion + Agreeableness +
Conscientiousness + Neuroticism + Openness.to.Experience, data=d)
summary(model_2)
model_2$coefficients
model_2$residuals #more on this later
model_2$fitted.values #more on this later
#we can get standardized coefficients by using scale() on the variables
#we can use scale() earlier to create a new dataframe or within lm()
#note that the personality variables are already standardized
model_2st <- lm(scale(AVGunemployment) ~ Extraversion + Agreeableness +
Conscientiousness + Neuroticism + Openness.to.Experience, data=d)
summary(model_2st)
#hierarchical regression
#this is a technique to see if a variable adds incremental variance for predicting an outcome
model_3 <- lm(AVGunemployment ~ Extraversion + Agreeableness + Conscientiousness + Openness.to.Experience, data=d)
#add neuroticism to the model
model_4 <- lm(AVGunemployment ~ Extraversion + Agreeableness + Conscientiousness +
Openness.to.Experience + Neuroticism, data=d)
#test if the models are significantly different
anova(model_3, model_4)
#what if we want to include interactions in model?
#just add it to our model statement using varName*varName
model_5 <- lm(AVGunemployment ~ Extraversion + Agreeableness + Conscientiousness + Neuroticism
+ Openness.to.Experience + Neuroticism + Extraversion*Agreeableness
+ Conscientiousness*Neuroticism + Extraversion*Neuroticism, data=d)
Plotting interactions can be done the easy way, or the hard way. Here’s an example of how to do it in a very easy way :)
#simplest way to plot interactions... is with a package called interactions :)
#install.packages("interactions")
library(interactions)
interact_plot(model_5, pred=Neuroticism, modx=Extraversion) #use the variables from interaction of interest
#high E and N is associated with high unemployment; high E and low N is associated with low unemployment
#N is not related to unemployment when E is low
#we can plot it with Neuroticism as the moderator too. Theory must lead the way here
interact_plot(model_5, pred=Extraversion, modx=Neuroticism)
#probing the interaction further via 'simple slopes'
#install.packages("emmeans")
library(emmeans)
## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
## Indicator predictors are now treated as 2-level factors by default.
## To revert to old behavior, use emm_options(cov.keep = character(0))
exta <- mean(d$Extraversion) + sd(d$Extraversion)
ext <- mean(d$Extraversion)
extb <- mean(d$Extraversion) - sd(d$Extraversion)
mylist <- list(Extraversion=c(extb,ext,exta))
emtrends(model_5, ~Extraversion, var="Neuroticism", at=mylist)
#provides us with slope and confidence interval for slope of neuroticism's relationship with unemployment at our specified levels of Extraversion
resid_2 <- model_2$residuals
preds_2 <- model_2$fitted.values
#now we can use resid_2 and preds_2 to do stuff with the residuals and/or fitted values
#like students will be doing in PSY 631
#can also combine them together in a single object
residpred_mod2 <- cbind(model_2$residuals, model_2$fitted.values)
#equivalent to: cbind(resid_2, preds_2)
Note that the echo = FALSE
parameter can be added to the code chunk to prevent printing of the R code that generated the results. Deleting results=‘hide’ allows the output of R commands to be included in your R Markdown files.
#information on the base R package and its functions
#https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html
#dplyr helps with manipulating data frames
#library(dplyr)
#so does tidyr
#library(tidyr)
#a super easy-to-use package for meta analysis
#library(psychmeta)
#a package for confirmatory factor analysis and structural equation modeling
#library(lavaan)
#want to use machine learning? caret makes it easy!
#library(caret)
#a page with good info for probing interactions in r: https://stats.idre.ucla.edu/r/seminars/interactions-r/