07/06/2017

R basics

Materials

You can find the materials for today's course on my website. Either go to the Moodle page for MY591 and click on the links, or:

  • Assignment: jackblumenau.com/assignment.html
  • Slides: jackblumenau.com/introtor.html

It is worth noting that both the assignment and the slides were all created using R and RStudio. So was my website.

Beat that, STATA…

I have used STATA for years, why learn R?

Because:

  • R is free
  • R is open source and has excellent package support and development
  • R is (easily) customisable
  • R is very good for data visualisation
  • R can accomodate more than one data matrix in memory at the same time
  • You are about to learn some methods that are basically only available in R

On the other hand:

  • R has a steep(er) learning curve

R and RStudio

  • As in STATA, you can either write commands directly into the console, or use a script (do file) to save commands and then run them
  • It is perfectly possible to use the base R environment for everything, but most people do not do this
  • Today we will mostly be using RStudio, which is a graphical user interface (GUI) for R
  • RStudio has a host of useful features which make it an excellent environment for learning R
    • Syntax highlighting
    • Easy data overview
    • Looks a bit like STATA…

RStudio demonstration

Basic operations in R

1 + 1
## [1] 2
5 - 3
## [1] 2
6/2
## [1] 3
4*4
## [1] 16

Basic operations in R

  • Other typical mathematical functions are also hard-coded:
log(<number>)
exp(<number>)
sqrt(<number>)
mean(<numbers>)
sum(<numbers>)

Basic operations in R

R also understands logical operators:

  • < : less than
  • > : greater than
  • == : equal to (note, not =)
  • >= : greater than or equal to
  • <= : less than or equal to
  • != : not equal to
  • & : and
  • | : or

R objects

  • R is an object-oriented programming language
  • The entities that R creates and manipulates are known as objects
  • These may be variables, arrays of numbers, character strings, functions
  • Objects are created using the assignment operator: <-
  • Once created, an object is stored in your current workspace
my_object <- 10
print(my_object)
## [1] 10
my_other_object <- 4
print(my_object - my_other_object)
## [1] 6

Data Structures

vector - the basic building block of R.

A vector is a collection of entites which all share the same type.

We use the c() function to concatenate observations into a single vector.

  • Numeric
num_vec <- c(5, 4, 2, 100, 7.65)
print(num_vec)
## [1]   5.00   4.00   2.00 100.00   7.65
  • Character
char_vec <- c("apple", "pear", "plumb", "pineapple", "strawberry")
print(char_vec)
## [1] "apple"      "pear"       "plumb"      "pineapple"  "strawberry"

vector - the basic building block of R.

  • Logical
log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE)
print(log_vec)
## [1] FALSE FALSE FALSE  TRUE  TRUE
as.numeric(log_vec)
## [1] 0 0 0 1 1

vector - the basic building block of R.

  • Factor

A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:

fac_vec <- as.factor(c("a", "b","c","a","b","c"))
fac_vec
## [1] a b c a b c
## Levels: a b c
as.numeric(fac_vec)
## [1] 1 2 3 1 2 3

vector subsetting

To subset a vector, use square parenthesis to index the elements you would like via object[index]:

  • Numerical subsetting
char_vec
## [1] "apple"      "pear"       "plumb"      "pineapple"  "strawberry"
char_vec[1:3]
## [1] "apple" "pear"  "plumb"

vector subsetting

  • Logical subsetting
log_vec
## [1] FALSE FALSE FALSE  TRUE  TRUE
char_vec[log_vec]
## [1] "pineapple"  "strawberry"

vector operations

In R, mathematical operations on vectors occur elementwise:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib[1:7]
## [1]  1  1  2  3  5  8 13
fib[2:8]
## [1]  1  2  3  5  8 13 21
fib[1:7] + fib[2:8]
## [1]  2  3  5  8 13 21 34

vector operations

It is also possible to perform logical operations on vectors:

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)
fib_greater_five <- fib > 5
print(fib_greater_five)
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

And we can combine logical operators

fib_greater_five_less_two <- fib > 5 | fib < 2
print(fib_greater_five_less_two)
## [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

matrix

A matrix is just a collection of vectors! I.e. it is a multidimensional vector where all elements are of the same type

my_matrix <- matrix(data = 1:100, nrow = 10, ncol = 10)
my_matrix
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3   13   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100

data.frame

A data.frame is a matrix-like R object in which the columns can be of different types

my_data_frame <- data.frame(numbers = num_vec, fruits = char_vec, logical = log_vec)
my_data_frame
##   numbers     fruits logical
## 1    5.00      apple   FALSE
## 2    4.00       pear   FALSE
## 3    2.00      plumb   FALSE
## 4  100.00  pineapple    TRUE
## 5    7.65 strawberry    TRUE

matrix and data.frame subsetting

To subset a matrix or data.frame, you need to specify both rows and columns:

my_matrix[1:3, 1:3]
##      [,1] [,2] [,3]
## [1,]    1   11   21
## [2,]    2   12   22
## [3,]    3   13   23
my_data_frame[1, ]
##   numbers fruits logical
## 1       5  apple   FALSE

matrix and data.frame subsetting

We can also subset to remove rows or columns that we do not want to see by using the - operator applied to the c function:

my_matrix[-c(1:3), -c(1:3)]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]   34   44   54   64   74   84   94
## [2,]   35   45   55   65   75   85   95
## [3,]   36   46   56   66   76   86   96
## [4,]   37   47   57   67   77   87   97
## [5,]   38   48   58   68   78   88   98
## [6,]   39   49   59   69   79   89   99
## [7,]   40   50   60   70   80   90  100

In the code, 1:3 creates a vector of the integers 1, 2 and 3, and the - operator negates these. We wrap the vector in the c function so that - applies to each element, and not just the first element.

list

A list is a collection of any set of object types

my_list <- list(something = num_vec, 
                another_thing = my_matrix[1:3,1:3], 
                something_else = "jack")
my_list
## $something
## [1]   5.00   4.00   2.00 100.00   7.65
## 
## $another_thing
##      [,1] [,2] [,3]
## [1,]    1   11   21
## [2,]    2   12   22
## [3,]    3   13   23
## 
## $something_else
## [1] "jack"

functions

R makes extensive use of functions, which all have the same basic structure.

function_name(argument_one, argument_two, ...)

Where

  • function_name is the name of the function
  • argument_one is the first argument passed to the function
  • argument_two is the seccond argument passed to the function

function example

Let's consider the mean() function. This function takes two main arguments:

mean(x, na.rm = FALSE)

Where x is a numeric vector, and na.rm is a logical value that indicates whether we'd like to remove missing values (NA).

vec <- c(1,2,3,4,5)
mean(x = vec, na.rm = FALSE)
## [1] 3
vec <- c(1,2,3,NA,5)
mean(x = vec, na.rm = TRUE)
## [1] 2.75

function example

We can also perform calculations on the output of a function:

mean(vec) * 3
## [1] 9

Which means that we can also have nested functions:

sqrt(mean(vec))
## [1] 1.732051

We can also assign the output of any function to a new object for use later:

sqrt_mean_of_vec <- sqrt(mean(vec))

User defined functions

Functions are also objects, and we can create our own. We define a function as follows:

my_addition_function <- function(a, b){
  return(a + b)
}

my_addition_result <- my_addition_function(a = 5, b = 50)
print(my_addition_result)
## [1] 55

Note that the function itself, and the result returned by the function are both objects!

Help!

R has an inbuilt help facility which provides more information about any function:

help(summary)
?summary

The quality of the help varies hugely as some packages are well-maintained and others less so.

Alternative sources of help:

www.stackoverflow.com

www.google.com

Exercise One

Data manipulation and exploration

Reading data into R (.csv)

my_data <- read.csv(file = "my_file.csv")
  • my_data is an R data.frame object (you could call this anything)
  • my_file.csv is a .csv file with your data
  • <- is the assignment operator
  • In order for R to access my_file.csv, it will have to be saved in your current working directory
  • Use get_wd() to check your current working directory
  • Use set_wd() to change your current working directory

Reading data into R (.dta)

install.packages("foreign") ## install the "foreign" package
library(foreign) ## Loads an additional package that deals with STATA files
my_data <- read.dta(file = "my_file.dta") 
  • The foreign package is not automatically installed in R so we need to do that ourselves
  • my_data is still an R data.frame object (you could still call this anything)
  • my_file.dta is a .dta file with your data
  • <- is still an assignment operator

Descriptive statistics functions

R includes a number of helpful functions for exploring your data

View(my_data) # View data in spreadsheet-style
names(my_data) # Names of the variables in your data
head(my_data) # First six rows of the data
tail(my_data) # Last six rows of the data
str(my_data) # "Structure" of any R object
summary(my_data) # Summary statistics for each of the columns in the data
dim(my_data) # Dimensions of a matrix
range(my_data$my_variable) # Range of a numeric vector
quantile(my_data$my_variable) # Quantiles of a numeric vector

head

str

str(my_data) 
## 'data.frame':    1000 obs. of  4 variables:
##  $ x: num  -0.625 -1.32 1.791 -0.355 1.76 ...
##  $ y: num  1.636 0.536 0.267 1.524 0.908 ...
##  $ z: num  0.399 0.11 0.564 0.292 0.263 ...
##  $ g: Factor w/ 6 levels "a","b","c","d",..: 4 5 6 5 4 2 1 6 5 1 ...

summary

summary(my_data) 
##        x                  y                 z             g      
##  Min.   :-3.41936   Min.   :-3.0830   Min.   :0.0008453   a:170  
##  1st Qu.:-0.64878   1st Qu.:-0.2858   1st Qu.:0.2352761   b:160  
##  Median : 0.05486   Median : 0.4243   Median :0.4825389   c:160  
##  Mean   : 0.03154   Mean   : 0.4274   Mean   :0.4853983   d:171  
##  3rd Qu.: 0.73181   3rd Qu.: 1.1314   3rd Qu.:0.7299549   e:156  
##  Max.   : 3.74557   Max.   : 3.7938   Max.   :0.9987806   f:183

pairs

pairs(my_data)

range

Suppose you would like to find the range of a variable in your data.frame. The following will not work:

range(my_data) 
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables

Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…

range

Instead, we need to access one variable. There are two ways to do this.

  • We can subset using square parenthesis, or
  • We can use the $ operator and access the variable of interest by name
range(my_data[,1])
## [1] -3.419364  3.745572
range(my_data$x) 
## [1] -3.419364  3.745572

Plots and graphs

Introduction

  • Plots are one of the great strenghts f R.
  • There are two main frameworks for plotting
  • Base R graphics
  • ggplot
  • Which should you use? It is a matter of preference!

Base R plots

The basic plotting syntax is very simple. plot(x_var, y_var) will give you a scatter plot:

plot(my_data$x, my_data$y)

Hmm, let's work on that.

Base R plots

The plot function takes a number of arguments (?plot for a full list). The fewer you specify, the uglier your plot:

plot(x = my_data$x, y = my_data$y, 
     xlab = "X variable", 
     ylab = "Y variable", 
     main = "Awesome plot title",
     pch = 19, # Solid points
     cex = 0.5, # Smaller points
     bty = "n", # Remove surrounding box
     col = as.factor(my_data$g) # Colour by grouping variable
     )

Base R plots

Base R plots

The default behaviour of plot() depends on the type of input variables for the x and y arguments. If x is a factor variable, and y is numeric, then R will produce a boxplot:

plot(x = my_data$g, y = my_data$x)

Base R plots

ggplot

Also popular is the ggplot2 library. This is a separate package (i.e. it is not a part of the base R environment) but is very widely used. But not by me.

  • Based on the Grammar of Graphics data visualisation scheme
  • Graphs are broken into scales and layers
  • Has slightly idiosyncratic language style!

ggplot

Let's recreate the previous scatter plot using ggplot:

library(ggplot2)
ggplot(data = my_data, aes(x= x, y= y, col = g)) + 
    geom_point() +
    xlab("X variable")+
    ylab("Y variable")+
    ggtitle("ggplot title")+
    theme(panel.background = element_rect("white"))

ggplot

ggplot

One nice feature of ggplot is that it is very easy to create facet plots:

library(ggplot2)
ggplot(data = my_data, aes(x= x, y= y, col = g)) + 
    geom_point() +
    xlab("X variable")+
    ylab("Y variable")+
    ggtitle("ggplot facet title")+
    theme(panel.background = element_rect("white"))+
    facet_wrap(~ g)

ggplot

Other examples of pretty R graphics

Other examples of pretty R graphics

Exercise Two

Linear Regression Models

lm

Linear regression models in R are implemented using the lm function.

my_lm <- lm(formula = y ~ x, data = my_data)

The formula argument is the specification of the model, and the data argument is the data on which you would like the model to be estimated.

print(my_lm)
## 
## Call:
## lm(formula = y ~ x, data = my_data)
## 
## Coefficients:
## (Intercept)            x  
##      0.4178       0.3046

lm

We can specify multivariate models:

my_lm_multi <- lm(formula = y ~ x + z, data = my_data)

Interaction models:

my_lm_interact <- lm(formula = y ~ x * z, data = my_data)

Fixed-effect models:

my_lm_fe <- lm(formula = y ~ x + g, data = my_data)

And many more!

lm

The output of the lm function is a long list of interesting output.

When we call print(saved_model), we are presented with the estimated coefficients, and nothing else.

For some more information of the estimated model, use summary(saved_model):

my_lm_summary <- summary(my_lm)
print(my_lm_summary)

lm

## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6246 -0.6960 -0.0202  0.7016  3.2367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.41777    0.03209  13.020   <2e-16 ***
## x            0.30463    0.03238   9.407   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 998 degrees of freedom
## Multiple R-squared:  0.08145,    Adjusted R-squared:  0.08053 
## F-statistic:  88.5 on 1 and 998 DF,  p-value: < 2.2e-16

lm

As with any other function, summary(saved_model) returns an object. Here, it is a list. What is saved as the output of this function?

names(my_lm_summary)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

If we want to extract other information of interest from the fitted model object, we can use the $ operator to do so:

print(my_lm_summary$r.squared)
## [1] 0.08145418

lm

Accessing elements from saved models can be very helpful in making comparisons across models:

my_lm_r2 <- summary(my_lm)$r.squared
my_lm_multi_r2 <- summary(my_lm_multi)$r.squared
my_lm_interact_r2 <- summary(my_lm_interact)$r.squared

r2.compare <- data.frame(
                  model = c("bivariate", "multivariate", "interaction"), 
                  r.squared = c(my_lm_r2, 
                         my_lm_multi_r2, 
                         my_lm_interact_r2))

lm

We can print the values:

print(r2.compare)
##          model  r.squared
## 1    bivariate 0.08145418
## 2 multivariate 0.08552710
## 3  interaction 0.09811655

Or we can plot them:

ggplot(r2.compare, aes(x = model, y = r.squared))+
  geom_point(size = 4)+
  ggtitle("R^2 Comparison")

lm

lm diagnostics

There are a number of functions that are helpful in producing model diagnostics:

  • residuals(saved_model) extracts the residuals from a fitted model
  • coefficients(saved_model) extracts coefficients
  • fitted(saved_model) extracts fitted values
  • plot(saved_model) is a convenience function for producing a number of useful diagnostics plots

lm residual plot

We can easily plot the residuals from a fitted model against an explanatory variable of interest:

par(mfrow = c(1,2)) # Divide the plotting region into 1 row and 2 cols
plot(x = my_data$x, y = residuals(my_lm_multi), 
     xlab = "x", ylab = "residuals",
     pch = 19, # Filled circles
     col="grey", # Change colour
     cex = 0.5) # Make point size smaller
abline(h = 0) # Add a horizontal line with y-intercept of 0
plot(x = my_data$z, y = residuals(my_lm_multi), 
     xlab = "z", ylab = "residuals",
     pch = 19, # Filled circles
     col="grey", # Change colour
     cex = 0.5) # Make point size smaller
abline(h = 0) # Add a horizontal line with y-intercept of 0

lm residual plot

Non-Linear Regression Models

glm

To estimate a range of non-linear models, the glm function is particularly helpful.

First, let us transform our outcome variable from a continuous measure to a binary measure:

my_data$y_bin <- as.numeric(my_data$y > median(my_data$y))
table(my_data$y_bin)
## 
##   0   1 
## 500 500

glm

Now we can estimate our model:

my_logit <- glm(formula = y_bin ~ x + z, data = my_data, family = "binomial")

Where:

  • formula is the model specification
  • data is the data
  • family is a description of the error distribution and link function to be used
    • binomial, poisson, gaussian etc…

glm

summary(my_logit)
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.77744  -1.13471  -0.03837   1.14638   1.85331  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.17079    0.12745  -1.340     0.18    
## x            0.40330    0.06793   5.937  2.9e-09 ***
## z            0.32624    0.22678   1.439     0.15    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.3  on 999  degrees of freedom
## Residual deviance: 1346.9  on 997  degrees of freedom
## AIC: 1352.9
## 
## Number of Fisher Scoring iterations: 4

glm

OK, but no-one actually thinks in terms of log-odds, so let's translate that into something more meaningful.

my_logit_OR <- exp(cbind(OR = coef(my_logit), confint(my_logit)))
  • coef extracts the coefficients
  • confint extracts the confidence intervals
  • cbind binds the vectors together as separate columns
  • exp exponentiates the log-odds ratios
round(my_logit_OR, digits = 4)
##                 OR  2.5 % 97.5 %
## (Intercept) 0.8430 0.6563 1.0819
## x           1.4968 1.3122 1.7129
## z           1.3858 0.8889 2.1636

glm

Almost all of the convenience functions that we used for lm are also applicable to glm models:

summary(my_logit)
plot(my_logit)
residuals(my_logit)
coefficients(my_logit)
fitted(my_logit)

Other models

There are a number of external packages that can make fitting other model types (relatively) straightforward:

  • lmer4 - Linear, generalised linear, and nonlinear mixed models
  • mcgv - generalised additive models
  • survival - survival analysis
  • glmnet - lasso and elastic net regression models
  • randomForest - random forest models from machine learning
  • rjags and rstan - Bayesian models

To use a package that is not a part of the base R installation, use:

install.packages("survival")
library(survival)

predict

We can retreive the fitted values from the model using fitted(), but we may be interested in calculating predicted values for arbitrary levels of our covariates.

sim_data <- data.frame(x = c(0, 1))
y_hat <- predict(object = my_lm, newdata = sim_data)
y_hat
##         1         2 
## 0.4177653 0.7223981

Here, I am creating a data.frame with two observations of one variable (x).

I am then using the predict function, where

  • object = my_lm tells R the model object for which prediction is desired
  • newdata = sim_data tells R the values for which I would like predictions

predict

We can use the same syntax to retreive predictions for (marginally) more interesting models:

sim_data <- data.frame(x = c(0, 0, 1, 1), z = c(0, 1, 0, 1))
sim_data
##   x z
## 1 0 0
## 2 0 1
## 3 1 0
## 4 1 1
y_hat <- predict(my_lm_multi, newdata = sim_data)
y_hat
##         1         2         3         4 
## 0.3027153 0.5397412 0.6072847 0.8443107

predict

This can be especially useful when trying to visualise interactive models:

sim_data_z0 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 0)
sim_data_z1 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 1)
y_hat_z0 <- predict(my_lm_interact, newdata = sim_data_z0)
y_hat_z1 <- predict(my_lm_interact, newdata = sim_data_z1)
  • seq generates a regular sequences from one value to another value by given increments
  • I am creating two data.frames for prediction, in both cases varying the value of x, but first setting z to 0, and then setting z to 1

predict

# Create a plot of the data
plot(my_data$x, my_data$y, cex = 0.5, pch = 19, 
     col = "gray", bty = "n",
     xlab = "X", ylab = "Y", 
     main = "Fitted values for sim_data")

# Add a prediction line for z = 0
lines(x = sim_data_z0$x, y = y_hat_z0, lwd = 2)

# Add a prediction line for z = 1
lines(x = sim_data_z1$x, y = y_hat_z1, lwd = 2, col = "red")

# Add a legend
legend("bottomright", legend = c("z = 0", "z = 1"), col = c("black", "red"), lty = 1) 

predict

Can we just use margins?

Yes! Thomas Leeper has created a nice port of STATA's margins command:

library(margins)

cplot(my_lm_interact, x = "x", dx = "z", what = "effect")

cplot is equivalent to marginsplot in STATA.

The specification above tell R that we want to plot the marginal effect of z on y across the range of x.

Can we just use margins?

Thank you!

Any questions?

Other helpful functions

objects() & ls() # Which objects are currently loaded in my working environment?
rm() # Remove objects from my current environment
save() # Save R object(s) to disk
is.na() # Is this object a missing value? Or, which elements in this vector are missing?
rnorm() # Generate random numbers from a normal distribution
runif() # Generate random numbers from a uniform distribution

Other helpful packages

library(data.table)
library(dplyr)
library(plyr)
library(zoo)
library(reshape2)
library(shiny)