07/06/2017
You can find the materials for today's course on my website. Either go to the Moodle page for MY591 and click on the links, or:
It is worth noting that both the assignment and the slides were all created using R and RStudio. So was my website.
Beat that, STATA…
Because:
On the other hand:
1 + 1
## [1] 2
5 - 3
## [1] 2
6/2
## [1] 3
4*4
## [1] 16
log(<number>) exp(<number>) sqrt(<number>) mean(<numbers>) sum(<numbers>)
R also understands logical operators:
<
: less than>
: greater than==
: equal to (note, not =
)>=
: greater than or equal to<=
: less than or equal to!=
: not equal to&
: and|
: orobjects
<-
my_object <- 10 print(my_object)
## [1] 10
my_other_object <- 4 print(my_object - my_other_object)
## [1] 6
vector
- the basic building block of R.A vector is a collection of entites which all share the same type.
We use the c()
function to concatenate observations into a single vector.
num_vec <- c(5, 4, 2, 100, 7.65) print(num_vec)
## [1] 5.00 4.00 2.00 100.00 7.65
char_vec <- c("apple", "pear", "plumb", "pineapple", "strawberry") print(char_vec)
## [1] "apple" "pear" "plumb" "pineapple" "strawberry"
vector
- the basic building block of R.log_vec <- c(FALSE, FALSE, FALSE, TRUE, TRUE) print(log_vec)
## [1] FALSE FALSE FALSE TRUE TRUE
as.numeric(log_vec)
## [1] 0 0 0 1 1
vector
- the basic building block of R.A factor is similar to a character vector, but here each unique element is associated with a numerical value which represents a category:
fac_vec <- as.factor(c("a", "b","c","a","b","c")) fac_vec
## [1] a b c a b c ## Levels: a b c
as.numeric(fac_vec)
## [1] 1 2 3 1 2 3
vector
subsettingTo subset a vector
, use square parenthesis to index the elements you would like via object[index]
:
char_vec
## [1] "apple" "pear" "plumb" "pineapple" "strawberry"
char_vec[1:3]
## [1] "apple" "pear" "plumb"
vector
subsettinglog_vec
## [1] FALSE FALSE FALSE TRUE TRUE
char_vec[log_vec]
## [1] "pineapple" "strawberry"
vector
operationsIn R, mathematical operations on vectors occur elementwise:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib[1:7]
## [1] 1 1 2 3 5 8 13
fib[2:8]
## [1] 1 2 3 5 8 13 21
fib[1:7] + fib[2:8]
## [1] 2 3 5 8 13 21 34
vector
operationsIt is also possible to perform logical operations on vectors:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib_greater_five <- fib > 5 print(fib_greater_five)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
And we can combine logical operators
fib_greater_five_less_two <- fib > 5 | fib < 2 print(fib_greater_five_less_two)
## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
matrix
A matrix is just a collection of vectors! I.e. it is a multidimensional vector where all elements are of the same type
my_matrix <- matrix(data = 1:100, nrow = 10, ncol = 10) my_matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 11 21 31 41 51 61 71 81 91 ## [2,] 2 12 22 32 42 52 62 72 82 92 ## [3,] 3 13 23 33 43 53 63 73 83 93 ## [4,] 4 14 24 34 44 54 64 74 84 94 ## [5,] 5 15 25 35 45 55 65 75 85 95 ## [6,] 6 16 26 36 46 56 66 76 86 96 ## [7,] 7 17 27 37 47 57 67 77 87 97 ## [8,] 8 18 28 38 48 58 68 78 88 98 ## [9,] 9 19 29 39 49 59 69 79 89 99 ## [10,] 10 20 30 40 50 60 70 80 90 100
data.frame
A data.frame
is a matrix-like R object in which the columns can be of different types
my_data_frame <- data.frame(numbers = num_vec, fruits = char_vec, logical = log_vec) my_data_frame
## numbers fruits logical ## 1 5.00 apple FALSE ## 2 4.00 pear FALSE ## 3 2.00 plumb FALSE ## 4 100.00 pineapple TRUE ## 5 7.65 strawberry TRUE
matrix
and data.frame
subsettingTo subset a matrix
or data.frame
, you need to specify both rows and columns:
my_matrix[1:3, 1:3]
## [,1] [,2] [,3] ## [1,] 1 11 21 ## [2,] 2 12 22 ## [3,] 3 13 23
my_data_frame[1, ]
## numbers fruits logical ## 1 5 apple FALSE
matrix
and data.frame
subsettingWe can also subset to remove rows or columns that we do not want to see by using the -
operator applied to the c
function:
my_matrix[-c(1:3), -c(1:3)]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 34 44 54 64 74 84 94 ## [2,] 35 45 55 65 75 85 95 ## [3,] 36 46 56 66 76 86 96 ## [4,] 37 47 57 67 77 87 97 ## [5,] 38 48 58 68 78 88 98 ## [6,] 39 49 59 69 79 89 99 ## [7,] 40 50 60 70 80 90 100
In the code, 1:3
creates a vector of the integers 1, 2 and 3, and the -
operator negates these. We wrap the vector in the c
function so that -
applies to each element, and not just the first element.
list
A list
is a collection of any set of object types
my_list <- list(something = num_vec, another_thing = my_matrix[1:3,1:3], something_else = "jack") my_list
## $something ## [1] 5.00 4.00 2.00 100.00 7.65 ## ## $another_thing ## [,1] [,2] [,3] ## [1,] 1 11 21 ## [2,] 2 12 22 ## [3,] 3 13 23 ## ## $something_else ## [1] "jack"
functions
R makes extensive use of functions, which all have the same basic structure.
function_name(argument_one, argument_two, ...)
Where
function_name
is the name of the functionargument_one
is the first argument passed to the functionargument_two
is the seccond argument passed to the functionfunction
exampleLet's consider the mean()
function. This function takes two main arguments:
mean(x, na.rm = FALSE)
Where x
is a numeric vector, and na.rm
is a logical value that indicates whether we'd like to remove missing values (NA
).
vec <- c(1,2,3,4,5) mean(x = vec, na.rm = FALSE)
## [1] 3
vec <- c(1,2,3,NA,5) mean(x = vec, na.rm = TRUE)
## [1] 2.75
function
exampleWe can also perform calculations on the output of a function:
mean(vec) * 3
## [1] 9
Which means that we can also have nested functions:
sqrt(mean(vec))
## [1] 1.732051
We can also assign the output of any function to a new object for use later:
sqrt_mean_of_vec <- sqrt(mean(vec))
functions
Functions are also objects, and we can create our own. We define a function as follows:
my_addition_function <- function(a, b){ return(a + b) } my_addition_result <- my_addition_function(a = 5, b = 50) print(my_addition_result)
## [1] 55
Note that the function itself, and the result returned by the function are both objects!
R has an inbuilt help facility which provides more information about any function:
help(summary) ?summary
The quality of the help varies hugely as some packages are well-maintained and others less so.
Alternative sources of help:
www.stackoverflow.com
www.google.com
my_data <- read.csv(file = "my_file.csv")
my_data
is an R data.frame object (you could call this anything)my_file.csv
is a .csv file with your data<-
is the assignment operatormy_file.csv
, it will have to be saved in your current working directoryget_wd()
to check your current working directoryset_wd()
to change your current working directoryinstall.packages("foreign") ## install the "foreign" package library(foreign) ## Loads an additional package that deals with STATA files my_data <- read.dta(file = "my_file.dta")
foreign
package is not automatically installed in R so we need to do that ourselvesmy_data
is still an R data.frame object (you could still call this anything)my_file.dta
is a .dta file with your data<-
is still an assignment operatorR includes a number of helpful functions for exploring your data
View(my_data) # View data in spreadsheet-style names(my_data) # Names of the variables in your data head(my_data) # First six rows of the data tail(my_data) # Last six rows of the data str(my_data) # "Structure" of any R object summary(my_data) # Summary statistics for each of the columns in the data dim(my_data) #Â Dimensions of a matrix range(my_data$my_variable) # Range of a numeric vector quantile(my_data$my_variable) #Â Quantiles of a numeric vector
head
head(my_data)
## x y z g ## 1 -0.6254612 1.6362552 0.3989290 d ## 2 -1.3203506 0.5363096 0.1101269 e ## 3 1.7908249 0.2674716 0.5642690 f ## 4 -0.3554073 1.5243062 0.2920953 e ## 5 1.7603602 0.9080044 0.2631187 d ## 6 -0.2914242 0.8280291 0.1351559 b
str
str(my_data)
## 'data.frame': 1000 obs. of 4 variables: ## $ x: num -0.625 -1.32 1.791 -0.355 1.76 ... ## $ y: num 1.636 0.536 0.267 1.524 0.908 ... ## $ z: num 0.399 0.11 0.564 0.292 0.263 ... ## $ g: Factor w/ 6 levels "a","b","c","d",..: 4 5 6 5 4 2 1 6 5 1 ...
summary
summary(my_data)
## x y z g ## Min. :-3.41936 Min. :-3.0830 Min. :0.0008453 a:170 ## 1st Qu.:-0.64878 1st Qu.:-0.2858 1st Qu.:0.2352761 b:160 ## Median : 0.05486 Median : 0.4243 Median :0.4825389 c:160 ## Mean : 0.03154 Mean : 0.4274 Mean :0.4853983 d:171 ## 3rd Qu.: 0.73181 3rd Qu.: 1.1314 3rd Qu.:0.7299549 e:156 ## Max. : 3.74557 Max. : 3.7938 Max. :0.9987806 f:183
pairs
pairs(my_data)
range
Suppose you would like to find the range of a variable in your data.frame
. The following will not work:
range(my_data)
## Error in FUN(X[[i]], ...): only defined on a data frame with all numeric variables
Right, we cannot take the range of the entire data! Some of the variables are not numeric, and therefore do not have a range…
range
Instead, we need to access one variable. There are two ways to do this.
$
operator and access the variable of interest by namerange(my_data[,1])
## [1] -3.419364 3.745572
range(my_data$x)
## [1] -3.419364 3.745572
ggplot
The basic plotting syntax is very simple. plot(x_var, y_var)
will give you a scatter plot:
plot(my_data$x, my_data$y)
Hmm, let's work on that.
The plot function takes a number of arguments (?plot
for a full list). The fewer you specify, the uglier your plot:
plot(x = my_data$x, y = my_data$y, xlab = "X variable", ylab = "Y variable", main = "Awesome plot title", pch = 19, # Solid points cex = 0.5, # Smaller points bty = "n", #Â Remove surrounding box col = as.factor(my_data$g) #Â Colour by grouping variable )
The default behaviour of plot()
depends on the type of input variables for the x
and y
arguments. If x
is a factor variable, and y
is numeric, then R will produce a boxplot:
plot(x = my_data$g, y = my_data$x)
ggplot
Also popular is the ggplot2
library. This is a separate package (i.e. it is not a part of the base R environment) but is very widely used. But not by me.
ggplot
Let's recreate the previous scatter plot using ggplot
:
library(ggplot2) ggplot(data = my_data, aes(x= x, y= y, col = g)) + geom_point() + xlab("X variable")+ ylab("Y variable")+ ggtitle("ggplot title")+ theme(panel.background = element_rect("white"))
ggplot
ggplot
One nice feature of ggplot
is that it is very easy to create facet plots:
library(ggplot2) ggplot(data = my_data, aes(x= x, y= y, col = g)) + geom_point() + xlab("X variable")+ ylab("Y variable")+ ggtitle("ggplot facet title")+ theme(panel.background = element_rect("white"))+ facet_wrap(~ g)
ggplot
lm
Linear regression models in R are implemented using the lm
function.
my_lm <- lm(formula = y ~ x, data = my_data)
The formula
argument is the specification of the model, and the data
argument is the data on which you would like the model to be estimated.
print(my_lm)
## ## Call: ## lm(formula = y ~ x, data = my_data) ## ## Coefficients: ## (Intercept) x ## 0.4178 0.3046
lm
We can specify multivariate models:
my_lm_multi <- lm(formula = y ~ x + z, data = my_data)
Interaction models:
my_lm_interact <- lm(formula = y ~ x * z, data = my_data)
Fixed-effect models:
my_lm_fe <- lm(formula = y ~ x + g, data = my_data)
And many more!
lm
The output of the lm
function is a long list of interesting output.
When we call print(saved_model)
, we are presented with the estimated coefficients, and nothing else.
For some more information of the estimated model, use summary(saved_model)
:
my_lm_summary <- summary(my_lm) print(my_lm_summary)
lm
## Residuals: ## Min 1Q Median 3Q Max ## -3.6246 -0.6960 -0.0202 0.7016 3.2367 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.41777 0.03209 13.020 <2e-16 *** ## x 0.30463 0.03238 9.407 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.014 on 998 degrees of freedom ## Multiple R-squared: 0.08145, Adjusted R-squared: 0.08053 ## F-statistic: 88.5 on 1 and 998 DF, p-value: < 2.2e-16
lm
As with any other function, summary(saved_model)
returns an object. Here, it is a list. What is saved as the output of this function?
names(my_lm_summary)
## [1] "call" "terms" "residuals" "coefficients" ## [5] "aliased" "sigma" "df" "r.squared" ## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
If we want to extract other information of interest from the fitted model object, we can use the $
operator to do so:
print(my_lm_summary$r.squared)
## [1] 0.08145418
lm
Accessing elements from saved models can be very helpful in making comparisons across models:
my_lm_r2 <- summary(my_lm)$r.squared my_lm_multi_r2 <- summary(my_lm_multi)$r.squared my_lm_interact_r2 <- summary(my_lm_interact)$r.squared r2.compare <- data.frame( model = c("bivariate", "multivariate", "interaction"), r.squared = c(my_lm_r2, my_lm_multi_r2, my_lm_interact_r2))
lm
We can print the values:
print(r2.compare)
## model r.squared ## 1 bivariate 0.08145418 ## 2 multivariate 0.08552710 ## 3 interaction 0.09811655
Or we can plot them:
ggplot(r2.compare, aes(x = model, y = r.squared))+ geom_point(size = 4)+ ggtitle("R^2 Comparison")
lm
lm
diagnosticsThere are a number of functions that are helpful in producing model diagnostics:
residuals(saved_model)
extracts the residuals from a fitted modelcoefficients(saved_model)
extracts coefficientsfitted(saved_model)
extracts fitted valuesplot(saved_model)
is a convenience function for producing a number of useful diagnostics plotslm
residual plotWe can easily plot the residuals from a fitted model against an explanatory variable of interest:
par(mfrow = c(1,2)) # Divide the plotting region into 1 row and 2 cols plot(x = my_data$x, y = residuals(my_lm_multi), xlab = "x", ylab = "residuals", pch = 19, # Filled circles col="grey", # Change colour cex = 0.5) # Make point size smaller abline(h = 0) #Â Add a horizontal line with y-intercept of 0 plot(x = my_data$z, y = residuals(my_lm_multi), xlab = "z", ylab = "residuals", pch = 19, # Filled circles col="grey", # Change colour cex = 0.5) # Make point size smaller abline(h = 0) #Â Add a horizontal line with y-intercept of 0
lm
residual plotglm
To estimate a range of non-linear models, the glm
function is particularly helpful.
First, let us transform our outcome variable from a continuous measure to a binary measure:
my_data$y_bin <- as.numeric(my_data$y > median(my_data$y)) table(my_data$y_bin)
## ## 0 1 ## 500 500
glm
Now we can estimate our model:
my_logit <- glm(formula = y_bin ~ x + z, data = my_data, family = "binomial")
Where:
formula
is the model specificationdata
is the datafamily
is a description of the error distribution and link function to be used
binomial
, poisson
, gaussian
etc…glm
summary(my_logit)
## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.77744 -1.13471 -0.03837 1.14638 1.85331 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.17079 0.12745 -1.340 0.18 ## x 0.40330 0.06793 5.937 2.9e-09 *** ## z 0.32624 0.22678 1.439 0.15 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1386.3 on 999 degrees of freedom ## Residual deviance: 1346.9 on 997 degrees of freedom ## AIC: 1352.9 ## ## Number of Fisher Scoring iterations: 4
glm
OK, but no-one actually thinks in terms of log-odds, so let's translate that into something more meaningful.
my_logit_OR <- exp(cbind(OR = coef(my_logit), confint(my_logit)))
coef
extracts the coefficientsconfint
extracts the confidence intervalscbind
binds the vectors together as separate columnsexp
exponentiates the log-odds ratiosround(my_logit_OR, digits = 4)
## OR 2.5 % 97.5 % ## (Intercept) 0.8430 0.6563 1.0819 ## x 1.4968 1.3122 1.7129 ## z 1.3858 0.8889 2.1636
glm
Almost all of the convenience functions that we used for lm
are also applicable to glm
models:
summary(my_logit) plot(my_logit) residuals(my_logit) coefficients(my_logit) fitted(my_logit)
There are a number of external packages that can make fitting other model types (relatively) straightforward:
lmer4
- Linear, generalised linear, and nonlinear mixed modelsmcgv
- generalised additive modelssurvival
- survival analysisglmnet
- lasso and elastic net regression modelsrandomForest
- random forest models from machine learningrjags
and rstan
- Bayesian modelsTo use a package that is not a part of the base R installation, use:
install.packages("survival") library(survival)
predict
We can retreive the fitted values from the model using fitted()
, but we may be interested in calculating predicted values for arbitrary levels of our covariates.
sim_data <- data.frame(x = c(0, 1)) y_hat <- predict(object = my_lm, newdata = sim_data) y_hat
## 1 2 ## 0.4177653 0.7223981
Here, I am creating a data.frame
with two observations of one variable (x
).
I am then using the predict
function, where
object = my_lm
tells R the model object for which prediction is desirednewdata = sim_data
tells R the values for which I would like predictionspredict
We can use the same syntax to retreive predictions for (marginally) more interesting models:
sim_data <- data.frame(x = c(0, 0, 1, 1), z = c(0, 1, 0, 1)) sim_data
## x z ## 1 0 0 ## 2 0 1 ## 3 1 0 ## 4 1 1
y_hat <- predict(my_lm_multi, newdata = sim_data) y_hat
## 1 2 3 4 ## 0.3027153 0.5397412 0.6072847 0.8443107
predict
This can be especially useful when trying to visualise interactive models:
sim_data_z0 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 0) sim_data_z1 <- data.frame(x = seq(from = -2, to = 2, by = 0.01), z = 1) y_hat_z0 <- predict(my_lm_interact, newdata = sim_data_z0) y_hat_z1 <- predict(my_lm_interact, newdata = sim_data_z1)
seq
generates a regular sequences from
one value to
another value by
given incrementsdata.frame
s for prediction, in both cases varying the value of x
, but first setting z
to 0, and then setting z
to 1predict
# Create a plot of the data plot(my_data$x, my_data$y, cex = 0.5, pch = 19, col = "gray", bty = "n", xlab = "X", ylab = "Y", main = "Fitted values for sim_data") # Add a prediction line for z = 0 lines(x = sim_data_z0$x, y = y_hat_z0, lwd = 2) # Add a prediction line for z = 1 lines(x = sim_data_z1$x, y = y_hat_z1, lwd = 2, col = "red") # Add a legend legend("bottomright", legend = c("z = 0", "z = 1"), col = c("black", "red"), lty = 1)
predict
margins
?Yes! Thomas Leeper has created a nice port of STATA's margins
command:
library(margins) cplot(my_lm_interact, x = "x", dx = "z", what = "effect")
cplot
is equivalent to marginsplot
in STATA.
The specification above tell R
that we want to plot the marginal effect of z
on y
across the range of x
.
margins
?objects() & ls() # Which objects are currently loaded in my working environment? rm() # Remove objects from my current environment save() # Save R object(s) to disk is.na() # Is this object a missing value? Or, which elements in this vector are missing? rnorm() # Generate random numbers from a normal distribution runif() # Generate random numbers from a uniform distribution
library(data.table) library(dplyr) library(plyr) library(zoo) library(reshape2) library(shiny)