Introduction to R Programming

Exercise 1

Open RStudio and explore the programme. Make sure you can identify the console, the script editor, the environment window, the packages window and the help window!

In the script window, write code that creates a vector composed of 5 single-digit integers. To concatinate a series of values, use the c() function, with each argument separated by a comma. So, a character vector of length three could be c("a", "b", "c"). Assign your integer vector using the assignment operator <- to an object with a name of your choosing.

my_vec <- c(1,2,3,4,5)

Multiply your vector by 3, and assign the output to a new object. Print the values of your new object.

my_new_vec <- my_vec * 3
print(my_new_vec)

## [1]  3  6  9 12 15

Add together the two objects that you have created to far, printing the result. Note that R operates on vectors element-wise.

print(my_new_vec + my_vec)

## [1]  4  8 12 16 20

Create a logical vector of length five, again using the c() function. Make sure that you have a mix of TRUE and FALSE values in the vector. Use the logical vector to subset the numeric vector that you created in question 2 and print the result.

my_logical_vec <- c(T, T, T, F, F)
print(my_new_vec[my_logical_vec])

## [1] 3 6 9

Subset to just the first two elements of the numeric vector that you created in question 2 and assign the result to have the name my_short_vector.

my_short_vector <- my_new_vec[c(1,2)]

Exercise 2

This exercise relates to the College data set, which comes from An Introduction to Statistical Learning by James et al 2013.. It contains a number of variables for 777 different universities and colleges in the US.

The variables are

Private : Public/private indicator
Apps : Number of applications received
Accept : Number of applicants accepted
Enroll : Number of new students enrolled
Top10perc : New students from top 10% of high school class
Top25perc : New students from top 25% of high school class
F.Undergrad : Number of full-time undergraduates
P.Undergrad : Number of part-time undergraduates
Outstate : Out-of-state tuition
Room.Board : Room and board costs
Books : Estimated book costs
Personal : Estimated personal spending
PhD : Percent of faculty with Ph.D.’s
Terminal : Percent of faculty with terminal degree
S.F.Ratio : Student/faculty ratio
perc.alumni : Percent of alumni who donate
Expend : Instructional expenditure per student
Grad.Rate : Graduation rate

You can either download the .csv file containing the data from the MY591 moodle page, or read the data in directly from the website.

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:

college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")

Or you can load it from a saved file, using:

college <- read.csv("path_to_my_file/College.csv")

Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[, 1] 
View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[, -1] 
View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

Use the str() function to look at the structure of the data. Which of the variables are numeric? Which are integer? Which are factors?

str(college)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : int  1660 2186 1428 417 193 587 353 1899 1038 582 ...
##  $ Accept     : int  1232 1924 1097 349 146 479 340 1720 839 498 ...
##  $ Enroll     : int  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : int  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : int  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: int  2885 2683 1036 510 249 678 416 1594 973 799 ...
##  $ P.Undergrad: int  537 1227 99 63 869 41 230 32 306 78 ...
##  $ Outstate   : int  7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
##  $ Room.Board : int  3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
##  $ Books      : int  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : int  2200 1500 1165 875 1500 675 1500 850 500 1800 ...
##  $ PhD        : int  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : int  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: int  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : int  7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
##  $ Grad.Rate  : int  60 56 54 59 15 55 63 73 80 52 ...

Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

What is the mean and standard deviation of the Enroll and Top10Perc variables?

mean(college$Enroll)

## [1] 779.973

mean(college$Top10perc)

## [1] 27.55856

sd(college$Enroll)

## [1] 929.1762

sd(college$Top10perc)

## [1] 17.64036

Now remove the 10th through 85th observations. What is the mean and standard deviation of the Enroll and Top10Perc variables in the subset of the data that remains?

college <- college[-c(15:85),]
mean(college$Enroll)

## [1] 784.8867

mean(college$Top10perc)

## [1] 27.51841

sd(college$Enroll)

## [1] 928.6599

sd(college$Top10perc)

## [1] 17.61432

What is the range of the Books variable?

range(college$Books)

## [1]  110 2340

Use the pairs() function to produce a scatterplot matrix of the first five columns or variables of the data. Recall that you can reference the first five columns of a matrix A using A[,1:5].

pairs(college[,1:5])

Use the plot() function to produce a scatter plot of S.F.Ratio versus Grad.Rate. Give the axes informative labels.

plot(college$S.F.Ratio, college$Grad.Rate, xlab = "Student/Faculty Ratio", ylab = "Graduation Rate")

Compete with your neighbour to make the prettiest plot. You might want to look at ?plot and ?par for some ideas. If you are feeling very keen, try using ggplot but you will need to load the ggplot library first: library(ggplot2).

plot(college$S.F.Ratio, college$Grad.Rate, 
     xlab = "Student/Faculty Ratio", ylab = "Graduation Rate",
     main = "A really nice plot",
     pch = 19, col = "gray", bty = "n")

library(ggplot2)
ggplot(data = college, aes(x = S.F.Ratio, y = Grad.Rate, col = Private)) + 
  geom_point()+
  xlab("Student/Faculty Ratio")+
  ylab("Graduation Rate")+
  ggtitle("A really nice plot")+
  facet_grid(~Private)

Exercise 3

This exercise involves the auto data set available as Auto.csv from the MY591 website, or directly from http://www-bcf.usc.edu/~gareth/ISL/Auto.csv. Load this data into R. This data includes characteristics on a number of different types of cars. It includes the following variables:

mpg = miles per gallon
cylinders = number of cylinders
displacement = engine displacement (I have no idea what this is)
horsepower = number of horses powering the car (not really)
weight = weight of the car in kgs
acceleration = time in seconds for the car to go from 0-60mph
year = year of manufacture
origin = country of manufacture
name = name of car

Unfortunately, the people who made this data decided to code their missing values with a ?, which is an awful thing to do. When reading the data in from the .csv file, use na.strings = "?" to convert them to NAs. Then exclude all the NAs from the data.

auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", na.strings = "?")
auto <- na.omit(auto)

Which of the predictors are quantitative, and which are qualitative?

Note: Sometimes when you load a dataset, a qualitative variable might have a numeric value. For instance, the origin variable is qualitative, but has integer values of 1, 2, 3. From mysterious sources (Googling), we know that this variable is coded 1 = usa; 2 = europe; 3 = japan. So we can covert it into a factor, using:

auto$originf <- factor(auto$origin, labels = c("usa", "europe", "japan"))

str(auto)

## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ originf     : Factor w/ 3 levels "usa","europe",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5] 33 127 331 337 355
##   .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year. Qualitative: name, origin, originf

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

qualitative_columns <- c(4, 9, 10)
pairs(auto[, -qualitative_columns])

# Heavier weight correlates with lower mpg.
plot(auto$mpg, auto$weight)

# More cylinders, fewer mpg.
plot(auto$mpg, auto$cylinders)

# Cars become more efficient over time.
plot(auto$mpg, auto$year)

# Lets plot some mpg vs. region of origin
plot(auto$originf, auto$mpg, ylab = "mpg")

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors (make sure you use the factor version of the origin variable that we created in part a). Use the summary() function to print the results. Comment on the output.

lm.fit1 <-  lm(mpg ~ cylinders + displacement + horsepower + weight +
                 acceleration + year + originf, data=auto)
summary(lm.fit1)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + originf, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders     -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement   2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower    -1.818e-02  1.371e-02  -1.326 0.185488    
## weight        -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration   7.910e-02  9.822e-02   0.805 0.421101    
## year           7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originfeurope  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originfjapan   2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes, there is a relationship between the predictors and the response by testing the null hypothesis of whether all the regression coefficients are zero. The F-statistic is far from 1 (with a small p-value), indicating evidence against the null hypothesis.

ii. Which predictors appear to have a statistically significant relationship to the response?

Looking at the p-values associated with each predictor’s t-statistic, we see that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower, and acceleration do not.

iii. What does the coefficient for the `year` variable suggest?

The regression coefficient for year, 0.7770269, suggests that for every one year, mpg increases by the coefficient. In other words, cars become more fuel efficient every year by almost 1 mpg / year.

iv. What is the R-squared for this model? Access the r-squared value from the summary of the model object, and save it as its own object.

fit1r2 <- summary(lm.fit1)$r.squared

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm.fit1)

Seems to be non-linear pattern, linear model not the best fit. From the leverage plot, point 14 appears to have high leverage, although not a high magnitude residual.

I would like to know which of the following cars is has the higher miles-to-gallon ratio:
- A 6 cylinder European car which weighs 4000 kg and accelerates from 0-60 in 12 seconds
- A 4 cylinder Japanese car which weighs 2000 kg and accelerates from 0-60 in 20 seconds Use the predict() function to give a numerical prediction for each of these car types. Base your estimates on the following model:

lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)

lm.fit3 <-  lm(mpg ~ cylinders + weight + acceleration + originf, data=auto)

sim_example_one <- data.frame(cylinders = 6,
                                  acceleration = 12,
                                  weight = 4000,
                                  originf = "europe"
                                  )

sim_example_two <- data.frame(cylinders = 4,
                                  acceleration = 20,
                                  weight = 2000,
                                  originf = "japan"
                                  )

predict(lm.fit3, sim_example_one)

##        1 
## 16.25391

predict(lm.fit3, sim_example_two)

##        1 
## 32.76226

Use the predict() function to further explore the fitted values from your model. You may want to use summary() again to work out some reasonable values of your covariates such that you are not extrapolating wildly away from the data. You will need to specify values for each of the variables that you included in your model. For example:

sim_data_low_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1), 
                                  weight = 2000, 
                                  cylinders = mean(auto$cylinders),
                                  horsepower = mean(auto$horsepower),
                                  acceleration = mean(auto$acceleration),
                                  year = mean(auto$year),
                                  originf = "europe"
                                  )

range(auto$displacement)

## [1]  68 455

range(auto$weight)

## [1] 1613 5140

sim_data_high_weight <- data.frame(displacement = seq(from = 68, to = 455, by = 1), 
                                  weight = 5000, 
                                  cylinders = mean(auto$cylinders),
                                  horsepower = mean(auto$horsepower),
                                  acceleration = mean(auto$acceleration),
                                  year = mean(auto$year),
                                  originf = "europe"
                                  )

y_hat_low_weight <- predict(lm.fit1, newdata = sim_data_low_weight)
y_hat_high_weight <- predict(lm.fit1, newdata = sim_data_high_weight)

# Create a plot of the data
plot(auto$displacement, auto$mpg, cex = 0.5, pch = 19, 
     col = "gray", bty = "n",
     main = "Fitted values for displacement and weight")

# Add a prediction line for z = 0
lines(x = sim_data_low_weight$displacement, y = y_hat_low_weight, lwd = 2)

# Add a prediction line for z = 1
lines(x = sim_data_high_weight$displacement, y = y_hat_high_weight, lwd = 2, col = "red")

# Add a legend
legend("topright", legend = c("weight = 2000", "weight = 5000"), col = c("black", "red"), lty = 1)

Interestingly, the fitted values reveal that while the relationship between displacement and mpg in the raw data is negative, once we condition on the other variables in the model, the coefficient is positive.

Use the * operator to fit linear regression models with interaction effects (you can choose which variables to interact - you almost certainly know more about the determinants of car efficiency than I do). Do any interactions appear to be statistically significant?

lm.fit2 <-  lm(mpg ~ cylinders * displacement + displacement * weight, data=auto)
summary(lm.fit2)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

Interaction between displacement and weight is statistically signifcant, while the interactiion between cylinders and displacement is not.

Again, use the predict() function to create fitted values from the interaction model. Use the fitted values to illustrate the conditional effect of one of your variables of interest.

# We can re-use the simulated data that we created last time and just project the interaction model onto this!

y_hat_low_weight <- predict(lm.fit2, newdata = sim_data_low_weight)
y_hat_high_weight <- predict(lm.fit2, newdata = sim_data_high_weight)

# Create a plot of the data
plot(auto$displacement, auto$mpg, cex = 0.5, pch = 19, 
     col = "gray", bty = "n",
     main = "Conditional effect of displacement on mpg")

# Add a prediction line for z = 0
lines(x = sim_data_low_weight$displacement, y = y_hat_low_weight, lwd = 2)

# Add a prediction line for z = 1
lines(x = sim_data_high_weight$displacement, y = y_hat_high_weight, lwd = 2, col = "red")

# Add a legend
legend("topright", legend = c("weight = 2000", "weight = 5000"), col = c("black", "red"), lty = 1)

How does the r-squared from this model compare to that of the non-interactive model? Again, access the R-squared value from the summary of the model object and print both r-squared values.

fit2r2 <- summary(lm.fit2)$r.squared
print(fit1r2)

## [1] 0.8241995

print(fit2r2)

## [1] 0.7272237

Exercise 4 - logistic regression

Using the median() function, create a new variable which is equal to 1 (or, TRUE) when mpg is above the median value, and 0 (or, FALSE) otherwise.

auto$mpg_bin <- as.numeric(auto$mpg > median(auto$mpg))

Make use of this new variable to estimate a logistic regression with cylinders, weight, acceleration and origin as predictors of mpg. The glm() function will be useful, and you should set the family argument to be equal to "binomial".

glm.fit1 <- glm(mpg_bin ~ cylinders  + weight +
                 acceleration + originf, data=auto, family = "binomial")

Print a summary of the fitted model object and interpret two of the coefficients.

summary(glm.fit1)

## 
## Call:
## glm(formula = mpg_bin ~ cylinders + weight + acceleration + originf, 
##     family = "binomial", data = auto)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.50956  -0.25737   0.04781   0.39649   3.03403  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    8.7921523  1.6940903   5.190  2.1e-07 ***
## cylinders     -0.4137897  0.2395222  -1.728   0.0841 .  
## weight        -0.0033442  0.0006335  -5.279  1.3e-07 ***
## acceleration   0.1705619  0.0846498   2.015   0.0439 *  
## originfeurope  0.2036643  0.4701660   0.433   0.6649    
## originfjapan   0.3487730  0.5090198   0.685   0.4932    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 543.43  on 391  degrees of freedom
## Residual deviance: 214.43  on 386  degrees of freedom
## AIC: 226.43
## 
## Number of Fisher Scoring iterations: 7

If you struggled with iii, that is because log-odds ratios are completely non-intuitive. Let’s make some progress by converting the coefficients (coef()) and their associated confidence intervals (confint()) into odds-ratios (exp()). Create a data.frame of these values and print it.

odds.ratios <- exp(cbind(OR = coef(glm.fit1), confint(glm.fit1)))
print(odds.ratios)

##                         OR       2.5 %       97.5 %
## (Intercept)   6582.3844043 268.5175349 2.114970e+05
## cylinders        0.6611400   0.4085758 1.048035e+00
## weight           0.9966614   0.9953509 9.978397e-01
## acceleration     1.1859710   1.0085434 1.406990e+00
## originfeurope    1.2258865   0.4905699 3.134691e+00
## originfjapan     1.4173275   0.5288169 3.958253e+00

Introduction to R Programming

Jack Blumenau

Exercise 1

Exercise 2

Exercise 3

Exercise 4 - logistic regression