Exercise 1: Linear Regression
Part 1
The data in the following URL “http://www-bcf.usc.edu/~gareth/ISL/Income1.csv” includes observation on income levels (in tens of thousands of dollars) and years of education. The data is not real and was actually simulated.
Read the data into R and generate a ggplot with a fitted line.
Split the data into train and test set. Note that here we do not form a validation set as we have very few observations and will only consider a single model.
Fit a linear model with education (years of education) as input variable and income as response variable. What are the model coefficients obtained and how can you extract them? Inspect the model fit using summary()
Compute the fitted values of income for the observations included in the train set.
Predict the income for new observations, for people with 16.00, 12.52, 15.55, 21.09, and 18.36 years of education. Then, make predictions also for the test set and evaluate the root mean squared error on the test set.
Part 2
Now download data from “http://www-bcf.usc.edu/~gareth/ISL/Income2.csv” which include the same observations but also records data on “senority”. Again, split the data into train and test.
Fit a new model including a new variable and print the model summary.
Predicted values of income for the observations in the train set.
Predict the income levels for new observations with years of education equal to 16.00, 12.52, 15.55, 21.09, 18.36 and seniority to 123.74, 83.63, 90.94, 178.96, 125.17.
Exercise 2
In this exercise you will perform Lasso regression yourself. We will use the Boston
dataset from the MASS
package. The dataset contains information on the Boston suburbs housing market collected by David Harrison in 1978.
We will try to predict the median value of of homes in the region based on its attributes recorded in other variables.
First install the package:
# install.packages("MASS")
library(MASS)
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
head(Boston, 3)
str(Boston)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Split the data to training and testing subsets.
Perform a Lasso regression with glmnet
. Steps:
- Extract the input and output data from the
Boston
data.frame
and convert them if necessary to a correct format.
- Use cross-validation to select the value for \(\lambda\).
- Inspect computed coefficients for
lambda.min
and lambda.1se
.
- Compute the predictions for the test dataset the two choices of the tuning parameter,
lambda.min
and lambda.1se
. Evaluate the MSE for each.
LS0tCnRpdGxlOiAiTGVjdHVyZSA2OiBFeGVyY2lzZXMiCmRhdGU6IE9jdG9iZXIgMTZ0aCwgMjAxOApvdXRwdXQ6IAogIGh0bWxfbm90ZWJvb2s6CiAgICB0b2M6IHRydWUKICAgIHRvY19mbG9hdDogdHJ1ZQotLS0KCgojIEV4ZXJjaXNlIDE6IExpbmVhciBSZWdyZXNzaW9uCgojIyBQYXJ0IDEKClRoZSBkYXRhIGluIHRoZSBmb2xsb3dpbmcgVVJMICJodHRwOi8vd3d3LWJjZi51c2MuZWR1L35nYXJldGgvSVNML0luY29tZTEuY3N2IgppbmNsdWRlcyBvYnNlcnZhdGlvbiBvbiBpbmNvbWUgbGV2ZWxzIChpbiB0ZW5zIG9mIHRob3VzYW5kcyBvZiBkb2xsYXJzKQphbmQgeWVhcnMgb2YgZWR1Y2F0aW9uLiAqVGhlIGRhdGEgaXMgbm90IHJlYWwgYW5kIHdhcyBhY3R1YWxseSBzaW11bGF0ZWQqLgoKYS4gUmVhZCB0aGUgZGF0YSBpbnRvIFIgYW5kIGdlbmVyYXRlIGEgZ2dwbG90IHdpdGggYSBmaXR0ZWQgbGluZS4KCmIuIFNwbGl0IHRoZSBkYXRhIGludG8gdHJhaW4gYW5kIHRlc3Qgc2V0LiBOb3RlIHRoYXQgaGVyZSB3ZSBkbyBub3QgZm9ybQphIHZhbGlkYXRpb24gc2V0IGFzIHdlIGhhdmUgdmVyeSBmZXcgb2JzZXJ2YXRpb25zIGFuZCB3aWxsIG9ubHkgY29uc2lkZXIKYSBzaW5nbGUgbW9kZWwuCgpjLiBGaXQgYSBsaW5lYXIgbW9kZWwgd2l0aCBlZHVjYXRpb24gKHllYXJzIG9mIGVkdWNhdGlvbikgYXMgaW5wdXQgdmFyaWFibGUgCmFuZCBpbmNvbWUgYXMgcmVzcG9uc2UgdmFyaWFibGUuIFdoYXQgYXJlIHRoZSBtb2RlbCBjb2VmZmljaWVudHMgb2J0YWluZWQgYW5kIApob3cgY2FuIHlvdSBleHRyYWN0IHRoZW0/IEluc3BlY3QgdGhlIG1vZGVsIGZpdCB1c2luZyBgc3VtbWFyeSgpYAoKZC4gQ29tcHV0ZSB0aGUgZml0dGVkIHZhbHVlcyBvZiBpbmNvbWUgZm9yIHRoZSBvYnNlcnZhdGlvbnMKaW5jbHVkZWQgaW4gdGhlIHRyYWluIHNldC4KCmUuIFByZWRpY3QgdGhlIGluY29tZSBmb3IgbmV3IG9ic2VydmF0aW9ucywgZm9yIHBlb3BsZSB3aXRoIAoxNi4wMCwgMTIuNTIsIDE1LjU1LCAyMS4wOSwgYW5kIDE4LjM2IHllYXJzIG9mIGVkdWNhdGlvbi4gVGhlbiwKbWFrZSBwcmVkaWN0aW9ucyBhbHNvIGZvciB0aGUgdGVzdCBzZXQgYW5kIGV2YWx1YXRlIHRoZSByb290IG1lYW4gc3F1YXJlZCAKZXJyb3Igb24gdGhlIHRlc3Qgc2V0LgoKCiMjIFBhcnQgMgoKYS4gTm93IGRvd25sb2FkIGRhdGEgZnJvbSAiaHR0cDovL3d3dy1iY2YudXNjLmVkdS9+Z2FyZXRoL0lTTC9JbmNvbWUyLmNzdiIKd2hpY2ggaW5jbHVkZSB0aGUgc2FtZSBvYnNlcnZhdGlvbnMgYnV0IGFsc28gcmVjb3JkcyBkYXRhIG9uICJzZW5vcml0eSIuCkFnYWluLCBzcGxpdCB0aGUgZGF0YSBpbnRvIHRyYWluIGFuZCB0ZXN0LgoKYi4gRml0IGEgbmV3IG1vZGVsIGluY2x1ZGluZyBhIG5ldyB2YXJpYWJsZSBhbmQgcHJpbnQgdGhlIG1vZGVsIHN1bW1hcnkuCgpjLiBQcmVkaWN0ZWQgdmFsdWVzIG9mIGluY29tZSBmb3IgdGhlIG9ic2VydmF0aW9ucyBpbiB0aGUgdHJhaW4gc2V0LgoKZC4gUHJlZGljdCB0aGUgaW5jb21lIGxldmVscyBmb3IgbmV3IG9ic2VydmF0aW9ucyB3aXRoIHllYXJzIG9mIGVkdWNhdGlvbgplcXVhbCB0byAxNi4wMCwgMTIuNTIsIDE1LjU1LCAyMS4wOSwgMTguMzYgYW5kIHNlbmlvcml0eSB0bwoxMjMuNzQsIDgzLjYzLCAgOTAuOTQsIDE3OC45NiwgMTI1LjE3LgoKCiMgRXhlcmNpc2UgMgoKSW4gdGhpcyBleGVyY2lzZSB5b3Ugd2lsbCBwZXJmb3JtIExhc3NvIHJlZ3Jlc3Npb24geW91cnNlbGYuCldlIHdpbGwgdXNlIHRoZSBgQm9zdG9uYCBkYXRhc2V0IGZyb20gdGhlIGBNQVNTYCBwYWNrYWdlLgpUaGUgZGF0YXNldCBjb250YWlucyBpbmZvcm1hdGlvbiBvbiB0aGUgQm9zdG9uIHN1YnVyYnMgCmhvdXNpbmcgbWFya2V0IGNvbGxlY3RlZCBieSBEYXZpZCBIYXJyaXNvbiBpbiAxOTc4LgoKV2Ugd2lsbCB0cnkgdG8gcHJlZGljdCB0aGUgbWVkaWFuIHZhbHVlIG9mIG9mIGhvbWVzIGluIHRoZSByZWdpb24gYmFzZWQgb24gCml0cyBhdHRyaWJ1dGVzIHJlY29yZGVkIGluIG90aGVyIHZhcmlhYmxlcy4KCkZpcnN0IGluc3RhbGwgdGhlIHBhY2thZ2U6CmBgYHtyfQojIGluc3RhbGwucGFja2FnZXMoIk1BU1MiKQpsaWJyYXJ5KE1BU1MpCmBgYAoKYGBge3J9CmhlYWQoQm9zdG9uLCAzKQpzdHIoQm9zdG9uKQpgYGAKCgphLiBTcGxpdCB0aGUgZGF0YSB0byB0cmFpbmluZyBhbmQgdGVzdGluZyBzdWJzZXRzLgoKYi4gUGVyZm9ybSBhIExhc3NvIHJlZ3Jlc3Npb24gd2l0aCBgZ2xtbmV0YC4gU3RlcHM6CiAgCjEuIEV4dHJhY3QgdGhlIGlucHV0IGFuZCBvdXRwdXQgZGF0YSBmcm9tIHRoZSBgQm9zdG9uYCBgZGF0YS5mcmFtZWAgYW5kIGNvbnZlcnQKdGhlbSBpZiBuZWNlc3NhcnkgdG8gYSBjb3JyZWN0IGZvcm1hdC4KMi4gVXNlIGNyb3NzLXZhbGlkYXRpb24gdG8gc2VsZWN0IHRoZSB2YWx1ZSBmb3IgJFxsYW1iZGEkLgozLiBJbnNwZWN0IGNvbXB1dGVkIGNvZWZmaWNpZW50cyBmb3IgYGxhbWJkYS5taW5gIGFuZCBgbGFtYmRhLjFzZWAuCjQuIENvbXB1dGUgdGhlIHByZWRpY3Rpb25zIGZvciB0aGUgdGVzdCBkYXRhc2V0IHRoZSB0d28gY2hvaWNlcyBvZiB0aGUgdHVuaW5nCnBhcmFtZXRlciwgYGxhbWJkYS5taW5gIGFuZCBgbGFtYmRhLjFzZWAuIApFdmFsdWF0ZSB0aGUgTVNFIGZvciBlYWNoLgoK