Decision Trees

#install.packages("tree")
require(ISLR)
Loading required package: ISLR

Carseats data is being used, using “tree” package

dim(Carseats)
[1] 400  11

Sales is a quantitative variable. We turn this into a binary variable so that we can make it a classification problem. Cutoff is 8 sales for high and low.

Carseats$High <- as.factor(ifelse(Carseats$Sales <= 8, "No", "Yes"))

Fitting a tree based models, (removing sales from the equation because the binary response was created from it)

tree_carseats <- tree(High~.-Sales, data = Carseats)
summary(tree_carseats)

Classification tree:
tree(formula = High ~ . - Sales, data = Carseats)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population"  "Advertising" "Age"         "US"         
Number of terminal nodes:  27 
Residual mean deviance:  0.4575 = 170.7 / 373 
Misclassification error rate: 0.09 = 36 / 400 

The summary shows the variables used in the split (not ordered), the number of terminal nodes, deviance and misclassification error rate

Plotting the tree (we need to annotate the tree with the text, else it is just going to show the tree structure with no other information)

plot(tree_carseats)
text(tree_carseats,pretty=0)

There are so many splits in this tree, quite complex to look at! Since this is a classification problem, the terminal contains either a Yes or a No, which is the majority in that that region

The detailed summary of the tree contains details of the terminal nodes, proportion at splits, and other details at every node in the tree. Quite tough to look at really.

tree_carseats
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

  1) root 400 541.500 No ( 0.59000 0.41000 )  
    2) ShelveLoc: Bad,Medium 315 390.600 No ( 0.68889 0.31111 )  
      4) Price < 92.5 46  56.530 Yes ( 0.30435 0.69565 )  
        8) Income < 57 10  12.220 No ( 0.70000 0.30000 )  
         16) CompPrice < 110.5 5   0.000 No ( 1.00000 0.00000 ) *
         17) CompPrice > 110.5 5   6.730 Yes ( 0.40000 0.60000 ) *
        9) Income > 57 36  35.470 Yes ( 0.19444 0.80556 )  
         18) Population < 207.5 16  21.170 Yes ( 0.37500 0.62500 ) *
         19) Population > 207.5 20   7.941 Yes ( 0.05000 0.95000 ) *
      5) Price > 92.5 269 299.800 No ( 0.75465 0.24535 )  
       10) Advertising < 13.5 224 213.200 No ( 0.81696 0.18304 )  
         20) CompPrice < 124.5 96  44.890 No ( 0.93750 0.06250 )  
           40) Price < 106.5 38  33.150 No ( 0.84211 0.15789 )  
             80) Population < 177 12  16.300 No ( 0.58333 0.41667 )  
              160) Income < 60.5 6   0.000 No ( 1.00000 0.00000 ) *
              161) Income > 60.5 6   5.407 Yes ( 0.16667 0.83333 ) *
             81) Population > 177 26   8.477 No ( 0.96154 0.03846 ) *
           41) Price > 106.5 58   0.000 No ( 1.00000 0.00000 ) *
         21) CompPrice > 124.5 128 150.200 No ( 0.72656 0.27344 )  
           42) Price < 122.5 51  70.680 Yes ( 0.49020 0.50980 )  
             84) ShelveLoc: Bad 11   6.702 No ( 0.90909 0.09091 ) *
             85) ShelveLoc: Medium 40  52.930 Yes ( 0.37500 0.62500 )  
              170) Price < 109.5 16   7.481 Yes ( 0.06250 0.93750 ) *
              171) Price > 109.5 24  32.600 No ( 0.58333 0.41667 )  
                342) Age < 49.5 13  16.050 Yes ( 0.30769 0.69231 ) *
                343) Age > 49.5 11   6.702 No ( 0.90909 0.09091 ) *
           43) Price > 122.5 77  55.540 No ( 0.88312 0.11688 )  
             86) CompPrice < 147.5 58  17.400 No ( 0.96552 0.03448 ) *
             87) CompPrice > 147.5 19  25.010 No ( 0.63158 0.36842 )  
              174) Price < 147 12  16.300 Yes ( 0.41667 0.58333 )  
                348) CompPrice < 152.5 7   5.742 Yes ( 0.14286 0.85714 ) *
                349) CompPrice > 152.5 5   5.004 No ( 0.80000 0.20000 ) *
              175) Price > 147 7   0.000 No ( 1.00000 0.00000 ) *
       11) Advertising > 13.5 45  61.830 Yes ( 0.44444 0.55556 )  
         22) Age < 54.5 25  25.020 Yes ( 0.20000 0.80000 )  
           44) CompPrice < 130.5 14  18.250 Yes ( 0.35714 0.64286 )  
             88) Income < 100 9  12.370 No ( 0.55556 0.44444 ) *
             89) Income > 100 5   0.000 Yes ( 0.00000 1.00000 ) *
           45) CompPrice > 130.5 11   0.000 Yes ( 0.00000 1.00000 ) *
         23) Age > 54.5 20  22.490 No ( 0.75000 0.25000 )  
           46) CompPrice < 122.5 10   0.000 No ( 1.00000 0.00000 ) *
           47) CompPrice > 122.5 10  13.860 No ( 0.50000 0.50000 )  
             94) Price < 125 5   0.000 Yes ( 0.00000 1.00000 ) *
             95) Price > 125 5   0.000 No ( 1.00000 0.00000 ) *
    3) ShelveLoc: Good 85  90.330 Yes ( 0.22353 0.77647 )  
      6) Price < 135 68  49.260 Yes ( 0.11765 0.88235 )  
       12) US: No 17  22.070 Yes ( 0.35294 0.64706 )  
         24) Price < 109 8   0.000 Yes ( 0.00000 1.00000 ) *
         25) Price > 109 9  11.460 No ( 0.66667 0.33333 ) *
       13) US: Yes 51  16.880 Yes ( 0.03922 0.96078 ) *
      7) Price > 135 17  22.070 No ( 0.64706 0.35294 )  
       14) Income < 46 6   0.000 No ( 1.00000 0.00000 ) *
       15) Income > 46 11  15.160 Yes ( 0.45455 0.54545 ) *

Creating a train-test split, building the tree on the training data and evaluating on the testing data

Total records = 400; Training dataset = 250, testing dataset = 150 observations

set.seed(1011)
index <- sample(1:nrow(Carseats), 250)
train_tree_carseats <- tree(High~.-Sales, data = Carseats, subset = index)
summary(train_tree_carseats)

Classification tree:
tree(formula = High ~ . - Sales, data = Carseats, subset = index)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Age"         "CompPrice"   "Advertising" "Education"   "Income"      "US"         
Number of terminal nodes:  23 
Residual mean deviance:  0.3498 = 79.4 / 227 
Misclassification error rate: 0.088 = 22 / 250 

Plotting the above tree

plot(train_tree_carseats)
text(train_tree_carseats, pretty = 0)

The tree is similarly complex as the one with full data.

Predicting the results on the test dataset

table(test_tree_carseats_pred)
test_tree_carseats_pred
 No Yes 
 78  72 

Confusion matrix for test dataset

with(Carseats[-index,],table(test_tree_carseats_pred,High))
                       High
test_tree_carseats_pred No Yes
                    No  58  20
                    Yes 27  45

Misclassification rate from the above table

cat("Error Rate",(58+45)/150)
Error Rate 0.6866667

Using CV to control the depth. We use misclassification error as the pruning criteria

cv_train_tree_carseats
$size
 [1] 23 17 16 14 10  8  6  5  4  2  1

$dev
 [1] 67 68 76 77 80 81 81 86 83 82 99

$k
 [1] -Inf  0.0  1.0  1.5  2.0  3.0  3.5  5.0  6.0  7.0 27.0

$method
[1] "misclass"

attr(,"class")
[1] "prune"         "tree.sequence"

The object contains the size of the tree as pruning proceeds, the deviance, k = cost complexity parameter

This plots the misclassification rate, across different sizes. The minimum seems to occur somewhere around 17-18, and we can prune the tree at that depth

prune_carseats <- prune.misclass(train_tree_carseats, best = 13)
plot(prune_carseats)
text(prune_carseats, pretty = 0)

This is a little shallower than the previous trees. This is the result of the cross-validation.

Evaluating this on the testing dataset

cv_test_tree_carseats_pred <- predict(prune_carseats, Carseats[-index,], type = "class")
table(cv_test_tree_carseats_pred)
cv_test_tree_carseats_pred
 No Yes 
 78  72 

Calculating the misclassification rate again

table(cv_test_tree_carseats_pred, Carseats$High[-index])
                          
cv_test_tree_carseats_pred No Yes
                       No  59  19
                       Yes 26  46
cat("Error Rate", (59+46)/150)
Error Rate 0.7

We get a similar performance with respect to misclassification rate, but the tree is shallower.


More complicated tree-based models using Boston Housing Data (housing values, statistics in 506 suburbs of Boston based on 1970 census)

Random Forests

require(randomForest)
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

This builds a lot of trees, selecting a split parameter from a subset of parameters, and averages out to reduce the variance.

Creating a train-test split with 300-206 data observations each

set.seed(101)
data(Boston)
index <- sample(1:nrow(Boston), 300)

We use random forest for regression, with ‘medv’ as the response variable and every other variable as the predictor.

rf_boston <- randomForest(medv~., data = Boston, subset = index)
rf_boston

Call:
 randomForest(formula = medv ~ ., data = Boston, subset = index) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 12.68651
                    % Var explained: 83.45

500 trees were grown. The number of variables at each split is 4, which is approximately sqrt(13), the number of parameters. The MSR is Out of Bag mean squared residuals.

m is the number of variables that is selected at each split in random. This is given by the mtry parameter in randomForest(). And this is one of the parameters that we can tune.

Building random forest decision trees with m ranging from 1 to 13 (p parameters), and recording the OOB and test errors

oob_rf_error
 [1] 22.72616 14.62290 13.73073 12.79109 12.38863 12.69210 12.41103 12.44833 12.71799 13.08525 12.91090 12.60250 12.83121

Plotting the oob and test errors across m

matplot(1:m, cbind(test_rf_error, oob_rf_error), pch = 19, col = c("red", "blue"), type = "b", ylab = "Mean Squared Error")
legend("topright", legend = c("OOB", "Test"), pch = 19, col = c("red", "blue"))

If we want to do Bagging, we can use mtry = number of parameters. The left most is value is the MSE of a single tree


Boosting

We run a simple boositng logic on the same Boston data. We use the package “gbm” (Gradient Boosted Machines)

require(gbm)
Loading required package: gbm
Loaded gbm 2.1.5

We ask for 10000 because there are a lot of shallow trees, with 4 depths. So, 4 splits choosing 4 variables.

Summary gives the variable importance plot! The two variables that seem to be important are lstat and rm

Checking out the two variables, partial dependence plots

par(mfrow = c(1,2))
plot(boost_boston, i = "lstat")

plot(boost_boston, i = "rm")

Higher the proportion of lower status people in the suburb, lower the value of the house Higher the number of rooms in the house, higher the value of the house


Looking at the test performance as a function of the number of trees

no_trees <- seq(from = 100, to = 10000, by = 100)
pred_matrix <- predict(boost_boston, newdata = Boston[-index,], n.trees = no_trees)

pred_matrix consists of a matrix of predictions of the test data. So, there are 206 observations of the test data, and 100 different tree predictions

Computing the test error

We have calculated the squared differences for each of the columns, and then we take the mean of each column. This computes the column-wise mean squared error

boost_test_error <- with(Boston[-train,], apply((pred_matrix - medv)^2, 2, mean))
Error in `[.data.frame`(Boston, -train, ) : object 'train' not found

Ploting the boosting test error and comparing it with the random forest error

plot(no_trees, boost_test_error, pch = 19, ylab = "Mean Squared Error", xlab = "# of Trees", main = "Boosting Test Error")
abline(h = min(test_rf_error), col = "red")

The test error drops down, but levels off. There is no real increase of decrease.


---
title: "R Notebook"
output: html_notebook
---

###Decision Trees

```{r}
#install.packages("tree")
require(ISLR)
require(tree)
```

Carseats data is being used, using "tree" package

```{r}
data(Carseats)
dim(Carseats)
hist(Carseats$Sales)
```

Sales is a quantitative variable. We turn this into a binary variable so that we can make it a classification problem. Cutoff is 8 sales for high and low.

```{r}
Carseats$High <- as.factor(ifelse(Carseats$Sales <= 8, "No", "Yes"))
```

Fitting a tree based models, (removing sales from the equation because the binary response was created from it)

```{r}
tree_carseats <- tree(High~.-Sales, data = Carseats)
summary(tree_carseats)

```

The summary shows the variables used in the split (not ordered), the number of terminal nodes, deviance and misclassification error rate

Plotting the tree (we need to annotate the tree with the text, else it is just going to show the tree structure with no other information)
```{r}
plot(tree_carseats)
text(tree_carseats,pretty=0)
```

There are so many splits in this tree, quite complex to look at! Since this is a classification problem, the terminal contains either a Yes or a No, which is the majority in that that region

The detailed summary of the tree contains details of the terminal nodes, proportion at splits, and other details at every node in the tree. Quite tough to look at really.
```{r}
tree_carseats
```

---

Creating a train-test split, building the tree on the training data and evaluating on the testing data

Total records = 400; Training dataset = 250, testing dataset = 150 observations
```{r}
set.seed(1011)
index <- sample(1:nrow(Carseats), 250)

train_tree_carseats <- tree(High~.-Sales, data = Carseats, subset = index)
summary(train_tree_carseats)
```

Plotting the above tree
```{r}
plot(train_tree_carseats)
text(train_tree_carseats, pretty = 0)
```

The tree is similarly complex as the one with full data.

Predicting the results on the test dataset
```{r}
test_tree_carseats_pred <- predict(train_tree_carseats, newdata = Carseats[-index,], type = "class")
table(test_tree_carseats_pred)
```

Confusion matrix for test dataset
```{r}
table(test_tree_carseats_pred, Carseats$High[-index])

with(Carseats[-index,],table(test_tree_carseats_pred,High))
```

Misclassification rate from the above table
```{r}
cat("Error Rate",(58+45)/150)
```

Using CV to control the depth. We use misclassification error as the pruning criteria

```{r}
cv_train_tree_carseats <- cv.tree(train_tree_carseats, FUN = prune.misclass)
cv_train_tree_carseats
```

The object contains the size of the tree as pruning proceeds, the deviance, k = cost complexity parameter

```{r}
plot(cv_train_tree_carseats)
```

This plots the misclassification rate, across different sizes. The minimum seems to occur somewhere around 17-18, and we can prune the tree at that depth

```{r}
prune_carseats <- prune.misclass(train_tree_carseats, best = 13)
plot(prune_carseats)
text(prune_carseats, pretty = 0)
```

This is a little shallower than the previous trees. This is the result of the cross-validation.

Evaluating this on the testing dataset
```{r}
cv_test_tree_carseats_pred <- predict(prune_carseats, Carseats[-index,], type = "class")
table(cv_test_tree_carseats_pred)
```

Calculating the misclassification rate again
```{r}
table(cv_test_tree_carseats_pred, Carseats$High[-index])
cat("Error Rate", (59+46)/150)
```

We get a similar performance with respect to misclassification rate, but the tree is shallower.

---

More complicated tree-based models using Boston Housing Data (housing values, statistics in 506 suburbs of Boston based on 1970 census)

###Random Forests

```{r}
require(MASS)
require(randomForest)
```

This builds a lot of trees, selecting a split parameter from a subset of parameters, and averages out to reduce the variance.

Creating a train-test split with 300-206 data observations each
```{r}
set.seed(101)
data(Boston)
index <- sample(1:nrow(Boston), 300)
```

We use random forest for regression, with 'medv' as the response variable and every other variable as the predictor.


```{r}
rf_boston <- randomForest(medv~., data = Boston, subset = index)
rf_boston
```

500 trees were grown. The number of variables at each split is 4, which is approximately sqrt(13), the number of parameters. The MSR is Out of Bag mean squared residuals.

m is the number of variables that is selected at each split in random. This is given by the mtry parameter in randomForest(). And this is one of the parameters that we can tune.

Building random forest decision trees with m ranging from 1 to 13 (p parameters), and recording the OOB and test errors

```{r}
oob_rf_error <- double(13)
test_rf_error <- double(13)

for(m in 1:13)
{
  fit <- randomForest(medv~., data = Boston, subset = index, mtry = m, ntree = 400)
  oob_rf_error[m] <- fit$mse[400]
  pred <- predict(fit, Boston[-index,])
  test_rf_error[m] <- with(Boston[-index,], mean((medv - pred)^2))
  cat(m," ")
}

```


Plotting the oob and test errors across m

- type = "b" means it plots both the points and connects them with lines
```{r}
matplot(1:m, cbind(test_rf_error, oob_rf_error), pch = 19, col = c("red", "blue"), type = "b", ylab = "Mean Squared Error")
legend("topright", legend = c("OOB", "Test"), pch = 19, col = c("red", "blue"))
```

If we want to do Bagging, we can use mtry = number of parameters. The left most is value is the MSE of a single tree

---

###Boosting

We run a simple boositng logic on the same Boston data. We use the package "gbm" (Gradient Boosted Machines)

```{r}
require(gbm)
```

We ask for 10000 because there are a lot of shallow trees, with 4 depths. So, 4 splits choosing 4 variables.

```{r}
boost_boston <- gbm(medv~., data = Boston[index,], distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth =4)
summary(boost_boston)
```

Summary gives the variable importance plot! The two variables that seem to be important are lstat and rm

Checking out the two variables, partial dependence plots
```{r}
par(mfrow = c(1,2))
plot(boost_boston, i = "lstat")
plot(boost_boston, i = "rm")
```

Higher the proportion of lower status people in the suburb, lower the value of the house
Higher the number of rooms in the house, higher the value of the house

---

Looking at the test performance as a function of the number of trees

```{r}
no_trees <- seq(from = 100, to = 10000, by = 100)
pred_matrix <- predict(boost_boston, newdata = Boston[-index,], n.trees = no_trees)

```

pred_matrix consists of a matrix of predictions of the test data. So, there are 206 observations of the test data, and 100 different tree predictions

Computing the test error

We have calculated the squared differences for each of the columns, and then we take the mean of each column. This computes the column-wise mean squared error
```{r}
boost_test_error <- with(Boston[-index,], apply((pred_matrix - medv)^2, 2, mean))
```


Ploting the boosting test error and comparing it with the random forest error

```{r}
plot(no_trees, boost_test_error, pch = 19, ylab = "Mean Squared Error", xlab = "# of Trees", main = "Boosting Test Error")
abline(h = min(test_rf_error), col = "red")
```

The test error drops down, but levels off. There is no real increase of decrease.

---
