Using the USA Arrests dataset for unsupervised learning
r
r dimnames(USArrests)
[[1]]
[1] \Alabama\ \Alaska\ \Arizona\ \Arkansas\ \California\ \Colorado\ \Connecticut\
[8] \Delaware\ \Florida\ \Georgia\ \Hawaii\ \Idaho\ \Illinois\ \Indiana\
[15] \Iowa\ \Kansas\ \Kentucky\ \Louisiana\ \Maine\ \Maryland\ \Massachusetts\
[22] \Michigan\ \Minnesota\ \Mississippi\ \Missouri\ \Montana\ \Nebraska\ \Nevada\
[29] \New Hampshire\ \New Jersey\ \New Mexico\ \New York\ \North Carolina\ \North Dakota\ \Ohio\
[36] \Oklahoma\ \Oregon\ \Pennsylvania\ \Rhode Island\ \South Carolina\ \South Dakota\ \Tennessee\
[43] \Texas\ \Utah\ \Vermont\ \Virginia\ \Washington\ \West Virginia\ \Wisconsin\
[50] \Wyoming\
[[2]]
[1] \Murder\ \Assault\ \UrbanPop\ \Rape\
r
r apply(USArrests,2,mean)
Murder Assault UrbanPop Rape
7.788 170.760 65.540 21.232
r
r apply(USArrests,2, var)
Murder Assault UrbanPop Rape
18.97047 6945.16571 209.51878 87.72916
Principal Components is about the Variance (Mean will be made 0). We see that Assault has the most variance. If a single variable has a lot of variance, it might dominate the principal components
We calculate the PCA by standardizing the variables
r
r #prcomp will standardize the variables for us pca.out=prcomp(USArrests, scale=TRUE) pca.out
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
The standard deviations are the SD of the four principal components. The rotation are the loadings - The 1 PC is loaded equally on Murder, Assault, Rape (all the three crimes) and loaded less on the Urban population. The sign doesnt matter really, because the variance doesnt change - The second is heavily loaded on the urban population
Plotting the first two components
The red is the direction of the loadings of the principal components while each state takes a position on the x plot.
Eg. Florida, Nevada, California is high on crime. Maine, North Dakota, New Hampshire have low crime The y axis is about whether the state has a high population or not
We work on a simulated data, where we have clusters and we shift them by shifting their means
r
r set.seed(101) #Generating 2 columns, 100 records x <- matrix(rnorm(100*2), 100, 2) #Generating means and shiting means for the 4 clusters xmean <- matrix(rnorm(8, sd = 4), 4, 2) #Which point is going to be placed on which cluster? which <- sample(1:4, 100, replace = TRUE) #xmean[which,] produces a 100 row matrix with 2 columns x <- x + xmean[which,] plot(x, col = which, pch = 19)
Running a kmeans on this dataset
r
r #We run 15 random starts (15 times algorithm, each time the starting cluster assignment is different) kmout <- kmeans(x, 4, nstart = 15) kmout
K-means clustering with 4 clusters of sizes 32, 28, 20, 20
Cluster means:
[,1] [,2]
1 -0.5787702 4.7639233
2 -5.6518323 3.3513316
3 1.4989983 -0.2412154
4 -3.1104142 1.2535711
Clustering vector:
[1] 2 4 1 2 4 1 2 4 1 1 3 1 1 3 4 3 2 3 2 2 2 2 2 3 1 1 4 2 4 1 2 3 2 4 4 3 3 4 3 3 2 4 4 2 2 3 2 1 2 4 2 1 1 3 3 4 3 1 1 1 4 2 2
[64] 2 4 4 1 1 3 2 2 1 1 3 1 3 2 1 1 1 4 1 4 1 2 3 1 2 2 1 1 4 2 4 1 1 3 3 1 1
Within cluster sum of squares by cluster:
[1] 53.04203 42.40322 34.95921 48.52107
(between_SS / total_SS = 85.7 %)
Available components:
[1] \cluster\ \centers\ \totss\ \withinss\ \tot.withinss\ \betweenss\ \size\ \iter\
[9] \ifault\
Plotting the data
r
r #Clusters plot(x, col = kmout$cluster, cex = 2, pch = 1, lwd = 2) #Plotting the original data points(x, col = which, pch = 19)
There are one or two mismatches between the blue and black. They are much closer to each other and this result is reasonable.
We use the same simulated dataset for hierarchical clustering
Computing the clusters based on different link functions and plotting the dendogram (bottom-up clustering)
r
r hc_complete <- hclust(dist(x), method = ) hc_single <- hclust(dist(x), method = ) hc_average <- hclust(dist(x), method = ) par(mfrow = c(1,3)) plot(hc_complete) plot(hc_single) r plot(hc_average)
Complete is the prefered method
“cuttree” is a function used to get the clusters at a preferred height level. We compare this with the original data
r
r table(hc_cut, which)
which
hc_cut 1 2 3 4
1 0 28 0 0
2 1 0 0 20
3 31 0 0 0
4 0 0 20 0
The orders are actually arbitrary. THe small numers are the misidentifications. 1 in hc_cut is not the same as 1 in original data
Comparing with K-Means clustering (4 clusters)
r
r table(hc_cut, kmout$cluster)
hc_cut 1 2 3 4
1 0 28 0 0
2 1 0 0 20
3 31 0 0 0
4 0 0 20 0
We see similar results for the kmeans and hierarchical clustering
The dendograms can be labelled with the original cluster assignments in order to see misassignments