Computational HW 6

Computational Assignment 6:
Hierarchical, k Means and Spectral Clustering
-Written by James Wilson
-Edited by Andrew Nobel
In this assignment we will explore how to use three popular clustering tools in the R environment: k
means, hierarchical and spectral clustering. We apply each of these methods to the TCGA dataset. The
data contains the expression of a random sample of 2000 genes for 217 patients. Here, we will focus on the
clustering of the patients based on patterns of their gene expression.
Note: Clustering is an unsupervised learning task in the sense that it searches for structure in the data
without the additional knowledge of labels associated with the data. The TCGA dataset is a special in
that we have auxillary information about the patients, namely their breast cancer subtype. It is important
to note that we will cluster the patients while treating these subtypes as unknown. This auxillary
information will only be used as validation to the clustering itself.
1. Load and parse the TCGA data from the course website using the following commands:
#Load the data
trial.sample = read.table("http://www.unc.edu/~jameswd/TCGA_sample.txt",header = TRUE)
#Store the subtypes of tissue and the gene expression data
Subtypes = trial.sample[,1]
Gene.Expression = as.matrix(trial.sample[,2:2001])
As we are clustering the patients, we first must calculate pairwise distances between the patients.
The dist(X, method) function can be used to calculate the pairwise distances between the rows of X
using the specified method as a distance metric. Calculate the average Euclidean and average
absolute pairwise distances between patients (a 217 × 217 matrix) using the following code:
dist.euclidean = dist(Gene.Expression, method = "euclidean")
dist.abs = dist(Gene.Expression, method = "manhattan") #absolute distance
Questions
(a) The object Subtypes is a vector with the names of the tumor types of each subject. How many
subjects had a ”Normal” tumor type and how many had a ”Basal” tumor type?
(b) Clustering requires a distance matrix as input. Above we calculated the Euclidean and absolute
distances between the gene expressions of the patients. Give an informal reason why the choice
of a distance metric may change depending on the type of data you are clustering.
2. Hierarchical Clustering: There are many ways in R to perform hierarchical clustering. We will use the
version that visualizes both the dendrogram and the distance matrix associated with the clustering.
The function requires the GMD package. First load the GMD package using the following code:
1
install.packages("GMD")
library(GMD)
To run hierarchical clustering and view the permuted columns/rows of the resulting distance matrix,
one can use the heatmap.2() function. There are many arguments required for this function. I list the
arguments we require here but please type ?heatmap.2 for more information:
• Colv and Rowv: logical value (TRUE or FALSE) specifying whether or not you want the
columns (rows) reordered according to hierarchical clustering
• trace: specifies whether or not you want to overlay a grid
• RowSideColors: (optional) - colors that you want the leaves of the dendrogram of the rows to be
colored
• dendrogram: specifies whether you want hierarchical clustering performed on the rows, columns,
both or none of the data matrix
• col: the color scheme used for the heatmap
• distfun: distance metric used on the data
Note that the heatmap.2() function automatically uses average linkage though this can be altered by
specifying the linkage argument to the function. We will use average linkage on the TCGA data.
Run hierarchical clustering on the patients using the Euclidean and absolute distance matrix using
the following code. Print each of the plots generated.
#Euclidean distance: cluster the patients (rows)
heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c)
dist(c, method = "euclidean"), col = "redgreen", dendrogram = "row")
dev.new()
#Absolute distance: cluster the patients (rows)
heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c)
dist(c, method = "manhattan"), col = "redgreen", dendrogram = "row")
Now, let’s use the auxillary information to get an idea of how well our clustering performed on each
case. To visualize our results, we will color the dendrogram according to the true subtype labels of
each patient. Plot the results using the following code:
#Set up colors
colors = rep(0,217)
colors[which(Subtypes == "Basal")] = "blue"
colors[which(Subtypes == "Normal")] = "green"
#Euclidean distance: cluster the patients (rows)
heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c)
dist(c, method = "euclidean"), col = "redgreen", dendrogram = "row", RowSideColors = colors)
dev.new()
#Absolute distance: cluster the patients (rows)
heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c)
dist(c, method = "manhattan"), col = "redgreen", dendrogram = "row", RowSideColors = colors)
Note that the dendrograms are generated based on pairwise distances between the patients. The
resulting heatmap shows the original gene expression data matrix with the rows reordered according
to hierarchical clustering.
2
Questions:
(a) Comment on the differences between the two dendrograms. Does one distance appear to cluster
differently than the other? Which do you prefer?
(b) How might you quantify the difference between the two clusterings without the use of the
auxillary information?
(c) How might you quantify the difference between the two clusterings with the use of the auxillary
information?
(d) Visually “cut” the dendrogram into two clusters. Draw a line on your plots showing where you
cut the dendrogram. How closely do the subtypes appear to cluster in these two groups?
(e) Suppose that you read a scientific paper where the authors use hierarchical clustering on a data
set and show a figure similar to what you just created. What kinds of questions might you be
inclined to ask the authors regarding their clustering? Does the flexibility of the clustering (the
choice of distance, the choice of linkage, etc.) make you more or less confident in the authors’
clustering?
3. k means: To run the k means algorithm on the rows of a matrix X, the kmeans(X, k, iter.max)
command can be used where k is the number of clusters to find and iter.max is the maximum number
of iterations used to find the k partition. Once you run y = kmeans(x,k,iter.max), you can type
y$cluster to obtain the cluster labels of the data points. Also, you can type y$tot.withinss to obtain
the total within cluster sum of squares (WCSS) for the identified partition.
Recall that k means searches for the partition of the data that minimizes the WCSS for a fixed k. To
get an idea of how k affects the WCSS, run k means for k from 1 to 20 and calculate the WCSS for
each partition. Then, plot the WCSS across k by using the following code:
for(k in 1:20){
z = kmeans(Gene.Expression,k,iter.max = 100)
withinss[k] = z$tot.withinss
}
plot(withinss, xlab = "k", ylab = "WCSS", main = "k means on TCGA")
Questions:
(a) Comment on WCSS as a function of k. Do your findings from the above plot make intuitive
sense? Why or why not?
(b) Comment about the WCSS at k = 2.
(c) Re-run k means for k = 2 and compare the cluster labels for the results at k = 2 with the true
subtype labels. What proportion of each cluster contains “Normal” and “Basal” subtypes?
4. Spectral clustering: Let X be an n × p data matrix. Spectral clustering first relies upon a similarity
matrix S. Then, the spectral clustering algorithm runs the k-means algorithm (or some other
clustering algorithm) on the k smallest eigenvectors of the normalized graph Laplacian of S.
We can use the linear algebra commands of Computer Assignment 2 to write a simple function for
normalized spectral clustering. For simplicity, we will specify S to be the p × p correlation matrix
associated with the rows of X.
Write an R function that will run spectral clustering on the rows of an n × p matrix X with the
following input arguments: k: number of clusters to be found, max.iter: number of iterations to run
the k means algorithm. Use the following lines of code as a template:
3
#Calculate the sample correlation of rows of X
S = cor(t(X))
#Calculate D
D = diag(rowSums(S))
#Calculate normalized laplacian
L.norm = 1 - solve(D)%*%S
#Find the k smallest eigenvectors and store them as n x k matrix X
temp = eigen(L.norm)
indx = order(temp$values) #ascending order
M = temp$vectors[indx,1:k]
#Run k means on M
resulting.cluster = kmeans(X,k,iter.max)$cluster
HINT: the above is almost complete, you just need to add a input line (with appropriate input
arguments) and a return() command.
Questions:
(a) Use your newly crafted spectral clustering algorithm on the TCGA data Gene.Expression
setting k = 2 and iter.max = 100. Now, using the auxillary Subtypes object, determine what
proportion of each cluster contains “Normal” and “Basal” subtypes.
4