Computational Assignment 6: Hierarchical, k Means and Spectral Clustering -Written by James Wilson -Edited by Andrew Nobel In this assignment we will explore how to use three popular clustering tools in the R environment: k means, hierarchical and spectral clustering. We apply each of these methods to the TCGA dataset. The data contains the expression of a random sample of 2000 genes for 217 patients. Here, we will focus on the clustering of the patients based on patterns of their gene expression. Note: Clustering is an unsupervised learning task in the sense that it searches for structure in the data without the additional knowledge of labels associated with the data. The TCGA dataset is a special in that we have auxillary information about the patients, namely their breast cancer subtype. It is important to note that we will cluster the patients while treating these subtypes as unknown. This auxillary information will only be used as validation to the clustering itself. 1. Load and parse the TCGA data from the course website using the following commands: #Load the data trial.sample = read.table("http://www.unc.edu/~jameswd/TCGA_sample.txt",header = TRUE) #Store the subtypes of tissue and the gene expression data Subtypes = trial.sample[,1] Gene.Expression = as.matrix(trial.sample[,2:2001]) As we are clustering the patients, we first must calculate pairwise distances between the patients. The dist(X, method) function can be used to calculate the pairwise distances between the rows of X using the specified method as a distance metric. Calculate the average Euclidean and average absolute pairwise distances between patients (a 217 × 217 matrix) using the following code: dist.euclidean = dist(Gene.Expression, method = "euclidean") dist.abs = dist(Gene.Expression, method = "manhattan") #absolute distance Questions (a) The object Subtypes is a vector with the names of the tumor types of each subject. How many subjects had a ”Normal” tumor type and how many had a ”Basal” tumor type? (b) Clustering requires a distance matrix as input. Above we calculated the Euclidean and absolute distances between the gene expressions of the patients. Give an informal reason why the choice of a distance metric may change depending on the type of data you are clustering. 2. Hierarchical Clustering: There are many ways in R to perform hierarchical clustering. We will use the version that visualizes both the dendrogram and the distance matrix associated with the clustering. The function requires the GMD package. First load the GMD package using the following code: 1 install.packages("GMD") library(GMD) To run hierarchical clustering and view the permuted columns/rows of the resulting distance matrix, one can use the heatmap.2() function. There are many arguments required for this function. I list the arguments we require here but please type ?heatmap.2 for more information: • Colv and Rowv: logical value (TRUE or FALSE) specifying whether or not you want the columns (rows) reordered according to hierarchical clustering • trace: specifies whether or not you want to overlay a grid • RowSideColors: (optional) - colors that you want the leaves of the dendrogram of the rows to be colored • dendrogram: specifies whether you want hierarchical clustering performed on the rows, columns, both or none of the data matrix • col: the color scheme used for the heatmap • distfun: distance metric used on the data Note that the heatmap.2() function automatically uses average linkage though this can be altered by specifying the linkage argument to the function. We will use average linkage on the TCGA data. Run hierarchical clustering on the patients using the Euclidean and absolute distance matrix using the following code. Print each of the plots generated. #Euclidean distance: cluster the patients (rows) heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c) dist(c, method = "euclidean"), col = "redgreen", dendrogram = "row") dev.new() #Absolute distance: cluster the patients (rows) heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c) dist(c, method = "manhattan"), col = "redgreen", dendrogram = "row") Now, let’s use the auxillary information to get an idea of how well our clustering performed on each case. To visualize our results, we will color the dendrogram according to the true subtype labels of each patient. Plot the results using the following code: #Set up colors colors = rep(0,217) colors[which(Subtypes == "Basal")] = "blue" colors[which(Subtypes == "Normal")] = "green" #Euclidean distance: cluster the patients (rows) heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c) dist(c, method = "euclidean"), col = "redgreen", dendrogram = "row", RowSideColors = colors) dev.new() #Absolute distance: cluster the patients (rows) heatmap.2(Gene.Expression, Rowv = TRUE, Colv = FALSE, trace = "none", distfun = function(c) dist(c, method = "manhattan"), col = "redgreen", dendrogram = "row", RowSideColors = colors) Note that the dendrograms are generated based on pairwise distances between the patients. The resulting heatmap shows the original gene expression data matrix with the rows reordered according to hierarchical clustering. 2 Questions: (a) Comment on the differences between the two dendrograms. Does one distance appear to cluster differently than the other? Which do you prefer? (b) How might you quantify the difference between the two clusterings without the use of the auxillary information? (c) How might you quantify the difference between the two clusterings with the use of the auxillary information? (d) Visually “cut” the dendrogram into two clusters. Draw a line on your plots showing where you cut the dendrogram. How closely do the subtypes appear to cluster in these two groups? (e) Suppose that you read a scientific paper where the authors use hierarchical clustering on a data set and show a figure similar to what you just created. What kinds of questions might you be inclined to ask the authors regarding their clustering? Does the flexibility of the clustering (the choice of distance, the choice of linkage, etc.) make you more or less confident in the authors’ clustering? 3. k means: To run the k means algorithm on the rows of a matrix X, the kmeans(X, k, iter.max) command can be used where k is the number of clusters to find and iter.max is the maximum number of iterations used to find the k partition. Once you run y = kmeans(x,k,iter.max), you can type y$cluster to obtain the cluster labels of the data points. Also, you can type y$tot.withinss to obtain the total within cluster sum of squares (WCSS) for the identified partition. Recall that k means searches for the partition of the data that minimizes the WCSS for a fixed k. To get an idea of how k affects the WCSS, run k means for k from 1 to 20 and calculate the WCSS for each partition. Then, plot the WCSS across k by using the following code: for(k in 1:20){ z = kmeans(Gene.Expression,k,iter.max = 100) withinss[k] = z$tot.withinss } plot(withinss, xlab = "k", ylab = "WCSS", main = "k means on TCGA") Questions: (a) Comment on WCSS as a function of k. Do your findings from the above plot make intuitive sense? Why or why not? (b) Comment about the WCSS at k = 2. (c) Re-run k means for k = 2 and compare the cluster labels for the results at k = 2 with the true subtype labels. What proportion of each cluster contains “Normal” and “Basal” subtypes? 4. Spectral clustering: Let X be an n × p data matrix. Spectral clustering first relies upon a similarity matrix S. Then, the spectral clustering algorithm runs the k-means algorithm (or some other clustering algorithm) on the k smallest eigenvectors of the normalized graph Laplacian of S. We can use the linear algebra commands of Computer Assignment 2 to write a simple function for normalized spectral clustering. For simplicity, we will specify S to be the p × p correlation matrix associated with the rows of X. Write an R function that will run spectral clustering on the rows of an n × p matrix X with the following input arguments: k: number of clusters to be found, max.iter: number of iterations to run the k means algorithm. Use the following lines of code as a template: 3 #Calculate the sample correlation of rows of X S = cor(t(X)) #Calculate D D = diag(rowSums(S)) #Calculate normalized laplacian L.norm = 1 - solve(D)%*%S #Find the k smallest eigenvectors and store them as n x k matrix X temp = eigen(L.norm) indx = order(temp$values) #ascending order M = temp$vectors[indx,1:k] #Run k means on M resulting.cluster = kmeans(X,k,iter.max)$cluster HINT: the above is almost complete, you just need to add a input line (with appropriate input arguments) and a return() command. Questions: (a) Use your newly crafted spectral clustering algorithm on the TCGA data Gene.Expression setting k = 2 and iter.max = 100. Now, using the auxillary Subtypes object, determine what proportion of each cluster contains “Normal” and “Basal” subtypes. 4
© Copyright 2025 ExpyDoc