Visualization and Clustering with High-dimensional Genomic Data Zhenqiu Liu, PhD 10/14/2014 Agenda • • • • • Data Visualization Big data and their visualization PCA and MDS for data visualization Clustering and data mining Bioinformatics resources Introduction • Visualization is the use of graphical techniques to communicate information and support reasoning or analysis • Visualizations are cost-effective because they exploit – powerful human visual processing capabilities and – high quality graphics created at low cost • Two kinds of visualizations – Scientific Visualization – Information Visualization 3 Visualization • Scientific Visualization ● ● ● graphical representations from the results of mathematical models, computational data, and simulations. Involves research in computer graphics, statistics, image processing, high performance computing, and other areas It's not just a pretty picture or animation • Information Visualization: – The use of computer-supported, interactive, visual representations of abstract data to amplify cognition. • Visualization is not only looking into a pretty picture… – understanding of the data – been able to analyze and interpret data Goals of Information Visualization Visualization should: – Make large datasets coherent (Present huge amounts of information compactly) – Present information from various viewpoints – Present information at several levels of detail (from overviews to fine structure) – Support visual comparisons – Tell stories about the data Why Visualization? Use the eye for pattern recognition; people are good at scanning recognizing remembering images Graphical elements facilitate comparisons via length shape orientation texture Animation shows changes across time Color helps make distinctions A Key Question How do we Convert abstract information into a visual representation While still preserving the underlying meaning And at the same time providing new insight? Problem Solving - Example • London cholera epidemic of 1854 • At that time, two hypotheses of causes of cholera: – Cholera is related to miasmas concentrated in the swampy areas of the city – Cholera is related to ingestion of contaminated water • Input Data – Locations of deaths due to cholera – Locations of water pumps 8 Dr Snow’s Cholera Map Dots locate deaths due to cholera Crosses Locate water pumps 9 Dr Snow’s Cholera Map John Snow’s deduction that a cholera epidemic was caused by a bad water pump, circa 1854. Horizontal lines indicate location of deaths. From Visual Explanations by Edward Tufte, Graphics Press, 1997 Dr Snow’s Cholera Map • Plotting the input data on the map helped Dr Snow – to detect the epicentre of the epidemic – Close to a pump on Broad Street • Considered a classic case of visualization helping reasoning with data 11 LifeLines • Visualization of computerised medical records • For a patient – Horizontal lines (time lines) represent medical problems, hospitalization and medications – Icons on these lines represent events such as tests and physician consultations • All the patient information is fitted into one screen 12 Screenshot of LifeLines 13 Types of Visualization (Kosslyn 89) • Graphs Type name here Type title here • Charts • Maps • Diagrams Type name here Type title here Type name here Type title here Type name here Type title here length of access length of page length of access url 1 url 2 url 3 url 4 url 5 url 6 url 7 very long long medium short 45 40 35 30 25 20 15 10 5 0 # of accesses URL length of access length of page # of accesses # of accesses Common Graph Types days # of accesses Scatter Plots • Qualitatively determine if variables – are highly correlated • linear mapping between horizontal & vertical axes – have low correlation • spherical, rectangular, or irregular distributions – have a nonlinear relationship • a curvature in the pattern of plotted points • Place points of interest in context – color representing special entities Visualizing Temporal Data • Traditionally time series are visualized using trend graphs and seasonality graphs – A time series can be expressed in terms of its trend and seasonality components – Data = trend + seasonal + remainder 17 Trend And Seasonality in Time Series 18 When to use which type? • Line graph – x-axis requires quantitative variable – Variables have continuous values – Ordering among ordinals • Bar graph – comparison of relative point values • Scatter plot – convey overall impression of relationship between two variables • Pie Chart? – Emphasizing differences in proportion among a few numbers Growth Chart Of GEO (RNA etc) Gene Expression Omnibus (GEO) database holds over 10 000 experiments comprising 300 000 samples, 16 billion individual abundance measurements, for over 500 organisms, submitted by 5000 laboratories from around the world. The database typically receives over 60 000 query hits and 10 000 bulk FTP downloads per day, and has been cited in over 5000 manuscripts. GenBank growth chart (DNA sequences) There are 126 billion bases in 135 million sequence records in the traditional GenBank divisions and 191 billion bases in 62 million sequence records in the WGS division as of April 2011. Big Omics Data • A lot of genes, and samples, heterogenious data structure and data type. • Big data collection vs. big data objects • Big data collection: aggregates of many data sets (multi‐source, multi‐disciplinary, heterogeneous, and maybe distributed) • Big data objects: single object too large – For main memory – For local disk Even for remote disk Basic Types of Omics Data • Nominal (qualitative) – (no inherent order) SNP, Sequencing, ... • Ordinal (qualitative) – (ordered, but not at measurable intervals) – first, second, third, … – Clinical phenotypes (e.g. cancer stages) • Quantitative – list of integers or reals – Gene expression, protein expression. • Structural (PPI networks) Dimension Reduction • High dimensional data points are difficult to visualize • Always good to plot data in 2D – Easier to detect or confirm the relationship among data points – Catch stupid mistakes (e.g. in clustering) • Two ways to reduce: – By genes: some experiments are similar or have little information – By experiments: some genes are similar or have little information Principal Component Analysis • Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components • Components are orthogonal (mutually uncorrelated) • Few PCs may capture most variation in original data • E.g. reduce 2D into 1D data PCA v2 v2 v1 v1 v1 v2 Dimension Reduction (PCA) z New Axis 1 New Axis 2 New Axis 3 y Principal Components pick out the directions in the data that capture the greatest variability x Representing data in a reduced space New Axis 2 New Axis 1 The first new axes will be projected through the data so as to explain the greatest proportion of the variance in the data (most important). The second new axis will be orthogonal, and will explain the next largest amount of variance 0.010 PCA analysis Plot of eigenvalues, select number. 0.000 0.005 X 0.015 0.020 0.025 Typical Analysis Array Projection Plot PC1 v PC2 etc Gene Projection Interpreting an PCA Each axes represent a different “trend” or set of profiles The further from the origin Greater loading/contribution (ie higher expression) Same direction from the origin PCA of Gene Expression Data Multidimensional scaling (MDS) • MDS deals with the following problem: for a set of observed similarities (or distances) between every pair of N items, find a representation of the items in few dimensions such that the similarity (distance) structure nearly match the structure original similarities (or distance). • The numerical measure of how close the original distances and the distances at lower dimensional coordinate is called stress. MDS MDS 1. MDS attempts to map objects to a visible 2D or 3D Euclidean space. The goal is to best preserve the distance structure after the mapping. 2. The original data can be of high-dimensional or even non-metric space. The method only cares the distance (dissimilarity) structure. 3. It could be shown that the results of PCA are exactly those of classical MDS if the distances calculated from the data matrix are Euclidean. PCA MDS Input data Data matrix (S subjects in Dissimilarity structure G dimensions) (distance between any pair of subjects) Method “Project” subjects to lowdimensional space and preserve as large variance as possible Find a low-dimensional space that best keep the dissimilarity structure Restrictions Data have to be in Euclidean space Flexible to any data structure as long as the dissimilarity structure can be defined Pros and cons The PCs can be further used to model in downstream analyses. If a new subject is added, it can be similarly projected. Flexibility and visualization. But if a new subject is added, it can’t be shown in an existing MDS solution. PCA application: genomic study • Population stratification: allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. • PCA could be used to infer underlying population structure. Figure 2 Nature Genetics 38, 904 - 909 (2006) Principal components analysis corrects for stratification in genome-wide association studies Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reic Chao Tian, Peter K. Gregersen and Michael F. Seldin. (2008) Accounting for ancestry: population substructure and genome-wide association studies. Software for dimension reduction & visualization PCA in R: prcomp(stats) princomp(stats) screeplot(stats) Principal Components Analysis (preferred) Principal Components Analysis Screeplot of PCA Results PCA in IMSL (a commercial C library) MDS in R: isoMDS(MASS) cmdscale(stats) sammon(MASS) Kruskal's Non-metric Multidimensional Scaling Classical (Metric) Multidimensional Scaling Sammon's Non-Linear Mapping MDS: Various software and resources about MDS http://www.granular.com/MDS/ Heatmap visualization: Treeview http://rana.lbl.gov/EisenSoftware.htm Visualization vs. Analysis? • Applications to data mining and data discovery. – Visualization tools are helpful for exploring hunches and presenting results • Examples: scatterplots – They are the WRONG primary tool when the goal is to find a good prediction model in a complex situation. Data Mining and Machine Learning • Machine Learning and data mining can be used: to recognize or classify complex items (objects, situations, etc.), to predict future data or events, and to explore the data structure in the data. • On the boundary of Computer Science and Statistics. • Very exciting area. Why Data Mining? • • • • • A lot of data Data is noisy No clear biological theory Large number of features (genes) Complex relationships • Let the data do the talking! KDD Process Unsupervised Learning Unsupervised learning attempts to discover interesting structure in the available data Data mining, Clustering Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts. Tailor-made for each person: too expensive One-size-fits-all: does not fit all. Example 2: In medicine, identifying patients subtype based on their omics profiling Personalized medicine. 44 Supervised Learning observations System (unknown) property of interest Train dataset ? ML algorithm new observation model Classification prediction 45 Topics • Clustering and Data Mining • Classification and Prediction • Bioinformatics Resources The challenge • Biologists are estimated to produce 25.000.000.000.000.000 bytes of data each year (± 35 billion CD-rooms). • How do we learn something from this data? • Find patterns/structure in the data.  Use cluster analysis Cluster analysis • Definition: Clustering is the process of grouping several objects into a number of groups, or clusters. • Goal: Objects in the same cluster are more similar to one another than they are to objects in other clusters. Basic principles of clustering Aim: to group observations or variables that are “similar” based on predefined criteria. Issues: Which genes / genomic technology to use? Which similarity or dissimilarity measure? Which method to use to join the clusters/observations? Which clustering algorithm? How to validate the resulting clusters? . 49 Clustering of genes Omics Data For each gene, calculate a summary statistics and/or adjusted p-values Set of candidate DE genes. Biological verification Similarity metrics Clustering Descriptive interpretation Clustering algorithm 50 Which similarity or dissimilarity measure? • A metric is a measure of the similarity or dissimilarity between two data objects • Two main classes of metric: – Correlation coefficients (similarity) • Compares shape of expression curves – Kernel matrix (e.g. string kernel) – Distance metrics (dissimilarity) • City Block (Manhattan) distance • Euclidean distance 51 Correlation (a measure between -1 and 1) • Pearson Correlation Coefficient (centered correlation) Sx = Standard deviation of x Sy = Standard deviation of y  xi  x  yi  y    S  S  x  y  i 1  n 1 n 1 • Others include Spearman’s  and Kendall’s  Positive correlation Negative correlation 52 Distance metrics • City Block (Manhattan) distance: – Sum of differences across dimensions – Less sensitive to outliers – Diamond shaped clusters d ( X , Y )   xi  yi • Euclidean distance: – Most commonly used distance – Sphere shaped cluster – Corresponds to the geometric distance into the multidimensional space d ( X ,Y )  i Y X Condition 1 where gene X = (x1,…,xn) and gene Y=(y1,…,yn) Condition 2 Condition 2 i 2 ( x  y )  i i Y X Condition 1 53 Euclidean vs Correlation (I) • Euclidean distance • Correlation 54 Clustering algorithms • Clustering algorithm comes in 2 basic flavors Partitioning Hierarchical 55 Hierarchical methods • Hierarchical clustering methods produce a tree or dendrogram. • They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can be built in two distinct ways – bottom-up: agglomerative clustering (usually used). – top-down: divisive clustering. 56 Illustration of points In two dimensional space Agglomerative 1 5 2 3 4 1,2,3,4,5 4 3 1,2,5 5 1 3,4 1,5 2 1 5 2 3 4 57 Relationships between these pairwise distances- Clustering Algorithms • Different algorithms – Bottom-up or top-down – Popular hierarchical bottom-up clustering method – The distance between a cluster and the remaining clusters can be measured using minimum, maximum or average distance. – Single lineage algorithm uses the minimum distance. Comparison of Linkage Methods Single Join by min Average average Complete max Partitioning methods • Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups. • Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. Ideally, dissimilarity between clusters will be maximized while it is minimized within clusters. 60 Partitioning methods K=2 61 Partitioning methods K=4 62 Cluster Analysis dist() hclust() heatmap() library(heatplus) Many publications present both Summary Data Mining (Clustering) – Similarity measures – Partitioning and hierarchical clustering – Overlapping Clustering 65 Topics • Clustering and Data Mining • Classification (Discrimination) and Prediction • Bioinformatics resources Classification and Prediction Learning Set Data with known classes Prediction Classification rule Data with unknown classes Classification Technique Class Assignment Discrimination 67 Classification in Bioinformatics • Computational diagnostic: early cancer detection • Tumor biomarker discovery • Protein folding prediction • Protein-protein binding sites prediction • Gene function prediction • … 68 Learning set Predefine classes Clinical outcome Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Good Prognosis ? Matesis > 5 Objects Array Feature vectors Gene expression new array Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan. . Classification rule 69 Microarray Data Challenges to Machine Learning Algorithms: • Few samples for analysis (38 labeled) • Extremely high-dimensional data (7129 gene expression values per sample) • Noisy data • Complex underlying mechanisms, not fully understood Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful AML ALL Some genes are more useful than others for building classification models Example: genes 37176_at and 36563_at not useful Importance of Feature (Gene) Selection • Majority of genes are not directly related to cancer • Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting • Noise and the small number of training samples makes this even more likely A 6 gene signature of lung metastasis Landemaine T et al., Cancer Res. 2008 Aug 1;68(15):6092-9. Topics • Clustering and Data Mining • Classification (Discrimination) and Prediction • Bioinformatics Resources Five websites that all biologists should know • NCBI (The National Center for Biotechnology Information; – http://www.ncbi.nlm.nih.gov/ • EBI (The European Bioinformatics Institute) – http://www.ebi.ac.uk/ • The Canadian Bioinformatics Resource – http://www.cbr.nrc.ca/ • SwissProt/ExPASy (Swiss Bioinformatics Resource) – http://expasy.cbr.nrc.ca/sprot/ • PDB (The Protein Databank) – http://www.rcsb.org/PDB/ A few more resources to be aware of • • • • • Human Genome Working Draft – http://genome.ucsc.edu/ TIGR (The Institute for Genomics Research) – http://www.tigr.org/ Celera – http://www.celera.com/ (Model) Organism specific information: – Yeast: http://genome-www.stanford.edu/Saccharomyces/ – Arabidopis: http://www.tair.org/ – Mouse: http://www.jax.org/ – Fruitfly: http://www.fruitfly.org/ – Nematode: http://www.wormbase.org/ Nucleic Acids Research Database Issue – http://nar.oupjournals.org/ (First issue every year) Challenges • • • • Confusing choice of tools Developed independently Written by and for nerds Need help from bioinformatician Outline • • • • • what is R What is Bioconductor getting and using Bioconductor Overview of Bioconductor packages demo R • R is a language and environment for statistical computing and graphics. • what sorts of things is R good at? – there are very many statistical algorithms – there are very many machine learning algorithms – visualization – it is possible to write scripts that can be reused Goals of Bioconductor • Provide access to powerful statistical and graphical methods for the analysis of genomic data. • Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed) in the analysis of experimental data. • Allow the rapid development of extensible, interoperable, and scalable software. • Promote high-quality documentation and reproducible research. • Provide training in computational and statistical methods. Microarray data analysis Installation 1. Main R software: download from CRAN (cran.r-project.org), use latest release, now 1.8.0. 2. Bioconductor packages: download from Bioconductor (www.bioconductor.org), use latest release, now 1.3. Available for Linux/Unix, Windows, and Mac OS. Documentation and help • R manuals and tutorials:available from the R website or on-line in an R session. • R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats. > help.start() > help(lm) > ?hclust > apropos(mean) > example(hclust) > demo() > demo(image) R cluster analysis packages • cclust: convex clustering methods. • class: self-organizing maps (SOM). • cluster: – – – – – – AGglomerative NESting (agnes), Clustering LARe Applications (clara), DIvisive ANAlysis (diana), Fuzzy Analysis (fanny), MONothetic Analysis (mona), Partitioning Around Medoids (pam). • e1071: – fuzzy C-means clustering (cmeans), – bagged clustering (bclust). • • • • • flexmix: flexible mixture modeling. fpc: fixed point clusters, clusterwise regression and discriminant plots. GeneSOM: self-organizing maps. mclust, mclust98: model-based cluster analysis. mva: – hierarchical clustering (hclust), – k-means (kmeans). • Specialized summary, plot, and print methods for clustering results. Hierarchical clustering hclust function from mva package Heatmaps heatmap function from mva package References • R www.r-project.org, cran.r-project.org – – – – software (CRAN); documentation; newsletter: R News; mailing list. • Bioconductor www.bioconductor.org – software, data, and documentation (vignettes); – training materials from short courses; – mailing list. Conclusions • • • • • • Visualization PCA and MDS visualization Clustering Classification Bioinformatics resources R and Bioconductor