Fall 2016

K-means clustering

  • k-means clustering is a method of cluster analysis which aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean.
  • It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.

K-means statistics

  • The basic idea behind K-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized

\[minimize\left( \sum_{i=1}^kW(C_k) \right)\]

where \(C_k\) is the \(k^{th}\) cluster and \(W(C_k)\) is the within-cluster variation of the cluster \(C_k\).

K-means - Algorithm

K-means steps

  • Simplified example
    – Expression for two genes for 14 samples
  • Some structure can be seen

K-means steps

  • Choose \(K\) centroids
  • These are starting values that the user picks.
  • There are some data driven ways to do it

K-means steps

– Find the closest centroid for each point
- This is where distance is used
- This is "first partition" into \(K\) clusters

K-means steps

– Take the middle of each cluster
- Re-compute centroids in relation to the middle
- Use the new centroids to calculate distance

K-means steps

– Expression for two genes for 14 samples

PAM (K-medoids)

  • Centroid - The average of the samples within a cluster
  • Medoid - The “representative object” within a cluster
  • Initializing requires choosing medoids at random.

K-means limitations

  • Final results depend on starting values
  • How do we chose \(K\)? There are methods but not much theory saying what is best.
  • Where are the pretty pictures?

Self-organizing (Kohonen) maps

  • Self organizing map (SOM) is a learning method which produces low dimension data (e.g. \(2D\)) from high dimension data (\(nD\)) through the use of self-organizing neural networks
  • E.g. an apple is different from a banana in more then two ways but they can be differentiated based on their size and color only.

Self-organizing (Kohonen) maps

If we present apples and bananas with points and similarity with lines then

  • Two points connected by a shorter line are of same kind
  • Two points connected by a longer line are of different kind
  • Threshold \(t\) is chosen to decide if the line is longer/shorter

Self-organizing (Kohonen) maps

  • We just created a map to differentiate an apple from banana based on two traits only.
  • We have successfully “trained” the SOM, now anyone can use to “map” apples from banana and vice versa

SOM in gene expression studies

SOM example

SOM example

Application of SOM

Genome Clustering

  • Goal: trying to understand the phylogenetic relationship between different genomes.
  • Compute: bootstrap support of individual genomes for different phylogentic tree topologies, then cluster based on the topology support.

Clustering Proteins based on the architecture of their activation loops

  • Align the proteins under investigation
  • Extract the functional centers
  • Turn 3D representation into 1D feature vectors
  • Cluster based on the feature vectors

Other approaches

  • Bi-clustering - cluster both the genes and the experiments simultaneously to find appropriate context for clustering
  • R packages: iBBiG, FABIA, biclust
  • stand-alone: BicAT (Biclustering Analysis Toolbox))

Dimensionality reduction techniques

Principal Components Analysis

  • Principal component analysis (PCA) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components
  • Also know as Independent component analysis or dimension reduction technique
  • PCA decomposes complex data relationship into simple components
  • New components are linear combinations of the original data

Principal Components Analysis

  • Performs a rotation of the data that maximizes the variance in the new axes
  • Projects high dimensional data into a low dimensional sub-space (visualized in 2-3 dims)
  • Often captures much of the total data variation in a few dimensions (< 5)
  • Exact solutions require a fully determined system (matrix with full rank), i.e. a “square” matrix with independent rows

Principal Components Analysis

  • PCA - linear projection of the data onto major principal components defined by the eigenvectors of the covariance matrix.
  • Criterion to be minimised: square of the distance between the original and projected data.

\[x_P=Px\]

\(P\) is composed by eigenvectors of the covariance matrix

\[C=\frac{1}{n-1} \sum_i{(x_i - \mu)(x_i - \mu)^t }\]

Principal Components Analysis

Example: Leukemia data sets by Golub et al.: Classification of ALL and AML

Principal Components Analysis

  • Eigenvalue: describes the total variance in an eigenvector.
  • The eigenvector with the largest eigenvalue is the first principal component. The second largest eigenvalue will be the direction of the second largest variance.

Principal Components Analysis

PCA for gene expression

  • Given a gene-by-sample matrix \(X\) we decompose (centered and scaled) \(X\) as \(USV^T\)
  • We don’t usually care about total expression level and the dynamic range which may be dependent on technical factors
  • \(U\), \(V\) are orthonormal
  • \(S\) diagonal-elements are eigenvalues = variance explained

PCA for gene expression

  • Columns of \(V\) are
    • Principle components
    • Eigengenes/metagenes that span the space of the gene transcriptional responses
  • Columns of \(U\) are
    • The “loadings”, or the correlation between the column and the component
    • Eigenarrays/metaarrays - span the space of the gene transcriptional responses
  • Truncating \(U\), \(V\), \(D\) to the first \(k\) dimensions gives the best \(k\)-rank approximation of \(X\)

Singular Value Decomposition

PCA applied to cell cycle data

Other decomposition techniques

  • Non-negative matrix factorization
  • \(A=WH\) (\(A\), \(W\), \(H\) are non-negative)
  • H defined a meta-gene space: similar to eigengenes
  • Classification can be done in the meta-gene space

Jean-Philippe Brunet et al. PNAS 2004;101:4164-4169

NMF

  • Many computational methods
    • Cost function \(|A-WH|\)
    • Squared error - aka Frobenius norm
    • Kullback–Leibler divergence
  • Optimization procedure
    • Most use stochastic initialization, and the results don’t always converge to the same answer

NMF

  • \(A=WH\) : Toy Biological interpretation
  • Assume \(k=2\)
  • We have 2 transcription factors that activate gene signatures \(W1\) and \(W2\)
  • \(H\) represents the activity of each factor in each sample
  • TF effects are additive

NMF

  • NMF operates in the original non-negative measurement space
  • Highly expressed genes matter more
  • Positivity constraint is advantageous: positive correlation among genes is more likely to be biologically meaningful
  • NMF may more accurately capture the data generating process

NMF vs. PCA

  • Results of PCA vs NMF for reducing the leukemia data with 72 samples in visualization. Sample 66 is mislabeled. However in 2-D display, the reduced data by NMF can clearly show this mistake while that by PCA cannot demonstrate the wrong. ‘PC’ stands for principal component and ‘BE’ means basis experiment.

Weixiang Liu, Kehong Yuan, Datian Ye “Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis” Journal of Biomedical Informatic 2008,

Multidimensional scaling

MDS attempts to

  • Identify abstract variables which have generated the inter-object similarity measures
  • Reduce the dimension of the data in a non-linear fashion
  • Reproduce non-linear higher-dimensional structures on a lower-dimensional display

Kruskal’s stress

\[stress=\sqrt{\frac{\sum{(d_{ij} - \hat{d_{ij}})^2}}{\sum{d_{ij}^2}}}\]

  • Goodness-of-fit - Measures degree of correspondence between distances among points on the MDS map and the matrix input.
  • Start with distances \(d_{ij}\)
  • Fit decreasing numbers \(\hat{d_{ij}}\)
  • Subtract, square, sum
  • Take a square root
  • Divide by a scaling factor

MDS Basic Algorithm

  • Obtain and order the \(M\) pairs of similarities
  • Try a configuration in \(q\) dimensions
    • Determine inter-item distances and reference numbers
    • Minimize Kruskal’s stress
  • Move the points around to obtain an improved configuration
  • Repeat until minimum stress is obtained

Comparison Between PCA, MDS, and SOM

  • PCA tries to preserve the covariance of the original data
  • MDS tries to preserve the metric (ordering relations) of the original space
  • SOM tries to preserve the topology (local neighborhood relations), items projected to nearby locations are similar