Clustering QC

Fall 2016

Assess cluster fit and stability

Most often ignored.
Cluster structure is treated as reliable and precise
BUT! Clustering is generally VERY sensitive to noise and to outliers
Measure cluster quality based on how “tight” the clusters are.
Do genes in a cluster appear more similar to each other than genes in other clusters?

Clustering evaluation methods

Sum of squares
Homogeneity and Separation
Cluster Silhouettes and Silhouette coefficient: how similar genes within a cluster are to genes in other clusters
Rand index
Gap statistics
Cross-validation

Sum of squares

A good clustering yields clusters where genes have small within-cluster sum-of-squares (and high between-cluster sum-of-squares).

Homogeneity

Homogeneity is calculated as the average distance between each gene expression profile and the center of the cluster it belongs to

\[H_{k}=\frac{1}{N_g} \sum_{i \in k} d(X_i,C(X_i))\]

\(N_g\) - total number of genes in the cluster

Separation

– Separation is calculated as the weighted average distance between cluster centers

\[S_{ave}=\frac{1}{\sum_{k \neq l}{N_kN_l}} \sum_{k \neq l}{N_kN_ld(C_k,C_l)}\]

Homogeneity and separation

– Homogeneity reflects the compactness of the clusters while S reflects the overall distance between clusters
– Decreasing Homogeneity or increasing Separation suggest an improvement in the clustering results

Variance Ratio Criterion (VCR)

\[VRC_k=(SS_B/(K-1))/(SS_W/(N-K))\]

\(SS_B\) – between-cluster variation
\(SS_W\) – within-cluster variation

The goal is to maximize \(VRC_k\) over the clusters

\[\kappa_k=(VRC_{k+1} - VRC_k) - (VRC_k - VRC_{k-1})\]

Select K to minimize the value of kappaK
Calinski & Harabasz (1974)

Silhouette

Good clusters are those where the genes are close to each other compared to their next closest cluster.

\[s(i)=\frac{b(i)-a(i)}{max(a(i),b(i))}\]

\(b(i) = min(AVGD_{BETWEEN}(i,k))\)
\(a(i) = AVGD_{WITHIN}(i)\)
How well observation \(i\) matches the cluster assignment. Ranges \(-1 < s(i) < 1\)
Overall silhouette: \(SC=\frac{1}{N_g}\sum_{i=1}^{N_g}{s(i)}\)
Rousseeuw, Peter J. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 1987 http://www.sciencedirect.com/science/article/pii/0377042787901257

Silhouette plot

The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
Silhouette width near +1 indicates points that are very distant from neighboring clusters
Silhouette width near 0 indicate points that are not distinctly in one cluster or another
Negative width indicates points are probably assigned to the wrong cluster.

Rand index

Cluster multiple times

Clustering A: 1, 2, 2, 1, 1
Clustering B: 2, 1, 2, 1, 1

Compare pairs

\(a: \; = \; and \; =\), the number of pairs assigned to the same cluster in A and in B
\(b: \; \neq \; and \; \neq\), … different clusters in A and in B
\(c: \; \neq \; and \; =\), … same in A, different in B
\(d: \; = \; and \; \neq\), … same in B, different in A

Rand index

\[R=\frac{a+b}{a+b+c+d}\]

Adjust the Rand index to make it vary between -1 and 1 (negative if less than expected)
\(AdjRand = (Rand – expect(Rand)) / (max(Rand) – expect(Rand))\)

Gap statistics

Cluster the observed data, varying the total number of clusters \(k=1, 2, ... K\)
For each cluster, calculate the sum of the pairwise distances for all points

\[D_r=\sum_{i,i' \in C_r}{d_{ii'}}\]

Calculate within-cluster dispersion measures

\[W_k=\sum_{r=1}^k{\frac{1}{2n_r}D_r}\]

Gap statistics

Cross-validation approaches

Cluster while leave-out \(k\) experiments (or genes)
Measure how well cluster groups are preserved in left out experiment(s)
Or, measure agreement between test and training set

Clustering validity

Hypothesis: if the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the distance vector

WADP - robustness of clustering

If the input data deviate slightly from their current value, will we get the same clustering?
– Important in Microarray expression data analysis because of constant noise

Bittner M. et.al. "Molecular classification of cutaneous malignant melanoma by gene expression profiling" Nature 2000 http://www.nature.com/nature/journal/v406/n6795/full/406536A0.html

WADP - robustness of clustering

Perturb each original gene expression profile by \(N(0, 0.01)\)
Re-normalize the data, cluster
Cluster-specific discrepancy rate: \(D/M\). That is, for the \(M\) pairs of genes in an original cluster, count the number of gene pairs, \(D\), that do not remain together in the clustering of the perturbed data, and take their ratio.
The overall discrepancy ratio is the weighted average of the cluster-specific discrepancy rates.

WADP - robustness of clustering

If there were originally \(m_j\) genes in the cluster \(j\), then there are \(M_j=m_j(m_j-1)/2\) pairs of genes
In the new clustering, identify how many of these paris (\(D_j\)) still remain in the cluster
Calculate \(D_j/M_j\)

\[WADP=\frac{\sum_{j=1}^k{m_jD_j/M_j}}{\sum_{j=1}^k{m_j}}\]

Summary

Clustering pitfalls

Any data – even noise – can be clustered
It is quite possible for there to be several different classifications of the same set of objects.
It should be clear that any clustering produced should be related to the features in which the investigator in interested.