Processing math: 55%

Clustering QC

Mikhail Dozmorov

Fall 2016

Assess cluster fit and stability

  • Most often ignored.
  • Cluster structure is treated as reliable and precise
  • BUT! Clustering is generally VERY sensitive to noise and to outliers
  • Measure cluster quality based on how “tight” the clusters are.
  • Do genes in a cluster appear more similar to each other than genes in other clusters?

Clustering evaluation methods

  • Sum of squares
  • Homogeneity and Separation
  • Cluster Silhouettes and Silhouette coefficient: how similar genes within a cluster are to genes in other clusters
  • Rand index
  • Gap statistics
  • Cross-validation

Sum of squares

  • A good clustering yields clusters where genes have small within-cluster sum-of-squares (and high between-cluster sum-of-squares).

Homogeneity

  • Homogeneity is calculated as the average distance between each gene expression profile and the center of the cluster it belongs to

Hk=1Ngikd(Xi,C(Xi))

Ng - total number of genes in the cluster

Separation

Separation is calculated as the weighted average distance between cluster centers

Save=1klNkNlklNkNld(Ck,Cl)

Homogeneity and separation

– Homogeneity reflects the compactness of the clusters while S reflects the overall distance between clusters
– Decreasing Homogeneity or increasing Separation suggest an improvement in the clustering results

Variance Ratio Criterion (VCR)

VRCk=(SSB/(K1))/(SSW/(NK))

  • SSB – between-cluster variation
  • SSW – within-cluster variation

The goal is to maximize VRCk over the clusters

κk=(VRCk+1VRCk)(VRCkVRCk1)

  • Select K to minimize the value of kappaK
  • Calinski & Harabasz (1974)

Silhouette

  • Good clusters are those where the genes are close to each other compared to their next closest cluster.

s(i)=b(i)a(i)max(a(i),b(i))

  • b(i)=min(AVGDBETWEEN(i,k))
  • a(i)=AVGDWITHIN(i)
  • How well observation i matches the cluster assignment. Ranges 1<s(i)<1
  • Overall silhouette: SC=1NgNgi=1s(i)
  • Rousseeuw, Peter J. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 1987 http://www.sciencedirect.com/science/article/pii/0377042787901257

Silhouette plot

  • The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
  • Silhouette width near +1 indicates points that are very distant from neighboring clusters
  • Silhouette width near 0 indicate points that are not distinctly in one cluster or another
  • Negative width indicates points are probably assigned to the wrong cluster.

Rand index

Cluster multiple times

  • Clustering A: 1, 2, 2, 1, 1
  • Clustering B: 2, 1, 2, 1, 1

Compare pairs

  • a:=and=, the number of pairs assigned to the same cluster in A and in B
  • b:and, … different clusters in A and in B
  • c:and=, … same in A, different in B
  • d:=and, … same in B, different in A

Rand index

R=a+ba+b+c+d

  • Adjust the Rand index to make it vary between -1 and 1 (negative if less than expected)

  • AdjRand = (Rand – expect(Rand)) / (max(Rand) – expect(Rand))

Gap statistics

  • Cluster the observed data, varying the total number of clusters k=1, 2, ... K
  • For each cluster, calculate the sum of the pairwise distances for all points

D_r=\sum_{i,i' \in C_r}{d_{ii'}}

  • Calculate within-cluster dispersion measures

W_k=\sum_{r=1}^k{\frac{1}{2n_r}D_r}

Gap statistics

Cross-validation approaches

  • Cluster while leave-out k experiments (or genes)
  • Measure how well cluster groups are preserved in left out experiment(s)
  • Or, measure agreement between test and training set

Clustering validity

  • Hypothesis: if the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the distance vector

WADP - robustness of clustering

  • If the input data deviate slightly from their current value, will we get the same clustering?
    – Important in Microarray expression data analysis because of constant noise

Bittner M. et.al. "Molecular classification of cutaneous malignant melanoma by gene expression profiling" Nature 2000 http://www.nature.com/nature/journal/v406/n6795/full/406536A0.html

WADP - robustness of clustering

  • Perturb each original gene expression profile by N(0, 0.01)
  • Re-normalize the data, cluster
  • Cluster-specific discrepancy rate: D/M. That is, for the M pairs of genes in an original cluster, count the number of gene pairs, D, that do not remain together in the clustering of the perturbed data, and take their ratio.
  • The overall discrepancy ratio is the weighted average of the cluster-specific discrepancy rates.

WADP - robustness of clustering

  • If there were originally m_j genes in the cluster j, then there are M_j=m_j(m_j-1)/2 pairs of genes
  • In the new clustering, identify how many of these paris (D_j) still remain in the cluster
  • Calculate D_j/M_j

WADP=\frac{\sum_{j=1}^k{m_jD_j/M_j}}{\sum_{j=1}^k{m_j}}

Summary

Clustering pitfalls

  • Any data – even noise – can be clustered
  • It is quite possible for there to be several different classifications of the same set of objects.
  • It should be clear that any clustering produced should be related to the features in which the investigator in interested.