- Partitioning of a data set into subsets.
- A cluster is a group of relatively homogeneous cases or observations
Fall 2016
Given \(n\) objects, assign them to \(k\) groups (clusters) based on their similarity
No function exists that satisfies all three.
J. Kleinberg. "An Impossibility Theorem for Clustering. Advances in Neural Information Processing Systems" (NIPS) 15, 2002. https://www.cs.cornell.edu/home/kleinber/nips15.pdf
Clustering works as expected when the data to be clustered is processed correctly
Clustering works as expected when the data to be clustered is processed correctly
Normalize arrays - multiply all values in each column of data by a scale factor S so that the sum of the squares of the values in each column is 1.0 (a separate S is computed for each column).
Log transforming centered genes are not the same as centering log transformed genes.
In many cases, we are not interested in the absolute amplitudes of gene expression, but in the relative changes. Then, we standardize:
\[g_s=(g-\hat{g})/\sigma(g)\]
Standardized gene expression vectors have a mean value of zero and a standard deviation of one.
\(Gene_1= (E_{11}, E{12}, ..., E{1N})\)
\(Gene_2= (E_{21}, E{22}, ..., E{2N})\)
\(Sample_1= (E_{11}, E_{21}, ..., E_{G1})\)
\(Sample_2= (E_{12}, E_{22}, ..., E_{G2})\)
\(E_{gi}=expression \; gene \; g, \; sample \; i\)
For all objects \(i\), \(j\), and \(h\)
Euclidean distance
Example distance between gene 1 and 2:
– Sqrt of Sum of \((E_{1i} - E_{2i})^2\), \(i=1,...,N\)
Cosine similarity. From Euclidean dot product between two non-zero vectors:
The cosine similarity is
Pearson correlation coefficient [-1, 1]
Vectors are normalized to the vector’s means
Convert to dissimilarity [0, 1]
\(d(i,j)=(1-s(i,j))/2\)
\[d(X_i,X_j)=\frac{1-Cor(X_i, X_j)}{2}\]
Jaccard distance
\[\frac{a}{a+b+c}\]
Tanimoto distance
\[\frac{a+d}{a+d+2(b+c)}\]
Gower distance
J. C. Gower "A General Coefficient of Similarity and Some of Its Properties" Biometrics 1971 http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf
How to calculate distance measure for each pair of variables
Quantitative: interval-scaled distance \(d_{ij}^{(f)}=\frac{|x_{if} - x_{jf}|}{R_f}\), where \(x_{if}\) is the value for object \(i\) in variable \(f\), and \(R_f\) is the range of variable \(f\) for all objects
Categorical: use "1" when \(x_{if}\) and \(x_{jf}\) agree, and "0" otherwise
Ordinal: Use normalized ranks, then like interval-scaled based on range
Examples
Function | Package | Distances |
---|---|---|
dist | stats | Euclidean, Manhattan, Canberra, max, binary |
daisy | cluster, bioDist | Euclidean, Manhattan |
distancematrix, distancevector | hopach | Euclidean, cor, cosine-angle (abs versions) |
vegdist | vegan | Jaccard, Gower, many others |
Other packages: cclust
, e1071
, flexmix
, fpc
, mclust
, Mfuzz
, class
\(S(n,1)=1\) - one way to partition \(n\) object in to 1 group, or \(n\) disjoint groups
\(S(n,2)=2^{n-1}-1\) ways to partition \(n\) objects into two non-empty groups
Both agglomerative and divisive are only ‘step-wise’ optimal: at each step the optimal split or merge is performed
Outliers will irreversibly change clustering structure
Agglomerative/Bottom-Up
– Computationally simpler, and more available.
– More "precision" at bottom of tree
– When looking for small clusters and/or many clusters, use agglomerative
Divisive/Top-Down
– More "precision" at top of tree.
– When looking for large and/or few clusters, use divisive
Results ARE sensitive to choice!
Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association http://iv.slis.indiana.edu/sw/data/ward.pdf
An important issue though is the form of input that is necessary to give Ward’s method. For an input data matrix, \(x\), in R’s hclust
function the following command is required: hclust(dist(x)^2, method="ward")
although this is not mentioned in the function’s documentation file.
Fionn Murtagh "Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?" Journal of Classification 2014 https://link.springer.com/article/10.1007/s00357-014-9161-z