Normalization

Fall 2016

Biological vs. technical variability in gene expression

What is ultimately of interest in the use of gene expression microarrays is the measurement of differences between experimental and reference states or between different groups of experimental units.
Observed differences in microarray gene expression studies, however, are recognized as arising from two sources:
- Biological variability – changes in signal intensity driven by changes between biological states (healthy – disease)
- Technical variability – non-biological sources of variability

Sources of technical variability

Systematic

Amount of extracted RNA, efficiencies of RNA extraction, reverse transcription, labeling, photodetection, GC content of probes
Similar non-bilological effect on many measurements

Corrections can be estimated from data and accounted for by normalization

Sources of technical variability

Stochastic

PCR yield, DNA quality, spotting efficiency, spot size, non-specific hybridization, stray signal
Noise components & "Schmutz" (dirt)

Too random to be explicitly accounted for – need to use error modeling

Why normalization

Main idea

Remove the systematic bias in the data as completely possible while preserving the variation in the gene expression that occurs because of biologically relevant changes in transcription.

The purpose of normalization is to adjust the gene expression values so that all genes on the array that are not differentially expressed have similar values across all arrays.

Goal of normalization

Assumption

The average gene does not change in its expression level in the biological sample being tested.
Most genes are not differentially expressed
Up- and down-regulated genes roughly cancel out the expression effect.

Two categories of normalization methods

Baseline (reference) based methods

Use a reference set of selected genes (housekeeping, invariant, spike-ins), or a baseline array

Complete (global, scaling) methods

Combine information from all arrays in a given dataset

Reference set

Housekeeping genes - responsible for essential activities of cell maintenance & survival but not involved in cell function or proliferation. Such genes will be similarly expressed in all samples.
Control genes - serve as artificial housekeeping gene set that should have equal expression across arrays or channels

Reference set

Invariant set - genes that have the same rank across experiments. Empirically chosen
All genes - appropriate when the majority of the genes are believed to be not differentially expressed

Problems - defining reference sets may be biased. E.g., invariant set genes will be selected from the center of the distribution

Within- and between array normalization

Intra-slide normalization (within array)

Applies to two-channel arrays
Normalizes expression values to make intensities in two channels consistent within each array

Inter-slide normalization (between array)

Normalizes expression values to achieve consistency between arrays
Generally done after within-array normalization

Normalization procedure

The normalized signal intensity ratio for clone \(k\) on array \(j\) will be \[x_{jk}=\log{\frac{R_{jk}}{G_{jk}}} - c_{jk}\]

Where

\(R_{jk}\) - the (background adjusted) Red signal
\(G_{jk}\) - the (background adjusted) Green signal
\(c_{jk}\) - the normalization factor

Calculating \(c_{jk}\)

Global normalization
Intensity dependent normalization
- Lo(w)ess
- Invariant set
- Quantile

Global normalization

\(c_{jk}\) is the same for all genes on array \(j\).

Underlying assumptions

Red & Green intensities have ~linear relationship through the origin;
All cDNA species within a sample will incorporate an equivalent amount of dye per mole cDNA;
There are no other variables that contribute to dye bias across slides.

Calculating \(c_{jk}\)

A constant \(c_j\) equal to the mean or median of the log ratios may be subtracted from all spots on array \(j\). For example,

\[c_{jk}=c_j=median\left(log\frac{R_{jk}}{G_{jk}}\right)\]

for all clones/probes \(k\) in \(S\).

Alternatively, fit a linear regression and use the estimated slope parameter as the constant.

Disadvantages of global normalization

Does not account for non-linearity of signal intensities.
Assumes cDNA from both dyes hybridized equally.
More commonly, intensity dependent normalization methods are used.

Intensity-dependent normalization

Corrects intensities depending on the level of intensity, thereby changing the shape of the distribution of data

Bland Altman (MA) plots
Fitting a non-linear exponential curve
LOWESS/LOESS regression

Intensity dependent normalization

Here the correction is still

\[x_{jk}=\left(log\frac{R_{jk}}{G_{jk}}\right)-c_{jk}\]

but now \(c_{jk}\) is the lowess fit, or \(c_{jk}=f_j(A_{jk})\) where \(f\) is some smoothing function fitted to array \(j\) over all clones/probes \(k\) in \(S\).

Robust locally weighted regression of intensity log-ratios \(M_{jk}\) on the average log-intensity \(A_{jk}\) overall (global lowess) can be used for intensity dependent normalization.
Other methods such as smoothing splines or exponential fits may also work well.

Intensity-dependent normalization: LOWESS

LOcally WEighted Scatterplot Smoothing (Cleveland, 1979)
First proposed for microarrays by Yang et al. (2002). Yang et al (2002) used local window of 40%.
Global LOWESS use implicit assumptions that, when stratified by mRNA abundance,
- Only a minority of genes are expected to be differentially expressed or,
- Any differential expression is as likely to be up-regulation as well as down-regulation

Intensity-dependent normalization: LOWESS

Loess normalization is based on MA plots.
Skewing reflects experimental artifacts such as the contamination of one RNA source with genomic DNA or rRNA, the use of unequal amounts of fluorescent probes.

Intensity-dependent normalization: LOWESS

Skewing can be corrected with local smoother: fitting a local regression curve to the data and subtracting the predicted value from the observed values
Goal: minimize the standard deviation and place the mean log ratio at 0

Print-tip lowess

LOWESS fits to the data within print-tip groups
Sub-array normalization

Affymetrix Method

Scaling (Affymetrix method, sadd_whitepaper): First, choose a baseline GeneChip against which all other GeneChips are normalized.
Calculate the 2% trimmed mean expression for the baseline GeneChip, represented by \(\widetilde{x}_{base}\).
Calculate the 2% trimmed mean expression for the \(j^{th}\) GeneChip, represented by \(\widetilde{x}_j\).
The scaling factor is taken to be \(\beta_j=\widetilde{x}_{base}/\widetilde{x}_j\), so that the scaled values on GeneChip \(j\) are

\[x_{jk}^{scaled}=\beta_j*\widetilde{x}_{jk}\]

Rank invariant set

Rather than using all genes for normalization, one may want to restrict the set of genes used for normalization by identifying those that are invariant.
First, for each chip all genes are ranked; the invariant set is the set of genes with the same rank for each of the chips.
- This is usually a very small number hence typically genes with approximately the same rank are used.
Once the set of rank invariant genes is identified, intensity dependent normalization (fitting some smooth fit) is typically applied.

Quantile normalization

Motivation from quantile-quantile plot
Normal quantile-quantile plot consists of a plot of the ordered values in your data versus the corresponding quantiles of a standard normal distribution
If the normal qqplot is fairly linear, your data are reasonably Gaussian; otherwise, they are not.

Between-array normalization methods

Quantile normalization: Make distribution of data equal across all samples. Final distribution is the average of each quantile across chips (Bolstad et.al., Bioinformatics (2003))

Quantile normalization

Given \(n\) arrays of length \(p\), form matrix \(X\) of dimension \(p × n\) where each array is a column.
Sort each column of \(X\) to get \(X_{sort}\). Remember to original order
Take the means across rows of \(X_{sort}\) and replace the values of \(X\) by those means. The resulting matrix is \(X_{sort}'\).
Get \(X_{normalized}\) by rearranging each column of \(X_{sort}'\) to have the same ordering as original \(X\).

Quantile normalization changes expression over many slides i.e. changes the correlation structure of the data, may affect subsequent analysis.

Biological vs. technical variability in gene expression

Sources of technical variability

Sources of technical variability

Why normalization

Goal of normalization

Two categories of normalization methods

Reference set

Reference set

Within- and between array normalization

Normalization procedure

Calculating \(c_{jk}\)

Global normalization

Calculating \(c_{jk}\)

Disadvantages of global normalization

Intensity-dependent normalization

Intensity dependent normalization

Intensity-dependent normalization: LOWESS

Intensity-dependent normalization: LOWESS

Intensity-dependent normalization: LOWESS

Print-tip lowess

Affymetrix Method

Rank invariant set

Quantile normalization

Between-array normalization methods

Quantile normalization

Comparison of normalization techniques

Variance stabilizing normalization (VSN)