library(DT)
library(limma)
# source('http://bioconductor.org/biocLite.R') # Import biocLite() function into R environment biocLite('limma')
library(limma)
# limmaUsersGuide() # Opens pdf manual for limma biocLite('samr')
library(samr)  # Load the library
library(affy)  # Load affy package 
library(GEOquery)
# Use biomaRt for annotation biocLite('biomaRt')
library(biomaRt)

SAM (Significance Analysis of Microarrays)

We’ll be comparing gene expression changes of lung cells in response to different bacterial strains. Data from Cobb et.al., 2004, obtained from GEO, GDS858

# Download GDS file, put it in the current directory, and load it: gds858 <- getGEO('GDS858', destdir='data')
gds858 <- getGEO(filename = "../data/GDS858.soft.gz", destdir = "data")  # If FTP doesn't work, read in from local file
eset <- GDS2eSet(gds858, do.log2 = TRUE)  # Convert the data to ESET object
## File stored at:
## /var/folders/tq/q7zhthbj7l574j7qwf1sm60m0000gq/T//RtmpMefjq2/GPL96.annot.gz
# help(ExpressionSet) # If needed, refresh your memory

Let’s check which infection statuses do we have:

pData(eset)$infection  # Let's check the infection status
##  [1] uninfected uninfected uninfected uninfected FRD875     FRD875     FRD875     FRD875     FRD1234    FRD1234    FRD1234    FRD1       FRD1       FRD1       FRD1       FRD440     FRD440     FRD440     FRD440    
## Levels: FRD1 FRD1234 FRD440 FRD875 uninfected
table(pData(eset)$infection)  # How many samples are in each infection status
## 
##       FRD1    FRD1234     FRD440     FRD875 uninfected 
##          4          3          4          4          4

Select indexes of the columns associated with “uninfected” or “FRD440” infection status. Check that the right infection status was selected:

selected <- grep("uninfected|FRD440", pData(eset)$infection)  # Select indexes
pData(eset)$infection[selected]  # Check if the right infection status was selected
## [1] uninfected uninfected uninfected uninfected FRD440     FRD440     FRD440     FRD440    
## Levels: FRD1 FRD1234 FRD440 FRD875 uninfected

We selected 4 “uninfected” samples and 4 “FRD440” samples. Let’s make a vector of outcome measurements, defining groups of replicates. “Uninfected” samples are labeled as “1” and “FRD440” samples are labeled as “2”:

y <- c(rep(1, 4), rep(2, 4))  # Vector of outcome measurements
y  # Visualize it
## [1] 1 1 1 1 2 2 2 2

From the whole expression matrix we select a subset of columns corresponding to the selected indexes. We then quantile normalize the data:

exprs.selected <- exprs(eset)[, selected]  # Get a subset of expression values of the selected samples
exprs.selected.q <- normalizeQuantiles(exprs.selected)  # Quantile normalize the data

In order to perform the analysis, SAM requires a list object containing a data object with expression values in a form of p genes by n samples matrix (missing values allowed), a vector of length n of outcome measurements, vectors of gene names and gene IDs, both of length p, and a boolean indicating whether the data is log2-transformed. This object resembles ExpressionSet with slots having different data, but we assemble it as a list.

Now we have our expression matrix (exprs.selected.q), our vector of outcomes (y). Let’s use row names of the expression matrix as both gene names and IDs, and assemble our data:

genenames <- rownames(exprs(eset))  # Get row names = gene names or IDs
data <- list(x = exprs.selected.q, y = y, geneid = genenames, genenames = genenames, logged2 = TRUE)

SAM can perform a variety types of analyses, specified in “resp.type” parameter. Each analysis type requires specific formatting of outcome vector and expression data. Refer to help(samr) for details. Now, we’re performing “Two class unpaired” analysis:

samr.obj <- samr(data, resp.type = "Two class unpaired", nperms = 100)

Let’s check what we have. Everything obvious, isn’t it?

names(samr.obj)  # Get object names from samr.obj
##  [1] "n"                    "x"                    "xresamp"              "y"                    "argy"                 "censoring.status"     "testStatistic"        "nperms"               "nperms.act"          
## [10] "tt"                   "numer"                "sd"                   "sd.internal"          "s0"                   "s0.perc"              "evo"                  "perms"                "permsy"              
## [19] "nresamp"              "nresamp.perm"         "all.perms.flag"       "ttstar"               "ttstar0"              "eigengene"            "eigengene.number"     "pi0"                  "foldchange"          
## [28] "foldchange.star"      "sdstar.keep"          "resp.type"            "resp.type.arg"        "assay.type"           "stand.contrasts"      "stand.contrasts.star" "stand.contrasts.95"   "depth"               
## [37] "call"

Now, we have to choose the delta value that is able to give us the best compromise in terms of called genes, false genes and False Discovery Rate (FDR). In microarray analysis, is very important to have statistically robust results, but we have to keep in mind that too short gene lists are not able to describe the biological meaning of the experiment. In any case, keeping the FDR < 10% (the number of false positives is < 10%) is pretty safe in most cases.

In general, defining the cut-off is a subjective choice and there is no absolute best way to do it.

delta.table <- samr.compute.delta.table(samr.obj, min.foldchange = 1.5)  # Compute thresholds for different deltas
datatable(delta.table)  # Look at the whole range

Let’s select delta with median FDR <10% - subset the whole delta table and take the first row.

delta.table[delta.table[, "median FDR"] < 0.1, ][1, ]  # Check delta corresponding to median FDR ~0.1
##               delta     # med false pos 90th perc false pos            # called          median FDR       90th perc FDR               cutlo               cuthi 
##          1.44384287          3.45985729          6.91971458         38.00000000          0.09104888          0.18209775         -5.91083339          4.29822435
delta <- 1.5  # Select the delta

Let’s select the delta, and have a look at SAM plot:

samr.plot(samr.obj, delta)  # Check SAM plot

Do we have larger number of upregulated genes? Or downregulated? Let’s have a look at them:

siggenes.table <- samr.compute.siggenes.table(samr.obj, delta, data, delta.table, min.foldchange = 1.5)  # Summarize significant genes
names(siggenes.table)  # What data we have in the summary list
## [1] "genes.up"            "genes.lo"            "color.ind.for.multi" "ngenes.up"           "ngenes.lo"
nrow(siggenes.table$genes.up)  # How many upregulated genes
## [1] 31
nrow(siggenes.table$genes.lo)  # How many downregulated
## [1] 5
# Or
siggenes.table$ngenes.up
## [1] 31
siggenes.table$ngenes.lo
## [1] 5

Let’s have a look at actual differentially expressed genes:

datatable(siggenes.table$genes.up)  # Check how table with the results look like

We can write them in a clipboard (Windows only), then paste into Excel. Or export them to files:

# Write the results in clipboard write.table(siggenes.table$genes.up,'clipboard-128',sep='\t') Write the results in file
write.table(siggenes.table$genes.up, "results/genes.up.txt", sep = "\t")
write.table(siggenes.table$genes.lo, "results/genes.dn.txt", sep = "\t")

These results show pretty much everything we need to know. Except the Affy probe IDs do not tell any biological story up front - it would be nice to have at least gene names. Let’s extract those IDs and annotate them:

# Extract up- and downregulated IDs
up.ids <- siggenes.table$genes.up[, "Gene ID"]  # Upregulated
dn.ids <- siggenes.table$genes.lo[, "Gene ID"]  # Downregulated

We can simply get the platform data from GEO, and extract subsets of IDs:

# Annotate Affy probe IDs
Meta(gds858)$platform  # Check which platform do we have
## [1] "GPL96"
# gpl96 <- getGEO('GPL96', destdir = 'data') # Get the data for this platform from GEO
gpl96 <- getGEO(filename = "../data/GPL96.soft", destdir = "data")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, : not all columns named in 'colClasses' exist
datatable(Table(gpl96)[Table(gpl96)[, "ID"] %in% up.ids, c("ID", "Gene Symbol", "Gene Title")])  # Extract annotation for up.ids

Or, better, we can use biomaRt. It is a more flexible approach:

datatable(listDatasets(useMart("ensembl")))  # List available datasets
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")  # Load BIOMART dataset for homo sapiens
# Information - lists of filters and attributes
datatable(listFilters(mart))  # Filters, these are our IDs we'll be subsetting the BIOMART annotations on
datatable(listAttributes(mart))  # Attributes, these are annotations we would like to get for our IDs
Meta(gpl96)$title  # Check which microarray we have, to select the right attributes
## [1] "[HG-U133A] Affymetrix Human Genome U133A Array"
attr <- listAttributes(mart)  # Get all attributes as a table
attr[grep("affy", attr[, 1]), ]  # Parse them for anything that looks like from affymetrix
##                      name                    description         page
## 107          affy_hc_g110          Affy HC G110 probeset feature_page
## 108         affy_hg_focus         Affy HG FOCUS probeset feature_page
## 109   affy_hg_u133_plus_2   Affy HG U133-PLUS-2 probeset feature_page
## 110       affy_hg_u133a_2       Affy HG U133A_2 probeset feature_page
## 111         affy_hg_u133a         Affy HG U133A probeset feature_page
## 112         affy_hg_u133b         Affy HG U133B probeset feature_page
## 113        affy_hg_u95av2        Affy HG U95AV2 probeset feature_page
## 114          affy_hg_u95b          Affy HG U95B probeset feature_page
## 115          affy_hg_u95c          Affy HG U95C probeset feature_page
## 116          affy_hg_u95d          Affy HG U95D probeset feature_page
## 117          affy_hg_u95e          Affy HG U95E probeset feature_page
## 118          affy_hg_u95a          Affy HG U95A probeset feature_page
## 119         affy_hugenefl        Affy HuGene FL probeset feature_page
## 120          affy_hta_2_0          Affy HTA-2_0 probeset feature_page
## 121   affy_huex_1_0_st_v2   Affy HuEx 1_0 st v2 probeset feature_page
## 122 affy_hugene_1_0_st_v1 Affy HuGene 1_0 st v1 probeset feature_page
## 123 affy_hugene_2_0_st_v1 Affy HuGene 2_0 st v1 probeset feature_page
## 124        affy_primeview                 Affy primeview feature_page
## 125         affy_u133_x3p         Affy U133 X3P probeset feature_page

Now we know that our IDs correspond to “affy_hg_u133a” attributes in Biomart. Let’s extract annotations for them:

# Get annotations from Biomart
genes.up <- getBM(attributes = c("affy_hg_u133a", "external_gene_name", "description"), filters = "affy_hg_u133a", values = up.ids, mart = mart)  #, uniqueRows = T)
datatable(genes.up)  # Check your work

Do that for the downregulated genes, and write the results to file.

Clean your workspace

# Clean workspace unlink(c('GDS858.soft.gz','genes.up.txt','genes.dn.txt','GPL96.soft'))