library(DT)
library(limma)
# source('http://bioconductor.org/biocLite.R') # Import biocLite() function into R environment biocLite('limma')
library(limma)
# limmaUsersGuide() # Opens pdf manual for limma biocLite('samr')
library(samr) # Load the library
library(affy) # Load affy package
library(GEOquery)
# Use biomaRt for annotation biocLite('biomaRt')
library(biomaRt)
We’ll be comparing gene expression changes of lung cells in response to different bacterial strains. Data from Cobb et.al., 2004, obtained from GEO, GDS858
# Download GDS file, put it in the current directory, and load it: gds858 <- getGEO('GDS858', destdir='data')
gds858 <- getGEO(filename = "../data/GDS858.soft.gz", destdir = "data") # If FTP doesn't work, read in from local file
eset <- GDS2eSet(gds858, do.log2 = TRUE) # Convert the data to ESET object
## File stored at:
## /var/folders/tq/q7zhthbj7l574j7qwf1sm60m0000gq/T//RtmpMefjq2/GPL96.annot.gz
# help(ExpressionSet) # If needed, refresh your memory
Let’s check which infection statuses do we have:
pData(eset)$infection # Let's check the infection status
## [1] uninfected uninfected uninfected uninfected FRD875 FRD875 FRD875 FRD875 FRD1234 FRD1234 FRD1234 FRD1 FRD1 FRD1 FRD1 FRD440 FRD440 FRD440 FRD440
## Levels: FRD1 FRD1234 FRD440 FRD875 uninfected
table(pData(eset)$infection) # How many samples are in each infection status
##
## FRD1 FRD1234 FRD440 FRD875 uninfected
## 4 3 4 4 4
Select indexes of the columns associated with “uninfected” or “FRD440” infection status. Check that the right infection status was selected:
selected <- grep("uninfected|FRD440", pData(eset)$infection) # Select indexes
pData(eset)$infection[selected] # Check if the right infection status was selected
## [1] uninfected uninfected uninfected uninfected FRD440 FRD440 FRD440 FRD440
## Levels: FRD1 FRD1234 FRD440 FRD875 uninfected
We selected 4 “uninfected” samples and 4 “FRD440” samples. Let’s make a vector of outcome measurements, defining groups of replicates. “Uninfected” samples are labeled as “1” and “FRD440” samples are labeled as “2”:
y <- c(rep(1, 4), rep(2, 4)) # Vector of outcome measurements
y # Visualize it
## [1] 1 1 1 1 2 2 2 2
From the whole expression matrix we select a subset of columns corresponding to the selected indexes. We then quantile normalize the data:
exprs.selected <- exprs(eset)[, selected] # Get a subset of expression values of the selected samples
exprs.selected.q <- normalizeQuantiles(exprs.selected) # Quantile normalize the data
In order to perform the analysis, SAM requires a list object containing a data object with expression values in a form of p genes by n samples matrix (missing values allowed), a vector of length n of outcome measurements, vectors of gene names and gene IDs, both of length p, and a boolean indicating whether the data is log2-transformed. This object resembles ExpressionSet with slots having different data, but we assemble it as a list.
Now we have our expression matrix (exprs.selected.q), our vector of outcomes (y). Let’s use row names of the expression matrix as both gene names and IDs, and assemble our data:
genenames <- rownames(exprs(eset)) # Get row names = gene names or IDs
data <- list(x = exprs.selected.q, y = y, geneid = genenames, genenames = genenames, logged2 = TRUE)
SAM can perform a variety types of analyses, specified in “resp.type” parameter. Each analysis type requires specific formatting of outcome vector and expression data. Refer to help(samr)
for details. Now, we’re performing “Two class unpaired” analysis:
samr.obj <- samr(data, resp.type = "Two class unpaired", nperms = 100)
Let’s check what we have. Everything obvious, isn’t it?
names(samr.obj) # Get object names from samr.obj
## [1] "n" "x" "xresamp" "y" "argy" "censoring.status" "testStatistic" "nperms" "nperms.act"
## [10] "tt" "numer" "sd" "sd.internal" "s0" "s0.perc" "evo" "perms" "permsy"
## [19] "nresamp" "nresamp.perm" "all.perms.flag" "ttstar" "ttstar0" "eigengene" "eigengene.number" "pi0" "foldchange"
## [28] "foldchange.star" "sdstar.keep" "resp.type" "resp.type.arg" "assay.type" "stand.contrasts" "stand.contrasts.star" "stand.contrasts.95" "depth"
## [37] "call"
Now, we have to choose the delta value that is able to give us the best compromise in terms of called genes, false genes and False Discovery Rate (FDR). In microarray analysis, is very important to have statistically robust results, but we have to keep in mind that too short gene lists are not able to describe the biological meaning of the experiment. In any case, keeping the FDR < 10% (the number of false positives is < 10%) is pretty safe in most cases.
In general, defining the cut-off is a subjective choice and there is no absolute best way to do it.
delta.table <- samr.compute.delta.table(samr.obj, min.foldchange = 1.5) # Compute thresholds for different deltas
datatable(delta.table) # Look at the whole range
Let’s select delta with median FDR <10% - subset the whole delta table and take the first row.
delta.table[delta.table[, "median FDR"] < 0.1, ][1, ] # Check delta corresponding to median FDR ~0.1
## delta # med false pos 90th perc false pos # called median FDR 90th perc FDR cutlo cuthi
## 1.44384287 3.45985729 6.91971458 38.00000000 0.09104888 0.18209775 -5.91083339 4.29822435
delta <- 1.5 # Select the delta
Let’s select the delta, and have a look at SAM plot:
samr.plot(samr.obj, delta) # Check SAM plot
Do we have larger number of upregulated genes? Or downregulated? Let’s have a look at them:
siggenes.table <- samr.compute.siggenes.table(samr.obj, delta, data, delta.table, min.foldchange = 1.5) # Summarize significant genes
names(siggenes.table) # What data we have in the summary list
## [1] "genes.up" "genes.lo" "color.ind.for.multi" "ngenes.up" "ngenes.lo"
nrow(siggenes.table$genes.up) # How many upregulated genes
## [1] 31
nrow(siggenes.table$genes.lo) # How many downregulated
## [1] 5
# Or
siggenes.table$ngenes.up
## [1] 31
siggenes.table$ngenes.lo
## [1] 5
Let’s have a look at actual differentially expressed genes:
datatable(siggenes.table$genes.up) # Check how table with the results look like
We can write them in a clipboard (Windows only), then paste into Excel. Or export them to files:
# Write the results in clipboard write.table(siggenes.table$genes.up,'clipboard-128',sep='\t') Write the results in file
write.table(siggenes.table$genes.up, "results/genes.up.txt", sep = "\t")
write.table(siggenes.table$genes.lo, "results/genes.dn.txt", sep = "\t")
These results show pretty much everything we need to know. Except the Affy probe IDs do not tell any biological story up front - it would be nice to have at least gene names. Let’s extract those IDs and annotate them:
# Extract up- and downregulated IDs
up.ids <- siggenes.table$genes.up[, "Gene ID"] # Upregulated
dn.ids <- siggenes.table$genes.lo[, "Gene ID"] # Downregulated
We can simply get the platform data from GEO, and extract subsets of IDs:
# Annotate Affy probe IDs
Meta(gds858)$platform # Check which platform do we have
## [1] "GPL96"
# gpl96 <- getGEO('GPL96', destdir = 'data') # Get the data for this platform from GEO
gpl96 <- getGEO(filename = "../data/GPL96.soft", destdir = "data")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, : not all columns named in 'colClasses' exist
datatable(Table(gpl96)[Table(gpl96)[, "ID"] %in% up.ids, c("ID", "Gene Symbol", "Gene Title")]) # Extract annotation for up.ids
Or, better, we can use biomaRt. It is a more flexible approach:
datatable(listDatasets(useMart("ensembl"))) # List available datasets
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # Load BIOMART dataset for homo sapiens
# Information - lists of filters and attributes
datatable(listFilters(mart)) # Filters, these are our IDs we'll be subsetting the BIOMART annotations on
datatable(listAttributes(mart)) # Attributes, these are annotations we would like to get for our IDs
Meta(gpl96)$title # Check which microarray we have, to select the right attributes
## [1] "[HG-U133A] Affymetrix Human Genome U133A Array"
attr <- listAttributes(mart) # Get all attributes as a table
attr[grep("affy", attr[, 1]), ] # Parse them for anything that looks like from affymetrix
## name description page
## 107 affy_hc_g110 Affy HC G110 probeset feature_page
## 108 affy_hg_focus Affy HG FOCUS probeset feature_page
## 109 affy_hg_u133_plus_2 Affy HG U133-PLUS-2 probeset feature_page
## 110 affy_hg_u133a_2 Affy HG U133A_2 probeset feature_page
## 111 affy_hg_u133a Affy HG U133A probeset feature_page
## 112 affy_hg_u133b Affy HG U133B probeset feature_page
## 113 affy_hg_u95av2 Affy HG U95AV2 probeset feature_page
## 114 affy_hg_u95b Affy HG U95B probeset feature_page
## 115 affy_hg_u95c Affy HG U95C probeset feature_page
## 116 affy_hg_u95d Affy HG U95D probeset feature_page
## 117 affy_hg_u95e Affy HG U95E probeset feature_page
## 118 affy_hg_u95a Affy HG U95A probeset feature_page
## 119 affy_hugenefl Affy HuGene FL probeset feature_page
## 120 affy_hta_2_0 Affy HTA-2_0 probeset feature_page
## 121 affy_huex_1_0_st_v2 Affy HuEx 1_0 st v2 probeset feature_page
## 122 affy_hugene_1_0_st_v1 Affy HuGene 1_0 st v1 probeset feature_page
## 123 affy_hugene_2_0_st_v1 Affy HuGene 2_0 st v1 probeset feature_page
## 124 affy_primeview Affy primeview feature_page
## 125 affy_u133_x3p Affy U133 X3P probeset feature_page
Now we know that our IDs correspond to “affy_hg_u133a” attributes in Biomart. Let’s extract annotations for them:
# Get annotations from Biomart
genes.up <- getBM(attributes = c("affy_hg_u133a", "external_gene_name", "description"), filters = "affy_hg_u133a", values = up.ids, mart = mart) #, uniqueRows = T)
datatable(genes.up) # Check your work
Do that for the downregulated genes, and write the results to file.
Clean your workspace
# Clean workspace unlink(c('GDS858.soft.gz','genes.up.txt','genes.dn.txt','GPL96.soft'))