Slides: Intro to RStudio
Slides: Intro to R
Slides: Intro to RMarkdown
Slides: Intro to Mathjax
Slides: Intro to Presentations
Markdown demo
ioslides presentation template
Beamer presentation example
“Data Analysis and Visualization Using R.” This is a course that combines video, HTML and interactive elements to teach the statistical programming language R. http://varianceexplained.org/RData/
“R for beginners” book by Emmanuel Paradis, covers basics of R language. https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
A very short R introduction, https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
Rstudio tutorial slides, http://dss.princeton.edu/training/RStudio101.pdf
R reference card 2.0 by Matt Baggott. https://cran.r-project.org/doc/contrib/Short-refcard.pdf - Tom Short R reference card, and https://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf
R Markdown Reference Guide, including knitr and pandoc for presentations. https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
“R Programming for Data Science” bookdown by Roger Peng. From R basics to parallel programming. https://bookdown.org/rdpeng/rprogdatascience/
The “R” course, by Jonathan D. Rosenblatt, http://www.john-ros.com/Rcourse/. Git repository at https://github.com/johnros/Rcourse
“R for Data Science” by Garrett Grolemund and Hadley Wickham, http://r4ds.had.co.nz/. Git repository at https://github.com/hadley/r4ds
“Efficient R programming,” a book by Colin Gillespie, https://github.com/csgillespie/efficientR
Swirl R package, and swirl-related courses, http://swirlstats.com/, https://github.com/swirldev/swirl_courses
Interactive R basics tutorial, http://tryr.codeschool.com/
A curated list of awesome R frameworks, packages and software. Very rich list of references. https://github.com/garthtarr/awesome-R
Intro to R, short videos from Google developers, https://www.youtube.com/playlist?list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP
Several R related playlists for novices, https://www.youtube.com/user/TheLearnR/playlists
Tips for giving great talk, by Kayvon Fatahalian, https://www.cs.cmu.edu/~kayvonf/misc/cleartalktips.pdf
Video: “How to Speak”, Lecture Tips from Patrick Winston, https://vimeo.com/101543862
The IMRaD (Introduction – Method – Results – and – Discussion) format, http://sokogskriv.no/en/writing/structure/the-imrad-format/
The ICMJE standards of authorship, http://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html.
The Scientific Writing Resource, https://cgi.duke.edu/web/sciwriting/
Slides: Dplyr and ggplot2 basics
Tutorial: dplyr_ggplot2.R
Tutorial: Base R graphics
Tutorial: ggplot2
Data Manipulation Using R (& dplyr). PDF slides available at https://ramnarasimhan.files.wordpress.com/2014/10/data-manipulation-using-r_acm2014.pdf, and http://www.slideshare.net/Ram-N/data-manipulation-using-r-acm2014
Data Manipulation with dplyr
. http://datascienceplus.com/data-manipulation-with-dplyr/
“Aggregating and analyzing data with dplyr” by Data Carpentry. http://www.datacarpentry.org/R-genomics/04-dplyr.html
Do your “data janitor work” like a boss with dplyr
. http://www.gettinggeneticsdone.com/2014/08/do-your-data-janitor-work-like-boss.html
“Data visualization in R” by Data Carpentry. http://www.datacarpentry.org/R-genomics/05-data-visualization.html
“ggplot2 tutorial/slides/code examples/references” by Jenny Bryan. https://github.com/jennybc/ggplot2-tutorial
“R Graph Catalog”, visuals and code examples of graphs made with ggplot2
. http://shiny.stat.ubc.ca/r-graph-catalog/
R Seminar: Introduction to ggplot2
, comprehensive introduction, from UCLA. http://www.ats.ucla.edu/stat/r/seminars/ggplot2_intro/ggplot2_intro.htm
well-documented ggplot2 tutorial from the Institute For Quantitative Social Science at Harvard, http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
Exploratory Data Analysis with R graphics by Roger Peng, https://www.youtube.com/playlist?list=PLjTlxb-wKvXPhZ7tQwlROtFjorSj9tUyZ, his other video R tutorials
Introduction to Data Visualization with R and ggplot2, https://www.youtube.com/watch?v=49fADBfcDD4&t=
“Data Visualization for Social Science”, A practical introduction with R and ggplot2, free book by Kieran Healy, Duke University http://socviz.co/
One-pager simple git
guide. https://rogerdudler.github.io/git-guide/
One-pager of git
commands. https://github.com/kbroman/Tools4RR/blob/master/04_Git/GitCommands/git_notes.md
Learn git
interactively in 15 min. https://try.github.io/levels/1/challenges/1
Interactive git branching tutorial. http://learngitbranching.js.org/
“Git and GitHub guide” by Karl Bromanhttp://kbroman.org/github_tutorial/
Software Carpentry course on git
. https://swcarpentry.github.io/git-novice/
Book “Version Control by Example” by Eric Sink. http://ericsink.com/vcbe/
Book(down) “Happy Git and GitHub for the useR” by Jenny Bryan. http://happygitwithr.com/
How to create pull requests. https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/
Quick Git and GitHub videos. http://www.dataschool.io/git-and-github-videos-for-beginners/
GitHub training videos. https://www.youtube.com/user/GitHubGuides/videos
http://biology-pages.info/ - Kimball biology pages
http://atlasgeneticsoncology.org/GeneticEng.html - educational items in human genetics
https://www.youtube.com/watch?v=NR0mdDJMHIQ - Mitotic cell division
http://www.hhmi.org/biointeractive/dna-packaging - how DNA is packaged into nucleosomes-chromosomes
https://www.youtube.com/watch?v=TNKWgcFPHqw - DNA replication 3D
https://www.youtube.com/watch?v=41_Ne5mS2ls and https://www.youtube.com/watch?v=5bLEDd-PSTQ - Transcription and Translation
http://vcell.ndsu.nodak.edu/animations/transcription/movie-flash.htm - mRNA transcription
http://vcell.ndsu.nodak.edu/animations/mrnaprocessing/movie-flash.htm - mRNA processing
http://vcell.ndsu.nodak.edu/animations/translation/movie-flash.htm - mRNA translation
http://sepuplhs.org/high/sgi/teachers/genetics_act16_sim.html - mRNA translation demo and exercise
http://learn.genetics.utah.edu/content/basics/transcribe/TranscribeTranslate.swf - transcribe and translate exercise
http://www.hhmi.org/biointeractive/coding-sequences-dna - Size of the genome and coding sequences
https://www.youtube.com/watch?v=46FC0OGzsSs - dna structure & genetics 3d animation, self-watch
Slides: Microarray technology
Slides: Microarray databases
Microarray data
Tumor Analysis Best Practices Working Group. “Expression Profiling–Best Practices for Data Generation and Interpretation in Clinical Trials.” Nature Reviews. Genetics 5, no. 3 (March 2004): 229–37. doi:10.1038/nrg1297. https://www.ncbi.nlm.nih.gov/pubmed/14970825 - Microarray overview. Best practices for technology and analysis.
Lipshutz, R. J., S. P. Fodor, T. R. Gingeras, and D. J. Lockhart. “High Density Synthetic Oligonucleotide Arrays.” Nature Genetics 21, no. 1 Suppl (January 1999): 20–24. doi:10.1038/4447. https://www.ncbi.nlm.nih.gov/pubmed/9915496 - Description of Affymetrix technology
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
https://youtu.be/MRmpeBTwwWw - photolitographic manufacturing of oligonucleotide arrays
Slides: Image analysis
Log2 transformation
Mann-Whitney test
Trimmed mean exact formula
Mann Whitney step-by-step example, http://statweb.stanford.edu/~susan/courses/s141/hononpara.pdf
Robust summaries and log transformation, https://genomicsclass.github.io/book/pages/robust_summaries.html
Exploratory data analysis, https://genomicsclass.github.io/book/pages/exploratory_data_analysis.html
Slides: Bioconductor
Slides: Annotation, gene IDs
R: Install Bioconductor
R: ExpressionSet
R: Biomart
R: Annotation
ExpressionSet, SummarizedExperiment, GEOquery, biomaRt, week 3 from https://kasperdanielhansen.github.io/genbioconductor/
ExpressionSet and SummarizedExperiment, http://www.sthda.com/english/wiki/expressionset-and-summarizedexperiment
TutorialDurinck.ppt
- detailed presentation of Bioconductor, analysis, http://www.nettab.org/2003/docs/TutorialDurinck.ppt
Slides: Quality control, spotted arrays
Slides: Quality control, Affymetrix arrays
Exercises and data
lab/MAplots_QC_spotted.R
- Reading in two spotted arrays, playing with regular and Bland-Altman plots. Uses data_spotted
files.lab/MAplots_affy.R
- constructing ExpressionSet, visualizing it as scatterplot, MAplot, smoothscatter. Volcano plot. Uses data_eset
files.data_analysis_fundamentals_manual.pdf
- Affy arrays QC metrics, detection p-value (wilcoxon), log-transformation. Read Chapter 1-2, and “Brief Information on the Databases Available on the NetAffx Analysis Center” at the end. http://www2.stat.duke.edu/~mw/ABS04/RefInfo/data_analysis_fundamentals_manual.pdf
Lipshutz, R. J., S. P. Fodor, T. R. Gingeras, and D. J. Lockhart. “High Density Synthetic Oligonucleotide Arrays.” Nature Genetics 21, no. 1 Suppl (January 1999): 20–24. doi:10.1038/4447. - Description of Affymetrix technology. PM-MM
Gautier, Laurent, Leslie Cope, Benjamin M. Bolstad, and Rafael A. Irizarry. “Affy–Analysis of Affymetrix GeneChip Data at the Probe Level.” Bioinformatics (Oxford, England) 20, no. 3 (February 12, 2004): 307–15. doi:10.1093/bioinformatics/btg405. - affy package description and details of probe-level analysis
Archer, Kellie J., Catherine I. Dumur, Suresh E. Joel, and Viswanathan Ramakrishnan. “Assessing Quality of Hybridized RNA in Affymetrix GeneChip Experiments Using Mixed-Effects Models.” Biostatistics (Oxford, England) 7, no. 2 (April 2006): 198–212. doi:10.1093/biostatistics/kxj001. - 3’/5’ ratio
Slides: Normalization
Lecture notes, normalization
Quantile normalization example
Exercises and data
lab/01_lowess_curve_fit_demo.R
- Non-linear curve fitting exerciselab/02_lowess_by_hand_demo.R
- Manually fitting Loess curves on wine
datasetlab/03_lowess_2color.R
- Loess on two-color arrays. MA diagnostic plots, global and print-tip loess normalization. Uses data_spotted
files from 05b_Quality
.lab/04_normalization_affy.R
- quantile normalization of one-color array. Uses data_affy
files from 05b_Quality
.lab/05_quantile_demo.R
- QQplot manual demo, affy quantile normalization. Uses data_affy
files from 05b_Quality
.Bolstad, B. M., R. A. Irizarry, M. Astrand, and T. P. Speed. “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias.” Bioinformatics (Oxford, England) 19, no. 2 (January 22, 2003): 185–93. - Normalization methods description.
Cleveland, William S, and Susan J Devlin. “Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting.” Journal of the American Statistical Association 83, no. 403 (1988): 596–610. - Loess regression. Concept, statistics.
Fan, Jianqing, and Yi Ren. “Statistical Analysis of DNA Microarray Data in Cancer Research.” Clinical Cancer Research: An Official Journal of the American Association for Cancer Research 12, no. 15 (August 1, 2006): 4469–73. doi:10.1158/1078-0432.CCR-06-1033. - Steps in microarray data analysis, from preprocessing to differential expression and time course. Brief.
Carvalho, Benilton S., and Rafael A. Irizarry. “A Framework for Oligonucleotide Microarray Preprocessing.” Bioinformatics (Oxford, England) 26, no. 19 (October 1, 2010): 2363–67. doi:10.1093/bioinformatics/btq431. - Preprocessing for different microarray types - Affy, Illumina, Nimblegen - , and platforms - SNP, Exon, Expression, Tiling. Probe affinity effect figure
Huber, Wolfgang, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. “Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression.” Bioinformatics (Oxford, England) 18 Suppl 1 (2002): S96-104. - VSN paper. aka VST. arsinh: https://www.geogebra.org/m/f5gdhrmT
On non-linear curve fitting and goodness-of-fit: “Technical note: Curve fitting with the R Environment for Statistical Computing”, PDF, http://www.css.cornell.edu/faculty/dgr2/teach/R/R_CurveFit.pdf
qsmooth
R package - Smooth quantile normalization (qsmooth) is a generalization of quantile normalization, which is an average of the two types of assumptions about the data generation process: quantile normalization and quantile normalization between groups. https://github.com/stephaniehicks/qsmooth. Paper: Hicks, Stephanie C, Kwame Okrah, Joseph N Paulson, John Quackenbush, Rafael A Irizarry, and Hector Corrada Bravo. “Smooth Quantile Normalization,” November 2, 2016. http://biorxiv.org/lookup/doi/10.1101/085175. - Per-group quantile normalization accounting for the differences in variation among groups
Ballman, Karla V., Diane E. Grill, Ann L. Oberg, and Terry M. Therneau. “Faster Cyclic Loess: Normalizing RNA Arrays via Linear Models.” Bioinformatics (Oxford, England) 20, no. 16 (November 1, 2004): 2778–86. doi:10.1093/bioinformatics/bth327. - loess how-to. Method description. cyclic loess. Parallel implementation. Quantile normalization as another non-parametric normalization.
Hurvich, Clifford M., Jeffrey S. Simonoff, and Chih-Ling Tsai. “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion.” Journal of the Royal Statistical Society. Series B (Statistical Methodology) 60, no. 2 (1998): 271–93. http://www.jstor.org/stable/2985940.
Heider, Andreas, and Rüdiger Alt. “VirtualArray: A R/Bioconductor Package to Merge Raw Data from Different Microarray Platforms.” BMC Bioinformatics 14 (2013): 75. doi:10.1186/1471-2105-14-75. - Meta-analysis of multiple microarrays in GEO. Techniques for normalization and batch effect removal
“Automated parameter selection for LOESS regression.” http://blog.eighty20.co.za//technique%20review/2016/02/11/loess-Allan/
Slides: Expression summarization
Lecture notes: Expression summarization
Li & Wong summarization example
Median polish example
RMA example 1
RMA example 2
Exercises and data
lab/Tukey_MAS5.R
- normalization example. Uses data from 05b_Quality/lab/data_affy/
lab/median_polish.R
- median polish example. Uses data from 05b_Quality/lab/data_affy/
sadd_whitepaper.pdf
- Statistical Algorithms Description Document. Summarization, noise, background correction, ideal mismatch. P/A calls (wilcoxon). http://tools.thermofisher.com/content/sfs/brochures/sadd_whitepaper.pdf
Li, C., and W. Hung Wong. “Model-Based Analysis of Oligonucleotide Arrays: Model Validation, Design Issues and Standard Error Application.” Genome Biology 2, no. 8 (2001): RESEARCH0032. http://www.pnas.org/content/98/1/31.long
Irizarry, Rafael A., Bridget Hobbs, Francois Collin, Yasmin D. Beazer-Barclay, Kristen J. Antonellis, Uwe Scherf, and Terence P. Speed. “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data.” Biostatistics (Oxford, England) 4, no. 2 (April 2003): 249–64. doi:10.1093/biostatistics/4.2.249. https://academic.oup.com/biostatistics/article/4/2/249/245074/Exploration-normalization-and-summaries-of-high
Slides: Introduction into differential expression analysis
Exercise: lab/DEG_ttest.R
- Basic t-test and volcano plot
Exercise: lab/volcano.R
- Demo of pretty volcano plot
Handouts: Power analysis
Slides: Linear models for microarray data analysis, limma, empirical Bayes
Exercise: lab/DiffExpr_Limma.Rmd - limma demo
Slides: Significance Analysis of Microarrays, SAM
Exercise: lab/DiffExpr_SAM.Rmd - SAM demo, uses GDS858 data. BiomaRt demo at the end
Slides: Multiple testing correction
Exercise: lab/Filtering.Rmd - filtering gene set
Exercise: lab/multitest.R - multiple testing correction and filtering. Uses Methyl.RData
Slides: Batch effect correction, ComBat, SVA
Batch effect exercise 1, and exercise 2
Krzywinski, Martin, and Naomi Altman. “Points of Significance: Power and Sample Size.” Nature Methods 10, no. 12 (November 26, 2013): 1139–40. doi:10.1038/nmeth.2738.
Cui, Xiangqin, and Gary A. Churchill. “Statistical Tests for Differential Expression in CDNA Microarray Experiments.” Genome Biology 4, no. 4 (2003): 210. - Differential expression analysis of microarrays, from fold-change to t-test, its moderated versions SAM and limma, and ANOVA.
Tong, Tiejun, and Hongyu Zhao. “Practical Guidelines for Assessing Power and False Discovery Rate for a Fixed Sample Size in Microarray Experiments.” Statistics in Medicine 27, no. 11 (May 20, 2008): 1960–72. doi:10.1002/sim.3237. Power analysis. t-statistics, FDR types and definitions, then derivation of power calculations.
Lönnstedt, Ingrid, and Terry Speed. “REPLICATED MICROARRAY DATA.” Statistica Sinica 12, no. 1 (2002): 31–46. http://www.jstor.org/stable/24307034. - Empirical Bayes method for analyzing microarray replicates. Issues with simple approaches, proposed B statistics - a Bayes log posterior odds with two hyperparameters in the inverse gamma prior for the variances, and a hyperparameter in the normal prior of the nonzero means. Appendix - detailed definitions, derivations, and solutions.
Smyth, Gordon K. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (2004): Article3. doi:10.2202/1544-6115.1027. PDF - Linear models for differential analysis, moderated t-statistics via shrinkage of sample variance. Empirical estimation of Bayesian prior variance distribution and shrinkage hyperparameters.
Phipson, Belinda, Stanley Lee, Ian J. Majewski, Warren S. Alexander, and Gordon K. Smyth. “Robust Hyperparameter Estimation Protects against Hypervariable Genes and Improves Power to Detect Differential Expression.” The Annals of Applied Statistics 10, no. 2 (June 2016): 946–63. doi:10.1214/16-AOAS920. PDF - An extension of differential analysis using linear modeling and empirical Bayes by windsorizing outliers in estimating sample distribution.
Sartor, Maureen A., Craig R. Tomlinson, Scott C. Wesselkamper, Siva Sivaganesan, George D. Leikauf, and Mario Medvedovic. “Intensity-Based Hierarchical Bayes Method Improves Testing for Differentially Expressed Genes in Microarray Experiments.” BMC Bioinformatics 7 (December 19, 2006): 538. https://doi.org/10.1186/1471-2105-7-538. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-538 - Intensity-Based Moderated T-statistic (IBMT). Empirical Bayes approach allowing for the relationship between variance and gene signal intensity (estimated with loess). Brief description of previous methods (Smyth, Cyber-T). Details of Smyth hierarchical model and moderated t-statistics, estimation of hyperparameters with implementation of variance-signal. Software at http://eh3.uc.edu/ibmt/.
Lianbo Yu et al., “Fully Moderated T-Statistic for Small Sample Size Gene Expression Arrays,” Statistical Applications in Genetics and Molecular Biology 10, no. 1 (September 15, 2011), https://doi.org/10.2202/1544-6115.1701. https://www.degruyter.com/view/j/sagmb.2011.10.issue-1/1544-6115.1701/1544-6115.1701.xml - Third implementation of moderated t-statistics. First is Smyth 2004 model assuming $d_{0g}$ and $s_{0g}^2$ constant, second is IBMT (intensity-based moderated t) Sartor 2006 allows varying $s_{0g}^2$ with gene expression, third is the present FMT (fully moderated t) model allowing varying d_{0g} and s_{0g}^2. Description of Smyth hierarchical model, estimation of hyperparameters. Adjusted log variances are fit with loess. Goal - increase in power - is demonstrated on simulated data.
Goeman, Jelle J., and Aldo Solari. “Multiple Hypothesis Testing in Genomics.” Statistics in Medicine 33, no. 11 (May 20, 2014): 1946–78. https://doi.org/10.1002/sim.6082. - multiple testing review
Benjamini, Yoav, and Yosef Hochberg. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological), 1995, 289–300. - FDR paper
Storey, John D., and Robert Tibshirani. “Statistical Significance for Genomewide Studies.” Proceedings of the National Academy of Sciences of the United States of America 100, no. 16 (August 5, 2003): 9440–45. https://doi.org/10.1073/pnas.1530509100. - q-value paper
Krzywinski, Martin, and Naomi Altman. “Points of Significance: Comparing Samples—part II.” Nature Methods 11, no. 4 (March 28, 2014): 355–56. doi:10.1038/nmeth.2900.
Lazar, Cosmin, Stijn Meganck, Jonatan Taminau, David Steenhoff, Alain Coletta, Colin Molter, David Y. Weiss-Solís, Robin Duque, Hugues Bersini, and Ann Nowé. “Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey.” Briefings in Bioinformatics 14, no. 4 (July 2013): 469–90. https://doi.org/10.1093/bib/bbs037. - Review of batch correction methods, definitions, methodologies
Leek, Jeffrey T., Robert B. Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A. Irizarry. “Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data.” Nature Reviews. Genetics 11, no. 10 (October 2010): 733–39. https://doi.org/10.1038/nrg2825. - Batch effect, types, sources, examples of wrong conclusions, SVA and ComBat methods
Johnson, W. Evan, Cheng Li, and Ariel Rabinovic. “Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods.” Biostatistics (Oxford, England) 8, no. 1 (January 2007): 118–27. https://doi.org/10.1093/biostatistics/kxj037. - ComBat paper. Batch effect removal using Empirical Bayes method.
Leek, Jeffrey T., and John D. Storey. “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis.” PLoS Genetics 3, no. 9 (September 2007): 1724–35. https://doi.org/10.1371/journal.pgen.0030161. - SVA paper
Chen, Chao, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu. “Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods.” PloS One 6, no. 2 (2011): e17238. https://doi.org/10.1371/journal.pone.0017238. - Comparing batch effect removal software. ComBat is best. Importance of standartization
Slides: Functional enrichment analysis
Exercise: lab/Enrichment_Genes.Rmd
- differential expression using limma, annotations, functional enrichment
Exercise: lab/EnrichmentOverlay.Rmd
- overlaying DEGs over a pathway
Slides: Chi-square, Fisher’s exact, McNemar tests
Bard, Jonathan B. L., and Seung Y. Rhee. “Ontologies in Biology: Design, Applications and Future Challenges.” Nature Reviews. Genetics 5, no. 3 (March 2004): 213–22. doi:10.1038/nrg1295. - Ontologies review
Ackermann, Marit, and Korbinian Strimmer. “A General Modular Framework for Gene Set Enrichment Analysis.” BMC Bioinformatics 10, no. 1 (2009): 47. doi:10.1186/1471-2105-10-47. - All steps for enrichment analysis, methods, statistics, GSEA.
Efron, Bradley, and Robert Tibshirani. “On Testing the Significance of Sets of Genes.” The Annals of Applied Statistics, 2007, 107–29. - maxmean statistics for enrichment analysis. Comparison with GSEA.
Huang, Da Wei, Brad T. Sherman, and Richard A. Lempicki. “Bioinformatics Enrichment Tools: Paths toward the Comprehensive Functional Analysis of Large Gene Lists.” Nucleic Acids Research 37, no. 1 (January 2009): 1–13. doi:10.1093/nar/gkn923. - Gene enrichment analyses tools. Statistics, concept of background. 68 tools, table
Hung, Jui-Hung, Tun-Hsiang Yang, Zhenjun Hu, Zhiping Weng, and Charles DeLisi. “Gene Set Enrichment Analysis: Performance Evaluation and Usage Guidelines.” Briefings in Bioinformatics 13, no. 3 (May 2012): 281–91. doi:10.1093/bib/bbr049. - Details of GSEA. Statistics, correction for multiple testing. Lack of gold standard - concept of mutual coverage.
Khatri, Purvesh, Marina Sirota, and Atul J. Butte. “Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges.” PLoS Computational Biology 8, no. 2 (2012): e1002375. doi:10.1371/journal.pcbi.1002375. - Review of enrichment analyses techniques, focus on pathways, Table 1 lists tools, limitations.
Slides: Introduction to clustering, preprocessing, distance/similarity metrics, hierarchical/divisive clustering, single/average/complete metrics
Exercises: Average_linkage.R
, Single_Linkage.R
, Complete_Linkage.R
, Divisive_Kmeans.R
, CramersV_gower.R
Slides: Non-hierarchical Clustering and dimensionality reduction techniques
Exercises: Kmeans_PAM.R
(uses BreastCancer.RData
), PCA.R
(uses nci60.tsv
), MDS.Rmd
Slides: Clustering quality control measures
Patrik D’haeseleer, “How Does Gene Expression Clustering Work?,” Nature Biotechnology 23, no. 12 (December 2005): 1499–1501, https://doi.org/10.1038/nbt1205-1499. - Clustering distances. Recommendations for gene expression choices of clustering
Satagopan, Jaya M., and Katherine S. Panageas. “A Statistical Perspective on Gene Expression Data Analysis.” Statistics in Medicine 22, no. 3 (February 15, 2003): 481–99. doi:10.1002/sim.1350. - Intro into microarray technology, statistical questions. Hierarchical clustering - clustering metrics. MDS algorithm. Class prediction - linear discriminant analysis algorithm and cross-validation. SAS and S examples
Altman, Naomi, and Martin Krzywinski. “Points of Significance: Clustering.” Nature Methods 14, no. 6 (May 30, 2017): 545–46. doi:10.1038/nmeth.4299. - Clustering depends on gene scaling, clustering method, number of simulations in k-means clustering.
Abdi, Hervé, and Lynne J. Williams. “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics 2, no. 4 (July 2010): 433–59. doi:10.1002/wics.101. - PCA in-depth review. Mathematical formulations, terminology, examples, interpretation. Figures showing PC axes, rotations, projections, circle of correlation. Rules for selecting number of components. Rotation - varimax, promax, illustrated. Correspondence analysis for nominal variables, Multiple Factor Analysis for a set of observations described by several groups (tables) of variables. Appendices - eigenvalues and eigenvectors, positive semidefinite matrices, SVD
Wall, Michael. “Singular Value Decomposition and Principal Component Analysis,” n.d. https://link.springer.com/chapter/10.1007/0-306-47815-3_5 - SVD and PCA statistical intro. Relation of SVD to PCA, Fourier transform. Examples of applications, including genomics.
Lever, Jake, Martin Krzywinski, and Naomi Altman. “Points of Significance: Principal Component Analysis.” Nature Methods 14, no. 7 (June 29, 2017): 641–42. doi:10.1038/nmeth.4346. PCA explanation, the effect of scale. Limitations
Meng, Chen, Oana A. Zeleznik, Gerhard G. Thallinger, Bernhard Kuster, Amin M. Gholami, and Aedín C. Culhane. “Dimension Reduction Techniques for the Integrative Analysis of Multi-Omics Data.” Briefings in Bioinformatics 17, no. 4 (July 2016): 628–41. doi:10.1093/bib/bbv108. - A must. Dimensionality reduction techniques - PCA and its derivatives, NMF. Table 1 - Terminology. Table 2 - methods, tools, visualization packages. Methods for integrative data analysis of multi-omics data.
Lee, Su-In, and Serafim Batzoglou. “Application of Independent Component Analysis to Microarrays.” Genome Biology 4, no. 11 (2003): R76. doi:10.1186/gb-2003-4-11-r76. - Independent Components Analysis theory and applications.
Stein-O’Brien, Genevieve L., Raman Arora, Aedin C. Culhane, Alexander Favorov, Casey Greene, Loyal A. Goff, Yifeng Li, et al. “Enter the Matrix: Interpreting Unsupervised Feature Learning with Matrix Decomposition to Discover Hidden Knowledge in High-Throughput Omics Data.” BioRxiv, October 2, 2017. doi:10.1101/196915. - Matrix factorization and visualization. Refs to various types of MF methods.Terminology, Fig 1 explanation of MF in terms of gene expression and biological processes. References to biological examples.
Libbrecht, Maxwell W., and William Stafford Noble. “Machine Learning Applications in Genetics and Genomics.” Nature Reviews. Genetics 16, no. 6 (June 2015): 321–32. doi:10.1038/nrg3920. - Machine learning in genomics. Supervised/unsupervised learning, semi-supervised, bayesian (incorporating prior knowledge), feature selection, imbalanced class sizes, missing data, networks.
Meng, Chen, Bernhard Kuster, Aedín C. Culhane, and Amin Moghaddas Gholami. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics 15 (May 29, 2014): 162. doi:10.1186/1471-2105-15-162. - MCIA - multiple correspondence analysis for integrating multiple datasets. Statistics and implementation in omicade4
- Multiple co-inertia analysis of omics datasets. https://bioconductor.org/packages/release/bioc/html/omicade4.html
Video: Hierarchical clustering playlist, and other clustering videos from Viktor Lavrenko
Video: t-SNE talk “Visualizing data using embeddings” by Laurens van der Maaten
All general references recommended to read.