Slides: Intro, course logistics
Slides: Genomic technologies
Hagen, J. B. “The Origins of Bioinformatics.” Nature Reviews. Genetics 1, no. 3 (2000): 231–36. doi:10.1038/35042090.
E. S. Lander et al., “Initial Sequencing and Analysis of the Human Genome,” Nature 409, no. 6822 (February 15, 2001): 860–921, doi:10.1038/35057062. - Genome sequencing landmark paper.
Mardis, Elaine R. “Next-Generation DNA Sequencing Methods.” Annual Review of Genomics and Human Genetics 9 (2008): 387–402. doi:10.1146/annurev.genom.9.081307.164359. - Sequencing technologies review. DNA-/RNA-/ChIP-seq. Figures.
Mardis, Elaine R. “DNA Sequencing Technologies: 2006-2016.” Nature Protocols 12, no. 2 (February 2017): 213–18. doi:10.1038/nprot.2016.182. - DNA sequencing technologies introduction, references.
Heather, James M., and Benjamin Chain. “The Sequence of Sequencers: The History of Sequencing DNA.” Genomics 107, no. 1 (January 2016): 1–8. https://doi.org/10.1016/j.ygeno.2015.11.003. - Review of sequencing technologies, from pre-Sanger to current PacBio, Oxford Nanopore, Ion Torrent.
Cordaux, Richard, and Mark A. Batzer. “The Impact of Retrotransposons on Human Genome Evolution.” Nature Reviews. Genetics 10, no. 10 (October 2009): 691–703. doi:10.1038/nrg2640.
Rothberg, Jonathan M, and John H Leamon. “The Development and Impact of 454 Sequencing.” Nature Biotechnology 26, no. 10 (October 2008): 1117–24. doi:10.1038/nbt1485. - 454 sequencing, history of sequencing development, pyrosequencing. Other technologies - Box 1, Table 2
Goodwin, Sara, John D. McPherson, and W. Richard McCombie. “Coming of Age: Ten Years of next-Generation Sequencing Technologies.” Nature Reviews. Genetics 17, no. 6 (17 2016): 333–51. doi:10.1038/nrg.2016.49. - From microarrays to long read sequencing, single-cell sequencing. Technology review. Table with all technologies and costs. Figures.
Green, Eric D., Mark S. Guyer, Eric D. Green, Mark S. Guyer, Teri A. Manolio, and Jane L. Peterson. “Charting a Course for Genomic Medicine from Base Pairs to Bedside.” Nature 470, no. 7333 (February 10, 2011): 204–13. doi:10.1038/nature09764. - Perspective on genomics development, technologies. Figure 1 - pictorial roadmap
Bioinformatics-Resources
- A curated list of resources for learning bioinformatics. https://github.com/YTLogos/Bioinformatics-Resources
Biostar Handbook - bioinformatics survival guide. A practical overview for the data analysis methods of bioinformatics. https://www.biostarhandbook.com/index.html and on Git https://github.com/ialbert/biostar-handbook-web
biotools - A massive collection of references on the topics of bioinformatics, sequencing technologies, programming, machine learning, and more. https://github.com/jdidion/biotools
“Computational genomics with R” book by Altuna Akalin. Web site, https://compgenomr.github.io/book/, and GitHub repo, https://github.com/compgenomr/book
Links and references to many resources https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources
https://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf - read about Illumina technology, mostly figures
Video. The Genomic Landscape circa 2016 - Eric Green. Overview of genomics and medicine. https://www.youtube.com/watch?v=mhD3-_5Ee-A
Video. Any talks by Elaine Mardis. https://www.youtube.com/results?search_query=elaine+mardis
Slides: Unix intro, command line
Slides: Text manipulations, regular expression, sed, awk
Slides: Shell scriptiong
A Book for Anyone to Get Started with Unix. http://seankross.com/the-unix-workbench/, and the GitHub repository, https://github.com/seankross/the-unix-workbench
Data Coding 101 – Intro To Bash. Four episodes, video. https://data36.com/data-coding-bash-best-practices/
An interactive explainer of any shell command. http://explainshell.com/
Unix/Linux command reference sheets. https://cheat-sheets.s3.amazonaws.com/linux-commands-cheat-sheet-new.pdf and https://files.fosswire.com/2007/08/fwunixref.pdf
Survival guide for Unix newbies. http://matt.might.net/articles/basic-unix/
Settling into Unix tutorial. http://matt.might.net/articles/settling-into-unix/
Shell programming with bash tutorial. http://matt.might.net/articles/bash-by-example/
Master the power of command-line with a list of one-liner gems. http://www.commandlinefu.com/commands/browse
“The Unix shell”, Software Carpentry. https://swcarpentry.github.io/shell-novice/
A curated list of Terminal frameworks, plugins & resources for command-line interface (CLI) lovers. http://terminalsare.sexy and https://github.com/k4m4/terminals-are-sexy
“Data Science at the Command Line” by Jeroen Janssens, GitHub repository https://github.com/jeroenjanssens/data-science-at-the-command-line and an online book https://www.datascienceatthecommandline.com/
Heng Li’s “A Bioinformatician’s UNIX Toolbox”, http://lh3lh3.users.sourceforge.net/biounix.shtml
Bioinformatics one-liners by Stephen Turner, https://github.com/stephenturner/oneliners
Collection of bioinformatics-genomics bash one liners, using awk, sed etc. https://github.com/crazyhottommy/bioinformatics-one-liners
Links and references to many genomics and bioinformatics resources, https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources
Regular expression, Unix commands, Python quick reference, SQL reference card. http://practicalcomputing.org/files/PCfB_Appendices.pdf
Tutorial to sed
by Bruce Barnett. http://www.grymoire.com/Unix/Sed.html
Vim introduction and tutorial. https://blog.interlinked.org/tutorials/vim_tutorial.html
Interactive Vim tutorial. http://www.openvim.com/
Vim reference card. http://web.mit.edu/merolish/Public/vi-ref.pdf
Slides: Genomic file formats and tools
Slides: Genomic resources
Slides: TCGA
Schroeder, Michael P., Abel Gonzalez-Perez, and Nuria Lopez-Bigas. “Visualizing Multidimensional Cancer Genomics Data.” Genome Medicine 5, no. 1 (2013): 9. https://doi.org/10.1186/gm413. - Omics visualization tools review, summary table.
Silva, Tiago C., Antonio Colaprico, Catharina Olsen, Fulvio D’Angelo, Gianluca Bontempi, Michele Ceccarelli, and Houtan Noushmehr. “TCGA Workflow: Analyze Cancer Genomics and Epigenomics Data Using Bioconductor Packages.” F1000Research 5 (December 28, 2016): 1542. https://doi.org/10.12688/f1000research.8923.2. - TCGA workflow
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of TCGA Data.” Nucleic Acids Research 44, no. 8 (May 5, 2016): e71. https://doi.org/10.1093/nar/gkv1507.
Cline, Melissa S., Brian Craft, Teresa Swatloski, Mary Goldman, Singer Ma, David Haussler, and Jingchun Zhu. “Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser.” Scientific Reports 3 (October 2, 2013): 2652. https://doi.org/10.1038/srep02652.
Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. “Reproducible RNA-Seq Analysis Using Recount2.” Nature Biotechnology 35, no. 4 (April 11, 2017): 319–21. https://doi.org/10.1038/nbt.3838. - recount2 - uniformly processed RNA-seq data, using Rail-RNA. Counts and bigWig files for differential region analysis using derfinder.
Parker, Joel S., Michael Mullins, Maggie C. U. Cheang, Samuel Leung, David Voduc, Tammi Vickery, Sherri Davies, et al. “Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes.” Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology 27, no. 8 (March 10, 2009): 1160–67. https://doi.org/10.1200/JCO.2008.18.1370. - PAM50 paper. BRCA classification into Luminal A, Luminal B, HER2-enriched, basal-like and normal-like subtypes. 50 genes determined by Prediction Analysis of Microarrays, remain significant predictors in uni- and multivariate analyses. Data links at https://www.biostars.org/p/77590/
UCSC genome browser video tutorials: http://genome.ucsc.edu/training/vids/
UCSC tutorial, common data formats, Galaxy tutorial, mentioning of VEP, IGV. http://web.cse.ohio-state.edu/~machiraju.1/teaching/CSE5599-BMI7830/Lectures/pdf/BMI7830-CSE5559-2014-huang-machiraju-Week-11-1.pdf
IGV tutorial, https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial
Analyze cancer genomics and epigenomics data using Bioconductor packages, https://github.com/BioinformaticsFMRP/TCGAWorkflow/, and the associated paper - Silva, Tiago C., Antonio Colaprico, Catharina Olsen, Fulvio D’Angelo, Gianluca Bontempi, Michele Ceccarelli, and Houtan Noushmehr. “TCGA Workflow: Analyze Cancer Genomics and Epigenomics Data Using Bioconductor Packages.” F1000Research 5 (December 28, 2016): 1542. https://doi.org/10.12688/f1000research.8923.2.
Tutorial for the sevenbridges cancer genomics cloud, https://github.com/tdelhomme/CancerGenomicsCloud_tutorial
VCF-tips-tricks, https://github.com/tdelhomme/VCF-tips-tricks
Slides: Alignment introduction
Exercise: naive_exact.R
- naive exact matching
Slides: Hamming/Edit distance, global/local alignment overview
Exercise: naive_Hamming.R
- Hamming distance matching
Exercise: edDistRecursive.R
- edit distance, recursive
Exercise: edDistDynamic.R
- edit distance, dynamic programming
Slides: Needleman-Wunsch global alignment
Slides: Smith-Waterman local alignment
Slides: Burrows-Wheeler Transform
Nagarajan, Niranjan, and Mihai Pop. “Sequence Assembly Demystified.” Nature Reviews. Genetics 14, no. 3 (March 2013): 157–67. doi:10.1038/nrg3367. - Gentle introduction into genome assembly. Technologies. Box2: Greedy, overlap-layout-consensus, De Bruijn. Problems
Pevzner, P. A., H. Tang, and M. S. Waterman. “An Eulerian Path Approach to DNA Fragment Assembly.” Proceedings of the National Academy of Sciences 98, no. 17 (August 14, 2001): 9748–53. https://doi.org/10.1073/pnas.171285098. - First de Bruijn graph for genome assembly paper. Idea of breaking reads into fragments. Typical approach reads are vertices connected by edges if they overlap. Hamiltonian path problem - visit each vertex exactly once, NP-complete. de Bruijn graph - overlapping fragments are edges, and the problem is Eulerian path - visit each edge once. Error-correction algorithm.
Primer: Compeau, Phillip E C, Pavel A Pevzner, and Glenn Tesler. “How to Apply de Bruijn Graphs to Genome Assembly.” Nature Biotechnology 29, no. 11 (November 8, 2011): 987–91. doi:10.1038/nbt.2023.
Chaisson, Mark J. P., Richard K. Wilson, and Evan E. Eichler. “Genetic Variation and the de Novo Assembly of Human Genomes.” Nature Reviews Genetics 16, no. 11 (October 7, 2015): 627–40. doi:10.1038/nrg3933. - Genome assembling strategies, problems. OLC, De Bruijn, string graphs. Types of gaps.
Miller, Jason R., Sergey Koren, and Granger Sutton. “Assembly Algorithms for Next-Generation Sequencing Data.” Genomics 95, no. 6 (June 2010): 315–27. doi:10.1016/j.ygeno.2010.03.001. - Assembly tools for overlap/layout/consensus and the de Bruijn graph approaches. de Bruin graph Issues with genome assembly, potential solutions.
String Graph Assembler. Simpson, J. T., and R. Durbin. “Efficient de Novo Assembly of Large Genomes Using Compressed Data Structures.” Genome Research 22, no. 3 (March 1, 2012): 549–56. doi:10.1101/gr.126953.111. - SGA - String Graph Assembler. From an FM-index. Velvet, ABySS, SOAPdenovo de Bruijn graph assemblers. BWA and FM explanation
Koren, Sergey, and Adam M. Phillippy. “One Chromosome, One Contig: Complete Microbial Genomes from Long-Read Sequencing and Assembly.” Current Opinion in Microbiology 23 (February 2015): 110–20. https://doi.org/10.1016/j.mib.2014.11.014. - Genome assembly overview focusing on long reads. Repeats (global and local) are problematic. Details on technologies: PacBio RS, Illumina’s Moleculo, ONT MinION. Assembling approaches: OLC, hierarchical hybrid (long reads correction using another technology) and non-hybrid (self long reads alignment-correction). Assembly augmentation: gap filling, scaffolding, read threading. Table 1 - long read assembly tools and descriptions.
Smith, T. F., and M. S. Waterman. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147, no. 1 (March 25, 1981): 195–97.
Burrows, Michael, and David J Wheeler. “A Block-Sorting Lossless Data Compression Algorithm,” 1994. - BWT paper
Ferragina, Paolo, and Giovanni Manzini. “Opportunistic Data Structures with Applications.” In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium On, 390–98. IEEE, 2000. https://dl.acm.org/citation.cfm?id=796543. - FM index paper
Li, Heng, and Richard Durbin. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics (Oxford, England) 25, no. 14 (July 15, 2009): 1754–60. https://doi.org/10.1093/bioinformatics/btp324. - BWA first paper
Li, Heng, and Richard Durbin. “Fast and Accurate Long-Read Alignment with Burrows-Wheeler Transform.” Bioinformatics (Oxford, England) 26, no. 5 (March 1, 2010): 589–95. https://doi.org/10.1093/bioinformatics/btp698. - BWA second paper
Li, Heng. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM.” ArXiv Preprint ArXiv:1303.3997, 2013. - BWA-MEM (maximal exact matches). Automatically select end-to-end or gapped alignment.
Langmead, Ben, and Steven L Salzberg. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9, no. 4 (March 4, 2012): 357–59. https://doi.org/10.1038/nmeth.1923. - Bowtie2 gapped alignment
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics (Oxford, England) 29, no. 1 (January 1, 2013): 15–21. https://doi.org/10.1093/bioinformatics/bts635. - STAR gapped aligner. Algorithm, testing on simulated dataset.
Slides: Bioconductor for genomics
Slides: Biostrings. Source
Exercise: biostrings.R
- Biostrings
Slides: IRanges, GenomicRanges, GenomicFeatures Source
Exercise: genomicRanges.R
- Biostrings
Exercise: annotation.R
- OrgDb, TxDb, biomaRt, AnnotationHub, ExperimentHub
Exercise: shortread.R
- basics of ShortRead package, read in FASTQ files, getting QC
Exercise: SummarizedExperiment.R
- SummarizedExperiment using ‘airway’ data
Slides: Integrative analysis
“Materials for Genomics Data Science: Introduction to Bioconductor” course by Kasper Hansen. Includes videos, code examples and lecture material. Web-site, https://kasperdanielhansen.github.io/genbioconductor/, and GitHub repo, https://github.com/kasperdanielhansen/genbioconductor
Tutorial on biostrings, GRanges, summarizedExperiment, Annotation resources, getting files into Bioconductor, https://www.bioconductor.org/help/course-materials/2017/OMRF/B2_Common_Operations.html
IRanges, GRanges, seqinfo, AnnotationHub, Biostrings, BSgenome, rtracklayer, https://kasperdanielhansen.github.io/genbioconductor/
Annotation tutorial (AnnoDb
, TxDb
, GRanges(List)
, OrganismDb
, AnnotationHub
, biomaRt
). https://github.com/jmacdon/BiocAnno2016
MultiAssayExperiment tutorial, GRanges, UpSetR diagrams. https://github.com/waldronlab/MultiAssayExperimentWorkshop
Kristensen, Vessela N., Ole Christian Lingjærde, Hege G. Russnes, Hans Kristian M. Vollan, Arnoldo Frigessi, and Anne-Lise Børresen-Dale. “Principles and Methods of Integrative Genomic Analyses in Cancer.” Nature Reviews. Cancer 14, no. 5 (May 2014): 299–313. https://doi.org/10.1038/nrc3721. - Integrative, systematic analysis. Cancer-related databases, tools. TCGA, ENCODE, pathway enrichment, network analysis. Table 1 - tools and descriptions
Nguyen, Tin, Rebecca Tagett, Diana Diaz, and Sorin Draghici. “A Novel Approach for Data Integration and Disease Subtyping.” Genome Research, October 24, 2017, gr.215129.116. https://doi.org/10.1101/gr.215129.116. - PINS - Perturbation clustering for data INtegration and disease Subtyping. Integrative analysis of multiple data types via clustering to detect subgroups differing in survival. Perturbing the data by Gaussian noise, recluster, find number of clusters least affected by perturbations. Rand index to assess clustering quality. Connectivity matrices for individual dataset types, combined into consensus matrix. Data and code at http://www.cs.wayne.edu/tinnguyen/PINS/PINS.html
Zhang, Shihua, Chun-Chi Liu, Wenyuan Li, Hui Shen, Peter W. Laird, and Xianghong Jasmine Zhou. “Discovery of Multi-Dimensional Modules by Integrative Analysis of Cancer Genomic Data.” Nucleic Acids Research 40, no. 19 (October 2012): 9379–91. https://doi.org/10.1093/nar/gks725. - Integrative analysis of gene expression, metnylation, miRNA expression. Using NMF. Good explanation of NMF. Tested on TCGA ovarian cancer data
Slides: RNA-seq introduction
Slides: Experimental design for RNA-seq
Slides: RNA-seq quality control, alignment
Slides: Gene/transcript quantification
Slides: RNA-seq data normalziation
Slides: RNA-seq batch effect removal
Slides: RNA-seq differential expression
Slides: RNA-seq alternative splicing
Conesa, Ana, Pedro Madrigal, Sonia Tarazona, David Gomez-Cabrero, Alejandra Cervera, Andrew McPherson, Michał Wojciech Szcześniak, et al. “A Survey of Best Practices for RNA-Seq Data Analysis.” Genome Biology 17, no. 1 (December 2016). https://doi.org/10.1186/s13059-016-0881-8. - RNA-seq analysis roadmap, QC. Differential detection. TPM. Tools for alternative splicing detection and visualization. small RNA analysis. single cell. Integrative analysis, with methylation.
Garber, Manuel, Manfred G. Grabherr, Mitchell Guttman, and Cole Trapnell. “Computational Methods for Transcriptome Annotation and Quantification Using RNA-Seq.” Nature Methods 8, no. 6 (June 2011): 469–77. https://doi.org/10.1038/nmeth.1613. - RNA-seq alignment and quantification. Table of tools. Transcriptome reconstruction. Alternative splicing
Wang, Zhong, Mark Gerstein, and Michael Snyder. “RNA-Seq: A Revolutionary Tool for Transcriptomics.” Nature Reviews. Genetics 10, no. 1 (January 2009): 57–63. doi:10.1038/nrg2484. - RNA-seq review.
Williams, Alexander G., Sean Thomas, Stacia K. Wyman, and Alisha K. Holloway. “RNA-Seq Data: Challenges in and Recommendations for Experimental Design and Analysis: RNA-Seq Data: Experimental Design and Analysis.” In Current Protocols in Human Genetics, edited by Jonathan L. Haines, Bruce R. Korf, Cynthia C. Morton, Christine E. Seidman, J.G. Seidman, and Douglas R. Smith, 11.13.1-11.13.20. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2014. https://doi.org/10.1002/0471142905.hg1113s83. - RNA-seq basics, tools, simulations
Martin, Jeffrey A., and Zhong Wang. “Next-Generation Transcriptome Assembly.” Nature Reviews. Genetics 12, no. 10 (September 7, 2011): 671–82. https://doi.org/10.1038/nrg3068 . - Transcriptome assembly. Sequencing technologies overview. Reference-based and de novo assembly, combined approach idea. Splice graph, De Bruijn graph.
Marioni, John C., Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. “RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Research 18, no. 9 (September 2008): 1509–17. doi:10.1101/gr.079558.108. - Illumina sequencing - microarray comparison. Good agreement. Assessing lane effect with hypergeometric distribution. Likelihood ratio test for differential expression. Chi-squared goodness-of-fit test.
Peixoto, Lucia, Davide Risso, Shane G. Poplawski, Mathieu E. Wimmer, Terence P. Speed, Marcelo A. Wood, and Ted Abel. “How Data Analysis Affects Power, Reproducibility and Biological Insight of RNA-Seq Studies in Complex Datasets.” Nucleic Acids Research 43, no. 16 (September 18, 2015): 7664–74. https://doi.org/10.1093/nar/gkv736. - The importance of RNA-seq normalization and batch effect removal. RUVseq increases power, but the choice of the number of latent variables is important. Steps in RNA-seq data processing, normalization, exploratory data analysis. https://github.com/drisso/peixoto2015_tutorial
Wang, Eric T., Rickard Sandberg, Shujun Luo, Irina Khrebtukova, Lu Zhang, Christine Mayr, Stephen F. Kingsmore, Gary P. Schroth, and Christopher B. Burge. “Alternative Isoform Regulation in Human Tissue Transcriptomes.” Nature 456, no. 7221 (November 27, 2008): 470–76. https://doi.org/10.1038/nature07509. - Alternative splicing comparison between tissues. ~94% of genes are alternatively transcribed. Variation in alternative splicing is much more between tissues than between individuals.
Park, Eddie, Zhicheng Pan, Zijun Zhang, Lan Lin, and Yi Xing. “The Expanding Landscape of Alternative Splicing Variation in Human Populations.” The American Journal of Human Genetics 102, no. 1 (January 2018): 11–26. https://doi.org/10.1016/j.ajhg.2017.11.002 . - Alternative splicing, detailed overview
Pachter, Lior. “Models for Transcript Quantification from RNA-Seq.” ArXiv Preprint ArXiv:1104.3889, 2011. - RNA-seq quantification statistics, expectation-maximization algorithm. https://arxiv.org/abs/1104.3889
Robinson, Mark D., and Alicia Oshlack. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biology 11, no. 3 (2010): R25. https://doi.org/10.1186/gb-2010-11-3-r25 - TMM normalization method. Problems with library scaling normalization. Well-written intuitive motivating example. MA plot, trimming outliers, weighted (inverse of the variance) M average after discarding 30% of M outliers and lowest 5% of A values.
“How not to perform a differential expression analysis (or science)” blog post by Lior Pachter, about Salmon-kallisto similarities and differences, general references. https://liorpachter.wordpress.com/2017/08/02/how-not-to-perform-a-differential-expression-analysis-or-science/
Robinson, Mark D., and Gordon K. Smyth. “Small-Sample Estimation of Negative Binomial Dispersion, with Applications to SAGE Data.” Biostatistics (Oxford, England) 9, no. 2 (April 2008): 321–32. doi:10.1093/biostatistics/kxm030 - Negative Binomial distribution instead of Poisson. Previous models: binomial, Poisson.
Lun, Aaron T. L., and Gordon K. Smyth. “No Counts, No Variance: Allowing for Loss of Degrees of Freedom When Assessing Biological Variability from RNA-Seq Data.” Statistical Applications in Genetics and Molecular Biology 16, no. 2 (April 25, 2017): 83–93. https://doi.org/10.1515/sagmb-2017-0010 - Negative impact of genes with zero counts on GLM framework for RNA-seq differential expression analysis. Overdispersion, GLM, quasi-likelihood F-test, adjusting degrees of freedom for zero-count genes.
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.” Genome Biology 15, no. 2 (2014): R29. https://doi.org/10.1186/gb-2014-15-2-r29. - voom paper
Love, Michael I, Wolfgang Huber, and Simon Anders. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15, no. 12 (December 2014). doi:10.1186/s13059-014-0550-8. - DESeq2 paper. Problems with fold-change ranking of genes - proposed solution using shrinkage of FCs. Generalized linear model using Negative Binomial distribution. Borrowing information - genes of similar avarage expression have similar dispersion. rlog-transformation. Look at the original DESeq publication, https://www.ncbi.nlm.nih.gov/pubmed/20979621, and the comments to the PubMed entry.
Witten, Daniela M. “Classification and Clustering of Sequencing Data Using a Poisson Model.” The Annals of Applied Statistics 5, no. 4 (December 2011): 2493–2518. doi:10.1214/11-AOAS493 - RNA-seq modeling with Poisson distribution. samples X genes matrix. Derivation of Poisson, negative binomial, using Poisson for linear discriminant analysis and clustering (Poisson dissimilarity).
Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14, no. 4 (March 6, 2017): 417–19. https://doi.org/10.1038/nmeth.4197 - Salmon paper. Pseudo-alignment, or using precomputed alignment to tramscriptome. Dual-phase statistical inference procedure and sample-specific bias models that account for sequence-specific, fragment, GC content, and positional biases. Comparison with kallisto and sailfish. Tests on simulated (Polyester, RSEM-sim) and real (GEUVADIS, SEQC) data. Detailed Methods description. https://github.com/COMBINE-lab/Salmon
Jiang, Hui, and Wing Hung Wong. “Statistical Inferences for Isoform Expression in RNA-Seq.” Bioinformatics 25, no. 8 (April 15, 2009): 1026–32. doi:10.1093/bioinformatics/btp113 - Alternative splicing statistics - Poisson modeling. Problem - most reads are shared by more than one isoform. How to quantify isoform expression from exon counts. Detailed statistical derivations
Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. “RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression.” F1000Research 4 (November 17, 2016): 1070. doi:10.12688/f1000research.7035.2 - RNA-seq workflow. From count import, including tximport, through EDA, DESeq2, batch removal, time course analysis, visualization.
Griffith, Malachi, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, and Obi L. Griffith. “Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud.” PLoS Computational Biology 11, no. 8 (August 2015): e1004393. https://doi.org/10.1371/journal.pcbi.1004393 - RNA-seq technology and analysis introduction. Very interesting are supplementary tables http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393#sec009. The full tutorials are at https://github.com/griffithlab/rnaseq_tutorial/wiki
Sahraeian, Sayed Mohammad Ebrahim, Marghoob Mohiyuddin, Robert Sebra, Hagen Tilgner, Pegah T. Afshar, Kin Fai Au, Narges Bani Asadi, et al. “Gaining Comprehensive Biological Insight into the Transcriptome by Performing a Broad-Spectrum RNA-Seq Analysis.” Nature Communications 8, no. 1 (December 2017). doi:10.1038/s41467-017-00050-4 - RNAcocktail - RNA-seq tools benchmarking. All aspects of RNA-seq analysis, structured, Fig 1. Recommended tools Fig 8. https://bioinform.github.io/rnacocktail/, http://www.rna-seqblog.com/unleash-the-power-within-rna-seq/
Law, Charity W., Monther Alhamdoosh, Shian Su, Gordon K. Smyth, and Matthew E. Ritchie. “RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and EdgeR.” F1000Research 5 (2016): 1408. https://doi.org/10.12688/f1000research.9005.2 - Latest Rsubread-limma plus pipeline https://f1000research.com/articles/5-1408/v2. The complete R code for RNA-seq analysis tutorial https://www.bioconductor.org/help/workflows/RNAseq123/
Pertea, Mihaela, Daehwan Kim, Geo M. Pertea, Jeffrey T. Leek, and Steven L. Salzberg. “Transcript-Level Expression Analysis of RNA-Seq Experiments with HISAT, StringTie and Ballgown.” Nature Protocols 11, no. 9 (September 2016): 1650–67. https://doi.org/10.1038/nprot.2016.095. http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html - New Tuxedo suite. Protocol.
“RNA-Seq Methods and Algorithms” short video course by Harold Pimentel, pseudoalignment, kallisto, sleuth, practical. https://www.youtube.com/watch?v=96yBPM8lEt8&list=PLfFNmoa-yUIb5cYG2R1zf5rtrQQKZvKwG
“RNA-seq workflow: gene-level exploratory analysis and differential expression” - Full pipeline, DEG analysis using DESeq2, visualziation, PCS, MDS, clustering, annotation. https://www.bioconductor.org/help/course-materials/2017/CSAMA/labs/2-tuesday/lab-03-rnaseq/rnaseqGene_CSAMA2017.html
“RNA-seq workflow: gene-level exploratory analysis and differential expression”, https://www.bioconductor.org/help/workflows/rnaseqGene/
“An RNA-seq Work Flow” - https://www.bioconductor.org/help/course-materials/2017/OSU/B3_RNASeq_Workflow.html
RNA-seq pipeline by Tommy Tang, https://gitlab.com/tangming2005/STAR_htseq_RNAseq_pipeline/tree/shark
RNA-seq analysis exercise using Galaxy, https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise, an example analysis using the Tophat+Cufflinks workflow.
“bcbio-nextgen” - Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis https://github.com/chapmanb/bcbio-nextgen, https://bcbio-nextgen.readthedocs.org
“Tutorial: RNA-seq differential expression & pathway analysis with Sailfish, DESeq2, GAGE, and Pathview” by Stephen Turner, http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html
Getting started with HISAT, StringTie, and Ballgown, https://davetang.org/muse/2017/10/25/getting-started-hisat-stringtie-ballgown/
“enrichOmics” - Functional enrichment analysis of high-throughput omics data. From basic ExpressionSet differential and functional enrichment analysis to genomic region enrichment analysis and MultiAssayExperiment demo. Uses versatile EnrichmentBrowser
package. https://github.com/waldronlab/enrichOmics
ENCODE RNA-, ChIP-, DNAse-, ATAC- and other seq pipelines, https://github.com/ENCODE-DCC/
Example scripts to be run on the Biostatistics Merlot cluster, https://github.com/mdozmorov/dcaf/tree/master/ngs.rna-seq
Slides: Methylation introduction
Slides: Methylation Illumina arrays
Slides: Methylation minfi tutorial
Slides: Methylation bisulfite sequencing
Slides: Cell type deconvolution
Pidsley, R., et. al., and Susan J. Clark. “Critical Evaluation of the Illumina MethylationEPIC BeadChip Microarray for Whole-Genome DNA Methylation Profiling.” Genome Biology 2016 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1066-1
Pan D., et. al. “Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis.” BMC Bioinformatics, 2010 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-587
Bock, Christoph, Eleni M Tomazou, Arie B Brinkman, Fabian Müller, Femke Simmer, Hongcang Gu, Natalie Jäger, Andreas Gnirke, Hendrik G Stunnenberg, and Alexander Meissner. “Quantitative Comparison of Genome-Wide DNA Methylation Mapping Technologies.” Nature Biotechnology 28, no. 10 (October 2010): 1106–14. https://doi.org/10.1038/nbt.1681. - Methylation intro, technology. Software tools, tables. Quality control and problems. Differential analysis. Public repositories.
Krueger, Felix, Benjamin Kreck, Andre Franke, and Simon R Andrews. “DNA Methylome Analysis Using Short Bisulfite Sequencing Data.” Nature Methods 9, no. 2 (January 30, 2012): 145–51. https://www.nature.com/articles/nmeth.1828. - Methylation intro, technologies to measure. Alignment problems, QC considerations, processing workflow. Theoretical, references.
Wreczycka, Katarzyna, Alexander Gosdschan, Dilmurat Yusuf, Bjoern Gruening, Yassen Assenov, and Altuna Akalin. “Strategies for Analyzing Bisulfite Sequencing Data,” August 9, 2017. https://www.sciencedirect.com/science/article/pii/S0168165617315936. - Bisufite sequencing data analysis steps. Intro into methylation. Refs to packages.
Feng, Hao, Karen N. Conneely, and Hao Wu. “A Bayesian Hierarchical Model to Detect Differentially Methylated Loci from Single Nucleotide Resolution Sequencing Data.” Nucleic Acids Research 42, no. 8 (April 2014): e69–e69. https://doi.org/10.1093/nar/gku154 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4005660/
Robinson, Mark D., Abdullah Kahraman, Charity W. Law, Helen Lindsay, Malgorzata Nowicka, Lukas M. Weber, and Xiaobei Zhou. “Statistical Methods for Detecting Differentially Methylated Loci and Regions.” Frontiers in Genetics 5 (2014): 324. https://doi.org/10.3389/fgene.2014.00324. - Methylation review, technology, databases, experimental design, statistics and tools for differential methylation detection, beta-binomial distribution, cell type deconvolution.
Krueger, Felix, and Simon R. Andrews. “Bismark: A Flexible Aligner and Methylation Caller for Bisulfite-Seq Applications.” Bioinformatics (Oxford, England) 27, no. 11 (June 1, 2011): 1571–72. https://doi.org/10.1093/bioinformatics/btr167. - Bismark paper. stranded and unstranded BS sequencing. Conversion of reads, genomes, best alignment strategy. https://www.bioinformatics.babraham.ac.uk/projects/bismark/
Xi, Yuanxin, and Wei Li. “BSMAP: Whole Genome Bisulfite Sequence MAPping Program.” BMC Bioinformatics 10, no. 1 (2009): 232. https://doi.org/10.1186/1471-2105-10-232. - BSMAP paper. Bisulphite conversion technology introduction, problems. BSMAP algorithm. Very good figures explaining all steps. https://code.google.com/archive/p/bsmap/
Chen, Yunshun, Bhupinder Pal, Jane E. Visvader, and Gordon K. Smyth. “Differential Methylation Analysis of Reduced Representation Bisulfite Sequencing Experiments Using EdgeR.” F1000Research 6 (November 28, 2017): 2055. https://doi.org/10.12688/f1000research.13196.1. - RRBS differential methylation analysis. Methylation intro. R code tutorial.
Venet, D., F. Pecasse, C. Maenhaut, and H. Bersini. “Separation of Samples into Their Constituents Using Gene Expression Data.” Bioinformatics (Oxford, England) 17 Suppl 1 (2001): S279-287 https://academic.oup.com/bioinformatics/article/17/suppl_1/S279/262438. - Statistical derivation of deconvolution.
Teschendorff, Andrew E., and Caroline L. Relton. “Statistical and Integrative System-Level Analysis of DNA Methylation Data.” Nature Reviews Genetics, November 13, 2017. https://doi.org/10.1038/nrg.2017.86. - Deconvolution of methylation profiles. Reference-based, reference-free, semi-reference-free. Table 1 - tools
Methylation statistics packages: Table 2 in Liu, Hongbo, Song Li, Xinyu Wang, Jiang Zhu, Yanjun Wei, Yihan Wang, Yanhua Wen, et al. “DNA Methylation Dynamics: Identification and Functional Annotation.” Briefings in Functional Genomics, 2016. https://www.ncbi.nlm.nih.gov/pubmed/27515490
R annotation ana analysis packages for Illumina methylation arrays, http://www.hansenlab.org/software.html
Fast and accurante alignment of BS-Seq reads. https://github.com/brentp/bwa-meth/
https://github.com/crazyhottommy/DNA-methylation-analysis - notes on DNA methylation analysis (arrays and sequencing data)
Slides: Epigenomics introduction
Slides: Epigenomic enrichment
Slides: Hidden Markov Models intro, Chromatin segmentation
Stephen B. Baylin and Peter A. Jones, “A Decade of Exploring the Cancer Epigenome — Biological and Translational Implications,” Nature Reviews Cancer 11, no. 10 (September 23, 2011): 726–34, https://doi.org/10.1038/nrc3130.- Cancer epigenomics introduction, therapies
Kagohara, Luciane T., Genevieve L. Stein-O’Brien, Dylan Kelley, Emily Flam, Heather C. Wick, Ludmila V. Danilova, Hariharan Easwaran, et al. “Epigenetic Regulation of Gene Expression in Cancer: Techniques, Resources and Analysis.” Briefings in Functional Genomics, August 11, 2017. https://doi.org/10.1093/bfgp/elx018. - Review of epigeneitc modifications, methylation, histones, chromatin states, 3D. Technologies, databases, software. Lots of references
Li, Bing, Michael Carey, and Jerry L. Workman. “The Role of Chromatin during Transcription.” Cell 128, no. 4 (February 23, 2007): 707–19. http://www.cell.com/abstract/S0092-8674(07)00109-2. - Transcription process and the role of chromatin modifications.
Zhou, Vicky W., Alon Goren, and Bradley E. Bernstein. “Charting Histone Modifications and the Functional Organization of Mammalian Genomes.” Nature Reviews. Genetics 12, no. 1 (January 2011): 7–18. https://www.nature.com/articles/nrg2905 - Histone marks review, ChIP-seq. Graphics of histone marks roles
Zhang, Z. D., A. Paccanaro, Y. Fu, S. Weissman, Z. Weng, J. Chang, M. Snyder, and M. B. Gerstein. “Statistical Analysis of the Genomic Distribution and Correlation of Regulatory Elements in the ENCODE Regions.” Genome Research 17, no. 6 (June 1, 2007): 787–97. https://doi.org/10.1101/gr.5573107. - ENCODE pilot project analysis. Non-random location of regulatory elements. Enrichment in TSSs, not in the middle or end of transcription sites. PCA and biplot representation of interrelatedness among TFs and histone marks, clustering.
Fu, Audrey Qiuyan, and Boris Adryan. “Scoring Overlapping and Adjacent Signals from Genome-Wide ChIP and DamID Assays.” Molecular BioSystems 5, no. 12 (December 2009): 1429–38. http://pubs.rsc.org/en/content/articlelanding/2009/mb/b906880e
Huen, David S., and Steven Russell. “On the Use of Resampling Tests for Evaluating Statistical Significance of Binding-Site Co-Occurrence.” BMC Bioinformatics 11 (June 30, 2010): 359. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-359 - Thorough review on permutation. Problem with different numbers of ROIs/epiregions in permutations. Regions may overlap (i.e., occur in clusters). Independent assignment is best.
McLean, Cory Y., Dave Bristor, Michael Hiller, Shoa L. Clarke, Bruce T. Schaar, Craig B. Lowe, Aaron M. Wenger, and Gill Bejerano. “GREAT Improves Functional Interpretation of Cis-Regulatory Regions.” Nature Biotechnology 28, no. 5 (May 2010): 495–501. https://doi.org/10.1038/nbt.1630. - Hypergeometric and binomial enrichment of regulatory regions in relation to genesgenomic regions and their ontologies.
Dozmorov, Mikhail G. “Epigenomic Annotation-Based Interpretation of Genomic Data: From Enrichment Analysis to Machine Learning.” Bioinformatics (Oxford, England) 33, no. 20 (October 15, 2017): 3323–30. https://doi.org/10.1093/bioinformatics/btx414
Eddy, Sean R. “What Is a Hidden Markov Model?” Nature Biotechnology 22, no. 10 (October 2004): 1315–16. https://doi.org/10.1038/nbt1004-1315
Eddy, S. R. “Multiple Alignment Using Hidden Markov Models.” Proceedings. International Conference on Intelligent Systems for Molecular Biology 3 (1995): 114–20. https://pdfs.semanticscholar.org/bc1e/ffe17026c1396697e4ea2b399f7a049202fc.pdf - HMMER hidden markov models for protein sequence alignment. First paper
Schuster-Böckler, Benjamin, and Alex Bateman. “An Introduction to Hidden Markov Models.” In Current Protocols in Bioinformatics, edited by Andreas D. Baxevanis, Daniel B. Davison, Roderic D.M. Page, Gregory A. Petsko, Lincoln D. Stein, and Gary D. Stormo. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2007. https://doi.org/10.1002/0471250953.bia03as18 - Hidden Markov Models primer
L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77, no. 2 (February 1989): 257–86, https://doi.org/10.1109/5.18626 - Theory and statistics of Hidden Markov Models. Very detailed, thorough
“Hidden Markov Models” by bioinformaticsalgorithms.com. https://www.youtube.com/watch?list=PLQ-85lQlPqFPnk31Uut2ajVkBvlFmMtdx&v=cnvYujS9Rlk
Ernst, Jason, and Manolis Kellis. “Chromatin-State Discovery and Genome Annotation with ChromHMM.” Nature Protocols 12, no. 12 (December 2017): 2478–92. https://doi.org/10.1038/nprot.2017.124 - ChromHMM protocol. Intro about ChromHMM, other methods. Links to genome annotation models.
Hoffman, Michael M., Orion J. Buske, Jie Wang, Zhiping Weng, Jeff A. Bilmes, and William Stafford Noble. “Unsupervised Pattern Discovery in Human Chromatin Structure through Genomic Segmentation.” Nature Methods 9, no. 5 (May 2012): 473–76. https://doi.org/10.1038/nmeth.1937 - Segway - segmentation and prediction of genomic states. Using dynamic Bayesian network
Slides: ChIP-seq
Slides: Other epigenomic technologies
Park, Peter J. “ChIP–seq: Advantages and Challenges of a Maturing Technology.” Nature Reviews Genetics 10, no. 10 (October 2009): 669–80. https://doi.org/10.1038/nrg2641 - ChIP-seq review, basics of technology, alignment, peak calling, downstream analysis.
Bailey, Timothy, Pawel Krajewski, Istvan Ladunga, Celine Lefebvre, Qunhua Li, Tao Liu, Pedro Madrigal, Cenny Taslim, and Jie Zhang. “Practical Guidelines for the Comprehensive Analysis of ChIP-Seq Data.” Edited by Fran Lewitter. PLoS Computational Biology 9, no. 11 (November 14, 2013): e1003326. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326 - ChIP-seq computational workflow - sequencing depth, alignment, QC, peak calling, reproducibility (IDR), narrow/broad peaks, differential binding analysis, annotation, normalization.
Furey, Terrence S. “ChIP–seq and beyond: New and Improved Methodologies to Detect and Characterize Protein–DNA Interactions.” Nature Reviews Genetics 13, no. 12 (October 23, 2012): 840–52. https://www.nature.com/articles/nrg3306 - ChIP-seq technologies, narrow and broad peaks, DNAse- and FAIRE-seq, cromatin conformation capture, Gentle technology and terms introduction.
Buenrostro, Jason D, Paul G Giresi, Lisa C Zaba, Howard Y Chang, and William J Greenleaf. “Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, DNA-Binding Proteins and Nucleosome Position.” Nature Methods 10, no. 12 (December 2013): 1213–18. https://doi.org/10.1038/nmeth.2688 - ATAC-seq technology, corespondence to DNAse-seq.
Zhang, Yong, Tao Liu, Clifford A. Meyer, Jérôme Eeckhoute, David S. Johnson, Bradley E. Bernstein, Chad Nusbaum, et al. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biology 9, no. 9 (2008): R137. https://doi.org/10.1186/gb-2008-9-9-r137 - MACS paper
Kharchenko, Peter V, Michael Y Tolstorukov, and Peter J Park. “Design and Analysis of ChIP-Seq Experiments for DNA-Binding Proteins.” Nature Biotechnology 26, no. 12 (December 2008): 1351–59. https://doi.org/10.1038/nbt.1508 - SPP - R package for analysis of ChIP-seq and other functional sequencing data. ChIP-seq technology, picture of strand-specific tag distribution. Strand cross-correlation as a method to decide whether tags should be included. Three types of anomalous tags. https://github.com/hms-dbmi/spp. http://compbio.med.harvard.edu/Supplements/ChIP-seq/
Rozowsky, Joel, Ghia Euskirchen, Raymond K Auerbach, Zhengdong D Zhang, Theodore Gibson, Robert Bjornson, Nicholas Carriero, Michael Snyder, and Mark B Gerstein. “PeakSeq Enables Systematic Scoring of ChIP-Seq Experiments Relative to Controls.” Nature Biotechnology 27, no. 1 (January 2009): 66–75. https://doi.org/10.1038/nbt.1518 - PeakSeq paper. [https://github.com/gersteinlab/PeakSeq])https://github.com/gersteinlab/PeakSeq
Zhang, Xuekui, Gordon Robertson, Martin Krzywinski, Kaida Ning, Arnaud Droit, Steven Jones, and Raphael Gottardo. “PICS: Probabilistic Inference for ChIP-Seq.” Biometrics 67, no. 1 (March 2011): 151–63. https://doi.org/10.1111/j.1541-0420.2010.01441.x - PICS paper. https://bioconductor.org/packages/release/bioc/html/PICS.html
Zang, Chongzhi, Dustin E. Schones, Chen Zeng, Kairong Cui, Keji Zhao, and Weiqun Peng. “A Clustering Approach for Identification of Enriched Domains from Histone Modification ChIP-Seq Data.” Bioinformatics 25, no. 15 (August 1, 2009): 1952–58. https://doi.org/10.1093/bioinformatics/btp340 - SICER paper. Cut genome into non-overlapping windows and compute a score for each window based on a Poisson model. Identify “islands” vs “non-islands” by thresholding the scores and clustering windows with significant scores. For each island, compute the probability of observing the island with a given score. Constructing score distribution is involved. Excellent statistical description.
D’haeseleer, Patrik. “What Are DNA Sequence Motifs?” Nature Biotechnology 24, no. 4 (April 2006): 423–25. https://doi.org/10.1038/nbt0406-423
D’haeseleer, Patrik. “How Does DNA Sequence Motif Discovery Work?” Nature Biotechnology 24, no. 8 (August 2006): 959–61. https://doi.org/10.1038/nbt0806-959
Lawrence, C. E., S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment.” Science (New York, N.Y.) 262, no. 5131 (October 8, 1993): 208–14. http://science.sciencemag.org/content/262/5131/208.long - Gibbs sampling for alignment of multiple sequences. Statistical definitions.
Bailey, T. L., and C. Elkan. “Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers.” Proceedings. International Conference on Intelligent Systems for Molecular Biology 2 (1994): 28–36. https://www.sdsc.edu/~tbailey/papers/memeplus.tech.pdf - MM MEME algorithm to find multiple motifs of width W in a set of sequences. Background and motif models of position frequencies of letters. EM algorithm to learn motifs that maximize likelihood of the data. After the first model is found, the procedure is repeated to find other motifs.
Pei Fen Kuan et al., “A Statistical Framework for the Analysis of ChIP-Seq Data,” Journal of the American Statistical Association 106, no. 495 (September 2011): 891–903, https://doi.org/10.1198/jasa.2011.ap09706 - MOSAiCS (MOdel-based one and two Sample Analysis and Inference for ChIP-Seq) - regression framework that explicitly models mappability and GC biases for one- and two-sample. Evaluated on non-crosslinked, non-IP DNA. Negative Binomial is better fit for tag counts. https://bioconductor.org/packages/release/bioc/html/mosaics.html
Li, Qunhua, James B. Brown, Haiyan Huang, and Peter J. Bickel. “Measuring Reproducibility of High-Throughput Experiments.” The Annals of Applied Statistics 5, no. 3 (September 2011): 1752–79. https://doi.org/10.1214/11-AOAS466 - IDR - irreproducible discovery rate theoretical paper.
Gibbs sampling theory and R tutorial, http://appsilondatascience.com/blog/rstats/2017/10/09/gibbs-sampling.html
ChIP-seq pipeline by Tommy Tang, https://github.com/crazyhottommy/ChIP-seq-analysis
ChIP-seq workflow and analysis in R, https://www.bioconductor.org/help/course-materials/2017/OMRF/B4_ChIPSeq.html
ATAC-seq pipeline by Tommy Tang, https://gitlab.com/tangming2005/snakemake_ATACseq_pipeline/tree/shark
ATAC_seq Workshop. https://github.com/ThomasCarroll/ATAC_Workshop
“Analyzing ChIP-seq data with SICER”, SICER workshop with data. http://cistrome.org/~czang/chipseqdata.htm
Wei, Zheng, Wei Zhang, Huan Fang, Yanda Li, and Xiaowo Wang. “EsATAC: An Easy-to-Use Systematic Pipeline for ATAC-Seq Data Analysis.” Bioinformatics (Oxford, England), March 7, 2018. https://doi.org/10.1093/bioinformatics/bty141 - esATAC R package for full ATAC-seq data processing and analysis. https://www.bioconductor.org/packages/release/bioc/html/esATAC.html
Wang, Yong, and Nicholas E. Navin. “Advances and Applications of Single-Cell Sequencing Technologies.” Molecular Cell 58, no. 4 (May 21, 2015): 598–609. https://doi.org/10.1016/j.molcel.2015.05.005 - Single cell sequencing review, all technologies. Whole genome amplification.
Zheng, Grace X. Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (January 16, 2017): 14049. https://doi.org/10.1038/ncomms14049 - 10X technology. Details of each wet-lab step, sequencing, and basic computational analysis.
Bacher, Rhonda, and Christina Kendziorski. “Design and Computational Analysis of Single-Cell RNA-Sequencing Experiments.” Genome Biology 17 (April 7, 2016): 63. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0927-y - scRNA-seq analysis. Table 1 - categorized tools and description. Normalization, noise reduction, clustering, differential expression, pseudotime ordering.
Vallejos, Catalina A, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14, no. 6 (May 15, 2017): 565–71. https://doi.org/10.1038/nmeth.4292 - single-cell RNA-seq normalization methods. Noise, sparsity. Cell- and gene-specific effects. Bulk RNA-seq normalization methods don’t work well. Overview of RPKM, TPM, TMM, DESeq normalizations. Spike-in based normalization methods.
Camara, Pablo G. “Methods and Challenges in the Analysis of Single-Cell RNA-Sequencing Data.” Current Opinion in Systems Biology 7 (February 2018): 47–53. https://doi.org/10.1016/j.coisb.2017.12.007 - scRNA-seq conscise review of computational analysis steps.
Dal Molin, Alessandra, and Barbara Di Camillo. “How to Design a Single-Cell RNA-Sequencing Experiment: Pitfalls, Challenges and Perspectives.” Briefings in Bioinformatics, January 31, 2018. https://doi.org/10.1093/bib/bby007 - single-cell RNA-seq review. From experimental to bioinformatics steps. Cell isolation (FACS, microfluidics, droplet-based), mRNA capture, RT and amplification (poly-A tails, template switching, IVT), quantitative standards (spike-ins, UMIs), transcript quantification, normalization, batch effect removal, visualization.
Brennecke, Philip, Simon Anders, Jong Kyoung Kim, Aleksandra A Kołodziejczyk, Xiuwei Zhang, Valentina Proserpio, Bianka Baying, et al. “Accounting for Technical Noise in Single-Cell RNA-Seq Experiments.” Nature Methods 10, no. 11 (September 22, 2013): 1093–95. https://doi.org/10.1038/nmeth.2645 - Single-cell noise. Technical, biological. Use spike-ins to estimate noise. Can be approximated with Poisson distribution.
Grün, Dominic, Lennart Kester, and Alexander van Oudenaarden. “Validation of Noise Models for Single-Cell Transcriptomics.” Nature Methods 11, no. 6 (June 2014): 637–40. https://doi.org/10.1038/nmeth.2930 - Quantification of sampling noise and global cell-to-cell variation in sequencing efficiency. Three noise models. 4-bases-long UMIs are sufficient for transcript quantification, improve CV. Negative binomial component of expressed genes. Statistics of transcript counting from UMIs, negative binomial distribution, noise models
Risso, Davide, Fanny Perraudeau, Svetlana Gribkova, Sandrine Dudoit, and Jean-Philippe Vert. “ZINB-WaVE: A General and Flexible Method for Signal Extraction from Single-Cell RNA-Seq Data.” BioRxiv, January 1, 2017. https://doi.org/10.1101/125112 - Zero-inflated negative binomial model for normalization, batch removal, and dimensionality reduction. Extends the RUV model with more careful definition of “unwanted” variation as it may be biological. Good statistical derivations in Methods. Refs to real and simulated scRNA-seq datasets
Ding, Bo, Lina Zheng, Yun Zhu, Nan Li, Haiyang Jia, Rizi Ai, Andre Wildberg, and Wei Wang. “Normalization and Noise Reduction for Single Cell RNA-Seq Experiments.” Bioinformatics (Oxford, England) 31, no. 13 (July 1, 2015): 2225–27. https://doi.org/10.1093/bioinformatics/btv122 - Fitting gamma distribution to log2 read counts of known spike-in ERCC controls, predicting RNA concentration from it.
Bacher, Rhonda, Li-Fang Chu, Ning Leng, Audrey P Gasch, James A Thomson, Ron M Stewart, Michael Newton, and Christina Kendziorski. “SCnorm: Robust Normalization of Single-Cell RNA-Seq Data.” Nature Methods 14, no. 6 (April 17, 2017): 584–86. https://doi.org/10.1038/nmeth.4263 - SCnorm - normalization for single-cell data. Quantile regression to estimate the dependence of transcript expression on sequencing depth for every gene. Genes with similar dependence are then grouped, and a second quantile regression is used to estimate scale factors within each group. Within-group adjustment for sequencing depth is then performed using the estimated scale factors to provide normalized estimates of expression. Good statistical methods description
Finak, Greg, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K. Shalek, Chloe K. Slichter, et al. “MAST: A Flexible Statistical Framework for Assessing Transcriptional Changes and Characterizing Heterogeneity in Single-Cell RNA Sequencing Data.” Genome Biology 16 (December 10, 2015): 278. https://doi.org/10.1186/s13059-015-0844-5 - MAST, scRNA-seq DEG analysis. CDR - the fraction of genes that are detectably expressed in each cell - added to the hurdle model that explicitly parameterizes distributions of expressed and non-expressed genes.
Poirion, Olivier B., Xun Zhu, Travers Ching, and Lana Garmire. “Single-Cell Transcriptomics Bioinformatics and Computational Challenges.” Frontiers in Genetics 7 (2016): 163. https://doi.org/10.3389/fgene.2016.00163 - single-cell RNA-seq review. Workflow, table 1 - tools, table 2 - data. Workflow sections describe each tools. References to all tools.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-Seq Data with Bioconductor.” F1000Research 5 (2016): 2122. https://doi.org/10.12688/f1000research.9501.2 - scRNA-seq analysis, from count matrix. Noise. Use of ERCCs and UMIs. QC metrics - library size, expression level, mitochondrial genes. Tests for batch effect. Finding highly variable genes, clustering. Several examples, with R code. https://bioconductor.org/help/workflows/simpleSingleCell/, https://bioconductor.org/packages/devel/bioc/html/scran.html, https://github.com/LTLA/SingleCellThoughts
Perraudeau, Fanny, Davide Risso, Kelly Street, Elizabeth Purdom, and Sandrine Dudoit. “Bioconductor Workflow for Single-Cell RNA Sequencing: Normalization, Dimensionality Reduction, Clustering, and Lineage Inference.” F1000Research 6 (July 21, 2017): 1158. https://doi.org/10.12688/f1000research.12122 - single-cell RNA-seq pipeline. Post-processing analysis in R, uzinng SummarizedExperiment object, scone for QC, ZINB-WaVE for normalization and dim. reduction, clusterExperiment for clustering, slingshot for temporal inference, differential expression using generalized additive model
McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Wills. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-Seq Data in R.” Bioinformatics (Oxford, England) 33, no. 8 (April 15, 2017): 1179–86. https://doi.org/10.1093/bioinformatics/btw777 - McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Wills. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-Seq Data in R.” Bioinformatics (Oxford, England) 33, no. 8 (April 15, 2017): 1179–86.
Zappia, Luke, Belinda Phipson, and Alicia Oshlack. “Exploring the Single-Cell RNA-Seq Analysis Landscape with the ScRNA-Tools Database.” BioRxiv, January 1, 2018. https://doi.org/10.1101/206573 - www.scrna-tools.org - structured collection of scRNA-seq tools, from visualization, specialized tools, to full pipelines
BioC 2017 workshop on analysis of single-cell RNA-seq data: Normalization, dimensionality reduction, clustering, and lineage inference. https://github.com/fperraudeau/bioc2017singlecell
Workshops on single cell analysis. Presentation, R code with instructions for running a Docker container on Amazon EC2. https://github.com/broadinstitute/single_cell_analysis
List of software packages for single-cell data analysis, including RNA-seq, ATAC-seq, etc. https://github.com/seandavi/awesome-single-cell
Analysis of single cell RNA-seq data course, Cambridge University, UK. https://github.com/hemberg-lab/scRNA.seq.course and its bookdown version https://hemberg-lab.github.io/scRNA.seq.course/
Single cell data portal. https://portals.broadinstitute.org/single_cell
Conquer DB of scRNA-seq datasets: http://imlspenticton.uzh.ch:3838/conquer/
Ghodsi, Mohammadreza, Bo Liu, and Mihai Pop. “DNACLUST: Accurate and Efficient Clustering of Phylogenetic Marker Genes.” BMC Bioinformatics 12 (June 30, 2011): 271. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-271 DNACLUST - metagenomics clustering of 16S sequencing. Comparison with CD-HIT and UCLUST. Recruit sequences within the radius around cluster seed. Explanation of distance, Needleman-Wunsch algorithm.
Grice, Elizabeth A., and Julia A. Segre. “The Human Microbiome: Our Second Genome.” Annual Review of Genomics and Human Genetics 13, no. 1 (September 22, 2012): 151–70. https://www.annualreviews.org/doi/abs/10.1146/annurev-genom-090711-163814 - Review on microbiome, 16S RNA, tools, human microbiome project.
Kong, Heidi H. “Skin Microbiome: Genomics-Based Insights into the Diversity and Role of Skin Microbes.” Trends in Molecular Medicine 17, no. 6 (June 2011): 320–28. https://doi.org/10.1016/j.molmed.2011.01.013. - Skin microbiome review. Introduction about microbiome sequencing (culture and direct), 16S sequencing, other markers. Definitions. Skin-specific findings.
McDonald, Daniel, Zhenjiang Xu, Embriette R. Hyde, and Rob Knight. “Ribosomal RNA, the Lens into Life.” RNA 21, no. 4 (April 2015): 692–94. http://rnajournal.cshlp.org/content/21/4/692.full - Short review of 16S rRNA history, databases (RDP< Greengenes, SILVA), QIIME
Huson, D. H., A. F. Auch, J. Qi, and S. C. Schuster. “MEGAN Analysis of Metagenomic Data.” Genome Research 17, no. 3 (February 6, 2007): 377–86. https://genome.cshlp.org/content/17/3/377.abstract - MEGAN paper. Intro into metagenomics, sequencing, alignment analysis. LCA (lowest common ancestor) algorithm.
Kuczynski, Justin, Christian L. Lauber, William A. Walters, Laura Wegener Parfrey, José C. Clemente, Dirk Gevers, and Rob Knight. “Experimental and Analytical Tools for Studying the Human Microbiome.” Nature Reviews Genetics 13, no. 1 (December 16, 2011): 47–58. https://www.nature.com/articles/nrg3129
Hamady, Micah, and Rob Knight. “Microbial Community Profiling for Human Microbiome Projects: Tools, Techniques, and Challenges.” Genome Research 19, no. 7 (July 2009): 1141–52. http://genome.cshlp.org/cgi/pmidlookup?view=long&pmid=19383763 - Introduction, background of metagenomics. 16S vs. sequencing, experimental questions like read length, sampling depth, how to analyze.
Schloss, P. D., S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, et al. “Introducing Mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities.” Applied and Environmental Microbiology 75, no. 23 (December 1, 2009): 7537–41. https://doi.org/10.1128/AEM.01541-09 - mothur paper. https://www.mothur.org/
Wang, Q., G. M. Garrity, J. M. Tiedje, and J. R. Cole. “Naive Bayesian Classifier for Rapid Assignment of RRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73, no. 16 (August 15, 2007): 5261–67. https://doi.org/10.1128/AEM.00062-07 - Naive Bayes classifier used in Ribosome Database Project (RDP). Good methods statistical description.
Sczyrba, Alexander, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen, Johannes Dröge, Ivan Gregor, et al. “Critical Assessment of Metagenome Interpretation-a Benchmark of Metagenomics Software.” Nature Methods 14, no. 11 (November 2017): 1063–71. https://doi.org/10.1038/nmeth.4458 - Critical Assessment of Metagenome Interpretation (CAMI) paper. Assessment of microbial genome assemblers, effect of sequencing depth, strain diversity, taxonomic binning, etc., recommendations on software and best practices.
List of software packages (and the people developing these methods) for microbiome (16S), metagenomics (WGS, Shot-gun sequencing), and pathogen identification/detection/characterization. https://github.com/stevetsa/awesome-microbes
“Analysis of Metagenomic Data (2016)” course. https://bioinformatics-ca.github.io/analysis_of_metagenomic_data_2016/
Metagenomic Assembly Workshop, https://2014-5-metagenomics-workshop.readthedocs.io/en/latest/index.html
Slides: DNA sequencing
Slides: SNPs
Slides: Heritability, Hardy-Weinberg equation
Slides: eQTLs
Slides: CNVs
1000 Genomes Project Consortium, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Erik P. Garrison, Hyun Min Kang, Jan O. Korbel, et al. “A Global Reference for Human Genetic Variation.” Nature 526, no. 7571 (October 1, 2015): 68–74. https://www.nature.com/nature/journal/v526/n7571/full/nature15393.html
Bamshad, Michael J., Sarah B. Ng, Abigail W. Bigham, Holly K. Tabor, Mary J. Emond, Deborah A. Nickerson, and Jay Shendure. “Exome Sequencing as a Tool for Mendelian Disease Gene Discovery.” Nature Reviews. Genetics 12, no. 11 (September 27, 2011): 745–55. https://www.nature.com/articles/nrg3031 - Exome sequencing technology, limitations, use for diagnostics, family studies.
McCarthy, Davis J, Peter Humburg, Alexander Kanapin, Manuel A Rivas, Kyle Gaulton, asds, Jean-Baptiste Cazier, and Peter Donnelly. “Choice of Transcripts and Software Has a Large Effect on Variant Annotation.” Genome Medicine 6, no. 3 (2014): 26. https://genomemedicine.biomedcentral.com/articles/10.1186/gm543 - SNP annotation depends on transcripts and software. Types of SNPs. Ambigious annotations
Pabinger, S., A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski. “A Survey of Tools for Variant Analysis of Next-Generation Genome Sequencing Data.” Briefings in Bioinformatics 15, no. 2 (March 1, 2014): 256–78. https://doi.org/10.1093/bib/bbs086 - SNP calling and analysis tools overview. Germline, somatic, CNV, SV detection. Variant annotation tools.
MacArthur, D. G., T. A. Manolio, D. P. Dimmock, H. L. Rehm, J. Shendure, G. R. Abecasis, D. R. Adams, et al. “Guidelines for Investigating Causality of Sequence Variants in Human Disease.” Nature 508, no. 7497 (April 24, 2014): 469–76. https://doi.org/10.1038/nature13127 - Definitions and guidelines to define pathogenicity of SNPs
Quinlan, Aaron R., and Ira M. Hall. “Characterizing Complex Structural Variation in Germline and Somatic Genomes.” Trends in Genetics: TIG 28, no. 1 (January 2012): 43–53. https://www.sciencedirect.com/science/article/pii/S0168952511001685 - SV review, types, how generated, technologies for detection (Box 1. depth, paired-end, split-read)
Alkan, Can, Bradley P. Coe, and Evan E. Eichler. “Genome Structural Variation Discovery and Genotyping.” Nature Reviews Genetics 12, no. 5 (May 2011): 363–76. https://www.nature.com/articles/nrg2958 - CNV, structural detection review
Zhang, Feng, Wenli Gu, Matthew E. Hurles, and James R. Lupski. “Copy Number Variation in Human Health, Disease, and Evolution.” Annual Review of Genomics and Human Genetics 10 (2009): 451–81. https://doi.org/10.1146/annurev.genom.9.081307.164217 - CNV review, mechanisms, analytical difficulties, roles in individual diseases
Trost, Brett, Susan Walker, Zhuozhi Wang, Bhooma Thiruvahindrapuram, Jeffrey R. MacDonald, Wilson W. L. Sung, Sergio L. Pereira, et al. “A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data.” American Journal of Human Genetics 102, no. 1 (January 4, 2018): 142–55. https://doi.org/10.1016/j.ajhg.2017.12.007 - CNV (>1kb) read-depth detection workflow, from experimental considerations to computational analysis. HuRef (NA12878) genome, supplemental files contain CNV genomic coordinates. CNVnator and ERDS perform optimally. Tools comparison, links to resources.
Pirooznia, Mehdi, Melissa Kramer, Jennifer Parla, Fernando S. Goes, James B. Potash, W. Richard McCombie, and Peter P. Zandi. “Validation and Assessment of Variant Calling Pipelines for Next-Generation Sequencing.” Human Genomics 8 (July 30, 2014): 14. https://humgenomics.biomedcentral.com/articles/10.1186/1479-7364-8-14 - SNP pipeline benchmarking, GATK vs. samtools. GATK is the best. Supplementary - actual commands to run. http://metamoodics.org/wiki/index.php?title=Whole_Exome_Sequencing_Analysis_Pipeline
Purcell, Shaun, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A. R. Ferreira, David Bender, Julian Maller, et al. “PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses.” American Journal of Human Genetics 81, no. 3 (September 2007): 559–75. https://doi.org/10.1086/519795 - PLINK - a tool for whole-genome association studies data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. Details of each task. gPLINK - graphical user interface integrated with HaploView. http://zzz.bwh.harvard.edu/plink/
Shabalin, Andrey A. “Matrix EQTL: Ultra Fast EQTL Analysis via Large Matrix Operations.” Bioinformatics (Oxford, England) 28, no. 10 (May 15, 2012): 1353–58. https://doi.org/10.1093/bioinformatics/bts163 - eQTL detection using linear regression/ANOVA models. Genotype by gene expression matrix multiplication to calculate model statistics. Handling of covariates, correlation structure, FDR correction, handling of cis/trans qtls. http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/
Van der Auwera, Geraldine A., Mauricio O. Carneiro, Chris Hartl, Ryan Poplin, Guillermo Del Angel, Ami Levy-Moonshine, Tadeusz Jordan, et al. “From FastQ Data to High Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.” Current Protocols in Bioinformatics 43 (2013): 11.10.1-33. https://doi.org/10.1002/0471250953.bi1110s43
Reed, Eric, Sara Nunez, David Kulp, Jing Qian, Muredach P. Reilly, and Andrea S. Foulkes. “A Guide to Genome-Wide Association Analysis and Post-Analytic Interrogation.” Statistics in Medicine 34, no. 28 (December 10, 2015): 3769–92. https://doi.org/10.1002/sim.6605 - GWAS R tutorial. Workflow details, file types, filtering steps, PCA, post-analysis visualization. Data from https://www.mtholyoke.edu/courses/afoulkes/Data/GWAStutorial/. Support site http://www.stat-gen.org/
Notes on whole exome and whole genome sequencing analysis. https://github.com/crazyhottommy/DNA-seq-analysis
Thousand Variant Callers Project Github Repo, links and short descriptions of different genomic variant callers. https://github.com/deaconjs/ThousandVariantCallersRepo
“Wrangling genomics” SNP calling pipeline. https://github.com/datacarpentry/wrangling-genomics
“Variant Annotation Workshop with FunciVAR, StateHub and MotifBreakR”, https://www.simoncoetzee.com/bioc2017.html
“GWAS tutorial”, https://poissonisfish.wordpress.com/2017/10/09/genome-wide-association-studies-in-r/, https://github.com/monogenea/GWAStutorial
Off-label workflow to simply call differences in two samples. https://gatkforums.broadinstitute.org/gatk/discussion/11315/off-label-workflow-to-simply-call-differences-in-two-samples
Basic walk-throughs for alignment and variant calling from NGS sequencing data, by Erik Garrison. https://github.com/ekg/alignment-and-variant-calling-tutorial