Reproducible research tools

BIOS 691-001

Mikhail Dozmorov, first dot last at vcuhealth dot edu

1 credit-hour, 8 hours

9:00 am to 12:00 pm

June 11 to June 14, 2018.

One Capitol Square, Rm 5009

By appointment

Course topics

  1. Overview. Steps in reproducible research
  2. Unix/command line basics, scripting
  3. Text manipulation with regular expressions, grep, awk, sed, vim
  4. Command line automation with Makefiles
  5. Remote computing, SSH
  6. Docker containers
  7. Amazon Elastic Compute Cloud (Amazon EC2)
  8. Best practices of data/code organization
  9. RStudio, R functions, and packages
  10. Literate programming with RMarkdown/KnitR, bibliography management with BibTex
  11. Reproducible presentations and web publishing
  12. Data manipulation (dplyr) and visualization (ggplot2) in R, tidyverse
  13. Version control using Git/GitHub
  14. Data/code sharing repositories, Licenses and copyright

Course Description

Reproducibility is the cornerstone of science. In data science, reproducibility aims at delegating the majority of scientific computations to automated workflows. Such automation minimizes potential errors and irreproducibility of the point-and-click approach and makes it easier for others to trace and reconstruct analytical steps. Although the importance of computational reproducibility is commonly recognized, it is still not widely adopted, in part due to little systematic knowledge about available tools for reproducible research.

This workshop-style course will methods, tools, and software for reproducibly managing, manipulating, analyzing, and visualizing large-scale biological data. The goal is to familiarize the students with best practices and computational tools that will have immediate and long-term benefits in everyday work of a data scientist.

This course is not a statistics class. It is a data science-oriented course. Some general knowledge of statistics and study design is helpful but isn’t required.

Expected Learning Outcomes

After successful completion of this course, students will be able to:

Prerequisites

R packages

Install several core packages, listed below. If install.packages() generate errors, read carefully the error messages - likely some dependency packages are missing. Install them before installing the core packages.

install.packages(c("dplyr", "readr", "tidyr", "ggplot2", "knitr", "rmarkdown", "shiny", "shinythemes", "lubridate"))

Who should take this class?

Both undergraduates and graduate students are welcome to take the course. Auditing is possible contingent on class capacity. Contact the instructor for auditing arrangements.

Class format

This course will rely mainly on in-class participation, followed by assigned reading and practices with the software tools.

There will be four connected modules, each focusing on an important area of computational reproducible research. Each module will be presented in a traditional seminar format combined with real-life demo of practical tasks. The students will learn about reproducible research actively — by doing it.

Required Textbook

None. Instead, a list of relevant reading will be provided.

Grading Procedure

Students are expected to attend every class and be on time. Participation counts toward the final grade. Homework will be provided for each topic and counts towards the final grade.

Source code

This course on GitHub https://github.com/mdozmorov/BIOS691.2018

Acknowledgements

Previous versions of this course