Welcome! The course is offered through the Virginia Commonwealth University, Department of Biostatistics. Course support web pages: http://mdozmorov.github.io/BIOS692

Course Logistics

Course number: BIOS 692-802

Instructor: Mikhail Dozmorov, Ph.D., mikhail dot dozmorov at vcuhealth dot edu

Units/Credits: 1

Lectures: Monday-Thursday, June 5-8, 2017, 9:00am-10:50am, One Capitol Square, room 5009, 830 East Main Street, RVA

Course Description

Reproducibility is the cornerstone of science. In data science, reproducibility aims at delegating the majority of scientific computations to automated workflows. Such automation minimizes potential errors and irreproducibility of the point-and-click approach and makes it easier for others to trace and reconstruct analytical steps. Although the importance of computational reproducibility is commonly recognized, it is still not widely adopted, in part due to little systematic knowledge about available tools for reproducible research.

This course will cover the philosophy and practical aspects of reproducible research in data science. The goal is to familiarize the students with the best practices and computational tools that will have immediate and long-term benefits in everyday work of a data scientist.

Expected Learning Outcomes

After successful completion of this course, students will be able to:

  • Know the main steps and best practices of reproducible research in data science
  • Use command line and other software tools to organize data and analysis
  • Effectively communicate the outcome of data analysis using literate programming and visualizations
  • Keeping history of changes via version control system
  • Facilitate collaboration through code, data and results sharing


Who should take this class?

Both undergraduates and graduate students are welcome to take the course. Auditing is possible contingent on class capacity and by contacting the instructor.

Class format

This course will rely mainly on in-class participation, followed by assigned reading and practices with the software tools.

There will be four connected modules, each focusing on an important area of computational reproducible research. Each module will be presented in a traditional seminar format combined with real life demo of practical tasks. The students will learn about reproducible research actively — by doing it.

Required Textbook

None. Instead, a list of relevant reading will be provided.

Grading Procedure

Students are expected to attend every class and be on time. Participation counts toward the final grade. Homework will be provided for each topic and counts towards the final grade.


Steps in reproducible research

  1. Overview Slides, References
  2. Linux/bash basics Slides, References
  3. Text manipulation with grep, awk, sed, vim Slides, References

Example file: chromInfo.txt.gz (0.8kb)

Automating everything

  1. Best practices of data/code organization Slides References
  2. Make/Makefiles Slides References
  3. RStudio, R functions & packages Slides References

Example makefile: Makefile

Reproducible reports

  1. Literate programming with Markdown/KnitR Slides References
  2. Data manipulation and visualization in R Slides References
  3. Presentations and web publishing Slides References

Version control, sharing, and collaboration

  1. Git/GitHub Slides References
  2. Licenses Slides References
  3. Data/code sharing repositories Slides References

