class: center, middle, inverse, title-slide # An Overview of Computational Reproducible Research and Tools ## Must know for modern data science ### Mikhail Dozmorov ### Virginia Commonwealth University ### 2021-11-09 --- <!-- ## Reproducibility and scientific progress - Science is the systematic enterprise of gathering knowledge about the universe and organizing and condensing that knowledge into testable laws and theories. - The success and credibility of science are anchored in the willingness of scientists to **expose their ideas and results to independent testing and replication by other scientists.** .small[ What is science? From http://www.aps.org/policy/statements/99_6.cfm ] --> ## What is reproducible research? - Reproducibility - Replicability - Repeatability - Reliability - Robustness - Generalizability **Flavors of reproducibility** - Empirical reproducibility - **Computational reproducibility** - Statistical reproducibility --- ## What is reproducible research? Reproducible research is the ultimate standard for strengthening scientific evidence by independent: - Investigators - Data - Analytical methods - Laboratories - Instruments --- ## Irreproducfibility ranges ~51% - 89% .center[ <img src="img/irreproducibility_bar.png" height = 450> ] .small[ - Leonard Freedman, Iain Cockburn, and Timothy Simcoe, “[The Economics of Reproducibility in Preclinical Research](https://doi.org/10.1371/journal.pbio.1002165).” PLOS Biol 2015 ] --- ## Cost of irreproducibility .center[ <img src="img/irreproducibility_chart.png" height = 450> ] .small[ - Leonard Freedman, Iain Cockburn, and Timothy Simcoe, “[The Economics of Reproducibility in Preclinical Research](https://doi.org/10.1371/journal.pbio.1002165).” PLOS Biol 2015 ] --- ## NIH focus on openness .center[ <img src="img/NIH.png" height = 500> ] --- ## NIH focus on openness - **Rigor and Reproducibility** https://grants.nih.gov/reproducibility/index.htm - **Public Access Policy** https://publicaccess.nih.gov/ - **Data Sharing Policies** https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_policies.html - **Model Organism Sharing Policy** https://grants.nih.gov/grants/policy/model_organism/ --- ## NSF stance on openness .center[ <img src="img/NSF.png" height = 450> ] .small[ - [NSF Data Management Plan](https://www.nsf.gov/bfa/dias/policy/dmp.jsp) - [NSF Scientific Integrity Policy](https://www.nsf.gov/bfa/dias/policy/si/index.jsp) ] --- ## Steps in reproducible research **The most important is the mindset, when starting, that the end product will be reproducible.** .pull-right[– Keith Baggerly] - Experimental design - Data generation - Data analysis - Results interpretation - Dissemination of results --- ## Common approach - write report around results - Point and click approach: Use MS Excel for data entry, cleaning, preparation, and possibly statistical analysis. .center[ <img src="img/excel.png" height = 280> ] .small[ Zeeberg BR et al. “[Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics](https://doi.org/10.1186/1471-2105-5-80)” _BMC Bioinformatics_ 2004 ] --- ## Common approach - write report around results **Problems** - With point-and-click, there’s no way to record/save the steps that generated the (copy/pasted) results. - Data files are kept separately from the analysis code, and from reports. - After modifications of one of the files, it becomes unclear which version corresponds exactly to the reported results. - Every time something changes, you have to regenerate the figures/results/reports by hand – very time consuming. --- ## Data organization in spreadsheets - Explicitly import text data as text, numeric as numeric, etc. - One thing in a cell (avoid comments, color coding, etc.) - Choose comprehensive variable names. Use "_" instead of spaces. Be consistent - Save the data as CSV, comma-separated values - Avoid calculations .center[ <img src="img/excel_stats.png" height = 140> ] .small[ http://kbroman.org/dataorg/ ] --- ## Better approach - write report that generates results - The report is automated via code - Data is attached to the well-documented code - History of any changes should be preserved - The final report should be self-sufficient and reproducible with a single command --- class: center, middle # What can we do .center[ <img src="img/tools.png" height = 280> ] ### Tools to enhance reproducibility --- ## Know your Unix - Unix is a family of operating systems and environments that exploits the power of linguistic abstractions to perform tasks - Unix is not an acronym; it is a pun on "Multics". Multics was a large multi-user operating system that was being developed at Bell Labs shortly before Unix was created in the early '70s. Brian Kernighan is credited with the name - All data science is done in Unix <center><img src="img/unix_plate.jpg" height="200px" /></center> .small[ http://www.read.seas.harvard.edu/~kohler/class/aosref/ritchie84evolution.pdf ] --- ## Unix systems - Three common types of laptop/desktop operating systems: Windows, Mac, Linux. - Mac and Linux are both Unix-like. - What that means for us: Unix-like operating systems are equipped with "shells" that provide a command line user interface. <center><img src="img/itu-unix-linux-1024x576.jpg" height="370px" /></center> <!-- ## History of Unix - Initial file system, command interpreter (shell), and process management started by Ken Thompson. - File system and further development from Dennis Ritchie, as well as Doug McIlroy and Joe Ossanna. - Vast array of simple, dependable tools that each do one simple task. <center><img src="img/Ken_Thompson_(sitting)_and_Dennis_Ritchie_at_PDP-11_(2876612463).jpg" height="270px" /></center> .small[ Ken Thompson (sitting) and Dennis Ritchie working together at a PDP-11 ] --> --- ## Philosophy of Unix - Vast array of simple, dependable tools. - Each do one simple task, and do it really well. - By combining these tools, one can conduct rather sophisticated analyses. - The Linux help philosophy: "RTFM" (Read the Fine Manual). --- ## Shell, aka command line, aka terminal - Shell is an interactive environment with a set of commands to initiate and direct computations - Shell encloses the complexity of OS, hence the name - You type in commands - Shell executes them <center><img src="https://www.howtogeek.com/wp-content/uploads/2017/03/img_58c0939c2d487.png" height="370px" /></center> .small[ https://en.wikipedia.org/wiki/Unix_shell ] --- ## Shell, aka command line, aka terminal - The Bourne shell (`sh`) is a shell, or command-line interpreter, for computer operating systems - Developed by Stephen Bourne at Bell Labs, 1976 - `bash` (the Bourne-Again shell) was later developed for the GNU project and incorporates features from the Bourne shell, `csh`, and `ksh`. It is meant to be POSIX-compliant https://en.wikipedia.org/wiki/Stephen_R._Bourne --- ## Getting to the command line on Windows - [Awesome WSL - Windows Subsystem for Linux](https://github.com/sirredbeard/Awesome-WSL) - detailed guide for working on Linux in Windows - **Cygwin**, http://www.cygwin.com/ - **Git Bash**, https://git-for-windows.github.io/ - Boot from a CD or USB (search for "linux usb") - Install the whole Linux systems as a Virtual Machine in **VirtualBox**, https://www.virtualbox.org/ - Remote access, SSH, [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/), [MobaXterm](https://mobaxterm.mobatek.net/) --- ## Interacting with shell - Most commands take additional arguments that fine tune their behavior - If you don't know what a command does, use the command ```bash man <command> # E.g., man grep ``` - Press `q` to quit the `man` page viewer - Most often, you’ll use `<command> -h` or `<command> --help` - Some commands output help if executed without any arguments --- ## File system: Full vs. relative paths <center><img src="img/file_paths.png" height="330px" /></center> - `cd /` - go to the root directory - `cd /usr/home/jack/bin` - go to the user’s sub-directory - `cd ..` - go to the upper level directory - `cd`, or `cd ~` - go to the user’s home directory - `cd --` - go to the last visited directory --- ## Orienting in the filesystem - `pwd` - print working directory - `ls` - list all files in the current directory - `ls -1` - list files in _one_ column - `ls –lah` - list files in `l`ong, `h`uman readable format, include `a`ll content, user, owner, permissions --- ## Creating, moving, copying, and removing files - `touch <file>` - creates an empty file - `nano <file>` - edit it - `mkdir <dirname>` - creates a directory - `cp <source_file> <target_file>` - copy a file to another location/file - `mv <source_file> <target_file>` - move a file - `rm <file>` - remove a file. If multiple files provided, removes all of them - `rm –r <dirname>` - recursive removal (deletes a directory) --- ## Other essential commands | | | |-----------|----------| | head/tail | cut | | for | comm | | sort | echo | | uniq | basename | | wc | dirname | | tr | history | | grep | which | | join | who | | kill | grep | | tar | seq | | gzip | paste | --- ## Scripts - plain text files with shell commands - A script is a file with a `.sh` extension. It contains a list of shell commands executed by an interpreter - Shebang (`#!`) defines the interpreter on the first line - `#!/bin/bash` - commands interpreted by bash - `#!/usr/bin/python` - interpreted by Python ```bash #!/bin/bash echo "Hello, world!" ``` - Running a script: `./hello_world.sh` or `bash hello_world.sh ` --- class: center, middle # Remote computing ### SSH (Secure Shell protocol) --- ## What is SSH? - “SSH is a protocol for secure remote login and other secure network services over an insecure network.” – RFC 4251 - Secure communication channel between two computers - Uses strong encryption and authentication to provide confidentiality and authenticity of the data. - Many uses other than remote shell (e.g., copying files from/to a remote computer). .small[https://datatracker.ietf.org/doc/html/rfc4251] --- ## Some implementations - `OpenSSH` – common on UNIX systems, open implementation of the last free SSH release, a suite of networking utilities. - `PuTTY` – client only, Windows. - `MobaXterm` - Enhanced terminal for Windows with X11 server, tabbed SSH client, network tools and much more. .small[ https://www.openssh.com/ https://www.chiark.greenend.org.uk/~sgtatham/putty/ https://mobaxterm.mobatek.net/ ] <!-- ## Layering of SSH Protocols - **Transport Layer Protocol** - Provides server authentication, confidentiality, and integrity - **User Authentication Protocol** - Authenticates the client-side user to the server - **Connection Protocol** - Multiplexes the tunnel into logical channels - New protocols can coexist with the existing ones --> --- ## Basic use ```bash ssh <ssh_server_name> ssh -l <username> <ssh_server_name> # Or ssh <username>@<ssh_server_name>[:/path/on/server] # ssh -X dozmorovm@merlot.bis.vcu.edu ssh <ssh_server_name> <command_to_run> ``` --- ## Encryption concepts **Public and private keys** for passwordless network communications. - Both public and private keys are generated by you – they are yours. - A public key is a “lock” that can only be opened with the corresponding private key. - Public key can be placed on any other computer you want to connect to. - Private key stays private on any machine you’ll be connecting from. .center[ <img src="img/ssh1.png" height = 250> ] --- ## Where SSH stores User Configuration Files - `~/.ssh/` - `id_*` - private authentication keys - `id_*.pub` – public authentication keys - `known_hosts` – list of known public host keys - `authorized_keys` – list of allowed public authentication keys .center[ <img src="img/ssh2.png" height = 300> ] --- ## Getting public and private keys Generate your public and private keys - First, check if you already have them, ```bash ls -al ~/.ssh ``` - If not, generate, ```bash ssh-keygen -t rsa -b 4096 -C your_email@example.com ``` --- ## Add public key to any machine - Copy your public key `~/.ssh/id_dsa.pub` to a remote machine - Add the content of your public key to `~/.ssh/authorized_keys` on the remote machine - Make sure the `~/.ssh/authorized_keys` has the right permissions (`read` + `write` for user, `nothing` for group and all) ```bash cat ~/.ssh/id_dsa.pub | ssh user@remote.machine.com 'mkdir -p .ssh; cat >> .ssh/authorized_keys; chmod 600 authorized_keys' ``` .small[ http://mah.everybody.org/docs/ssh ] --- ## Secure copy (`scp`) files over the network **scp**: securely copy a file from one computer to another. - Use `scp` to securely transfer files between two Unix computers. - `scp` encrypts both the file and any passwords exchanged. The syntax for the scp command is: ```bash scp [options] username1@source_host:directory1/filename1 \ username2@destination_host:directory2/filename2 ``` - Use `-r` (recursive) option to copy a directory --- class: center, middle # Installing software ### Conda --- ## Conda environment Package, dependency and environment management for any language — Python, R, Java, and more. Install "miniconda" for your OS. - **Environment** - a place isolated from the operating system where one can install software without risking system's conflicts. - If something will be wrong in the environment, it is easy to delete and start again. Think about environments like subfolders (they are). By default, you are in the "base" environment https://docs.conda.io/en/latest/miniconda.html --- ## Conda environment Making new environment: ```bash conda create -n new-env ``` Choose name `new-env` reflecting what you plan to install, e.g., ```bash conda create -n UCSC ``` https://docs.conda.io/ --- ## Conda environment Install any programming language within the environment (add `-y` arguiment to answer Yes to all prompts): ```bash conda create -n new-env python=3.9 conda create -n new-env r ``` Activate/Deactivate environment: ```bash conda activate new-env # source activate new-env conda deactivate ``` Note the change in command prompt, when in an environment, e.g. `(jupyterlab) mdozmorov@work:$` --- ## Conda environment Within an environment, install software per instructions. Also, use `conda` itself to install software: ```bash conda install -c bioconda ucsc-bigwigtowig ``` Search for "conda install 'software name'" Create as many environments as you want. List them as: ```bash conda info --envs conda env list ``` Delete an environment. ```bash conda remove --name new-env --all ``` --- class: center, middle # Automate every step using scripts/pipelines .center[ <img src="img/pipeline.png" height = 280> ] ### Pipelines are cool --- ## Makefiles - reproducibility in command line - I started with **Make**, a tool which controls the workflow of generating target/result files from the dependencies/source files. - Automates/documents a workflow. - Intelligently handles the dependencies among data files, code. - Accounts for the updates in data, code. --- ## Modern workflow pipelines - **Snakemake** workflow management system - **Nextflow** - data-driven computational pipeline - **WDL** - the Workflow Description Language - **CWL** - Common Workflow Language .small[ https://snakemake.readthedocs.io/ https://www.nextflow.io/ https://openwdl.org/ https://www.commonwl.org/ ] --- ## Snakefile example ```snakemake rule all: input: "ERR458493_fastqc.html", "ERR458493_fastqc.zip" rule make_fastqc: input: "ERR458493.fastq.gz" output: "ERR458493_fastqc.html", "ERR458493_fastqc.zip" shell: "fastqc ERR458493.fastq.gz" ``` .small[ https://github.com/ngs-docs/2020-GGG298/tree/master/Week4-snakemake-for-workflows ] --- class: center, middle # Create reproducible reports .center[ <img src="img/knitr_wide.png" height = 280> ] ### Literate programming for humans and computers --- ## Automate everything - R – free/open source programming language - Runs on Windows, Mac, and Linux - Extensible with a very large collection of actively developing packages - Excellent graphics & report-creating capabilities .center[ <img src="img/R.png" height = 280> ] --- ## R is reimagined with RStudio .center[ <img src="img/Rstudio.png" height = 480> ] .small[https://www.rstudio.com/products/rstudio/download/] --- ## Self-documenting code .pull-left[ - A report containing a stream of **text and code** chunks - Each code chunk loads data, computes results, shows figures - Each text chunk explains how the code chunks work - The resulting report is human- and machine readable ] .pull-right[ .center[ <img src="img/knuth.png" height = 380> ] .small[Donald E. Knuth "[Literate Programming](https://doi.org/10.1093/comjnl/27.2.97)" _The Computer Journal_, 01 January 1984] ] --- ## R Markdown basics - **Markdown** – a lightweight markup language for plain text formatting syntax. Easily converted to HTML. Developed in 2004 .center[ <img src="img/rmarkdown1.png" height = 450> ] --- ## R Markdown basics .center[ <img src="img/rmarkdown2.png" height = 380> ] .small[ [The R Markdown Cheat sheet](http://shiny.rstudio.com/articles/rm-cheatsheet.html) [R Markdown Reference Guide PDF](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf) ] --- ## Literate programming with knitR - **knitR** - a package for dynamic report generation written in R Markdown - Supports RMarkdown, LaTeX, MathJax. PDF, HTML, DOCX output - Developed in 2012 by Yihui Xie .center[ <img src="img/knitr.png" height = 280> ] .small[ http://yihui.name/knitr/ ] --- ## Literate programming with knitR - Code chunks separated from text. Code inline with text allowed - Graphics, code-generated and external images - Tables, code generated and in RMarkdown format - Caching long code chunks - Code chunks and results output fully customizable --- ## Literate programming with knitR - Mix markdown with code .center[ <img src="img/rmarkdown3.png" height = 450> ] --- ## Literate programming with knitr - "Knit" a report with one command .center[ <img src="img/rmarkdown4.png" height = 230> ] .center[ <img src="img/rmarkdown5.png" height = 230> ] --- ## Reproducibility for other languages **Jupyterlab Notebooks** (formerly IPython/Jupyter) - Combine text, equations, code, and graphics - Markdown support - Multiple languages support (>40) .center[ <img src="img/jupyter.png" height = 280> ] .small[ https://jupyter.org/ ] --- class: center, middle # Track changes with version control (GitHub) .center[ <img src="img/github_logo.png" height = 280> ] ### Collaborate with elegance --- ## Keeping history of changes .center[ <img src="img/final.png" height = 450> ] .small[ http://www.phdcomics.com/comics/archive.php?comicid=1531 ] --- ## Version control – what and when did you do **Git and GitHub – version control system** - Each project stored in its own repository - Keep history of changes – track what you did - Ability to go back if something breaks - Branch out, go creative, then merge or revert the changes - Collaborate through merging changes from multiple people --- ## Version control – what and when did you do - **Git** is a command line tool - **GitHub.com** is a web-based storage for your project repositories - RStudio has Git integration - <span style="font-family: monospace; font-weight:bold; color:red;">Git clone ...</span> – clone an existing repository - <span style="font-family: monospace; font-weight:bold; color:red;">Git add</span> – add a file to version control system - <span style="font-family: monospace; font-weight:bold; color:red;">Git commit</span> – make a snapshot of current changes - <span style="font-family: monospace; font-weight:bold; color:red;">Git push/pull</span> – send/get changes to/from GitHub --- class: center, middle # Reproducibility for the whole project .center[ <img src="img/docker-cloud.png" height = 280> ] ### Docker and Cloud computing --- ## Reproducibility for the whole projects **Docker** - an envelope (or container) for the whole project environment ~ lightweight virtual machine - OS-independent, portable application images - Preserves all application dependencies - Easy to distribute .center[ <img src="img/docker.png" height = 280> ] .small[ https://www.docker.com/ ] ## Reproducibility: tools that enable reproducibility of analysis results - **Docker** - Packages software with all necessary components to enable reproducible deployment in different computing environments - An open-source project to easily create lightweight, portable, self-sufficient images - A tool for creating a layered filesystem; each layer is versioned and can be shared across running instances, making for much more lightweight deployments - A company behind the project, as well as a site called the "Docker Hub" for sharing containers - Docker **is not** a virtual machine - **Singularity** - Similar to Docker but designed for scientific software running in high-performance computing environments .small[ https://www.docker.com/ https://sylabs.io/guides/3.5/user-guide/ ] --- ## Docker terminology - **Image** - The blueprints of applications which form the basis of containers. Get them with `docker pull` - A snapshot of a filesystem at a certain point in time - The image is composed of layers that progressively stack on top of each other - Layers can be shared by running instances of an image ```bash $ docker pull ubuntu Pulling repository ubuntu c4ff7513909d: Download complete 511136ea3c5a: Download complete 1c9383292a8f: Download complete ... ``` --- ## Docker terminology - **Container** - runtime instance of an image plus a read/write layer. Create them with `docker run`, list running containers with `docker ps` - A Docker container can be seen as a computer inside your computer - It can be saved and send to your friends; and when they start this computer and run your code they will get exactly the same results as you did --- ## Docker terminology - **Dockerfile** - build script that defines: - an existing image as the starting point - a set of instructions to augment that image (each of which results in a new layer in the file system) - meta-data such as the ports exposed - the command to execute when the image is run ```bash FROM ubuntu:18.04 ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 LABEL authors="Nicolas Servant" \ description="Docker image containing all requirements for the HiC-Pro pipeline" ## Install system tools RUN apt-get update \ && apt-get install -y build-essential \ wget \ unzip \ bzip2 \ gcc \ g++ && apt-get clean ``` --- ## Running docker (RStudio with R/Bioconductor) ```bash docker run -e PASSWORD=mypass -p 8787:8787 -d --rm \ -v $(pwd):/home/rstudio \ bioconductor/bioconductor_docker:devel ``` - `bioconductor/bioconductor_docker:devel` - name of Docker image (will be pulled, if not exists locally) - `-p 8787:8787` - maps port 8787 on a local computer to port 8787 in the Docker container - `-v $(pwd):/home/rstudio` - maps the current folder to `/home/rstudio` in the Docker container Open `http://localhost:8787/` in your web browser and login with `rstudio` user name and `mypass` password --- ## Cloud providers | Platform | Website | |-----------------------|------------------------------| | Amazon Web Services | https://aws.amazon.com/ | | Google Cloud Platform | https://cloud.google.com/ | | Microsoft Azure | https://azure.microsoft.com/ | | IBM Cloud | https://www.ibm.com/cloud | | Alibaba Cloud | https://us.alibabacloud.com/ | .small[ Langmead, Ben, and Abhinav Nellore. “[Cloud Computing for Genomic Data Analysis and Collaboration](https://doi.org/10.1038/nrg.2017.113).” Nature Reviews Genetics, January 30, 2018. ] --- ## Kubernetes Kubernetes is a container orchestration platform - Complex tasks may require running multiple applications in separate containers - Containers can exchange information over network - Need fault tolerance mechanism to deal with failed containers Kubernetes manages the entire lifecycle of individual containers, spinning up and shutting down resources as needed. If a container shuts down unexpectedly, Kubernetes will react by launching another container in its place. Think central node controlling working nodes. --- ## Learn more - [Reproducible research tools](https://mdozmorov.github.io/BIOS691.2018/) - the fundamental concepts in computational reproducible research. Through lectures and hands-on exercises students learn best practices of statistical data analysis and programming. - [Bioinformatics notes](https://github.com/mdozmorov/Bioinformatics_notes) <img src="https://img.shields.io/github/stars/mdozmorov/Bioinformatics_notes?style=social" align="center"> - [Statistics notes](https://github.com/mdozmorov/Statistics_notes) <img src="https://img.shields.io/github/stars/mdozmorov/Statistics_notes?style=social" align="center"> - [Programming notes](https://github.com/mdozmorov/Programming_notes) <img src="https://img.shields.io/github/stars/mdozmorov/Programming_notes?style=social" align="center"> - [R notes](https://github.com/mdozmorov/R_notes) <img src="https://img.shields.io/github/stars/mdozmorov/R_notes?style=social" align="center"> - [Python notes](https://github.com/mdozmorov/Python_notes) <img src="https://img.shields.io/github/stars/mdozmorov/Python_notes?style=social" align="center"> - [Machine learning notes](https://github.com/mdozmorov/MachineLearning_notes) <img src="https://img.shields.io/github/stars/mdozmorov/MachineLearning_notes?style=social" align="center">