class: center, middle ### preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution [bit.ly/preciseTAD](https://mdozmorov.github.io/Talk_preciseTAD/) Mikhail Dozmorov, Ph.D. Department of Biostatistics Virginia Commonwealth University <div class="my-footer"> <a href="https://dozmorovlab.github.io/"> <svg viewBox="0 0 576 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M528 32H48C21.5 32 0 53.5 0 80v16h576V80c0-26.5-21.5-48-48-48zM0 432c0 26.5 21.5 48 48 48h480c26.5 0 48-21.5 48-48V128H0v304zm352-232c0-4.4 3.6-8 8-8h144c4.4 0 8 3.6 8 8v16c0 4.4-3.6 8-8 8H360c-4.4 0-8-3.6-8-8v-16zm0 64c0-4.4 3.6-8 8-8h144c4.4 0 8 3.6 8 8v16c0 4.4-3.6 8-8 8H360c-4.4 0-8-3.6-8-8v-16zm0 64c0-4.4 3.6-8 8-8h144c4.4 0 8 3.6 8 8v16c0 4.4-3.6 8-8 8H360c-4.4 0-8-3.6-8-8v-16zM176 192c35.3 0 64 28.7 64 64s-28.7 64-64 64-64-28.7-64-64 28.7-64 64-64zM67.1 396.2C75.5 370.5 99.6 352 128 352h8.2c12.3 5.1 25.7 8 39.8 8s27.6-2.9 39.8-8h8.2c28.4 0 52.5 18.5 60.9 44.2 3.2 9.9-5.2 19.8-15.6 19.8H82.7c-10.4 0-18.8-10-15.6-19.8z"></path></svg> dozmorovlab.github.io</a> | <a href="https://github.com/mdozmorov"> <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> mdozmorov</a> | <a href="https://twitter.com/mikhaildozmorov"> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @mikhaildozmorov</a> </div> --- ## The 3D structure of the genome - Human genome is big - ~3.2 billion base pairs - ~4 meters (~12ft) of diploid genome is packed into ~10um nucleus - ~800 trips from Earth to Sun in ~30T cells from the human body .center[<img src="img/genome_scales.png" width = 600>] <div style="font-size: small;"> Human body has approximately 30 trillion human cells (excluding trillions of microbiome cells); Stretched haploid genome would be roughly 2 meters - each cell has 4 meters of DNA (1 m = 3.28 ft); 30 trillion * 4 meters = 120 trillion meters; Convert to miles: 120 trillion meters / 1609.34 = 7.45*10^{10}; Convert to Earth-Sun distance: 7.45*10^{10} / 91.43*10^6 = 814.83 </div> --- ## Topologically Associating Domains (TADs) and chromatin loops - **TADs** – regions of increased chromatin interactions, separated by insulating boundaries - **Loops** – points of enriched chromatin contacts, represented as high-intensity adjacent pixels on Hi-C maps - Binding of CTCF and the members of cohesin complex (RAD21, SMC3) are a defining feature of domain boundaries .center[<img src="img/tads_loops.png" width = 800>] .small[Goloborodko et al. 2016 “Chromosome Compaction by Active Loop Extrusion.”] --- ## Pitfalls of existing TAD/loop callers - Detect domains from Hi-C chromatin interaction matrices - Heavily reliant on Hi-C data _resolution_ - the size of genomic regions used to bin the genome for measuring interaction frequency (1-100kb) - Ignore the fundamental role of CTCF and other genome annotation data in domain formation .center[<img src="img/tads_zoomin.png" width = 700>] --- ## Transcription factor enrichment at domain boundaries - **Arrowhead** – an algorithm for finding contact domains - **Peakachu** – a machine learning framework for chromatin loop detection - **SpectralTAD** - Hierarchical TAD detection using spectral clustering - **Grubert** – experimentally obtained cohesin-mediated loops (ENCODE III) .center[<img src="img/grubert_callers.png" width = 1100>] .small[ Rao et al. 2014, “A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles ...” Salameh et al. 2020 “A Supervised Learning Framework for Chromatin Loop Detection ...” Cresswell et al. 2020, “SpectralTAD: an R package for defining a hierarchy of topologically ...” Grubert et al. 2020, “Landscape of Cohesin-Mediated Chromatin Loops in the Human Genome.” ] --- ## Improving domain boundary detection using genome annotation data - **Previous work:** using genome annotation data to predict enhancer-promoter interactions, genomic rearrangements (e.g., _3DPredictor_), Hi-C contact maps (e.g., _HiC-Reg_), and domain boundaries (e.g., _Lollipop_) - **Limitations:** Operating at the resolution of Hi-C data, region size 1-100 kb - **Hypothesis:** High-resolution genome annotation data can inform more precise location of domain boundaries .small[ Tao, Huan, et al. “**Computational Methods for the Prediction of Chromatin Interaction and Organization Using Sequence and Epigenomic Profiles**,” _Briefings in Bioinformatics_, 2021 - Review of 48 chromatin interaction prediction methods using sequence and genome annotations. Belokopytova P. & Fishman V. “**Predicting Genome Architecture: Challenges and Solutions.**” _Frontiers in Genetics_, 2021 - Review of chromatin architecture reconstruction using genome annotation data, biophysical and statistical/machine learning approaches ] --- **preciseTAD** – learning boundary-transcription factor associations at the Hi-C resolution, transferring the learned associations at the base-pair resolution (predict the probability of each base being a boundary, cluster with DBSCAN, PAM) .center[<img src="img/preciseTAD_fig1.png" width = 800>] --- ## Feature engineering, class imbalance .pull-left[ .center[<img src="img/preciseTAD_feaureengineering.png" height = 360>] ] .pull-right[ .center[<img src="img/preciseTAD_schema.png" height = 360>] ] Distance between boundary-annotation, random undersampling, 5kb resolution Hi-C data perform best --- ## preciseTAD definitions - Trained on either _Arrowhead_ or _Peakachu_ detected boundaries - Genome annotation data from ENCODE (Transcription factor binding sites, histone modifications, chromatin segmentation data) - preciseTAD detects - **PTBRs**, _preciseTAD Boundary Regions_, regions enriched in bases with high boundary probability, DBSCAN - **PTBPs**, _preciseTAD Boundary Points_, most likely boundary summit points, PAM --- <div style="float: left; width: 70%;"> <center><img src="img/preciseTAD_fig3.png" width="700px" /></center> </div> <div style="float: right; width: 30%;"> <div>A) Transcription factors (TFs) provide the best predictive power</div> <br> <div>B) 4-8 TFs are sufficient for optimal prediction</div> <br> <div>C) CTCF, RAD21, SMC3, ZNF143 transcription factors are most important predictors for all chromosomes</div> <br> <div>D) preciseTAD-predicted boundaries are closer to CTCF binding sites</div> </div> --- ## preciseTAD-predicted boundaries strongly enriched in CTCF and and other architectural proteins .center[<img src="img/preciseTAD_fig4AB.png" width = 1100>] --- ## preciseTAD-predicted domains resemble Grubert experimental data Size distribution .center[<img src="img/preciseTAD_fig4C.png" width = 1000>] --- ## preciseTAD-predicted domains resemble Grubert experimental data Size distribution .center[<img src="img/preciseTAD_fig4C.png" width = 1000>] CTCF orientation .center[<img src="img/preciseTAD_fig4D.png" width = 1000>] --- ## preciseTAD- and Lollipop-predicted boundaries similarly enriched in CTCF and other architectural proteins .center[<img src="img/preciseTAD_figS10BD.png" width = 1100>] --- ## Models trained in one cell line can predict boundaries in another cell line .center[<img src="img/preciseTAD_fig5.png" height = 400>] Training and testing on GM12878 data (blue, Arrowhead; red, Peakachu) versus training on K562 and testing on GM12878 data (black, dashed). --- ## Summary - Current TAD/loop callers focus more on visual Hi-C data characteristics and less on the underlying biology - preciseTAD learns boundary-genomic annotation associations at the Hi-C data resolution and transfers this knowledge to the base-pair level - Binding of four transcription factors (CTCF, RAD21, SMC3, ZNF143) is sufficient to improve domain boundary detection - Models trained in one cell type can accurately predict boundaries in another cell type without Hi-C data - We predicted boundaries for 60 cell lines, imputing missing genomic annotations with Avocado (ongoing) .small[ Stilianoudakis S, Marshall M, Dozmorov M. "**preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution**." bioRxiv (2021). https://doi.org/10.1101/2020.09.03.282186 ] --- class: center, middle # Thank you <br> Stilianoudakis S, Marshall M, Dozmorov M. "**preciseTAD: A transfer learning framework for 3D domain boundary prediction at base-pair resolution**." bioRxiv (2021) https://bioconductor.org/packages/preciseTAD/ https://dozmorovlab.github.io/preciseTAD/ <img src="img/phrma.png" height = 50> .small[ The George and Lavinia Blick Research Fund [bit.ly/preciseTAD](https://mdozmorov.github.io/Talk_preciseTAD/) ] <div class="my-footer"> <a href="https://dozmorovlab.github.io/"> <svg viewBox="0 0 576 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M528 32H48C21.5 32 0 53.5 0 80v16h576V80c0-26.5-21.5-48-48-48zM0 432c0 26.5 21.5 48 48 48h480c26.5 0 48-21.5 48-48V128H0v304zm352-232c0-4.4 3.6-8 8-8h144c4.4 0 8 3.6 8 8v16c0 4.4-3.6 8-8 8H360c-4.4 0-8-3.6-8-8v-16zm0 64c0-4.4 3.6-8 8-8h144c4.4 0 8 3.6 8 8v16c0 4.4-3.6 8-8 8H360c-4.4 0-8-3.6-8-8v-16zm0 64c0-4.4 3.6-8 8-8h144c4.4 0 8 3.6 8 8v16c0 4.4-3.6 8-8 8H360c-4.4 0-8-3.6-8-8v-16zM176 192c35.3 0 64 28.7 64 64s-28.7 64-64 64-64-28.7-64-64 28.7-64 64-64zM67.1 396.2C75.5 370.5 99.6 352 128 352h8.2c12.3 5.1 25.7 8 39.8 8s27.6-2.9 39.8-8h8.2c28.4 0 52.5 18.5 60.9 44.2 3.2 9.9-5.2 19.8-15.6 19.8H82.7c-10.4 0-18.8-10-15.6-19.8z"></path></svg> dozmorovlab.github.io</a> | <a href="https://github.com/mdozmorov"> <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> mdozmorov</a> | <a href="https://twitter.com/mikhaildozmorov"> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @mikhaildozmorov</a> </div> <!-- ## Interpretation of differentially interacting chromatin regions (DIRs) - **Visualization of DIRs.** A Manhattan-like plot of DIRs may inform us about abnormalities or reveal chromosome site-specific enrichment of differentially interacting regions .center[<img src="img/manhattan.png" height = 350>] ## Interpretation of differentially interacting chromatin regions (DIRs) - **Overlap between differentially expressed genes and DIRs.** If gene expression measurements are available, differentially expressed genes may be tested for overlap with DIRs - test the link between DIRs and changed gene expression - **Functional enrichment of genes overlapping DIRs.** DIRs may disrupt specific pathways/functions - test whether genes overlapping DIRs are enriched in a canonical pathway or share a common function ## Interpretation of differentially interacting chromatin regions (DIRs) - **Overlap enrichment between TAD boundaries and DIRs.** DIRs may correspond to TAD boundaries that are deleted or created - test DIRs for significant overlap with TAD boundaries detected in either condition or only in boundaries changed between the conditions - **Overlap between DIRs and transcription factor binding sites.** DIRs may correspond to the locations where proteins bind to DNA, such as CTCF sites - test for enrichment of DIRs in any genome annotation (epigenomic mark) ## Interpretation of differentially interacting regions .pull-left[ .center[<img src="img/multiHiCcompare_tutorial.png" height = 450>] ] .pull-right[ .center[<img src="img/Tutorial_Front_cover.png" height = 450>] ] .small[ Stansfield, John C., Duc Tran, Tin Nguyen, and Mikhail G. Dozmorov. “[R Tutorial: Detection of Differentially Interacting Chromatin Regions From Multiple Hi-C Datasets](https://doi.org/10.1002/cpbi.76)” _Curr Prot in Bioinformatics_, May 24, 2019 ] class: center, middle # preciseTAD ## Machine learning for TAD boundary prediction .pull-left[ - **preciseTAD** – a random forest model using genomic annotations for predicting the probability of each base being a boundary - Train a model on low-resolution Hi-C regions - binary classification of annotated boundary/non-boundary regions - Apply the model to each annotated base - predict the likelihood of a base being a boundary ] .pull-right[ .center[<img src="img/preciseTAD_features.png" height = 400>] ] .small[ Stilianoudakis, Spiro C. “[PreciseTAD: A Machine Learning Framework for Precise 3D Domain Boundary Prediction at Base-Level Resolution](https://doi.org/10.1101/2020.09.03.282186)” _bioRxiv_ Sept 29, 2020] ## Machine learning for TAD boundary prediction .pull-left[ - Different resolutions (5kb) - Four feature engineering techniques (distance) - Four approaches to class imbalance (RUS or SMOTE) - Three types of genome annotations (Transcription factors (CTCF, SMC3, RAD21, ZNF143), Histone modifications, Chromatin states) ] .pull-right[ .center[<img src="img/preciseTAD_schema.png" height = 500>] ] .small[ Stilianoudakis, Spiro C. “[PreciseTAD: A Machine Learning Framework for Precise 3D Domain Boundary Prediction at Base-Level Resolution](https://doi.org/10.1101/2020.09.03.282186)” _bioRxiv_ Sept 29, 2020] ## Machine learning for TAD boundary prediction .pull-left[ - DBSCAN clustering and PAM to identify boundary regions and summit points - Summits are highly enriched in CTCT et al. signal - Pre-trained models predict boundaries using only genome annotation data .small[ Stilianoudakis, Spiro C. “[PreciseTAD: A Machine Learning Framework for Precise 3D Domain Boundary Prediction at Base-Level Resolution](https://doi.org/10.1101/2020.09.03.282186)” _bioRxiv_ Sept 29, 2020] ] .pull-right[ .center[<img src="img/preciseTAD_overview.png" height = 500>] ] -->