vignettes/hic_tutorial.Rmd
hic_tutorial.Rmd
# Install, if necessary, and load necessary libraries and set up R session if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") library(readr) # install.packages("readr") library(data.table) # install.packages("data.table") library(dplyr) # install.packages("dplyr") library(edgeR) # BiocManager::install("edgeR") library(BiocParallel) # BiocManager::install("BiocParallel") library(HiCcompare) # BiocManager::install("HiCcompare"), or, for the latest version, # install.packages("devtools") # devtools::install_github('dozmorovlab/HiCcompare', build_vignettes = TRUE, force = TRUE) library(multiHiCcompare) # BiocManager::install("multiHiCcompare", version = "devel") # Output fixed numbers, not scientific notation. Needed for the correct # representation of genomic coordinates options(scipen = 999)
Like most sequencing data, Hi-C data starts out as paired-end reads stored in FASTQ format. The fastq
files can be very large, depending on the depth of the sequencing. Several Hi-C data processing pipelines exist to convert raw Hi-C FASTQ reads into text-based chromatin interaction matrices (Ay and Noble 2015). Researchers looking to generate their own Hi-C matrices will need to familiarize themselves with the Hi-C data processing pipelines (Lajoie, Dekker, and Kaplan 2015). However, those interested in using the wide range of public Hi-C data deposited on Gene Expression Omnibus (GEO) repositories can frequently bypass the data processing steps, as most deposited Hi-C data also includes the processed chromatin interaction matrices. These matrices are typically stored in the text-based .hic
or HDF5-based .cool
formats developed by Aiden lab (http://aidenlab.org/data.html) and Mirny lab (ftp://cooler.csail.mit.edu/coolers), respectively, and can be converted to plain text files. For more information, see Stansfield et al. 2019, “R Tutorial: Detection of Differentially Interacting Chromatin Regions From Multiple Hi-C Datasets” (Stansfield et al. 2019), and the vignettes for the HiCcompare, multiHiCcompare, SpectralTAD and TADCompare R/Bioconductor packages.
\(n \times n\) contact matrices are the most intuitive format to represent chromosome-specific chromatin interactions. The genome (and each chromosome) is binned into a set of non-overlapping regions of fixed size \(r\). \(r\) corresponds to the resolution of Hi-C data; typical resolution includes 10kb, 25kb, 50kb, 100kb, and 1Mb data.
\(n \times n\) contact matrices are square and symmetric with entry \(ij\) corresponding to the number of contacts between region \(i\) and region \(j\). Below is an example of a \(75 \times 75\) region of a contact matrix derived from Rao et al. 2014 data, GM12878 cell line (Rao et al. 2014), chromosome 22, 50kb resolution. Note the symmetry around the diagonal - the typical shape of chromatin interaction matrix. The figure was created using the pheatmap package. \(n \times n\) are most commonly associated with data from the Bing Ren lab (http://chromosome.sdsc.edu/mouse/hi-c/download.html).
\(n \times (n+3)\) matrices consist of an \(n \times n\) matrix but with three additional leading columns containing the chromosome, the start of the region and the end of the region. Regions in this case are determined by the resolution of the data. These matrices are commonly associated with the TopDom
TAD caller (http://zhoulab.usc.edu/TopDom/). The subset of a typical \(n \times (n+3)\) matrix is shown below.
#> chr start end X18500000 X18550000 X18600000 X18650000
#> 1 chr22 18500000 18550000 13313 4817 1664 96
#> 2 chr22 18550000 18600000 4817 15500 5120 178
#> 3 chr22 18600000 18650000 1664 5120 11242 316
#> 4 chr22 18650000 18700000 96 178 316 162
Sparse 3-column matrices are matrices where the first and second columns refer to region \(i\) and region \(j\) of the chromosome, and the third column is the number of contacts between them. This style is becoming increasingly popular and is associated with raw data from Lieberman-Aiden lab (e.g., https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525), and is the data output produced by the Juicer tool (Durand et al. 2016). 3-column matrices are handled internally in the package by converting them to \(n \times n\) matrices using the HiCcompare package’s sparse2full()
function. The first 5 rows of a typical sparse 3-column matrix are shown below.
#> region1 region2 IF
#> 1: 16050000 16050000 12
#> 2: 16200000 16200000 4
#> 3: 16150000 16300000 1
#> 4: 16200000 16300000 1
#> 5: 16250000 16300000 1
#> 6: 16300000 16300000 10
The HiCcompare
R package was designed for working with processed Hi-C data in text format, typically, sparse matrices. Hi-C data in sparse upper triangular matrix format extracted from .hic
files can be loaded using standard R functionality. HiCcompare
can then be used to convert this Hi-C data into a full \(n \times n\) contact matrix using the sparse2full
function. This process can be reversed using the full2sparse
function.
Data aligned by HiC-Pro
can be converted into a more usable BEDPE
format using the hicpro2bedpe
function. The hicpro2bedpe
function takes the .matrix
and .bed
files produced by HiC-Pro
as input and creates a sparse upper triangular matrix containing start and end coordinates for each interacting region.
Public Hi-C data is available from several sources. GEO (https://www.ncbi.nlm.nih.gov/geo/) catalogs the data for many studies. A search for “Hi-C” returns 3,703 hits (as of July 13, 2020). Additionally, the Aiden Lab website (https://www.aidenlab.org/) lists many high-quality datasets. Finally, there is the cooler
repository (https://github.com/mirnylab/cooler), which provides a database of Hi-C data ready for download. More Hi-C studies and data can be found in our GitHub repository (https://github.com/mdozmorov/HiC_data).
First, we will need to download an example set of Hi-C data. We will use data from Rao 2017 (Rao et al. 2017). For simplicity, we will only use two replicates for each experimental condition. The experimental conditions are normal HCT-116 cells and HCT-116 cells treated with auxin for six hours. To download the .hic
files from GEO run the following commands in the terminal. Note: downloading the data will require about 30GB of hard drive space. You will also need to download the straw
software from https://github.com/theaidenlab/straw/wiki and install it.
This step is only for illustration purposes due to the significant amount of download time and computational resources. We will start the workshop with the processed data in a sparse upper triangular text format.
# Install straw (Linux and Windows installations are available)
wget https://github.com/theaidenlab/straw/tree/master/bin/Mac/straw
chmod +x straw
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2795nnn/GSM2795535/suppl/GSM2795535_Rao-2017-HIC001_30.hic
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2795nnn/GSM2795536/suppl/GSM2795536_Rao-2017-HIC002_30.hic
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2809nnn/GSM2809539/suppl/GSM2809539_Rao-2017-HIC008_30.hic
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2809nnn/GSM2809540/suppl/GSM2809540_Rao-2017-HIC009_30.hic
# Make directories for the contact map files
mkdir HIC001
mkdir HIC002
mkdir HIC008
mkdir HIC009
# Extract contact maps using straw by running the following commands in the terminal
# Or, put the commands into a script file, e.g., `straw.sh`, and run it
for i in {1..22}
do
./straw NONE GSM2795535_Rao-2017-HIC001_30.hic $i $i BP 500000 > HIC001/HIC001.NONE.chr$i.500000.txt
done
./straw NONE GSM2795535_Rao-2017-HIC001_30.hic X X BP 500000 > HIC001/HIC001.NONE.chrX.500000.txt
for i in {1..22}
do
./straw NONE GSM2795536_Rao-2017-HIC002_30.hic $i $i BP 500000 > HIC002/HIC002.NONE.chr$i.500000.txt
done
./straw NONE GSM2795536_Rao-2017-HIC002_30.hic X X BP 500000 > HIC002/HIC002.NONE.chrX.500000.txt
for i in {1..22}
do
./straw NONE GSM2809539_Rao-2017-HIC008_30.hic $i $i BP 500000 > HIC008/HIC008.NONE.chr$i.500000.txt
done
./straw NONE GSM2809539_Rao-2017-HIC008_30.hic X X BP 500000 > HIC008/HIC008.NONE.chrX.500000.txt
for i in {1..22}
do
./straw NONE GSM2809540_Rao-2017-HIC009_30.hic $i $i BP 500000 > HIC009/HIC009.NONE.chr$i.500000.txt
done
./straw NONE GSM2809540_Rao-2017-HIC009_30.hic X X BP 500000 > HIC009/HIC009.NONE.chrX.500000.txt
These steps will create four folders containing the sparse upper triangular matrices for chromosomes 1-22 and X for each sample. HIC001 and HIC002 are the two replicates for the normal HCT-116 cells, and HIC008 and HIC009 are the two replicates for the auxin-treated HCT-116 cells.
The original HiCCompare
R package can be used when only two Hi-C datasets are available to be compared (Stansfield et al. 2018). HiCcompare
provides a method for the joint normalization and differential analysis of two Hi-C datasets, but cannot be generalized to higher numbers of datasets. multiHiCcompare
will need to be used if more than two Hi-C datasets are to be compared (Stansfield, Cresswell, and Dozmorov 2019).
We now need to read the data into R. Open R and make sure the working directory is set to the directory where the Hi-C data is stored in the following folder structure.
.
├── HIC001
│ ├── HIC001.NONE.chr1.500000.txt.gz
│ ├── HIC001.NONE.chr10.500000.txt.gz
| ...
│ ├── HIC001.NONE.chr9.500000.txt.gz
│ └── HIC001.NONE.chrX.500000.txt.gz
├── HIC002
│ ├── HIC002.NONE.chr1.500000.txt.gz
│ ├── HIC002.NONE.chr10.500000.txt.gz
| ...
│ ├── HIC002.NONE.chr9.500000.txt.gz
│ └── HIC002.NONE.chrX.500000.txt.gz
├── HIC008
│ ├── HIC008.NONE.chr1.500000.txt.gz
│ ├── HIC008.NONE.chr10.500000.txt.gz
| ...
│ ├── HIC008.NONE.chr9.500000.txt.gz
│ └── HIC008.NONE.chrX.500000.txt.gz
└── HIC009
├── HIC009.NONE.chr1.500000.txt.gz
├── HIC009.NONE.chr10.500000.txt.gz
...
├── HIC009.NONE.chr9.500000.txt.gz
└── HIC009.NONE.chrX.500000.txt.gz
Execute the following commands:
# Set up parameters for reading in data # Chromosome names chr <- paste0('chr', c(1)) # First chromosome, or use c(1:22, 'X') for all chromosomes # Sample and Folder names (samples <- paste0('HIC00', c(1, 2, 8, 9))) #> [1] "HIC001" "HIC002" "HIC008" "HIC009" # Data resolution res <- 500000 # Read data sample_list <- list() chr_list <- list() for( j in 1:length(samples)) { for (i in 1:length(chr)) { fileName <- system.file("extdata", paste0(samples[j], "/", samples[j], ".NONE.", chr[i], ".", res, ".txt.gz"), package = "HiCcompareWorkshop") chr_list[[i]] <- read_tsv(fileName, col_names = FALSE) %>% as.data.table() # Add column indicating the chromosome chr_list[[i]] <- cbind(i, chr_list[[i]]) colnames(chr_list[[i]]) <- c('chr', 'region1', 'region2', 'IF') } sample_list[[j]] <- chr_list chr_list <- list() } # Collapse separate chromosome lists into one table per sample sample_list <- lapply(sample_list, rbindlist)
We now have a list of length 4 with each entry containing the sparse upper triangular matrix for one of the Hi-C datasets:
sample_list[[1]] #> chr region1 region2 IF #> 1: 1 0 0 17 #> 2: 1 500000 500000 11583 #> 3: 1 500000 1000000 5454 #> 4: 1 1000000 1000000 30858 #> 5: 1 500000 1500000 868 #> --- #> 102703: 1 247000000 249000000 1166 #> 102704: 1 247500000 249000000 1138 #> 102705: 1 248000000 249000000 823 #> 102706: 1 248500000 249000000 2043 #> 102707: 1 249000000 249000000 8336
The first column indicates the chromosome number. The second column is the start location in base pairs for the first interacting region. The third column is the start location for the second interacting region, and the fourth column is the interaction frequency (IF) for the interacting pair.
As with any sequencing data, Hi-C datasets contain biases. There are two primary sources of bias, sequence- and technology-driven.
The DNA sequence-driven biases include GC content, chromatin accessibility, and mappability (Yaffe and Tanay 2011; O’Sullivan et al. 2013), which tend to be consistent across datasets generated for the same organism. The similar sequence-driven biases affect chromatin interactions to the same extent when comparing Hi-C datasets.
The technology-driven biases include cross-linking preferences, restriction enzyme choice, batch effects, and biotin labeling (Lun and Smyth 2015). The technology-driven biases affect the data unpredictably and thus are harder to model. The multiHiCcompare
R package was specifically designed to correct for the technology-driven biases between datasets.
To remove biases between pairs of Hi-C datasets, first, we need to create a Hicexp
object using the Hi-C data:
# Create a Hicexp object for use by multiHiCcompare # Four objects are assigned into two groups rao2017 <- make_hicexp(data_list = sample_list, groups = c(1, 1, 2, 2)) rao2017 # class(rao2017) #> Hi-C Experiment Object #> 2 experimental groups #> Group 1 has 2 samples #> Group 2 has 2 samples
The Hicexp
object stores the Hi-C experiment data and is the main input into the other functions included in multiHiCcompare
. The user can view the IF information by using the hic_table
accessor function:
hic_table(rao2017) #> chr region1 region2 D IF1 IF2 IF3 IF4 #> 1: 1 0 0 0 17 12 5 13 #> 2: 1 500000 500000 0 11583 16945 4694 6252 #> 3: 1 500000 1000000 1 5454 7462 1942 2475 #> 4: 1 500000 1500000 2 868 1103 720 856 #> 5: 1 500000 2000000 3 480 633 521 647 #> --- #> 95549: 1 248000000 248500000 1 4009 3839 2290 2333 #> 95550: 1 248000000 249000000 2 823 856 507 533 #> 95551: 1 248500000 248500000 0 8283 8848 5391 5892 #> 95552: 1 248500000 249000000 1 2043 2178 848 950 #> 95553: 1 249000000 249000000 0 8336 10546 4272 5359
When comparing multiple Hi-C datasets, a joint normalization procedure increases power and reduces the number of false positives (Stansfield et al. 2018). multiHiCcompare
provides a cyclic loess method for the joint normalization of multiple Hi-C datasets. The method is based on representing the data on an MD plot.
The MD plot is similar to the MA plot (Bland-Altman plot), which is commonly used to visualize gene expression differences. \(M\) is defined as the log difference between the two data sets \(M = log_2(IF_2/IF_1)\), where \(IF_1\) and \(IF_2\) are interaction frequencies of the first and the second Hi-C datasets, respectively. \(D\) is defined as the distance between two interacting regions, expressed in unit-length of the \(X\) resolution of the Hi-C data. A loess regression curve is fit through the MD plot and used to remove global biases by centering the \(M\) differences around \(M=0\) baseline. The multiHiCcompare
R package includes two methods for the joint normalization of Hi-C data, cyclic loess and fast loess (fastlo) (Ballman et al. 2004). We will normalize the data using fastlo.
# MD plots before normalization MD_hicexp(rao2017, plot.chr = 1, plot.loess = TRUE)
# Normalize using Loess rao2017 <- fastlo(rao2017)
# Plot normalization results MD_hicexp(rao2017, plot.chr = 1, plot.loess = TRUE)
chr | region1 | region2 | D | IF1 | IF2 | IF3 | IF4 |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 14.03845 | 9.054028 | 6.583487 | 16.14285 |
1 | 500000 | 500000 | 0 | 7976.91924 | 10874.195738 | 7464.948223 | 8895.88720 |
1 | 500000 | 1000000 | 1 | 3735.70219 | 4822.706786 | 3074.815893 | 3531.66059 |
1 | 500000 | 1500000 | 2 | 784.64047 | 939.231570 | 866.952110 | 923.59278 |
1 | 500000 | 2000000 | 3 | 583.88128 | 727.500686 | 467.083320 | 516.20057 |
1 | 500000 | 2500000 | 4 | 283.38905 | 353.270921 | 264.837459 | 258.98222 |
The hic_table
slot for IFs has been updated with their normalized values. The MD plots show that the normalization has been performed correctly, and the cloud of points is centered and symmetric around 0, indicating that any biases between datasets have been removed. The MD plot displays resolution \(r\) unit genomic distance on the x-axis and the log2 difference between the two datasets on the y-axis. Any shift of the points away from y = 0 represents scaling differences between the datasets. The loess fit to the data on the MD plot will also model any trend biases between the datasets. Correctly normalized data should be centered around y = 0 and symmetric (without any clear trends) on the MD plot.
Note that if multiple cores are available, the runtime of multiHiCcompare
can be sped up by using the parallel
option. multiHiCcompare
was built with the Bioconductor BiocParallel
package. The number of cores to be used in parallel processing can be set as follows:
library(BiocParallel) # BiocManager::install("BiocParallel") # Check how many cores are available numCores <- parallel::detectCores() # Set the number of cores at least one less than the total number if(Sys.info()['sysname'] == "Windows") { # Windows settings register(SnowParam(workers = numCores - 1), default = TRUE) } else { # Unix settings register(MulticoreParam(workers = numCores - 1), default = TRUE) }
Now that multiple cores are registered, the user can utilize parallel processing in any of the normalization and difference detection steps by setting parallel = TRUE
in the function options.
multiHiCcompare
provides two main ways to perform a differential comparison between the groups or conditions of your Hi-C experiment. For simple experiments where only a comparison between two groups is made, the hic_exactTest()
function can be used. For more complex experiments with covariates or multiple groups, the hic_glm()
function should be used. Both of these functions make use of the edgeR
package for fitting negative binomial models to the Hi-C data. For the difference detection steps, multiHiCcompare
first splits the data up by distance using the progressive pooling described in the fastlo section. Each distance pool is then treated similarly to an independent RNA-seq data matrix on which edgeR
’s functions are applied to fit the specified model. This process is illustrated in Figure below.
Figure. The off-diagonal analysis of multiple Hi-C replicates. Dashed lines represent the off-diagonal vectors of interaction frequencies at a given distance between interacting regions. Right: Converted into a matrix format similar to RNA-seq data, IFs can be loess normalized, variance across replicates can be estimated using an empirical Bayes approach, and differences can be detected using log-linear GLMs.
Now that we have jointly normalized our data, we are ready to compare the conditions to find differentially interacting chromatin regions. For this example, we only have two conditions and no other covariates. Thus, we can use the exact test for our comparison:
# Perform exact test # May use "parallel = TRUE" option to speed up computations rao2017 <- hic_exactTest(rao2017, parallel = FALSE)
# Plot a composite MD plot with the results of a comparison MD_composite(rao2017, plot.chr = 1, D.range = 0.4)
Here we can see the results. The composite MD plot highlights where the significantly different interactions occur in relation to distance and the fold difference between groups.
chr | region1 | region2 | D | logFC | logCPM | p.value | p.adj |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | -0.0174526 | -0.0045321 | 1.0000000 | 1.0000000 |
1 | 500000 | 500000 | 0 | -0.2778767 | 9.3487766 | 0.2724635 | 0.3478203 |
1 | 500000 | 1000000 | 1 | -0.5220235 | 8.1336596 | 0.0167912 | 0.0329749 |
1 | 500000 | 1500000 | 2 | 0.0952323 | 6.0276831 | 0.5575227 | 0.6320454 |
1 | 500000 | 2000000 | 3 | -0.5944782 | 8.6530833 | 0.0008158 | 0.0033206 |
1 | 500000 | 2500000 | 4 | -0.4011353 | 7.6744958 | 0.0319779 | 0.0680869 |
The results table shares the same first four columns with the hic_table
, but the following columns indicate the results of the exact test. logFC
is the log fold change difference between the experimental groups, logCPM
is the log counts per million between the samples, p.value
is the un-adjusted p-value for the exact test, and p.adj
is the false discovery rate (FDR) corrected p-value from the exact test.
You can save the results for the downstream analysis.
# Save the Hicexp object save(rao2017, file = 'rao2017_exact.rda') # To start the downstream analysis # without re-running multiHiCcompare load the saved file # load('rao2017_exact.rda')
For more complex experiments, the exact test is no longer sufficient, and the generalized linear model (GLM) framework must be used. If, for example, we have some other covariate of interest that we wish to control for or if there are more than two experimental groups, the GLM functionality of multiHiCcompare
should be used. Here we show an example GLM analysis using two additional replicates from Rao 2017, which come from different biological samples.
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2795nnn/GSM2795538/suppl/GSM2795538_Rao-2017-HIC004_30.hic wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2809nnn/GSM2809543/suppl/GSM2809543_Rao-2017-HIC012_30.hic
Assuming you processed the following files and extract matrices as described above, we read the data into R and create a Hicexp
object as before:
# Set up parameters for reading in data # Chromosome names chr <- paste0('chr', c(1)) # First chromosome, or use c(1:22, 'X') for all chromosomes # Sample and Folder names (samples <- paste0('HIC0', c('01', '02', '04', '08', '09', '12'))) #> [1] "HIC001" "HIC002" "HIC004" "HIC008" "HIC009" "HIC012" # Data resolution res <- 500000 # Read data sample_list <- list() chr_list <- list() for( j in 1:length(samples)) { for (i in 1:length(chr)) { fileName <- system.file("extdata", paste0(samples[j], "/", samples[j], ".NONE.", chr[i], ".", res, ".txt.gz"), package = "HiCcompareWorkshop") chr_list[[i]] <- read_tsv(fileName, col_names = FALSE) %>% as.data.table() # Add column indicating the chromosome chr_list[[i]] <- cbind(i, chr_list[[i]]) colnames(chr_list[[i]]) <- c('chr', 'region1', 'region2', 'IF') } sample_list[[j]] <- chr_list chr_list <- list() } # Collapse separate chromosome lists into one table per sample sample_list <- lapply(sample_list, rbindlist)
Then, we create the Hicexp object that includes batch covariate, assuming the additional samples came from a different batch.
# Create a Hicexp object for use by multiHiCcompare # Add the covariate data.frame for biological sample source rao_glm <- make_hicexp(data_list = sample_list, groups = c(1, 1, 1, 2, 2, 2), covariates = data.frame(biosample = c(1, 1, 2, 1, 1, 2))) # View covariates meta(rao_glm) #> group biosample #> Sample1 1 1 #> Sample2 1 1 #> Sample3 1 2 #> Sample4 2 1 #> Sample5 2 1 #> Sample6 2 2
Now we can normalize as was done before:
rao_glm <- fastlo(rao_glm, parallel = FALSE)
We are ready to use the GLM functionality of multiHiCcompare
, which is similar to the strategy used in limma
and edgeR
packages.
First, we need to create a design matrix. The design matrix should contain the covariates of interest. Any categorical variables should be entered as factors. Next, the comparison of interest will need to be specified using either the contrast
or the coef
option. For this example, we are interested in the group difference; thus, we can set coef = 2
(the first coefficient is for the intercept) to test if the group effect is equal to 0. For more information on using contrast
and coef
, please, see the edgeR user manual.
# Make design matrix d <- model.matrix(~factor(meta(rao_glm)$group) + factor(meta(rao_glm)$biosample)) colnames(d) <- c("(Intercept)", "group", "factor") knitr::kable(d)
(Intercept) | group | factor |
---|---|---|
1 | 0 | 0 |
1 | 0 | 0 |
1 | 0 | 1 |
1 | 1 | 0 |
1 | 1 | 0 |
1 | 1 | 1 |
Now we are ready to perform the comparison. There are three methods by which hic_glm()
can be performed. The default method is to use the QLFTest
which makes use of the quasi-likelihood model. Additionally, there is the LRTest
which conducts a likelihood ratio test. The final method is the Treat
method, which conducts a test relative to a specified fold-change threshold. For this option, the M
option will need to be used to specify the log2 fold change threshold. See the multiHiCcompare vignette for more details.
# Plug into GLM function rao_glm <- hic_glm(rao_glm, design = d, coef = 2, method = "QLFTest", p.method = "fdr", parallel = FALSE)
# Plot a composite MD plot with the results of a comparison MD_composite(rao_glm, plot.chr = 1, D.range = 0.2)
Here we can see the results. The composite MD plot highlights where the significantly different interactions are occurring in relation to distance and the fold change of the difference between groups.
chr | region1 | region2 | D | logFC | logCPM | F | p.value | p.adj |
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | -0.1021543 | -0.0243108 | 0.0316211 | 0.8650068 | 0.8946755 |
1 | 500000 | 500000 | 0 | 0.3204454 | 9.3528873 | 0.7814935 | 0.4123871 | 0.4878896 |
1 | 500000 | 1000000 | 1 | -0.2754725 | 8.0445835 | 1.3820008 | 0.2864798 | 0.3582978 |
1 | 500000 | 1500000 | 2 | 0.3323417 | 5.9134426 | 3.0192280 | 0.1355097 | 0.1915485 |
1 | 500000 | 2000000 | 3 | -0.2373112 | 8.5721266 | 0.9202522 | 0.3640999 | 0.4603879 |
1 | 500000 | 2500000 | 4 | -0.1304014 | 7.5473344 | 0.3783153 | 0.5547362 | 0.6357543 |
# Save the Hicexp object save(rao2017, file = 'rao2017_glm.rda') # To start the downstream analysis # without re-running multiHiCcompare load the saved file # load('rao2017_glm.rda')
The identification of differentially interacting chromatin regions (DIRs) opens up a problem of interpretation - what is so special about these regions from a genome regulation perspective? Answers to the following questions may help to better understand the regulatory role of differentially interacting regions.
Regions that are frequently detected as differentially interacting may be visualized using the Manhattan plot-like plotting function provided by multiHiCcompare
. The function manhattan_hicexp
allows the user to make a Manhattan plot showing the regions that are detected as significantly differentially interacting with any other regions (summarized p-value) or frequently identified as significantly differentially interacting (number of times a region is significantly differentially interacting with other regions).
The standard
method (default) displays the -log10-transformed p-values for each interacting pair of regions, which allows you to visualize the most significant interactions in the datasets. However, the standard
method requires more computational time than other methods. The p-value summarization methods include the addCLT
(default) (???), fisher
(???), and stouffer
(???) methods to combine the p-values for each region to produce a plot of the most significant regions.
The count
method creates a plot where the height corresponds to the number of times a region was detected as significant. The goal of these plots is to visualize the most significantly differentially interacting regions in the context of the linear genome.
The higher the dots are, the more significant/more frequent a region was detected as significantly differentially interacting. Use plot.chr
to focus on any given chromosome:
manhattan_hicexp(rao2017, method = 'fisher')
manhattan_hicexp(rao2017, method = 'count', plot.chr = 1)
It may be of interest to take a more in-depth look at the most significant regions that were frequently detected as differentially interacting. We can get started with this by using the topDirs
function that gives us a data.frame
of the regions and the count for the number of times each region was detected as differentially interacting, along with the Fisher combined p-value of the detected interactions. The topDirs
function is an analog of the limma::topTable
and edgeR::topTags
functions in that it allows us to filter the results by the average log fold-change (logfc_cutoff
), the average interaction frequency (the higher the average frequency, the more confident we are in the detected difference, logcpm_cutoff
), the adjusted p-value cutoff (p.adj_cutoff
), and the distance cutoff (D_cutoff
). The topDirs
function allows us to focus on the most significant regions while filtering out less interesting regions.
The return_df = 'bed'
option gives us a summary of the regions which are found to be interacting at least one time or more:
counts <- topDirs(rao2017, logfc_cutoff = 1, logcpm_cutoff = 2, p.adj_cutoff = 0.01, return_df = 'bed') knitr::kable(head(counts))
chr | start | end | count | avgD | avgLogFC | avgLogCPM | avgP.adj |
---|---|---|---|---|---|---|---|
chr1 | 206000000 | 206499999 | 107 | 45.10280 | -1.2141 | 6.9624 | 2.1193E-70 |
chr1 | 156000000 | 156499999 | 106 | 69.53774 | -1.0095 | 8.1635 | 3.9946E-69 |
chr1 | 156500000 | 156999999 | 102 | 68.78431 | -1.0139 | 8.2423 | 4.9611E-66 |
chr1 | 16000000 | 16499999 | 101 | 166.19802 | -1.0247 | 7.9119 | 4.7194E-66 |
chr1 | 145500000 | 145999999 | 95 | 48.92632 | -1.3783 | 7.2570 | 7.0017E-63 |
chr1 | 49500000 | 49999999 | 72 | 60.25000 | 1.1385 | 7.1517 | 2.7634E-47 |
We can now use the counts
data.frame as input for plotting the p-values of the top DIRs, summarized by Fisher’s method. To zoom in on a particular chromosome the plot.chr
option can be used:
plot_pvals(counts)
We can also plot the counts:
plot_counts(counts, plot.chr = 1)
The return_df = 'pairedbed'
will give the results in the form of interacting pairs:
pairs <- topDirs(rao2017, logfc_cutoff = 2, logcpm_cutoff = 4, p.adj_cutoff = 0.01, return_df = 'pairedbed') knitr::kable(head(pairs))
chr1 | start1 | end1 | chr2 | start2 | end2 | D | logFC | logCPM | p.value | p.adj |
---|---|---|---|---|---|---|---|---|---|---|
chr1 | 500000 | 999999 | chr1 | 50000000 | 50499999 | 99 | 4.5574 | 5.1487 | 2.5787E-04 | 5.4759E-03 |
chr1 | 500000 | 999999 | chr1 | 227000000 | 227499999 | 453 | -2.4551 | 5.6621 | 2.3725E-06 | 3.7135E-04 |
chr1 | 1000000 | 1499999 | chr1 | 50500000 | 50999999 | 99 | 2.0054 | 6.5709 | 2.7951E-04 | 5.6961E-03 |
chr1 | 1000000 | 1499999 | chr1 | 173000000 | 173499999 | 344 | -2.9429 | 4.7608 | 1.2840E-04 | 6.8989E-03 |
chr1 | 1000000 | 1499999 | chr1 | 229500000 | 229999999 | 457 | -2.2605 | 6.1349 | 6.2145E-08 | 2.3550E-05 |
chr1 | 1000000 | 1499999 | chr1 | 230000000 | 230499999 | 458 | -2.4601 | 5.3469 | 4.5318E-05 | 3.4711E-03 |
The coordinates of differentially interacting regions may be saved as .bed
files for downstream analysis in tools such as GenomeRunner (Dozmorov et al. 2016), LOLA (Sheffield and Bock 2016), or visualization in, e.g., UCSC Genome Browser (Karolchik et al. 2014):
# Regular BED format write_tsv(counts[, c('chr', 'start', 'end', 'count')], path = 'detected_regions.bed', col_names = FALSE) # Paired BED format write_tsv(pairs, path = 'detected_regions.pairedbed', col_names = FALSE)
Sometimes, a BED file of all regions in the genome needs to be saved, to be used as a “background” for random sampling:
# Get list of all 100KB regions in genome regions <- topDirs(rao2017, logfc_cutoff = 0, logcpm_cutoff = -1, D_cutoff = 0, p.adj_cutoff = 1, alpha = 2, return_df = 'bed' ) # Order regions regions <- regions[order(chr, start, end), ] # Remove unnecessary columns regions <- regions[, c('chr', 'start', 'end')] # Write into BED format write_tsv(regions, path = 'all_regions.bed', col_names = FALSE)
If gene expression data is available, DIRs may be checked for statistically significant overlap with differentially expressed (DE) genes. The hypothesis here is that regions detected as differentially interacting harbor genes that change their gene expression due to changes in chromatin interactions. In our example, we obtain a list of genes differentially expressed between the normal cells and auxin treated cells (Rao et al. 2017) and test whether they are co-localized with DIRs. First, we get the genomic coordinates (hg19/GRCh37) of all differentially interacting regions:
library(GenomicRanges) # BiocManager::install("GenomicRanges") # Make GRanges from significant regions sig.regions <- topDirs(rao2017, logfc_cutoff = 1, p.adj_cutoff = 10^-15, return_df = 'bed') sig.regions.gr <- makeGRangesFromDataFrame(sig.regions, seqnames.field = 'chr', start.field = 'start', end.field = 'end', keep.extra.columns = TRUE) sig.regions.gr #> GRanges object with 26 ranges and 5 metadata columns: #> seqnames ranges strand | count avgD avgLogFC #> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> #> [1] chr1 206000000-206499999 * | 107 45.1028 -1.2141 #> [2] chr1 156000000-156499999 * | 106 69.5377 -1.0095 #> [3] chr1 156500000-156999999 * | 102 68.7843 -1.0139 #> [4] chr1 16000000-16499999 * | 101 166.1980 -1.0247 #> [5] chr1 145500000-145999999 * | 95 48.9263 -1.3783 #> ... ... ... ... . ... ... ... #> [22] chr1 112500000-112999999 * | 41 80.5122 -1.0065 #> [23] chr1 147500000-147999999 * | 41 30.8049 -1.6457 #> [24] chr1 120500000-120999999 * | 35 27.8000 -1.1480 #> [25] chr1 144000000-144499999 * | 29 50.3103 -1.0657 #> [26] chr1 148000000-148499999 * | 6 21.6667 -1.9174 #> avgLogCPM avgP.adj #> <numeric> <character> #> [1] 6.9624 2.1193E-70 #> [2] 8.1635 3.9946E-69 #> [3] 8.2423 4.9611E-66 #> [4] 7.9119 4.7194E-66 #> [5] 7.2570 7.0017E-63 #> ... ... ... #> [22] 8.0560 1.8714E-27 #> [23] 6.2267 2.4473E-28 #> [24] 6.0484 3.5931E-24 #> [25] 6.2969 5.9417E-20 #> [26] 4.7078 1.9862E-19 #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Next, we get the genomic coordinates of all protein-coding genes in the genome. They will be used for a permutation test, to assess the average probability of overlap between DIRs and genes:
# Install annotables package for gene locations # devtools::install_github("stephenturner/annotables") library(annotables) library(dplyr) # Use annotables for hg19 symbols hg19_symbols <- grch37 %>% # Get genomic coordinates for hg19/GRCh37 genome assembly subset(.,biotype == "protein_coding") %>% # For protein-coding genes only subset(., chr %in% c(1:22, 'X')) # On autosomes and X chromosome # Make X chromosome numeric for compatibility with Hi-C data conventions hg19_symbols$chr[hg19_symbols$chr == 'X'] <- 23 hg19_symbols$chr <- paste0('chr', hg19_symbols$chr) hg19_symbols$strand <- ifelse(hg19_symbols$strand == -1, '-', '+') head(hg19_symbols) #> # A tibble: 6 x 9 #> ensgene entrez symbol chr start end strand biotype description #> <chr> <int> <chr> <chr> <int> <int> <chr> <chr> <chr> #> 1 ENSG000… 7105 TSPAN6 chr23 9.99e7 9.99e7 - protei… tetraspanin 6 [So… #> 2 ENSG000… 64102 TNMD chr23 9.98e7 9.99e7 + protei… tenomodulin [Sour… #> 3 ENSG000… 8813 DPM1 chr20 4.96e7 4.96e7 - protei… dolichyl-phosphat… #> 4 ENSG000… 57147 SCYL3 chr1 1.70e8 1.70e8 - protei… SCY1-like 3 (S. c… #> 5 ENSG000… 55732 C1orf1… chr1 1.70e8 1.70e8 + protei… chromosome 1 open… #> 6 ENSG000… 2268 FGR chr1 2.79e7 2.80e7 - protei… feline Gardner-Ra…
Then, we need to download the list of differentially expressed genes from GEO:
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE106nnn/GSE106886/suppl/GSE106886_Rao-2017-CMVnotreat_vs_RAD21treat.Genes.DESeq2.txt.gz
Now, we read the list of differentially expressed genes into R, get their genomic coordinates (hg19 genome assembly), and create the GRanges object:
fileName <- system.file("extdata", 'GSE106886_Rao-2017-CMVnotreat_vs_RAD21treat.Genes.DESeq2.txt.gz', package = "HiCcompareWorkshop") de.genes <- read.table(gzfile(fileName)) # Add "symbol" column de.genes <- de.genes %>% mutate(symbol = rownames(de.genes)) # Remove genes without differential expression statistics de.genes <- de.genes[ !is.na(de.genes[, "padj"]), ] # Select the most significant differentially expressed genes de.genes <- de.genes[de.genes[, "padj"] < 0.05, ] # FDR cutoff 0.05 # Merge differentially expressed genes with genomic coordinates de.genes <- left_join(de.genes, hg19_symbols, by = c('symbol' = 'symbol')) # Remove rows with NAs de.genes <- de.genes[complete.cases(de.genes),] # Make GRanges object for DE genes de.genes.gr <- GRanges(de.genes$chr, IRanges(start = de.genes$start, end = de.genes$end)) de.genes.gr #> GRanges object with 6380 ranges and 0 metadata columns: #> seqnames ranges strand #> <Rle> <IRanges> <Rle> #> [1] chr9 74729511-74872109 * #> [2] chr4 15004298-15071777 * #> [3] chr2 197059094-197458416 * #> [4] chr10 72432559-72522197 * #> [5] chr15 75011883-75017951 * #> ... ... ... ... #> [6376] chr11 58874658-58894883 * #> [6377] chr20 37377085-37400834 * #> [6378] chr8 13947373-15095848 * #> [6379] chr11 9481866-9550071 * #> [6380] chr3 180585929-180700541 * #> ------- #> seqinfo: 23 sequences from an unspecified genome; no seqlengths # Save the de.genes GRanges object # save(de.genes.gr, file = 'rao2017_genes.rda')
# To start the downstream analysis, load the saved file # load('rao2017_genes.rda') # Find overlaps olaps <- findOverlaps(sig.regions.gr, de.genes.gr) olaps #> Hits object with 37 hits and 0 metadata columns: #> queryHits subjectHits #> <integer> <integer> #> [1] 2 1502 #> [2] 2 1657 #> [3] 2 3512 #> [4] 2 5725 #> [5] 2 5726 #> ... ... ... #> [33] 21 1848 #> [34] 21 3419 #> [35] 22 643 #> [36] 22 686 #> [37] 24 1804 #> ------- #> queryLength: 26 / subjectLength: 6380
We can see that there are 37 overlaps between the DE genes and our significant regions. To test if this amount of overlap is significantly different from what can be expected by chance, we can perform a permutation test by testing the overlap of the DE genes with randomly selected regions from the genome. We can use multiHiCcompare
’s built-in permutation test for checking the enrichment of genomic features:
# use multiHiCcompare's permutation test function p.value <- perm_test(rao2017, de.genes.gr, p.adj_cutoff = 10^-15, logfc_cutoff = 1, num.perm = 1000) p.value #> [1] 0.5454545
We can see that for our toy example, the DE genes are not significantly enriched in our DIRs that were detected by mulitHiCcompare
(p-value = 5.455E-01).
Given the significant overlap between DE genes and DIRs, it may be of interest to test whether all genes overlapping DIRs are enriched in any canonical pathway or gene ontology annotation category. This can be done using the ROntoTools
R package (Voichita, Ansari, and Draghici 2020).
First, we need to get the genomic locations of the genes including strand information:
library(clusterProfiler) # BiocManager::install("clusterProfiler") # library(DOSE) # BiocManager::install("DOSE") library(ROntoTools) # BiocManager::install("ROntoTools") # library(graph) # BiocManager::install("graph") # Make GRanges out of all genes hg19_symbols.gr <- makeGRangesFromDataFrame(hg19_symbols, seqnames.field = 'chr', start.field = 'start', end.field = 'end', strand.field = 'strand', keep.extra.columns = TRUE) # Overlap genes with DIRs defined previously olap <- findOverlaps(sig.regions.gr, hg19_symbols.gr) length(olap) #> [1] 221
Next, we need to create a named vector for the number of times each region was detected as significantly interacting. The names for this vector will be the Entrez gene IDs. This vector will be used for the pathway enrichment to walk down the list of genes overlapping most-to-least frequently detected DIRs:
# Create "gene_counts" data.frame with the column for count and gene symbol genes_olap <- olap %>% as.data.frame %>% group_by(queryHits) %>% mutate(genes = hg19_symbols.gr@elementMetadata$symbol[subjectHits]) %>% dplyr::select(queryHits, genes) %>% distinct() tmp <- sig.regions %>% dplyr::select(count, avgLogFC, avgP.adj) %>% mutate(id = 1:nrow(sig.regions)) gene_counts <- left_join(genes_olap, tmp, by = c('queryHits' = 'id')) # Convert gene symbols into entrez ID entrez <- bitr(gene_counts$genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = 'org.Hs.eg.db') # Join the Entrez ID to the counts gene_counts <- left_join(gene_counts, entrez, by = c('genes' = 'SYMBOL')) # Remove unmapped entries gene_counts <- gene_counts[complete.cases(gene_counts), ] # Make the named vector of fold changes and pvalues for genes fc <- gene_counts$avgLogFC names(fc) <- paste0('hsa:', gene_counts$ENTREZID) pv <- as.numeric(gene_counts$avgP.adj) names(pv) <- paste0('hsa:', gene_counts$ENTREZID) # load KEGG pathways kpg <- keggPathwayGraphs("hsa", updateCache = FALSE, verbose = FALSE) # set edge weights kpg <- setEdgeWeights(kpg, edgeTypeAttr = "subtype", edgeWeightByType = list(activation = 1, inhibition = -1, expression = 1, repression = -1), defaultWeight = 0)
Now we can plug the fc
(fold-change) and pv
(p-value) vectors into ROntoTools
pathway analysis:
# Set node weights kpg <- setNodeWeights(kpg,weights = alpha1MR(pv), defaultWeight = 1) # Perform pathway analysis peRes <- pe(x = fc, graphs = kpg, ref = paste0('hsa:', as.character(hg19_symbols$entrez)), nboot = 200, verbose = FALSE) # Prepare results table kpn <- keggPathwayNames("hsa") table1 <- head(Summary(peRes, pathNames = kpn, totalAcc = FALSE, totalPert = FALSE, pAcc = FALSE, pORA = FALSE, comb.pv = NULL, order.by = "pPert"), n = 15) table1$pPert <- round(table1$pPert, digits = 3) table1$pPert.fdr <- round(table1$pPert.fdr, digits = 3)
# Print results knitr::kable(table1)
pathNames | pPert | pPert.fdr | |
---|---|---|---|
path:hsa04146 | Peroxisome | 0.015 | 0.583 |
path:hsa04740 | Olfactory transduction | 0.020 | 0.583 |
path:hsa04970 | Salivary secretion | 0.025 | 0.583 |
path:hsa04972 | Pancreatic secretion | 0.035 | 0.583 |
path:hsa04630 | JAK-STAT signaling pathway | 0.040 | 0.583 |
path:hsa04971 | Gastric acid secretion | 0.045 | 0.583 |
path:hsa04213 | Longevity regulating pathway - multiple species | 0.050 | 0.583 |
path:hsa03013 | RNA transport | 0.060 | 0.612 |
path:hsa04270 | Vascular smooth muscle contraction | 0.080 | 0.725 |
path:hsa04145 | Phagosome | 0.095 | 0.742 |
path:hsa04742 | Taste transduction | 0.104 | 0.742 |
path:hsa05165 | Human papillomavirus infection | 0.109 | 0.742 |
path:hsa04010 | MAPK signaling pathway | 0.139 | 0.742 |
path:hsa05230 | Central carbon metabolism in cancer | 0.139 | 0.742 |
path:hsa04141 | Protein processing in endoplasmic reticulum | 0.154 | 0.742 |
Here we can see the results of the Pathway analysis. These pathways do not make much sense in the context of the auxin vs. normal experiment and given only chromosome 1 was analyzed.
Another plausible hypothesis to test is the overlap between the DIRs and the boundaries of topologically associated domains (TADs).
We will first need to identify the TADs for the datasets being used. We can use the TopDom
R package for TAD identification. Note: TADs are typically called using Hi-C data at resolutions of >50KB, however for simplicity here, we continue to use the 50KB data. If the user plans to perform an analysis using TADs, she/he should call them at >50KB resolution.
# Install TopDom # remotes::install_github("HenrikBengtsson/TopDom", ref="master") library("TopDom")
Next, we will create the matrix file necessary for TopDom
. TopDom
requires an \(N \times (N+3)\) matrix in a text file. We can create this file for chromosome 1 as follows:
# Convert sparse matrix read in at beginning of tutorial to a full matrix mat <- sparse2full(sample_list[[1]][chr == 1, c('region1', 'region2', 'IF')]) # Create 3 extra columns necessary for TopDom bed <- data.frame(chr = 'chr1', start = colnames(mat), end = as.numeric(colnames(mat)) + resolution(rao2017)) # Merge 3 columns with full matrix mat <- cbind(bed, mat) # Write as a text file for input into TopDom write_tsv(mat, path = 'chr1.matrix', col_names = FALSE)
The user should now have a text file containing the \(N \times (N+3)\) contact matrix for chromosome 1 in the file chr1.matrix
. Now we can input the matrix into TopDom
and get the TAD boundaries:
TADs <- TopDom(data = "chr1.matrix", window.size = 5)
The results contain a BED file indicating the positions of the gaps, domains, and boundaries. We will pull out the locations of the boundaries to check if the DIRs are enriched within them:
# Pull out the bed file from the TopDom results with boundary locations boundaries <- TADs$bed # Subset to only boundaries boundaries <- boundaries[boundaries$name == "boundary",] # Convert to GRanges boundaries <- makeGRangesFromDataFrame(boundaries, seqnames.field = 'chrom', start.field = 'chromStart', end.field = 'chromEnd', keep.extra.columns = TRUE)
Similarly, we prepare a list of DIRs on chromosome 1:
# Make GRanges object for DIRs from `counts` object created with the topDIRs function chr1.dir <- counts[counts$chr == 'chr1', ] chr1.dir <- makeGRangesFromDataFrame(chr1.dir, seqnames.field = 'chr', start.field = 'start', end.field = 'end', keep.extra.columns = TRUE) # Find overlaps between boundaries and DIRs olaps <- findOverlaps(chr1.dir, boundaries) olaps #> Hits object with 2 hits and 0 metadata columns: #> queryHits subjectHits #> <integer> <integer> #> [1] 4 2 #> [2] 10 1 #> ------- #> queryLength: 26 / subjectLength: 6
Next, we perform a permutation test similar to the one performed in the previous section. This will test for enrichment of DIRs within the TAD boundaries:
# subset rao2017 Hicexp object to only chr1 chr1.rao2017 <- rao2017 slot(rao2017, "comparison") <- results(rao2017)[chr == 1, ] # perofrm permutation test p.value <- perm_test(rao2017, boundaries, p.adj_cutoff = 0.01, logfc_cutoff = 1, num.perm = 1000) p.value #> [1] 0.1448551
The DIRs do not seem to be enriched within TAD boundaries. This could be due to the simplifications we took for this tutorial. TADs should be called at resolutions of 50KB or higher, so it is possible our TAD boundaries are not as accurate as they should be. Additionally, it is possible that the changes induced by auxin do not target TAD boundaries but instead target smaller loop boundaries within the TADs.
The auxin treatment used in (Rao et al. 2017) was noted to destroy the RAD21 complex. Thus, our DIRs may correspond to changes at RAD21 binding sites.
We will need to download the location information for RAD21 binding sites for HCT-116 cells from CistromeDB. Search for “46207” and download the BED peaks 46207_peaks.bed
file into the working directory. Note the provided example has been subsetted to chromosome 1 only.
We can read the file into R:
rad21 <- read.table(system.file("extdata", '46207_chr1_peaks.bed', package = "HiCcompareWorkshop")) head(rad21) #> V1 V2 V3 V4 V5 #> 1 chr1 10149 10295 peak1 6.70350 #> 2 chr1 16184 16359 peak2 13.33365 #> 3 chr1 91401 91609 peak3 42.12526 #> 4 chr1 104859 105106 peak4 57.36115 #> 5 chr1 181917 182071 peak5 11.32777 #> 6 chr1 186798 187100 peak6 10.35612
This is a standard BED file. We will need to convert it into a GRanges
object so that we can input these locations into the permutation test function:
# convert X and Y chr names into 23 and 24 to correspond with multiHiCcompare results rad21$V1 <- sub("X", "23", rad21$V1) rad21$V1 <- sub("Y", "24", rad21$V1) # convert to GRanges rad21 <- GRanges(rad21$V1, IRanges(start = rad21$V2, end = rad21$V3)) # input into permutation test perm_test(rao2017, rad21, p.adj_cutoff = 10^-15, logfc_cutoff = 1, num.perm = 1000) #> [1] 0.97003
RAD21 sites are not significantly enriched in the DIRs. This is due to the oversimplification of analyzing chromosome 1 only. This enrichment is significant when all chromosomes are analyzed, confirming the published observations (Rao et al. 2017).
Ay, Ferhat, and William S Noble. 2015. “Analysis Methods for Studying the 3D Architecture of the Genome.” Genome Biol 16 (September): 183. https://doi.org/10.1186/s13059-015-0745-7.
Ballman, Karla V, Diane E Grill, Ann L Oberg, and Terry M Therneau. 2004. “Faster Cyclic Loess: Normalizing Rna Arrays via Linear Models.” Bioinformatics 20 (16): 2778–86. https://doi.org/10.1093/bioinformatics/bth327.
Dozmorov, Mikhail G, Lukas R Cara, Cory B Giles, and Jonathan D Wren. 2016. “GenomeRunner Web Server: Regulatory Similarity and Differences Define the Functional Impact of Snp Sets.” Bioinformatics 32 (15): 2256–63. https://doi.org/10.1093/bioinformatics/btw169.
Durand, Neva C, Muhammad S Shamim, Ido Machol, Suhas S P Rao, Miriam H Huntley, Eric S Lander, and Erez Lieberman Aiden. 2016. “Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-c Experiments.” Cell Syst 3 (1): 95–98. https://doi.org/10.1016/j.cels.2016.07.002.
Karolchik, Donna, Galt P Barber, Jonathan Casper, Hiram Clawson, Melissa S Cline, Mark Diekhans, Timothy R Dreszer, et al. 2014. “The Ucsc Genome Browser Database: 2014 Update.” Nucleic Acids Res 42 (Database issue): D764–70. https://doi.org/10.1093/nar/gkt1168.
Lajoie, Bryan R, Job Dekker, and Noam Kaplan. 2015. “The Hitchhiker’s Guide to Hi-c Analysis: Practical Guidelines.” Methods 72 (January): 65–75. https://doi.org/10.1016/j.ymeth.2014.10.031.
Lun, Aaron T L, and Gordon K Smyth. 2015. “DiffHic: A Bioconductor Package to Detect Differential Genomic Interactions in Hi-c Data.” BMC Bioinformatics 16: 258. https://doi.org/10.1186/s12859-015-0683-0.
O’Sullivan, Justin M, Michael D Hendy, Tatyana Pichugina, Graeme C Wake, and Jörg Langowski. 2013. “The Statistical-Mechanics of Chromosome Conformation Capture.” Nucleus 4 (5): 390–8. https://doi.org/10.4161/nucl.26513.
Rao, Suhas S. P., Su-Chen Huang, Brian Glenn St Hilaire, Jesse M. Engreitz, Elizabeth M. Perez, Kyong-Rim Kieffer-Kwon, Adrian L. Sanborn, et al. 2017. “Cohesin Loss Eliminates All Loop Domains.” Cell 171 (2): 305–320.e24. https://doi.org/10.1016/j.cell.2017.09.026.
Rao, Suhas S P, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T Robinson, Adrian L Sanborn, et al. 2014. “A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping.” Cell 159 (7): 1665–80. https://doi.org/10.1016/j.cell.2014.11.021.
Sheffield, Nathan C, and Christoph Bock. 2016. “LOLA: Enrichment Analysis for Genomic Region Sets and Regulatory Elements in R and Bioconductor.” Bioinformatics 32 (4): 587–9. https://doi.org/10.1093/bioinformatics/btv612.
Stansfield, John C, Kellen G Cresswell, and Mikhail G Dozmorov. 2019. “MultiHiCcompare: Joint Normalization and Comparative Analysis of Complex Hi-c Experiments.” Bioinformatics, January. https://doi.org/10.1093/bioinformatics/btz048.
Stansfield, John C, Kellen G Cresswell, Vladimir I Vladimirov, and Mikhail G Dozmorov. 2018. “HiCcompare: An R-Package for Joint Normalization and Comparison of Hi-c Datasets.” BMC Bioinformatics 19 (1): 279. https://doi.org/10.1186/s12859-018-2288-x.
Stansfield, John C, Duc Tran, Tin Nguyen, and Mikhail G Dozmorov. 2019. “R Tutorial: Detection of Differentially Interacting Chromatin Regions from Multiple Hi-c Datasets.” Curr Protoc Bioinformatics 66 (1): e76. https://doi.org/10.1002/cpbi.76.
Voichita, Calin, Sahar Ansari, and Sorin Draghici. 2020. ROntoTools: R onto-Tools Suite.
Yaffe, Eitan, and Amos Tanay. 2011. “Probabilistic Modeling of Hi-c Contact Maps Eliminates Systematic Biases to Characterize Global Chromosomal Architecture.” Nat Genet 43 (11): 1059–65. https://doi.org/10.1038/ng.947.