Overview
Alignment
Hi-C reads are mapped to the GRCh38 (human) or mm10 (mouse) reference genome using bwa
version 0.7.17. Specifically, we run:
bwa mem -SP5M -t<nthreads> <genome_index> <fastq1> <fastq2>
- The
-SP
option is used to ensure the results are equivalent to that obtained by runningbwa mem
on each mate separately, while retaining the right formatting for paired-end reads. This option skips a step inbwa mem
that forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment. - The
-5
option is used to report the 5' portion of chimeric alignments as the primary alignment. In Hi-C experiments, when a mate has chimeric alignments, typically, the 5' portion is the position of interest, while the 3' portion represents the same fragment as the mate. For chimeric alignments,bwa mem
reports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either 'supplementary' or 'secondary'. The-5
option forces the 5'end to be always annotated as primary. - The
-M
option is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such aspicard MarkDuplicates
. - The
-t
option is used for multi-threading and should not affect the result.
Filtering
For filtering valid Hi-C alignments, we use pairtools
(previously called pairsamtools
). Specifically, we use version 0.2.2
. The filtering workflow outputs a pairs file containing a list of valid contacts.
This filtering workflow applies the following criteria:
- Reads marked as duplicates are removed.
- Full-length alignments that are unique are kept.
- An unmapped portion shorter than 20bp is ignored; and the rest of the alignment is still considered as full-length.
- In addition, clipped (chimeric) alignments are kept, if they are valid Hi-C contacts. This means one mate is clipped and the other is full-length and the 3'end of the clipped alignment is mapped with 2kb of the full-length alignment in the orientation that the two 3'ends are pointing toward each other.
- As with full-length alignments, any unmapped portion shorter than 20bp is ignored.
One of the design choices we have made is to include a lossless bam file as an output of the data processing. This output file, containing all the sequences in the original fastq files, the alignment results, and pairtools-provided flags for read filtering, is provided as a resource. To be able to produce this output file, the contents of the bam file is carried forward in the filtering workflow in intermediate pairsam
files. Users who are only interested in the valid contact lists may run the same analysis with more light-weight intermediate files.
Specifically, the filtering workflow consists of the following steps:
pairtools parse
- A bam file is read in, and a pairsam file is written out.
- The pairsam file is a pairs file, listing one read pair per line, with additional columns to track the sam-file lines, and a pairtools read classification.
- These classifications include information on whether the read aligned to 0, 1, or multiple places in the genome and whether it aligned end-to-end or if it was clipped.
- This tool also upper-triangularizes the reads, i.e. if the coordinate of second read is higher than the first, the reads are flipped.
-
For more details, see pairsamtools doc
-
pairtools sort
- A
pairsam
(or genericallypairs
) file is read in, and apairsam
file is written out. - The rows are sorted in chr1-chr2-pos1-pos2 order.
-
Note that the flipping order and sort order of chromosomes is not identical. See the docs for more details.
-
pairtools merge
- One or more
pairsam
(or genericallypairs
) files are read in, and apairsam
file is written out. -
The files are merged, preserving the sorted order.
-
pairtools dedup --mark-dups
- (equivalent to
pairtools markasdup
) - Duplicate alignments that share the same pair of 5'end coordinates +/- 3 bps are marked as identified.
-
An arbitrary one is retained with the original classification, while others get a duplicate classification.
-
pairtools select
- Only the reads with pairtools classification
UU
andUC
are retained and output to apairs
file.
Matrix aggregation and normalization
Depending on the experiment type, the Hi-C pipeline was used either as it is or without restriction enzyme-based intra-fragment contact filtering.
- In situ Hi-C and dilution Hi-C data have been processed with the Hi-C pipeline as it is.
- Data from similar experiment types were processed using the Hi-C pipeline either without restriction enzyme filtering or without balancing.
Experiment type | Restriction site file | no_balance |
---|---|---|
in situ Hi-C | yes | False |
dilution Hi-C | yes | False |
DNase Hi-C | no | False |
Micro-C | no | False |
Capture Hi-C | yes | True |
ChIA-PET | no | True |
4DN DCIC provides a Hi-C matrix in two different formats: .mcool
format and .hic
format. The two files are generated from the same pairs
file as input filtered contact list. Both files contain multiple resolutions.
.hic
format- A
.hic
file is produced by Juicertools (version 1.8.9-cuda8) and can be visualized using Juicebox - The matrix is normalized using the VC, VC_SQRT, KR methods, after filtering out intra-fragment contacts (contacts that fall in the same restriction fragment).
.mcool
format- An
.mcool
file is produced by Cooler (version 0.8.3) and can be visualized using HiGlass. - The matrix is normalized using the ICE (iterative correction and eigenvalue decomposition) matrix balancing algorithm, after the matrix-level removal of the diagonal and the rows/columns with a low value.
- The
.mcool
file also contains the normalization vectors generated by Juicertools (same as in a.hic
file generated from the samepairs
file)
Resolutions
Both mcool
and hic
files contain the following resolutions.
- 1kb, 2kb, 5kb, 10kb, 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb, 2.5Mb, 5Mb, 10Mb
Merging replicates
Replicates are merged in two steps before producing a matrix.
- Sequencing replicates (replicates within an experiment)
- Sequencing replicates are merged after the alignment but before duplicate removal step (
pairsam dedup
), since reads resulting from a single PCR duplication event may exist in both replicates. - Merging is performed on (intermediate)
pairsam
files usingpairsam merge
. - DCIC provides a merged output as a
pairs
file. - Biological replicates (replicates within an experiment set)
- Biological replicates are merged after duplicate removal step, since PCR duplication event happened independently in these replicates.
- Merging is performed on
pairs
files using run-merge-pairs.sh. - DCIC provides a merged output as a merged
pairs
file.
Source files
The pipeline components are pre-installed in a publicly available Docker image (duplexa/4dn-hic:v43
) on Docker Hub. The source code for the Docker image and pipeline description in Common Workflow Language (CWL) can be found on GitHub.
- Latest runs
Content-wise, 0.2.5
, 0.2.6
, 0.2.7
can be considered (nearly) identical.
0.2.7
- CWL : https://github.com/4dn-dcic/docker-4dn-hic/tree/v43/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-hic/tree/v43
0.2.5
/0.2.6
- CWL : https://github.com/4dn-dcic/docker-4dn-hic/tree/v42.2/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-hic/tree/v42.2
- Old runs
0.2.0
- CWL : https://github.com/4dn-dcic/pipelines-cwl/tree/0.2.0/cwl_awsem/
- Docker : https://github.com/4dn-dcic/docker-4dn-hic/tree/v40
Example set of commands that were actually run as part of the pipeline can be found at https://github.com/4dn-dcic/docker-4dn-hic/blob/v42.2/HiCPipeline.md
Restriction Enzyme Check
For a bam file (produced in the alignment step), this process calculates the percentage of clipped sites with a restriction enzyme motif which matches the motif of the reported restriction enzyme. The result is available on the portal as the percent of clipped sites with RE motif field, located in the details section of the assessed bam file. Expected percentages for correctly reported restriction enzymes are in the neighborhood of 80%, while incorrect enzymes tend to produce values less than 10%.
The Docker image created for this step (4dndcic/4dn-re-checker/v1.1
) is available on Docker Hub. All source files can be found in the GitHub
re-checker v1.1 repository.