CUT&RUN Processing Pipeline – 4DN Data Portal

Overview

Designed as an alternative to ChIP-seq, CUT&RUN is a method for efficiently mapping DNA-protein interactions. The 4DN CUT&RUN processing pipeline consists of trimming, alignment, filtering, visualization, and peak calling. The final outputs of the CUT&RUN pipeline are a bigWig signal track and peaks in bed format, both of which can be visualized in HiGlass.

The CUT&RUN Docker image, used in all steps of the pipeline, can be found at https://hub.docker.com/r/4dndcic/cut-and-run-pipeline

Trimming

Input paired-end reads are merged according to orientation and trimmed with Trimmomatic version 0.36. In particular, Trimmomatic is called in paired-end mode (PE) with added parameters applied in the order they are specified:

ILLUMINACLIP:<path>/TruSeq3-PE.fa:2:15:4:1:true MINLEN:20

ILLUMINACLIP trims adapter sequences specified by TruSeq3-PE.fa
2 (seedMismatches) is a maximum mismatch count for full matches.
15 (palindromeClipThreshold) indicates the quality of match required for palindrome read alignment. The palindrome mode identifies adapter read-through by comparing the forward and reverse reads.
4 (simpleClipThreshold) indicates the accuracy a match between an adapter and read must have.
1 (minAdapterLength) indicates minimum adapter length for palindrome mode. This value is minimized to optimize adapter detection, leveraging a low false positive rate for palindrome mode.
true (keepBothReads) indicates that both reads should be kept when an adapter is trimmed in palindrome mode.
MINLEN:20 removes reads which are less than 20 bases (performed after all other trimming is completed).

To learn more about these settings, refer to the Trimmomatic documentation.

Alignment

After trimming, reads are aligned to a reference genome using bowtie2 version 2.2.6.

bowtie2 --dovetail --threads $threads -x [$index] -1 $fastq1 -2 $fastq2

The term 'dovetailing' describes mates which extend past one another, and is expected in CUT&RUN experiments. The --dovetail flag indicates that dovetailed alignments should be considered as concordant. See the bowtie2 documentation for an illustration of dovetailing.
Mouse and human index files are available on the data portal.

Filtering

Alignments in bam format are sorted and duplicates marked with Picard version 2.20.7. This consists of two steps:

Sorting bams with SortSam with SORT_ORDER=coordinate to specify that the input should be sorted by coordinate (used in the second step).
Marking duplicates for removal with MarkDuplicates.

In both steps, the flag VALIDATION_STRINGENCY=LENIENT is used to specify a more relaxed validation stringency (relative to STRICT). Refer to the Picard documentation for further details.

Duplicates are then removed using samtools version 1.9. Specifically, the command is:

samtools view -F 1024 -f 2 -b <input.bam>

-F 1024 omits reads marked as PCR or optical duplicates
-f 2 restricts the results to reads mapped in a proper pair

Finally, the alignments pass through a final set of cleaning and sorting recommended for peak calling with SEACR (see section below):

awk '$1==$4 && $1!="." && $6-$2 < 1000 {print $0}' <input.bam> | cut -f 1-6 | sort -k1,1 -k2,2n -k3,3n

awk '$1==$4 specifies that mates must be located on the same chromosome,
$1!="." specifies that mates cannot be null,
$6-$2 < 1000 specifies that mates' matched ends cannot be more than 1000 bases apart, and
cut -f 1-6 retains only the first six columns of the bam file.

Visualization and Peak Calling

Biological replicates are merged directly after filtering. Peak visualization is performed by a custom Gaussian smoother, which creates a Gaussian distribution around the midpoint of each read. Locations with more overlapping reads are thus represented by larger peaks.

The visualization output tracks are bedGraph files, which are converted into bigWig format using bedGraphToBigWig.

The final step of the pipeline is peak calling using SEACR version 1.3. The bedGraph files processed from both experimental and control experiments are needed to call peaks. In particular:

SEACR_1.3.sh <exp_bedgraph> <ctl_bedgraph> <norm> <stringency>

norm indicates the normalization experimental and control data and can be either "norm" (normalization occurs) or "non" (normalization does not occur). The default for this pipeline is "norm".
stringency specifies the threshold for peak calling. The default for this pipeline is relaxed, which compares the knee and peak of the curve, rather than stringent, which uses the peak of the curve only.

SEACR output is a bed file containing peaks. Each peak is detailed by its chromosome, start and end coordinates, total signal, and the location and value of the maximum signal within its coordinates. Peak calls can be viewed (alongside bigwig tracks) using HiGlass.

SEACR is designed specifically for CUT&RUN experiments. Learn more about SEACR from the SEACR GitHub page.

Source Files

The CUT&RUN pipeline is separated into two types of workflows: (1) The alignment workflow for control and experimental data and (2) the peak-calling workflow.

For the current version (v1), the Docker image source code and pipeline description in Common Workflow Language (CWL) are available on GitHub:

CWL: https://github.com/4dn-dcic/docker-4dn-cut-and-run-pipeline/tree/v1/cwl
Docker: https://github.com/4dn-dcic/docker-4dn-cut-and-run-pipeline/tree/v1/

CUT&RUN QC metrics will be available in the next version of the pipeline.