MSKCC-CTI/CTAinn is a C
omprehensive T
APS A
nalysis pipeline Nextflow/nf-core borne designed to be highly flexible and can be run on a wide range of computing environments, from a single laptop, to a computing cluster or cloud computing environments.
CTAinn processes TAPS (TET-assisted pyridine borane sequencing) data to analyze DNA methylation patterns. The pipeline takes raw FASTQ files from TAPS sequencing experiments and performs quality control, alignment, methylation calling, and comprehensive downstream analysis. It generates various outputs including quality metrics, methylation reports, and visualization files that enable researchers to understand DNA methylation patterns in their samples.
TAPS stands for TET-assisted pyridine borane sequencing.
The pipeline includes the following main steps:
- Quality Control (
FastQC
)- Comprehensive quality assessment of raw sequencing reads
- Concatenate FASTQs (
cat
)- Combines multiple FASTQ files for the same sample
- Mapping
3.1. Mapping with (
BWA-Meth
)- Alignment of bisulfite-converted reads to reference genome
OR
3.2. Mapping with (BWA mem2
) - The next version of bwa-mem
- Alignment of bisulfite-converted reads to reference genome
- Mark Duplicates (
GATK4-MarkDuplicates
)- Identification and marking of PCR duplicates
- Methylation Calling
5.1 Methylation Calling with (
rasTair
)- Extraction of methylation calls from aligned reads
5.2 Methylation Calling with (
asTair
) - Extraction of methylation calls from aligned reads
- Extraction of methylation calls from aligned reads
5.2 Methylation Calling with (
- MultiQC (
MultiQC
)- Aggregation of all QC reports into a single dashboard
Note
If you are new to Nextflow, please refer to this page on how to set-up Nextflow.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
:
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
TREATMENT_REP1,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
The samplesheet requires the following columns:
sample
: Unique sample identifierfastq_1
: Path to forward reads (R1)fastq_2
: Path to reverse reads (R2)
You can run the pipeline using:
nextflow run </path/to/>/ctainn \
-profile <docker/singularity> \
--input samplesheet.csv \
--genome GRCh38 \
--outdir results
--input
: Path to samplesheet CSV file--outdir
: Output directory path--email
: Email address for completion notification--max_memory
: Maximum memory to use (default: '128.GB')--max_cpus
: Maximum CPUs to use (default: 12)
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters; see docs.
mskcc-cti/ctainn was originally written by [email protected].
We thank the following people for their extensive assistance in the development of this pipeline:
- The nf-core community - Framework and best practices
If you would like to contribute to this pipeline, please see the contributing guidelines.
For support, please:
- Read the pipeline documentation
- Check existing issues
- Create a new issue with a detailed description of your problem
If you use mskcmoinn/ctainn for your analysis, please cite it using the following doi: 10.5281/zenodo.XXXXXX
Key tools used in this pipeline:
- BWA-Meth
Pedersen BS, et al. Fast and accurate alignment of long bisulfite-seq reads. arXiv:1401.1129, 2014.
TO-DO: Complete the citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.