This pipeline based on snakemake calls CHIP variants from next-generation whole-exome/genome sequencing of human samples and produces a purely filtered VCF file containing high confident CHIP mutations
Please download the following files which are required known variation vcf files in the GRCh38 resource bundle in advance, and put all downloaded files into the same directory -gatk_db
, coincided with the directory in your configuration file (config.yaml
)
GRCh38_full_analysis_set_plus_decoy_hla.fa
;
GRCh38_full_analysis_set_plus_decoy_hla.dict
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.alt
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.bwt
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.sa
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.pac
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.ann
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.amb
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.0123
;
GRCh38_full_analysis_set_plus_decoy_hla.fa.bwt.2bit.64
- reference_files: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/
germline_resource: https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
Please place both files in the same directory, which the directory will be set as the sample_dir
in your configuration file (config.yaml
)
- Two FASTQ files (named as 'sampleID_1.fq.gz' and 'sampleID_2.fq.gz') contained paired-end next generation sequencing ( WES or WGS) data
- conda >= 22.9.0 is required
cd path/to/download
git clone https://github.com/Shuhua-Group/CHIP-mutation-calling-NGS-pipeline
cd CHIP-mutation-calling-NGS-pipeline
conda env create -f environment.yaml
conda activate SomaticMC
- The provided configuration file (
config.yaml
) is presented as follows, and it requires modification for some items as described in the comment lines
## sampleName (your input files should be named as 'sampleName_1.fq.gz' and 'sampleName_2.fq.gz'.)
sampleName: "your_sampleName"
## replace the "/path/to/download/CHIP-mutation-calling-NGS-pipeline" to the absolute directory where the pipeline was downloaded
download_dir: /path/to/download/CHIP-mutation-calling-NGS-pipeline
## replace the "/path/to/reference" to the absolute directory where the required reference data were downloaded
gatk_db: /path/to/reference
## replace the "/path/to/sampleFolder" to the absolute directory where the samples(named as 'sampleID_1.fq.gz' and 'sampleID_2.fq.gz') were
sample_dir: /path/to/sampleFolder
threads: 32
mem_mb: 65536
Once the config file is ready, you can run the pipeline as follows:
snakemake -s snakemake_SMC --configfile config.yaml -c 32
In real-data testing, we used a 32-cores server to analyse pair-ends ~30x WGS data from one sample, taking a total of ~ 78 hours and consuming a peak of ~9 GB of memory;while ~30x WES data from one sample, takes a total of ~9 hours and consumes a peak of ~9 GB of memory.
You can also run the pipeline in PBS or SLURM system
See more details at snakemake doc
If the pipeline runs correctly, the results file will be written to {download_dir}/output/
, including:
-
a filtered individual VCF (named as *.somatic.final.vcf.gz) containing all detected somatic variants after hard filtering by Mutect2 will be written to:
{download_dir}/output/vcf/{sample}
, with high confident somatic variants remained -
an individual VCF (named as *.mutect2.vcf.gz) containing raw somatic variants calling output without filtration will be written to:
{download_dir}/output/vcf/{sample}
, which you can define the filtration rules customized -
a bam file (named as *.recal_reads.bam) containing pre-processed reads by the GATK BQSR will be written to:
{download_dir}/output/gatk/{sample}
, which could be directly loaded into IGV (Integrative Genomics Viewer) to check the sequenced reads coverage -
all log files will be saved in the
{download_dir}/output/logs/
directory -
To further interpret the results, see more details at(https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2)
- Please give credit to the relevant paper if the pipeline was applied to your work
- tech support: [email protected]