Skip to content

apduncan/bm-tk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bm-tk

Simple pipeline for predicting bacterial base modification in bulk from PacBio HiFi sequencing data with kinetics tags.

Currently the output design is to store the output BAMs alongside the input files, with the prefix jasmine_predict.{input_bam}. We implement a custom check for whether output already exists, and filter any inputs which have output files in the expected location. These will be logged by the pipeline. This behaviour can be disabled by setting --clobber true, which will force prediction to be rerun for all inputs.

Currently, the pipeline will

  1. Filter out BAMs which seem irrelevant by name (contain fail, unassigned, subread, scrap, fibertools_preidct), or which have existing output files.
  2. Filter out any BAMS which do not contain the required kinetics tags (CHECK_KINETICS)
  3. Predict 6mA, 5mC, and 5hmC base modification using jasmine (PREDICT_JASMINE)
  4. Extract modifications to table using a custom perl script (EXTRACT_CALLS). This currently only extracts modifications with a probability > 240 (~0.94). Currently this is a fixed threshold and cannot be changed. This extraction is done by default, but is optional. Disable by setting --extract_calls false.

Installation notes

samtools should be available in the environment you launch the pipeline from.

Running using slurm and either apptainer/singularity or micromamba

This will show how to run the pipeline using either micromamba environments, or singularity/apptainer. In both cases, we create a micromamba environment with nextflow installed. You could install nextflow in a different way, the important element is to ensure that samtools and nextflow are available in the environment that will run nextflow.

If the machines you run the pipeline on do not have internet access, see the later section on running without internet access

Install nextflow

Run

micromamba create -n nextflow nextflow conda samtools

We are installing conda within the environment, as nextflow needs the conda binary to activate and deactivate environments. samtools is installed as each file gets checked for kinetics tags locally, rather than submitted as a job, and so runs in the nextflow environment.

Pull the pipeline

(Optional). Nextflow can take a local copy of the pipeline to run. If your compute nodes have internet access, this step isn't strictly necessary.

nextflow pull apduncan/bm-tk -r v0.1

This will pull the most recent commit to the main branch. You could also specify a version tag (e.g. v0.1) or commit hash (e.g. 7097a95).

Move to directory where you will run the pipeline

Move to whichever directory you want pipeline logs and configuration to be kept in. Unlike many nextflow pipelines, output files will not be in this directory. Output will be in the same location as the input bams.

Customise nextflow.config profile

This step isn't neccessary if you are in our group, the default should work.

nextflow.config specifies profiles which give details for the submission system. It has defaults which work for our group, if you are using this elsewhere you will need to customise this. Take a copy of the default config

curl https://raw.githubusercontent.com/apduncan/bm-tk/refs/heads/main/nextflow.config > nextflow.config

You can either customise or copy the nbi_slurm profile. If you are also using slurm, it should be enough to specify your partition names in the queues fields.

Run pipeline

Activate your nextflow environment

micromamba activate nextflow

Then run the pipeline

nextflow run apduncan/bm-tk \
-profile nbi_slurm \
-work-dir /path/to/scratch \
-with-report \
-r main \
--bams "/glob/to/**/find*.bam"

Do this on a node where it is okay to start long running jobs interactively, or put the above in a batch submission script.

The pipeline should then run and produce your BAMs with predicted methylation.

This defaults to using singularity for execution. It will attempt to fetch the container image from the GitHub container registry automatically. If there is no internet access on the machines runnning these processes, see the later section. Similarly, if you want to use micromamba or equivalents, see the section below.

Using micromamba/mamba/conda

Environments can be managed using micromamba or equivalents instead of containers.

To use micromamba, you can edit the profile in the nextflow.config file to:

  • Remove singularity.enabled = true from the profile scope
  • Add conda.enabled = true to the profile scope
  • Add conda.useMicromamba = true to the profile scope

It will look as follows:

profiles {
    ...
    nbi_slurm {
        conda.enabled = true
        conda.useMicromamba = true
        process {
...

Running without internet access

The main obstacle to running without internet access is that nextflow will not be able to pull the container or create the conda environment. However, we can do that on a node with internet access, then provide the path to the environment.

singularity or apptainer

All steps in the pipeline run using a single image, so the simplest method is to download this and provide a path to it at the command line. To use apptainer, simply substitute apptainer for singularity in the commands below.

singularity pull bmtk-latest.sif bmtk-ghcr.io/apduncan/bm-tk:latest

The pipeline can then be run with

nextflow run apduncan/bm-tk \
-with-singularity bmtk-latest.sif \
-profile nbi_slurm \
-work-dir /path/to/scratch \
-with-report \
-r main \
--bams "/glob/to/**/find*.bam"

The container path on the command line takes priority over the setting in nextflow.config, so it will use the image you pulled.

micromamba or equivalent

To create the environment, run

curl https://raw.githubusercontent.com/apduncan/bm-tk/refs/heads/main/env.yaml > env.yaml && \
micromamba env create -n bmtk --file env.yaml

Find the environment path

> micromamba env list | grep bmtk
bmtk                      /home/user/micromamba/envs/bmtk

Copy that path into the conda = setting of the profile in nextflow.config, e.g. for the nbi_slurm profile:

profiles {
    conda {
        conda.enabled = true
        process.conda = "/home/kam24goz/miniforge3/envs/pbbm"
    }
    nbi_slurm {
        conda.useMicromamba = true
        process {
            conda = "/home/user/micromamba/envs/bmtk"
            executor = 'slurm'
            queue = 'ei-medium'
            memory = '2GB'
            cpus = 2
...

When you are submitting the nextflow pipeline it should use this environment. Be sure to also put export NXF_OFFLINE='true' in your submission scripts, otherwise nextflow will waste much time trying to phone home for updates.

About

Simple pipeline for predicting bacterial base modification from PacBio HiFi data with kinetics tags

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors