sr2silo

Wrangele BAM nucleotide alignments to cleartext alignments

General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON

sr2silo can convert millions of Short-Read nucleotide read in the form of a .bam CIGAR alignments to cleartext alignments. Further, it will gracefully extract insertions and deletions. Optionally, sr2silo can translate and align each read using diamond / blastX. And again handle insertions and deletions.

Your input .bam/.sam with one line as:

294	163	NC_045512.2	79	60	31S220M	=	197	400	CTCTTGTAGAT	FGGGHHHHLMM	...

sr2silo outputs per read a JSON (mock output):

{
  "metadata":{
    "read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
      ...
    },
    "nucleotideInsertions":{
                            "main":[10 : ACTG]
                            },
    "aminoAcidInsertions":{
                            "E":[],
                            ...
                            "ORF1a":[2323 : TG, 2389 : CA],
                            ...
                            "S":[23 : A]
                            },
    "alignedNucleotideSequences":
                                {
                                  "main":"NNNNNNNNNNNNNNNNNNCGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTGNNNNNNNNNNNNNNNNNNNNNNNN"
                                  },
    "unalignedNucleotideSequences":{
                                  "main":"CGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGA...TACAGGTTCGCGACGTGCTCGTGTGAAAGATGGCACTTGTG"
                                  },
    "alignedAminoAcidSequences":{
                "E":"",
                ...
                "ORF1a":"...NMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...",
                ...
                "S":""}
      }

The total output is handled in an .ndjson.zst.

Resource Requirements

When running sr2silo, particularly the import-to-loculus command, be aware of memory and storage requirements:

Standard configuration uses 8GB RAM and one CPU core
Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
Temporary storage needs (especially on clusters) can reach 30-50GB

For detailed information about resource requirements, especially for cluster environments, please refer to the Resource Requirements documentation.

Wrangling Short-Read Genomic Alignments for SILO Database

Originally this was started for wargeling short-read genomic alignments for from wastewater-sampling, into a format for easy import into Loculus and its sequence database SILO.

sr2silo is designed to process a nucliotide alignments from .bam files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend LAPIS-SILO.

For the V-Pipe to Silo implementation we carry through the following metadata:

  "metadata":{
    "read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
    "sample_id":"A1_05_2024_10_08",
    "batch_id":"20241024_2411515907",
    "sampling_date":"2024-10-08",
    "sequencing_date":"2024-10-24",
    "location_name":"Lugano (TI)",
    "read_length":"250","primer_protocol":"v532",
    "location_code":"05",
    "flow_cell_serial_number":"2411515907"
    "sequencing_well_position":"A1",
    "primer_protocol_name":"SARS-CoV-2 ARTIC V5.3.2",
    "nextclade_reference":"sars-cov-2"
    }

Setting up the repository

To build the package and maintain dependencies, we use Poetry. In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.

Installation

The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the environments/ directory:

Core Environment Setup

For basic usage of sr2silo:

make setup

This creates the core conda environment with essential dependencies and installs the package using Poetry.

Development Environment

For development work:

make setup-dev

This command sets up the development environment with Poetry.

Workflow Environment

For working with the snakemake workflow:

make setup-workflow

This creates an environment specifically configured for running the sr2silo in snakemake workflows.

All Environments

You can set up all environments at once:

make setup-all

Additional Setup for Development

After setting up the development environment:

conda activate sr2silo-dev
poetry install --with dev
poetry run pre-commit install

Run Tests

make test

or

conda activate sr2silo-dev
pytest

Run CLI

The sr2silo CLI has two main commands:

run - Not yet implemented command for future functionality
import-to-loculus - Convert BAM alignments to SILO format and optionally upload

Basic Usage

The main command you'll use is import-to-loculus:

sr2silo import-to-loculus \
    --input-file INPUT.bam \
    --sample-id SAMPLE_ID \
    --batch-id BATCH_ID \
    --timeline-file TIMELINE.tsv \
    --primer-file PRIMERS.yaml \
    --output-fp OUTPUT.ndjson \
    --reference sars-cov-2

Required Arguments

--input-file, -i: Path to the input BAM alignment file
--sample-id, -s: Sample ID to use for metadata
--batch-id, -b: Batch ID to use for metadata
--timeline-file, -t: Path to the timeline metadata file
--primer-file, -p: Path to the primers configuration file
--output-fp, -o: Path for the output file (will be auto-suffixed with .ndjson.zst)

Optional Arguments

--reference, -r: Reference genome to use (default: "sars-cov-2")
--upload/--no-upload: Whether to upload results to S3 and submit to SILO (default: no-upload)

Example Usage

Here's a complete example with sample data:

sr2silo import-to-loculus \
    --input-file ./data/sample/alignments/REF_aln_trim.bam \
    --sample-id "A1_05_2024_10_08" \
    --batch-id "20241024_2411515907" \
    --timeline-file ./data/timeline.tsv \
    --primer-file ./data/primers.yaml \
    --output-fp ./results/output.ndjson \
    --reference sars-cov-2

To also upload the results to SILO, add the --upload flag:

sr2silo import-to-loculus \
    # ...same arguments as above... \
    --upload

Tool Sections

The code quality checks run on GitHub can be seen in

.github/workflows/test.yml for the python package CI/CD,

We are using:

Ruff to lint the code.
Black to format the code.
Pyright to check the types.
Pytest to run the unit tests code and workflows.
Interrogate to check the documentation.

Contributing

This project welcomes contributions and suggestions. For details, visit the repository's Contributor License Agreement (CLA) and Code of Conduct pages.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
.vscode		.vscode
conda-recipe		conda-recipe
docs		docs
environments		environments
resources		resources
scripts		scripts
src/sr2silo		src/sr2silo
tests		tests
workflow		workflow
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sr2silo

Wrangele BAM nucleotide alignments to cleartext alignments

General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON

Resource Requirements

Wrangling Short-Read Genomic Alignments for SILO Database

Setting up the repository

Installation

Core Environment Setup

Development Environment

Workflow Environment

All Environments

Additional Setup for Development

Run Tests

Run CLI

Basic Usage

Required Arguments

Optional Arguments

Example Usage

Tool Sections

Contributing

About

Releases 3

Packages

Contributors 4

Languages

License

cbg-ethz/sr2silo

Folders and files

Latest commit

History

Repository files navigation

sr2silo

Wrangele BAM nucleotide alignments to cleartext alignments

General Use: Convert Nucleotide Alignment Reads - CIGAR in .BAM to Cleartext JSON

Resource Requirements

Wrangling Short-Read Genomic Alignments for SILO Database

Setting up the repository

Installation

Core Environment Setup

Development Environment

Workflow Environment

All Environments

Additional Setup for Development

Run Tests

Run CLI

Basic Usage

Required Arguments

Optional Arguments

Example Usage

Tool Sections

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages