sr2silo can convert millions of Short-Read nucleotide read in the form of a .bam CIGAR alignments to cleartext alignments. Further, it will gracefully extract insertions and deletions. Optionally, sr2silo can translate and align each read using diamond / blastX. And again handle insertions and deletions.
Your input .bam/.sam
with one line as:
294 163 NC_045512.2 79 60 31S220M = 197 400 CTCTTGTAGAT FGGGHHHHLMM ...
sr2silo outputs per read a JSON (mock output):
{
"metadata":{
"read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
...
},
"nucleotideInsertions":{
"main":[10 : ACTG]
},
"aminoAcidInsertions":{
"E":[],
...
"ORF1a":[2323 : TG, 2389 : CA],
...
"S":[23 : A]
},
"alignedNucleotideSequences":
{
"main":"NNNNNNNNNNNNNNNNNNCGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTGNNNNNNNNNNNNNNNNNNNNNNNN"
},
"unalignedNucleotideSequences":{
"main":"CGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGA...TACAGGTTCGCGACGTGCTCGTGTGAAAGATGGCACTTGTG"
},
"alignedAminoAcidSequences":{
"E":"",
...
"ORF1a":"...NMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...",
...
"S":""}
}
The total output is handled in an .ndjson.zst
.
When running sr2silo, particularly the import-to-loculus
command, be aware of memory and storage requirements:
- Standard configuration uses 8GB RAM and one CPU core
- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
- Temporary storage needs (especially on clusters) can reach 30-50GB
For detailed information about resource requirements, especially for cluster environments, please refer to the Resource Requirements documentation.
Originally this was started for wargeling short-read genomic alignments for from wastewater-sampling, into a format for easy import into Loculus and its sequence database SILO.
sr2silo is designed to process a nucliotide alignments from .bam
files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend LAPIS-SILO.
For the V-Pipe to Silo implementation we carry through the following metadata:
"metadata":{
"read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
"sample_id":"A1_05_2024_10_08",
"batch_id":"20241024_2411515907",
"sampling_date":"2024-10-08",
"sequencing_date":"2024-10-24",
"location_name":"Lugano (TI)",
"read_length":"250","primer_protocol":"v532",
"location_code":"05",
"flow_cell_serial_number":"2411515907"
"sequencing_well_position":"A1",
"primer_protocol_name":"SARS-CoV-2 ARTIC V5.3.2",
"nextclade_reference":"sars-cov-2"
}
To build the package and maintain dependencies, we use Poetry. In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.
The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the environments/
directory:
For basic usage of sr2silo:
make setup
This creates the core conda environment with essential dependencies and installs the package using Poetry.
For development work:
make setup-dev
This command sets up the development environment with Poetry.
For working with the snakemake workflow:
make setup-workflow
This creates an environment specifically configured for running the sr2silo in snakemake workflows.
You can set up all environments at once:
make setup-all
After setting up the development environment:
conda activate sr2silo-dev
poetry install --with dev
poetry run pre-commit install
make test
or
conda activate sr2silo-dev
pytest
The sr2silo CLI has two main commands:
run
- Not yet implemented command for future functionalityimport-to-loculus
- Convert BAM alignments to SILO format and optionally upload
The main command you'll use is import-to-loculus
:
sr2silo import-to-loculus \
--input-file INPUT.bam \
--sample-id SAMPLE_ID \
--batch-id BATCH_ID \
--timeline-file TIMELINE.tsv \
--primer-file PRIMERS.yaml \
--output-fp OUTPUT.ndjson \
--reference sars-cov-2
--input-file, -i
: Path to the input BAM alignment file--sample-id, -s
: Sample ID to use for metadata--batch-id, -b
: Batch ID to use for metadata--timeline-file, -t
: Path to the timeline metadata file--primer-file, -p
: Path to the primers configuration file--output-fp, -o
: Path for the output file (will be auto-suffixed with .ndjson.zst)
--reference, -r
: Reference genome to use (default: "sars-cov-2")--upload/--no-upload
: Whether to upload results to S3 and submit to SILO (default: no-upload)
Here's a complete example with sample data:
sr2silo import-to-loculus \
--input-file ./data/sample/alignments/REF_aln_trim.bam \
--sample-id "A1_05_2024_10_08" \
--batch-id "20241024_2411515907" \
--timeline-file ./data/timeline.tsv \
--primer-file ./data/primers.yaml \
--output-fp ./results/output.ndjson \
--reference sars-cov-2
To also upload the results to SILO, add the --upload
flag:
sr2silo import-to-loculus \
# ...same arguments as above... \
--upload
The code quality checks run on GitHub can be seen in
.github/workflows/test.yml
for the python package CI/CD,
We are using:
- Ruff to lint the code.
- Black to format the code.
- Pyright to check the types.
- Pytest to run the unit tests code and workflows.
- Interrogate to check the documentation.
This project welcomes contributions and suggestions. For details, visit the repository's Contributor License Agreement (CLA) and Code of Conduct pages.