6). Running SOMA with example data

NOTE: Undergoing editing, text may not match included images/diagrams.

Downloading the example data

Download the test FASTQ data, stored on ENA.
Note: This is public dataset of Illumina sequencing reads of the ZymoBIOMICS Microbial Community Standard. The original reads are available here and the relevant publication is available here.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/072/SRR15702472/SRR15702472_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/072/SRR15702472/SRR15702472_2.fastq.gz

You'll then need to edit the examples/SRR15702472_input.csv, to contain the full path to SRR15702472_*.fastq.gz.

Running SOMA

You can then run SOMA with the following command:

./run_soma --input SRR15702472_input.csv --outdir PRJNA759733_results

If you need to restart the run you can add, '-resume', as below:

./run_soma --input examples/SRR15702472_input.csv --outdir PRJNA759733_results -resume

Summary report

We will start with the summary report, found in: PRJNA759733_results/SRR15702472/summary/SRR15702472.PRJNA759733.summary_report.html (example here)

a). Sample background

These sections provide basic information on pipeline execution and sample metadata.

b). Read quality control

This section shows the pre- and post filtering read quality control information. Here for example we can see we started with 21 million reads and retained ~90% of all reads post-QC.

c). Taxonomic summary

We see broadly the expected species represented in this summary, as we move to the lower abundance results we do see representatives of the right genera but wrong species shown but in very small proportions (expected with Kraken2).

d). Binning summary

Here we see the assembly and polishing stages have produced a 41 Mb assembly, in 4317 contigs with a reasonable contig N₅₀. Looking at the binning summary we can see the assembly has been placed into 8 bins, representing the majority of bases and 7/8 high-quality bins. The small proportion of contigs assigned to bins can be explained by two factors: first, these will predominately be shorter contigs (<5 kb) and/or these contigs will represent the eukaryotic species (Saccharomyces cerevisiae and Cryptococcus neoformans) found in the sample, which do not get binned by the tools used by SOMA. More detail on binning can be found below.

e). Bin quality control

We can see each bins for 7/8 species in this sample (excluding Saccharomyces cerevisiae and Cryptococcus neoformans), all but one of which are high-quality (based on CheckM completeness and contamination scores), and the right assembly size and GC content.

f). in silico phenotyping

Depending on the samples identified, multiple metrics will automatically be evaluated to provide further details on individual bins.

In this section we can see the results of sequence typing (using MLST) and clonal complex assignment (where available). You may find that when re-run you will not consistently get sequence type assignments for all species, this is due to variations in the assembly process - although you should expect to see most of the samples assigned a sequence type.
The next section summarizes the genes identified within specific bins, in this case we only provided targets for two species.
We have also have the results of antimicrobial resistance typing (default: ResFinder & PointFinder), for all species of interest. A more detailed report across 4 different profilers can be found below.

g). Species-specific typing

Finally, there are three species-specific subworkflows reporting relevant metrics available for the three species given.

We can see the results of E. coli / Shigella spp. typing (summarized across multiple tools) which suggest this is not Shiga toxin-producing Escherichia coli (STEC), enteroinvasive E. coli (EIEC) or other pathogenic E. coli. It lists the classification as 'Unknown' as few tools report positive identification of non-pathogenic E. coli but this is likely the case.
We also have the results of consensus typing of Salmonella samples, in this case it appears to be Salmonella enterica subspecies enterica with antigenic profiles also reported.
Similarly, the results for Listeria monocytogenes serotyping using LisSero, are also reported.

Taxonomic abundance report

Report found in: PRJNA759733_results/SRR15702472/summary/SAMEA10644972.PRJNA759733.taxonomy_report.html (example here)

a). Kraken2 results

Assuming SOMA was run with a Kraken2 database, this report will include figures showing sequence (per-read) classification results with Kraken2.

The first figure shows the proportion of reads assigned at various taxonomic ranks, so in this case we can see ~48% of reads could be assigned a species and ~3% could be assigned at the subspecies level. ~0.87% were missing a rank which typically means the taxonomy ID hasn't been assigned a rank in the relevant database.
The second figure shows the 50 most abundant species, in this case we can see Bacillus subtilis subsp. spizizenii is the most well represented species by read count.

b). Bracken (Kraken2) results

This figure shows the results of processing the Kraken2 results with Bracken (Bayesian Reestimation of Abundance with KrakEN), resulting in accurate abundance estimates.

Binning report

Report found in: PRJNA759733_results/SRR15702472/summary/SRR15702472.PRJNA759733.summary_binning_report.html (example here)

a). Summary figure

This report shows a summary of the results of metagenomic binning and per-contig statistics. In the first figure, we can see 8 different bins (represented by the non-grey coloured circles, representing 8 species.

Hovering the cursor over any of the bins will show detailed information on various metrics for the selected contig (top) and metrics across the bin (bottom). In this case we can see, for example, the contig length and the contigs nearest taxonomic hit (using Skani against GTDB). Across the entire bin, we can also see that the bin is high-quality (low/no contamination and high completeness).

We can also see a cluster of short contigs at the bottom of the plot (~38% GC content and ~50%), which most likely represents Saccharomyces cerevisiae and Cryptococcus neoformans, but these species aren't in GTDB and so not reported. Another other major reason for contigs not being binned are that they are bacterial plasmids and therefore are harder to bin as they don't necessarily have the same coverage and GC content as the chromosomal DNA. We use geNomad to calculate the plasmid score, as you can see in the figure below the nearest hit to the highlighted contig is Staphylococcus and the plasmid score is 0.9058, suggesting this is a plasmid linked to bin 000003.

b). Quality summaries

The table and figures below provide a summaries of bin quality either as summary subplots - to make it easier to identify data quality issues - or as per-bin metrics.

Antimicrobial resistance report

Report found in: PRJNA759733_results/SRR15702472/summary/SRR15702472.PRJNA759733.amr_report.html (example here)

SOMA can report the results of up to 4 antimicrobial resistance (AMR) associated genes/mutations: ABRicate, AMRFinderPlus, ResFinder, RGI.

a). Summary report

Example results of running AMRFinderPlus on the test data are shown in the figure below. In the first section, the results for each metagenome-assembled genome, are summarized, including the resistance genes identified and the drug class to which it is thought to confer resistance.

b). Detailed report

In the second section, results are reported per-gene and includes the amino acid mutation where applicable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly