-
Notifications
You must be signed in to change notification settings - Fork 0
6). Running SOMA with example data
NOTE: Undergoing editing, text may not match included images/diagrams.
- Downloading the example data
- Running SOMA
- Summary report
- Read QC report
- Taxonomic abundance report
- Binning report
- Antimicrobial resistance report
- Download the test FASTQ data, stored on ENA.
- Note: This is public dataset of Illumina sequencing reads of the ZymoBIOMICS Microbial Community Standard. The original reads are available here and the relevant publication is available here.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/072/SRR15702472/SRR15702472_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/072/SRR15702472/SRR15702472_2.fastq.gz
You'll then need to edit the examples/SRR15702472_input.csv, to contain the full path to SRR15702472_*.fastq.gz.
You can then run SOMA with the following command:
./run_soma --input SRR15702472_input.csv --outdir PRJNA759733_results
If you need to restart the run you can add, '-resume', as below:
./run_soma --input examples/SRR15702472_input.csv --outdir PRJNA759733_results -resume
We will start with the summary report, found in: PRJNA759733_results/SRR15702472/summary/SRR15702472.PRJNA759733.summary_report.html
(example here)
These sections provide basic information on pipeline execution and sample metadata.
This section shows the pre- and post filtering read quality control information. Here for example we can see we started with 21 million reads and retained ~90% of all reads post-QC.
We see broadly the expected species represented in this summary, as we move to the lower abundance results we do see representatives of the right genera but wrong species shown but in very small proportions (expected with Kraken2).
Here we see the assembly and polishing stages have produced a 41 Mb assembly, in 4317 contigs with a reasonable contig N50. Looking at the binning summary we can see the assembly has been placed into 8 bins, representing the majority of bases and 7/8 high-quality bins. The small proportion of contigs assigned to bins can be explained by two factors: first, these will predominately be shorter contigs (<5 kb) and/or these contigs will represent the eukaryotic species (Saccharomyces cerevisiae and Cryptococcus neoformans) found in the sample, which do not get binned by the tools used by SOMA. More detail on binning can be found below.
We can see each bins for 7/8 species in this sample (excluding Saccharomyces cerevisiae and Cryptococcus neoformans), all but one of which are high-quality (based on CheckM completeness and contamination scores), and the right assembly size and GC content.
Depending on the samples identified, multiple metrics will automatically be evaluated to provide further details on individual bins.
-
In this section we can see the results of sequence typing (using MLST) and clonal complex assignment (where available). You may find that when re-run you will not consistently get sequence type assignments for all species, this is due to variations in the assembly process - although you should expect to see most of the samples assigned a sequence type.
-
The next section summarizes the genes identified within specific bins, in this case we only provided targets for two species.
-
We have also have the results of antimicrobial resistance typing (default: ResFinder & PointFinder), for all species of interest. A more detailed report across 4 different profilers can be found below.
Finally, there are three species-specific subworkflows reporting relevant metrics available for the three species given.
-
We can see the results of E. coli / Shigella spp. typing (summarized across multiple tools) which suggest this is not Shiga toxin-producing Escherichia coli (STEC), enteroinvasive E. coli (EIEC) or other pathogenic E. coli. It lists the classification as 'Unknown' as few tools report positive identification of non-pathogenic E. coli but this is likely the case.
-
We also have the results of consensus typing of Salmonella samples, in this case it appears to be Salmonella enterica subspecies enterica with antigenic profiles also reported.
-
Similarly, the results for Listeria monocytogenes serotyping using LisSero, are also reported.
Report found in: PRJNA759733_results/SRR15702472/summary/SAMEA10644972.PRJNA759733.taxonomy_report.html
(example here)
Assuming SOMA was run with a Kraken2 database, this report will include figures showing sequence (per-read) classification results with Kraken2.
-
The first figure shows the proportion of reads assigned at various taxonomic ranks, so in this case we can see ~48% of reads could be assigned a species and ~3% could be assigned at the subspecies level. ~0.87% were missing a rank which typically means the taxonomy ID hasn't been assigned a rank in the relevant database.
-
The second figure shows the 50 most abundant species, in this case we can see Bacillus subtilis subsp. spizizenii is the most well represented species by read count.
- This figure shows the results of processing the Kraken2 results with Bracken (Bayesian Reestimation of Abundance with KrakEN), resulting in accurate abundance estimates.
Report found in: PRJNA759733_results/SRR15702472/summary/SRR15702472.PRJNA759733.summary_binning_report.html
(example here)
This report shows a summary of the results of metagenomic binning and per-contig statistics. In the first figure, we can see 8 different bins (represented by the non-grey coloured circles, representing 8 species.
Hovering the cursor over any of the bins will show detailed information on various metrics for the selected contig (top) and metrics across the bin (bottom). In this case we can see, for example, the contig length and the contigs nearest taxonomic hit (using Skani against GTDB). Across the entire bin, we can also see that the bin is high-quality (low/no contamination and high completeness).
We can also see a cluster of short contigs at the bottom of the plot (~38% GC content and ~50%), which most likely represents Saccharomyces cerevisiae and Cryptococcus neoformans, but these species aren't in GTDB and so not reported. Another other major reason for contigs not being binned are that they are bacterial plasmids and therefore are harder to bin as they don't necessarily have the same coverage and GC content as the chromosomal DNA. We use geNomad to calculate the plasmid score, as you can see in the figure below the nearest hit to the highlighted contig is Staphylococcus and the plasmid score is 0.9058, suggesting this is a plasmid linked to bin 000003.
The table and figures below provide a summaries of bin quality either as summary subplots - to make it easier to identify data quality issues - or as per-bin metrics.
Report found in: PRJNA759733_results/SRR15702472/summary/SRR15702472.PRJNA759733.amr_report.html
(example here)
SOMA can report the results of up to 4 antimicrobial resistance (AMR) associated genes/mutations: ABRicate, AMRFinderPlus, ResFinder, RGI.
Example results of running AMRFinderPlus on the test data are shown in the figure below. In the first section, the results for each metagenome-assembled genome, are summarized, including the resistance genes identified and the drug class to which it is thought to confer resistance.
In the second section, results are reported per-gene and includes the amino acid mutation where applicable.