Skip to content
Ricardo H. Ramirez-Gonzalez edited this page Apr 9, 2022 · 2 revisions

IBSpy

Python package Maintainability

Python library to identify Identical By State regions

Overview

  1. Raw reads
  2. IBSpy count
  3. Build matrix
  4. Affinity propagation
  5. Minimum path and other plots
graph TB
  reads[/Query FastQ raw reads \]
  kmerFiles[/K-mer database\]
  kmerDBBuild[build k-mer databases ]
  trgetReference[/Target fasta file\]
  IBSpyCount[IBSpy Count]
  countsFiles[/k-mer counts per window\]
  metadata[/Metadata file with path to counts\]
  preferences[/Preference file\]
  buildMatrx[Build Matrix]
  kmerCountsDB[(Results and intermediate steps)]
  binaryFiles[\Query-Target <br> pickle files/]
  indexedFiles[\Tabix files per <br> target reference/]
  affinity[Affinity propagation]
  haplotypes[\Haplotypes per reference/]

  reads -.-> kmerDBBuild
  kmerDBBuild -.-> kmerFiles
  kmerFiles -.-> IBSpyCount
  kmerDBBuild --> IBSpyCount
  trgetReference -.-> IBSpyCount
  IBSpyCount -.-> countsFiles
  countsFiles -.- metadata
  metadata -.-> buildMatrx
  countsFiles -.-> buildMatrx
  IBSpyCount --> buildMatrx
  buildMatrx -.-> indexedFiles 
  preferences -.-> buildMatrx
  buildMatrx -.-> binaryFiles 
  binaryFiles  -.->  buildMatrx
  indexedFiles -.-> affinity
  kmerCountsDB -.- binaryFiles
  kmerCountsDB -.- indexedFiles
  preferences -.-> affinity
  buildMatrx --> affinity
  affinity -.-> haplotypes
  haplotypes -.- kmerCountsDB


Loading

Running IBSPy

IBSpy has relatively few options, you can look at them with the --help command.

IBSPy --help
usage: IBSPy [-h] [-w WINDOW_SIZE] [-k KMER_SIZE] [-d DATABASE] [-r REFERENCE]
             [-z] [-o OUTPUT] [-f {kmerGWAS,jellyfish}]

optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        window size to analyze
  -k KMER_SIZE, --kmer_size KMER_SIZE
                        Kmer size of the database
  -d DATABASE, --database DATABASE
                        Kmer database
  -r REFERENCE, --reference REFERENCE
                        The reference with the position of the kmers
  -z, --compress        When an ouput file is present, it is compressed as .gz
  -o OUTPUT, --output OUTPUT
                        Output file. If missing, the ouptut is sent to stdout
  -f {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}, --database_format {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}
                        Database format 

To generate the table with the number of observed kmers and variants run the following command, using the kmer database from kmerGWAS use the following command:

 IBSpy --output "kmer_windows_LineXXX.tsv.gz" -z --database kmers_with_strand  --reference arinaLrFor.fa --window_size 50000 --compress --database_format kmerGWAS

For KMC3, the database is the name used while creating the database, not the filename.

Running IBSplot

Look at the IBSplot commands using --help.

IBSPy --help
usage: IBSplot [-h] [-i IBSPY_COUNTS] [-w WINDOW_SIZE] [-f FILTER_COUNTS]
               [-n N_COMPONENTS] [-c COVARIANCE_TYPE] [-s STITCH_NUMBER]
               [-o OUTPUT] [-r REFERENCE] [-q QUERY] [-p PLOT_OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i IBSPY_COUNTS, --IBSpy_counts IBSPY_COUNTS
                        tvs file genetared by IBSpy output
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Windows size to count variations within
  -f FILTER_COUNTS, --filter_counts FILTER_COUNTS
                        Filter number of variaitons above this threshold to
                        compute GMM model, default=None
  -n N_COMPONENTS, --n_components N_COMPONENTS
                        Number of componenets for the GMM model, default=3
  -c COVARIANCE_TYPE, --covariance_type COVARIANCE_TYPE
                        type of covariance used for GMM model, default="full"
  -s STITCH_NUMBER, --stitch_number STITCH_NUMBER
                        Consecutive "outliers" in windows to stitch, default=3
  -o OUTPUT, --output OUTPUT
                        tsv file with variations count by windows and summary
                        statistics
  -r REFERENCE, --reference REFERENCE
                        genome reference name
  -q QUERY, --query QUERY
                        query sample
  -p PLOT_OUTPUT, --plot_output PLOT_OUTPUT
                        histograms and ascatter files in .PDF format

IBSplot uses the output table generated by IBSpy described above (e.g., "kmer_windows_LineXXX.tsv.gz"). It can be used to count variant assigning larger windows. In the example below it is using 400,000 bp windows to compute a GMM model and generate the plots.

To generate the table with variant count categorized by the GMM model as IBS or non-IBS and generate the plots, run the following command: The description of the GMM model is here

# minimal arguments
IBSplot --IBSpy_counts "kmeribs-Wheat_Jagger-Flame.tsv.gz" --window_size 400000 --output gmm_ibs.tsv.gz --reference Jagger --query Flame --plot_output gmm_plots.pdf

In addition, you can include some or all of the following commands to tune the GMM model parameters and define the best IBS and non-IBS according to the reference and query sample used:

IBSplot --filter_counts 1000 --n_components 3 --covariance_type 'full' --stitch_number 3
Clone this wiki locally