-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Python library to identify Identical By State regions
- Raw reads
- IBSpy count
- Build matrix
- Affinity propagation
- Minimum path and other plots
graph TB
reads[/Query FastQ raw reads \]
kmerFiles[/K-mer database\]
kmerDBBuild[build k-mer databases ]
trgetReference[/Target fasta file\]
IBSpyCount[IBSpy Count]
countsFiles[/k-mer counts per window\]
metadata[/Metadata file with path to counts\]
preferences[/Preference file\]
buildMatrx[Build Matrix]
kmerCountsDB[(Results and intermediate steps)]
binaryFiles[\Query-Target <br> pickle files/]
indexedFiles[\Tabix files per <br> target reference/]
affinity[Affinity propagation]
haplotypes[\Haplotypes per reference/]
reads -.-> kmerDBBuild
kmerDBBuild -.-> kmerFiles
kmerFiles -.-> IBSpyCount
kmerDBBuild --> IBSpyCount
trgetReference -.-> IBSpyCount
IBSpyCount -.-> countsFiles
countsFiles -.- metadata
metadata -.-> buildMatrx
countsFiles -.-> buildMatrx
IBSpyCount --> buildMatrx
buildMatrx -.-> indexedFiles
preferences -.-> buildMatrx
buildMatrx -.-> binaryFiles
binaryFiles -.-> buildMatrx
indexedFiles -.-> affinity
kmerCountsDB -.- binaryFiles
kmerCountsDB -.- indexedFiles
preferences -.-> affinity
buildMatrx --> affinity
affinity -.-> haplotypes
haplotypes -.- kmerCountsDB
IBSpy has relatively few options, you can look at them with the --help
command.
IBSPy --help
usage: IBSPy [-h] [-w WINDOW_SIZE] [-k KMER_SIZE] [-d DATABASE] [-r REFERENCE]
[-z] [-o OUTPUT] [-f {kmerGWAS,jellyfish}]
optional arguments:
-h, --help show this help message and exit
-w WINDOW_SIZE, --window_size WINDOW_SIZE
window size to analyze
-k KMER_SIZE, --kmer_size KMER_SIZE
Kmer size of the database
-d DATABASE, --database DATABASE
Kmer database
-r REFERENCE, --reference REFERENCE
The reference with the position of the kmers
-z, --compress When an ouput file is present, it is compressed as .gz
-o OUTPUT, --output OUTPUT
Output file. If missing, the ouptut is sent to stdout
-f {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}, --database_format {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}
Database format
To generate the table with the number of observed kmers and variants run the following command, using the kmer database from kmerGWAS use the following command:
IBSpy --output "kmer_windows_LineXXX.tsv.gz" -z --database kmers_with_strand --reference arinaLrFor.fa --window_size 50000 --compress --database_format kmerGWAS
For KMC3, the database is the name used while creating the database, not the filename.
Look at the IBSplot commands using --help
.
IBSPy --help
usage: IBSplot [-h] [-i IBSPY_COUNTS] [-w WINDOW_SIZE] [-f FILTER_COUNTS]
[-n N_COMPONENTS] [-c COVARIANCE_TYPE] [-s STITCH_NUMBER]
[-o OUTPUT] [-r REFERENCE] [-q QUERY] [-p PLOT_OUTPUT]
optional arguments:
-h, --help show this help message and exit
-i IBSPY_COUNTS, --IBSpy_counts IBSPY_COUNTS
tvs file genetared by IBSpy output
-w WINDOW_SIZE, --window_size WINDOW_SIZE
Windows size to count variations within
-f FILTER_COUNTS, --filter_counts FILTER_COUNTS
Filter number of variaitons above this threshold to
compute GMM model, default=None
-n N_COMPONENTS, --n_components N_COMPONENTS
Number of componenets for the GMM model, default=3
-c COVARIANCE_TYPE, --covariance_type COVARIANCE_TYPE
type of covariance used for GMM model, default="full"
-s STITCH_NUMBER, --stitch_number STITCH_NUMBER
Consecutive "outliers" in windows to stitch, default=3
-o OUTPUT, --output OUTPUT
tsv file with variations count by windows and summary
statistics
-r REFERENCE, --reference REFERENCE
genome reference name
-q QUERY, --query QUERY
query sample
-p PLOT_OUTPUT, --plot_output PLOT_OUTPUT
histograms and ascatter files in .PDF format
IBSplot uses the output table generated by IBSpy described above (e.g., "kmer_windows_LineXXX.tsv.gz"
). It can be used to count variant assigning larger windows. In the example below it is using 400,000 bp windows to compute a GMM model and generate the plots.
To generate the table with variant count categorized by the GMM model as IBS or non-IBS and generate the plots, run the following command: The description of the GMM model is here
# minimal arguments
IBSplot --IBSpy_counts "kmeribs-Wheat_Jagger-Flame.tsv.gz" --window_size 400000 --output gmm_ibs.tsv.gz --reference Jagger --query Flame --plot_output gmm_plots.pdf
In addition, you can include some or all of the following commands to tune the GMM model parameters and define the best IBS and non-IBS according to the reference and query sample used:
IBSplot --filter_counts 1000 --n_components 3 --covariance_type 'full' --stitch_number 3