Calculate archaic match percent for haplotype windows
Latest development version
pip install git+https://github.com/lparsons/archaic_match
Other (non-python) requirements
- sqlite3
- tabix
archaic_match max-match-pct
--vcf data/simulated_test/null/vcfs/Tenn_nonAfr_*_n1_0.1_n2_0.0.mod.vcf.gz
--populations data/simulated_test/null/Tenn.popfile # file with columns for sample and population
--archaic-populations Neand1 # archaic population to match against
--modern-populations EUR AFR # populations to generate match percents for
--chrom-sizes 1000000 # or file with chrom and size columns
[--window-size [BP] (default: 50000)]
[--step-size [BP] (default: 10000)]
[--informative_site_method {derived_in_archaic,derived_in_archaic_or_modern} [default: derived_in_archaic]
> output/simulated_test/null_tables/afr_eur-vs-neand1/tsv/max_match_pct_counts_all.tsv
archaic_match build-db
--match-pct-count output/simulated_test/null_tables/afr_eur-vs-neand1/tsv/max_match_pct_counts_all.tsv
--db output/simulated_test/null_tables/afr_eur-vs-neand1/max_match_pct_test.db
Compares each modern haplotype to each archaic haplotype.
For each window (as defined by window_size and step_size), calculates the frequency of informative sites and the maximum archaic match percent.
Additionally, a empirical pvalue is computed using the database of precomputed
match percents using windows that have an informative site frequency within
frequency_threshold
of the query window.
archaic_match max-match-pct
--vcf data/simulated_test/0.1pct/vcfs/Tenn_nonAfr_*_n1_0.1_n2_0.0.mod.vcf.gz
--populations data/simulated_test/null/Tenn.popfile # file with columns for sample and population
--archaic-populations Neand1 # archaic population to match against
--modern-populations EUR AFR # populations to generate match percents for
--chrom-sizes=1000000 # or file with chrom and size columns
[--window-size [BP] (default: 50000)]
[--step-size [BP] (default: 10000)]
[--informative_site_method {derived_in_archaic,derived_in_archaic_or_modern} [default: derived_in_archaic]
--match-pct-database output/simulated_test/null_tables/afr_eur-vs-neand1/max_match_pct.db
[--informative-site-range [FLOAT] (default: 0)]
[--overlap-regions [BED_FILE]]
> output/simulated_test/0.1_pct_pvalues.txt
Specifying a bed file with regions will generate two additional output columns:
-
region_overlap_bp
: The number of basepairs of the window that overlap an introgressed region. -
region_informative_sites
: The number of informative sites in the window that overlap an introgressed region.
This is helpful for assessing the amount of a window that in known to be introgressed in simulated data.
The format of the region bedfile should be a tab delimited file with the following columns:
- Chromsome (sequence id where the region is located)
- Start (zero based start position of the region)
- End (one-based end position of the region)
- Sample [Optional] (if specified, only windows associated with this sample will be reported as having overlap)
Note for TreeCalls bedfiles
The TreeCalls bedfiles output by the ??? simulator should have the haplotype
id converted into the associated sample name (e.g. 220 => msp_110). They
should also be combined into a single bedfile. This can be accomplished using
the included column_replace
command (e.g.):
column_replace \
data/simulated_test/0.05_pct/TreeCalls/Tenn_nonAfr_*_n1_0.05_n2_0.0.bed.merged.gz \
-d data/simulated_test/Tenn_haplotype_to_sample.txt \
-c 4 \
| sort -k 1,1 -k 2,2n \
> data/simulated_test/0/05_pct/TreeCalls/combined_introgressed_regions.bed