|
| 1 | +--- |
| 2 | +jupytext: |
| 3 | + text_representation: |
| 4 | + extension: .md |
| 5 | + format_name: myst |
| 6 | + format_version: 0.12 |
| 7 | + jupytext_version: 1.9.1 |
| 8 | +kernelspec: |
| 9 | + display_name: Python 3 |
| 10 | + language: python |
| 11 | + name: python3 |
| 12 | +--- |
| 13 | + |
| 14 | +# Bottlenecks |
| 15 | + |
| 16 | +**Konrad Lohse and Jerome Kelleher** |
| 17 | + |
| 18 | +The site frequency spectrum (SFS) summarises variants by their frequency in a sample and is a fundamental summary of sequence variation that forms the basis of many modern inference approaches (e.g. sweepfinder, DFE-alpha, dadi). First, the SFS is a lossless summary of unlinked variants, so any summary of sequence variation that ignores linkage (e.g. pairwise measures of diversity and divergence, F_st, Tajima's D and D) are summaries of the SFS. |
| 19 | + |
| 20 | +The SFS is convenient analytically, because it only depends on the mean length and frequency of genealogical branches. For many demographic models of interest the means can be derived analytically either using coalescent theory (cite Huang, TPB) or diffusion equations (cite dadi). A number of composite likelihood approaches have been developed based on either analytic results for the SFS (cite dadi Excoffier, Jaada). However, analytic expectations for the SFS break down for large samples and/or complex demographic models. |
| 21 | + |
| 22 | +In the following section we show how the SFS can be approximated using coalescence simulations and compare such approximations to analytic results. We will assume a simple toy history of a single panmictic population that is affected by an instaneous bottleneck at time T with strenght s (cite Galtier et al). The effect of this bottleneck is to induce sudden burst of coalescence, which simultaneous multiple merges. We measure bottleneck strength as the probability that a pair of lineages coalesces during the bottleneck (we could could of course convert s into on (imaginary) time period that would give the same probability of coalescence $s=1-e^{-T}$). |
| 23 | + |
| 24 | +We assume a sample of size 10 and use msprime to simulate 10,000 replicate genealogies. For each genealogy the function bottSFS records the unfolded SFS as the mean length of branches with n leafnodes (normalized by the total length of the genealogy) by iterating through all nodes in the tree.sequence. Note that we are simulating genealogies only, i.e. we do not need to simulate mutations. |
| 25 | + |
| 26 | +We use a for loop to record the SFS for a range of bottleneck strengths parameters in a dictionary: |
| 27 | + |
| 28 | +```{code-cell} ipython3 |
| 29 | +%matplotlib inline |
| 30 | +%config InlineBackend.figure_format = 'svg' |
| 31 | +import msprime |
| 32 | +import numpy as np |
| 33 | +import seaborn as sns |
| 34 | +import matplotlib.pyplot as plt |
| 35 | +``` |
| 36 | + |
| 37 | +```{code-cell} ipython3 |
| 38 | +def run_bott_sims(num_rep, num_samp, T, s): |
| 39 | + demographic_events = [ |
| 40 | + msprime.InstantaneousBottleneck(time=T, strength=s, population=0)] |
| 41 | + reps = msprime.simulate( |
| 42 | + sample_size=num_samp, Ne=Ne, num_replicates=num_rep, |
| 43 | + demographic_events=demographic_events) |
| 44 | + return reps |
| 45 | +
|
| 46 | +def approx_SFS(reps): |
| 47 | + B = np.zeros((num_rep, num_samp)) |
| 48 | + for rep_index, ts in enumerate(reps): |
| 49 | + assert ts.num_trees == 1 |
| 50 | + tree = ts.first() |
| 51 | + for u in tree.nodes(): |
| 52 | + nleaves = tree.num_samples(u) |
| 53 | + if tree.parent(u) != msprime.NULL_NODE: |
| 54 | + B[rep_index, nleaves] += tree.branch_length(u) |
| 55 | + data = np.mean(B, axis=0) |
| 56 | + data /= np.sum(data) |
| 57 | + return data |
| 58 | +
|
| 59 | +num_rep = 1000 |
| 60 | +num_samp = 10 |
| 61 | +Ne = 1 |
| 62 | +T = 0.5 |
| 63 | +taulist= np.array([0,1,2,3]) |
| 64 | +datalist = {} |
| 65 | +for tau in taulist: |
| 66 | + datalist[tau]= approx_SFS(run_bott_sims(num_rep, num_samp, T, tau)) |
| 67 | + |
| 68 | +# My guess/assumption is that currently bottleneck strength in msprime is scaled as an (imaginary) time tau (in units of 4N_e) generations. |
| 69 | +# It makes a lot more sense to express the bottleneck strength as the probability of pairwise coalescence |
| 70 | +# during the bottelenck s=1-np.exp(-tau/2) |
| 71 | +``` |
| 72 | + |
| 73 | +With increasing bottleneck strength the SFS becomes increasingly skewed (the leftmost blue bars show the SFS for a population of constant size). However, bottlenecks have a complex effect on the different frequency classes of the SFS: while the relative frequency of singletons increases, other frequency classes (e.g. doubletons) have a non-monotonic relationship with bottleneck strength: |
| 74 | + |
| 75 | +```{code-cell} ipython3 |
| 76 | +bar_width=0.2 |
| 77 | +index = np.arange(1,num_samp) |
| 78 | +j = 0 |
| 79 | +for s, B in datalist.items(): |
| 80 | + plt.bar(index + j * bar_width, B[1:], bar_width, label=str(s)) |
| 81 | + j += 1 |
| 82 | +``` |
| 83 | + |
| 84 | +## Comparison with analytic predictions |
| 85 | + |
| 86 | ++++ |
| 87 | + |
| 88 | +How does the approximate SFS compare to analytic expectations? For a population of constant size, the SFS is simply given by Watterson's correction factor, that is the total length branches with i leafnodes is given is 1/i. Reassuringly, in the limit of s=0 (no bottleneck), our SFS approximation based on simulated genealogies agrees with this prediction: |
| 89 | + |
| 90 | +```{code-cell} ipython3 |
| 91 | +expsfs=[(1/i) for i in range(1,10)] |
| 92 | +expsfs/=np.sum(expsfs) |
| 93 | +
|
| 94 | +fig, ax = plt.subplots() |
| 95 | +index = np.arange(1,10) |
| 96 | +bar_width = 0.4 |
| 97 | +opacity = 0.9 |
| 98 | +
|
| 99 | +simsfs = ax.bar(index, datalist[0][1:], bar_width, alpha=opacity, label='sim') |
| 100 | +expextsfs = ax.bar(index+ bar_width, expsfs, bar_width, alpha=opacity, label='exp') |
| 101 | +
|
| 102 | +fig.tight_layout() |
| 103 | +plt.show() |
| 104 | +``` |
| 105 | + |
| 106 | +The analytic prediction for the SFS under a bottleneck model is more complicated (Bunnefeld et al. 2015, Appendix). For a sample of n=4 lineages the SFS is: |
| 107 | + |
| 108 | +```{code-cell} ipython3 |
| 109 | +#We are assuming a bottleneck of strength tau = 4 N_e generations |
| 110 | +#and a bottleneck time of T=1 (2 in units of 4 Ne) |
| 111 | +#I am pretty sure the analytic prediction for the SFS is correct: the limit mfor s->0 is correct and |
| 112 | +#itr matches the automatically generated expression in the Mathematica .nb... |
| 113 | +
|
| 114 | +T=2 |
| 115 | +slist=[1-np.exp(-tau) for tau in taulist] |
| 116 | +
|
| 117 | +for s in slist: |
| 118 | + p=s*(-6 + 15*s - 20 * np.power(s,2) + 15 * np.power(s,3) - 6 * np.power(s,4) + np.power(s,5)) |
| 119 | + expsfsBottlN= [2/15*(np.exp(-6*T)*(15 *np.exp(6*T) - 9 *np.exp(5*T)*s - |
| 120 | + 5*np.exp(3*T)*s*(3 - 3*s + np.power(s,2)) + p)), |
| 121 | + 1/5*np.exp(-6*T)*(5*np.exp(6*T) - 6*np.exp(5*T)*s - p), |
| 122 | + 2/15*np.exp(-6*T)*(5*np.exp(6*T) - 9*np.exp(5*T)*s + 5*np.exp(3*T)*s*(3-3*s + np.power(s,2)) + p)] |
| 123 | +
|
| 124 | + expsfsBottlN/=np.sum(expsfsBottlN) |
| 125 | + print(expsfsBottlN) |
| 126 | +``` |
| 127 | + |
| 128 | +The fit between the SFS simulated with msprime and the analytic prediction is noty convincing (given the 100,000 replicates): |
| 129 | + |
| 130 | +```{code-cell} ipython3 |
| 131 | +num_samp = 4 |
| 132 | +num_rep = 1000 |
| 133 | +data4 = {} |
| 134 | +T = 1 |
| 135 | +for tau in taulist: |
| 136 | + data4[tau]= approx_SFS(run_bott_sims(num_rep, num_samp, T, tau/2)) |
| 137 | +``` |
| 138 | + |
| 139 | +```{code-cell} ipython3 |
| 140 | +fig, ax = plt.subplots() |
| 141 | +index = np.arange(1,4) |
| 142 | +bar_width = 0.4 |
| 143 | +print(data4[0][1:]) |
| 144 | +print(data4[1][1:]) |
| 145 | +print(data4[2][1:]) |
| 146 | +print(data4[3][1:]) |
| 147 | +
|
| 148 | +simsfs = ax.bar(index, data4[3][1:], bar_width, alpha=opacity, label='sim') |
| 149 | +expextsfs = ax.bar(index+ bar_width, expsfsBottlN, bar_width, label='exp') |
| 150 | +
|
| 151 | +fig.tight_layout() |
| 152 | +plt.show() |
| 153 | +``` |
| 154 | + |
| 155 | +## The distribution of nton branches |
| 156 | + |
| 157 | ++++ |
| 158 | + |
| 159 | +Given that the SFS only depends on mean branch lengths, it is interesting to inspect the probability density distribution of the underlying genealogical branches. Given the discrete event, the pfd of nton branches are discontinuous. |
| 160 | + |
| 161 | +```{code-cell} ipython3 |
| 162 | +s=1 |
| 163 | +demographic_events = [msprime.InstantaneousBottleneck(time=T, strength=s, population=0)] |
| 164 | +reps = msprime.simulate( |
| 165 | + sample_size=num_samp, Ne=Ne, num_replicates=num_rep, |
| 166 | + demographic_events=demographic_events) |
| 167 | +B = np.zeros((num_rep, num_samp)) |
| 168 | +for rep_index, ts in enumerate(reps): |
| 169 | + tree = next(ts.trees()) |
| 170 | + for u in tree.nodes(): |
| 171 | + nleaves = tree.num_samples(u) |
| 172 | + if tree.parent(u) != msprime.NULL_NODE: |
| 173 | + B[rep_index, nleaves]+=tree.branch_length(u) |
| 174 | +``` |
| 175 | + |
| 176 | +```{code-cell} ipython3 |
| 177 | +Btrans=np.array(B).T.tolist() |
| 178 | +sns.distplot(Btrans[1],axlabel="f(t)") |
| 179 | +sns.distplot(Btrans[2],axlabel="f(t)") |
| 180 | +sns.distplot(Btrans[3],axlabel="f(t)"); |
| 181 | +``` |
| 182 | + |
| 183 | +### To Do |
| 184 | + |
| 185 | ++++ |
| 186 | + |
| 187 | +1) Fix the scaling of the strength in msprime |
| 188 | +2) Fix the pdf plot above: |
| 189 | + - Label axes on the pdf plot above: y-> f(t), x -> t |
| 190 | + - Restrict X range to 15 |
| 191 | +3) Fix the x axes on all the barplots so that these correspond to classes in the SFS |
| 192 | + |
| 193 | +```{code-cell} ipython3 |
| 194 | +
|
| 195 | +``` |
0 commit comments